[00:05:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P79274 and previous config saved to /var/cache/conftool/dbconfig/20250717-000537-marostegui.json
[00:07:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:08:03] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1170219
[00:08:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1170219 (owner: 10TrainBranchBot)
[00:09:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:14:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:14:57] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:20:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T399249)', diff saved to https://phabricator.wikimedia.org/P79275 and previous config saved to /var/cache/conftool/dbconfig/20250717-002045-marostegui.json
[00:20:49] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2226.codfw.wmnet with reason: Maintenance
[00:20:49] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[00:20:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2226 (T399249)', diff saved to https://phabricator.wikimedia.org/P79276 and previous config saved to /var/cache/conftool/dbconfig/20250717-002056-marostegui.json
[00:26:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[00:26:17] <wikibugs>	 (03PS5) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535)
[00:27:29] <wikibugs>	 (03CR) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[00:29:20] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1170219 (owner: 10TrainBranchBot)
[00:31:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[00:58:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[01:01:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T399249)', diff saved to https://phabricator.wikimedia.org/P79277 and previous config saved to /var/cache/conftool/dbconfig/20250717-010111-marostegui.json
[01:01:16] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[01:03:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[01:03:48] <wikibugs>	 10SRE-swift-storage, 10MinT, 10LPL Essential (2025 Jul-Sep), 10LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#11011645 (10KartikMistry) @Dzahn Yes. We can remove MinT models from our home directori...
[01:08:10] <jinxer-wm>	 RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[01:16:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P79278 and previous config saved to /var/cache/conftool/dbconfig/20250717-011619-marostegui.json
[01:31:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P79279 and previous config saved to /var/cache/conftool/dbconfig/20250717-013127-marostegui.json
[01:46:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T399249)', diff saved to https://phabricator.wikimedia.org/P79280 and previous config saved to /var/cache/conftool/dbconfig/20250717-014635-marostegui.json
[01:46:40] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[01:46:51] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2238.codfw.wmnet with reason: Maintenance
[01:46:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2238 (T399249)', diff saved to https://phabricator.wikimedia.org/P79281 and previous config saved to /var/cache/conftool/dbconfig/20250717-014658-marostegui.json
[01:51:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[01:56:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[02:04:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[02:21:07] <icinga-wm>	 PROBLEM - Disk space on dbprov2003 is CRITICAL: DISK CRITICAL - free space: /srv 391035MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov2003&var-datasource=codfw+prometheus/ops
[02:25:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[02:30:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[02:39:04] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[02:50:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T399249)', diff saved to https://phabricator.wikimedia.org/P79282 and previous config saved to /var/cache/conftool/dbconfig/20250717-025002-marostegui.json
[02:50:08] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[02:52:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[02:54:25] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[02:57:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[03:05:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P79283 and previous config saved to /var/cache/conftool/dbconfig/20250717-030511-marostegui.json
[03:09:25] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:14:21] <wikibugs>	 (03CR) 10Krinkle: "Yeah, I've gone ahead and swapped the 2021 patch version for PS8 on deployment prep. Unlike the 2021 version, which fixed beta by breaking" [puppet] - 10https://gerrit.wikimedia.org/r/941479 (https://phabricator.wikimedia.org/T357877) (owner: 10Krinkle)
[03:20:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P79284 and previous config saved to /var/cache/conftool/dbconfig/20250717-032020-marostegui.json
[03:21:07] <icinga-wm>	 RECOVERY - Disk space on dbprov2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov2003&var-datasource=codfw+prometheus/ops
[03:22:13] <wikibugs>	 (03PS12) 10Krinkle: beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318)
[03:24:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[03:25:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[03:30:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[03:31:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[03:32:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[03:35:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T399249)', diff saved to https://phabricator.wikimedia.org/P79285 and previous config saved to /var/cache/conftool/dbconfig/20250717-033528-marostegui.json
[03:35:32] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[03:36:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[03:42:40] <jinxer-wm>	 FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[03:47:40] <jinxer-wm>	 RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[03:51:45] <wikibugs>	 06SRE, 06Traffic: "Backend fetch failed" on edit save - https://phabricator.wikimedia.org/T382790#11011741 (10BCornwall) Hi, @MGChecker! I apologize for the long delay in getting back to you on this. Would you say that this is still an issue since you opened the task?
[03:54:52] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Uploading, 06Traffic: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#11011752 (10BCornwall) @Underbar_dk Since some time has passed, have you observed similar difficulties with various sites on IPv6? Or would you say that...
[03:57:10] <wikibugs>	 06SRE, 06Commons, 06Traffic: Backend fetch failed - https://phabricator.wikimedia.org/T383013#11011754 (10BCornwall) Hi, @Jeff_G! I apologize for the delay in getting to you on this - Would you say this was a transient issue or a persistent one?
[04:13:40] <wikibugs>	 (03CR) 10Dragoniez: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[04:14:57] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:23:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[04:26:46] <wikibugs>	 (03PS6) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535)
[04:27:39] <wikibugs>	 (03CR) 10Tryvix1509: "Could you please re-review again, sorry but I didn't update commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[04:28:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[05:06:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[05:09:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:11:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[05:11:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:17:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[05:19:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:19:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[05:21:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[05:21:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:40:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[05:45:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[05:50:01] <wikibugs>	 (03PS2) 10Effie Mouzeli: prometheus::ops add job to scrape hCaptcha proxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/1170186 (https://phabricator.wikimedia.org/T399211)
[05:52:55] <wikibugs>	 (03CR) 10Effie Mouzeli: "Thank you for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1170186 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli)
[05:53:07] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170186 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T0600).
[06:02:24] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: maintenance
[06:05:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[06:06:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1258 with weight 0 T399699', diff saved to https://phabricator.wikimedia.org/P79286 and previous config saved to /var/cache/conftool/dbconfig/20250717-060629-root.json
[06:06:30] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Primary switchover x3 T399699
[06:06:34] <stashbot>	 T399699: Switchover x3 master (db1255 -> db1258) - https://phabricator.wikimedia.org/T399699
[06:07:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1258 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1170098 (https://phabricator.wikimedia.org/T399699) (owner: 10Gerrit maintenance bot)
[06:09:29] <marostegui>	 !log Starting x3 eqiad failover from db1255 to db1258 - T399699
[06:09:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:10:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[06:12:07] <wikibugs>	 (03PS1) 10Marostegui: dbconfig.schema: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1170225 (https://phabricator.wikimedia.org/T399699)
[06:12:35] <wikibugs>	 (03CR) 10Marostegui: "This was breaking dbctl during the switchover" [puppet] - 10https://gerrit.wikimedia.org/r/1170225 (https://phabricator.wikimedia.org/T399699) (owner: 10Marostegui)
[06:13:31] <wikibugs>	 (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1170225 (https://phabricator.wikimedia.org/T399699) (owner: 10Marostegui)
[06:14:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] dbconfig.schema: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1170225 (https://phabricator.wikimedia.org/T399699) (owner: 10Marostegui)
[06:15:13] <wikibugs>	 (03CR) 10Ryan Kemper: Replace elasticsearch api with python requests (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper)
[06:17:04] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] prometheus::ops add job to scrape hCaptcha proxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/1170186 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli)
[06:18:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set x3 eqiad as read-only for maintenance - T399699', diff saved to https://phabricator.wikimedia.org/P79287 and previous config saved to /var/cache/conftool/dbconfig/20250717-061800-root.json
[06:18:05] <stashbot>	 T399699: Switchover x3 master (db1255 -> db1258) - https://phabricator.wikimedia.org/T399699
[06:18:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1258 to x3 primary and set section read-write T399699', diff saved to https://phabricator.wikimedia.org/P79288 and previous config saved to /var/cache/conftool/dbconfig/20250717-061832-marostegui.json
[06:19:05] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update x3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1170099 (https://phabricator.wikimedia.org/T399699) (owner: 10Gerrit maintenance bot)
[06:19:08] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[06:19:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1255 T399699', diff saved to https://phabricator.wikimedia.org/P79289 and previous config saved to /var/cache/conftool/dbconfig/20250717-061943-marostegui.json
[06:20:02] <logmsgbot>	 !log marostegui@dns1006 END - running authdns-update
[06:22:29] <wikibugs>	 (03PS1) 10Marostegui: db1211: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170227 (https://phabricator.wikimedia.org/T399298)
[06:24:16] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 10 hosts with reason: Maintenance
[06:24:43] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1211: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170227 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui)
[06:25:16] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1255.eqiad.wmnet with reason: Maintenance
[06:26:02] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[06:27:49] <wikibugs>	 (03PS1) 10Marostegui: db1255: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170228 (https://phabricator.wikimedia.org/T399298)
[06:28:16] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1255: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170228 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui)
[06:29:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache
[06:29:44] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[06:30:13] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache
[06:30:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[06:33:20] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org
[06:33:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79291 and previous config saved to /var/cache/conftool/dbconfig/20250717-063327-root.json
[06:34:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[06:34:21] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2204.codfw.wmnet with reason: Maintenance
[06:34:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:35:51] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] raid: Do not use the pipe symbol '|' as a separator for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446) (owner: 10Jcrespo)
[06:39:04] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[06:39:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[06:39:16] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org
[06:41:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[06:48:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79292 and previous config saved to /var/cache/conftool/dbconfig/20250717-064833-root.json
[06:48:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168757 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro)
[06:51:29] <wikibugs>	 (03PS1) 10Marostegui: db2205: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170229 (https://phabricator.wikimedia.org/T399548)
[06:54:18] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] raid: Do not use the pipe symbol '|' as a separator for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446) (owner: 10Jcrespo)
[06:54:25] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:54:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:57:33] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6296/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz)
[06:57:48] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T0700).
[07:00:05] <jouncebot>	 georgekyz, Hide_on_rosie, and abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:16] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2205: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170229 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui)
[07:00:55] <georgekyz>	 Hey folks, I am going to start the deployment right now 
[07:01:09] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2205.codfw.wmnet with reason: Maintenance
[07:01:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2205 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79293 and previous config saved to /var/cache/conftool/dbconfig/20250717-070112-marostegui.json
[07:01:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170092 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis)
[07:02:00] <Hide_on_rosie>	 Me too
[07:02:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79294 and previous config saved to /var/cache/conftool/dbconfig/20250717-070211-root.json
[07:02:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:02:36] <wikibugs>	 (03Merged) 10jenkins-bot: ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170092 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis)
[07:03:06] <logmsgbot>	 !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1170092|ores-extension: enable revertrisk filter for simplewiki and trwiki (T395668)]]
[07:03:10] <stashbot>	 T395668: [batch #1] Enable revertrisk filters in simplewiki & trwiki  - https://phabricator.wikimedia.org/T395668
[07:03:16] <wikibugs>	 (03PS1) 10Marostegui: db1175: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170232 (https://phabricator.wikimedia.org/T399548)
[07:03:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79295 and previous config saved to /var/cache/conftool/dbconfig/20250717-070338-root.json
[07:04:44] <abijeet>	 o/
[07:05:19] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1175: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170232 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui)
[07:05:41] <logmsgbot>	 !log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1170092|ores-extension: enable revertrisk filter for simplewiki and trwiki (T395668)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:06:06] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[07:06:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1175 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79296 and previous config saved to /var/cache/conftool/dbconfig/20250717-070609-marostegui.json
[07:07:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:09:19] <logmsgbot>	 !log gkyziridis@deploy1003 gkyziridis: Continuing with sync
[07:09:25] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[07:12:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79297 and previous config saved to /var/cache/conftool/dbconfig/20250717-071201-root.json
[07:13:03] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Uploading, 06Traffic: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#11011982 (10Underbar_dk) I have not seen similar problems in other sites, but I have not had the opportunity to test with Commons either, unfortunately.
[07:15:23] <wikibugs>	 (03CR) 10Wangombe: [C:03+1] CX: Remove unused config related to database and cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168757 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro)
[07:16:31] <logmsgbot>	 !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170092|ores-extension: enable revertrisk filter for simplewiki and trwiki (T395668)]] (duration: 13m 25s)
[07:16:35] <stashbot>	 T395668: [batch #1] Enable revertrisk filters in simplewiki & trwiki  - https://phabricator.wikimedia.org/T395668
[07:16:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79298 and previous config saved to /var/cache/conftool/dbconfig/20250717-071642-root.json
[07:17:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79299 and previous config saved to /var/cache/conftool/dbconfig/20250717-071717-root.json
[07:18:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79300 and previous config saved to /var/cache/conftool/dbconfig/20250717-071844-root.json
[07:19:53] <wikibugs>	 (03CR) 10Dreamrimmer: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[07:20:01] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync
[07:20:07] <georgekyz>	 folks I am finished with my deployment. Feel free to proceed. Thnx
[07:20:30] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync
[07:21:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[07:22:15] <wikibugs>	 (03PS1) 10Elukey: Revert^2 "services: configure tegola in codfw to use maps-test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170234
[07:24:34] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Revert^2 "services: configure tegola in codfw to use maps-test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170234 (owner: 10Elukey)
[07:26:46] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for olliekryva - https://phabricator.wikimedia.org/T399803 (10OKryva-WMF) 03NEW
[07:27:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79301 and previous config saved to /var/cache/conftool/dbconfig/20250717-072709-root.json
[07:28:05] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync
[07:28:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:30:20] <abijeet>	 georgekyz, would you have time to help deploy my change?
[07:30:39] <georgekyz>	 yeap sure 
[07:30:43] <abijeet>	 thanks!
[07:30:52] <abijeet>	 Here's the patch: 1168757: CX: Remove unused config related to database and cluster | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1168757
[07:31:14] <wikibugs>	 (03PS1) 10Elukey: services: set user tegola for Tegola's codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170235 (https://phabricator.wikimedia.org/T381565)
[07:31:45] <georgekyz>	 I see this as the next deployment: Hide on Rosie (Hide_on_rosie)
[07:31:45] <georgekyz>	 [config] 1169603 (Deploy change) Create "abusefilter" editor user group for Vietnamese Wikipedia - task T399535
[07:31:45] <stashbot>	 T399535: Create "abusefilter" user group for Vietnamese Wikipedia (vi.wikipedia.org) - https://phabricator.wikimedia.org/T399535
[07:31:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79302 and previous config saved to /var/cache/conftool/dbconfig/20250717-073147-root.json
[07:32:04] <georgekyz>	 https://www.irccloud.com/pastebin/QUVqbFuf/
[07:32:05] <wikibugs>	 (03CR) 10Tacsipacsi: php8.1-cli: introduce opcache and JIT (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113124 (https://phabricator.wikimedia.org/T384294) (owner: 10Effie Mouzeli)
[07:32:11] <Hide_on_rosie>	 Hello, I'm here
[07:32:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79303 and previous config saved to /var/cache/conftool/dbconfig/20250717-073223-root.json
[07:32:41] <georgekyz>	 Hide_on_rosie: do you want to proceed with yours?
[07:32:52] <georgekyz>	 and then I can help @abijeet 
[07:33:02] <Hide_on_rosie>	 yes, thanks
[07:33:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:34:42] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[07:34:59] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[07:35:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T399249)', diff saved to https://phabricator.wikimedia.org/P79304 and previous config saved to /var/cache/conftool/dbconfig/20250717-073506-marostegui.json
[07:35:08] <kart_>	 abijeet: I'm also around if you need help.
[07:35:10] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[07:35:40] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: set user tegola for Tegola's codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170235 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[07:37:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[07:38:11] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync
[07:38:14] <Hide_on_rosie>	 kart_: georgekyz: hello, can you help with my change?
[07:38:30] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync
[07:38:49] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync
[07:39:07] <georgekyz>	 I was looking the patch from @abijeet 
[07:39:51] <kart_>	 georgekyz: go ahead. I'm on bad network and shouldn't be deploy.
[07:40:21] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for olliekryva - https://phabricator.wikimedia.org/T399803#11012023 (10SCherukuwada) I approve of this request.
[07:41:13] <georgekyz>	 alright I will go first with @abijeet patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1168757
[07:41:20] <georgekyz>	 is anybody around as well?
[07:42:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[07:42:11] <abijeet>	 hey
[07:42:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79305 and previous config saved to /var/cache/conftool/dbconfig/20250717-074214-root.json
[07:42:54] <abijeet>	 georgekyz, Here's the patch: 1168757: CX: Remove unused config related to database and cluster | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1168757
[07:43:16] <georgekyz>	 I starting deployment of that one, I was checking the patch 
[07:43:28] <georgekyz>	 it seems ok, lets see
[07:43:29] <georgekyz>	 starting now 
[07:43:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168757 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro)
[07:44:37] <wikibugs>	 (03Merged) 10jenkins-bot: CX: Remove unused config related to database and cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168757 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro)
[07:44:40] <georgekyz>	 abijeet: so the only thing you are doing is to remove the configs for translation cluster and the database ?
[07:44:44] <georgekyz>	 right ?
[07:44:59] <abijeet>	 georgekyz, yup, we should just check that CX still functions after thhis
[07:45:00] <logmsgbot>	 !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1168757|CX: Remove unused config related to database and cluster (T348513)]]
[07:45:12] <stashbot>	 T348513: Migrate ContentTranslation to use a virtual database domain - https://phabricator.wikimedia.org/T348513
[07:45:19] <georgekyz>	 alright stay around for testing it
[07:45:24] <abijeet>	 ok
[07:45:35] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for olliekryva - https://phabricator.wikimedia.org/T399803#11012027 (10OKryva-WMF) 05Open→03Invalid
[07:45:36] <georgekyz>	 the deployment started 
[07:45:57] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for olliekryva - https://phabricator.wikimedia.org/T399803#11012028 (10OKryva-WMF) Requested through https://idm.wikimedia.org/permissions/ instead.
[07:46:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79306 and previous config saved to /var/cache/conftool/dbconfig/20250717-074653-root.json
[07:47:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[07:47:19] <logmsgbot>	 !log gkyziridis@deploy1003 gkyziridis, abi: Backport for [[gerrit:1168757|CX: Remove unused config related to database and cluster (T348513)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:47:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79307 and previous config saved to /var/cache/conftool/dbconfig/20250717-074728-root.json
[07:48:25] <georgekyz>	 abijeet: now is the time to test it 
[07:48:39] <georgekyz>	 I am not clicking sync
[07:48:47] <Hide_on_rosie>	 georgekyz: how long does it take
[07:48:55] <abijeet>	 georgekyz, ok, on it
[07:49:54] <georgekyz>	 Hide_on_rosie: when @abijeet finish testing it will take around 5 mins. If something is going wrong then we need to revert it and deploy the reverted version which means more time. 
[07:49:59] <wikibugs>	 (03PS1) 10Effie Mouzeli: prometheus::ops update nginx-exporter port [puppet] - 10https://gerrit.wikimedia.org/r/1170245
[07:50:06] <Hide_on_rosie>	 thanks
[07:50:20] <wikibugs>	 (03PS2) 10Effie Mouzeli: prometheus::ops update nginx-exporter port [puppet] - 10https://gerrit.wikimedia.org/r/1170245
[07:50:27] <georgekyz>	 Hide_on_rosie: your patch seems to be kinda bigger and I need first to review it, I cannot take responsibility to deploy something without review it
[07:51:27] <Hide_on_rosie>	 sure, go ahead :)
[07:54:13] <georgekyz>	 abijeet: how are you testing this? 
[07:54:26] <abijeet>	 georgekyz, need 1 more minute
[07:54:40] <georgekyz>	 no worries just asking take your time
[07:55:09] <abijeet>	 georgekyz, i think we are good
[07:55:26] <georgekyz>	 how can I test it as well?
[07:57:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79308 and previous config saved to /var/cache/conftool/dbconfig/20250717-075720-root.json
[07:57:48] <abijeet>	 georgekyz, go to Special:ContentTranslation and start translating an article, you can try publishing it to your namespace
[07:58:08] <abijeet>	 (on any wikipedia)
[08:00:05] <jouncebot>	 dancy and andre: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T0800).
[08:00:17] <georgekyz>	 abijeet: it seems that it is working. I can just click on the text and see the automated translation in the right
[08:00:33] <abijeet>	 yup
[08:00:33] <georgekyz>	 abijeet: are we good to go? Click Sync? Do you need extra testing ?
[08:01:39] <abijeet>	 georgekyz, yup we can sync
[08:01:47] <georgekyz>	 alrighty ! 
[08:01:51] <logmsgbot>	 !log gkyziridis@deploy1003 gkyziridis, abi: Continuing with sync
[08:02:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79309 and previous config saved to /var/cache/conftool/dbconfig/20250717-080159-root.json
[08:02:40] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:02:55] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:05:26] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] prometheus::ops update nginx-exporter port [puppet] - 10https://gerrit.wikimedia.org/r/1170245 (owner: 10Effie Mouzeli)
[08:07:15] <logmsgbot>	 !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1168757|CX: Remove unused config related to database and cluster (T348513)]] (duration: 22m 15s)
[08:07:20] <stashbot>	 T348513: Migrate ContentTranslation to use a virtual database domain - https://phabricator.wikimedia.org/T348513
[08:07:38] <georgekyz>	 Patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1168757 
[08:07:44] <georgekyz>	 deployed successfully! 
[08:08:13] <georgekyz>	 congrats @abijeet 
[08:08:30] <abijeet>	 georgekyz, thank you. I'll do another sanity check
[08:08:36] <georgekyz>	 yes please
[08:09:04] <georgekyz>	 abijeet: if you see something going wrong please create a revert patch and schedule it for deployment asap
[08:09:22] <georgekyz>	 let me know if everything is fine please :P 
[08:11:11] <Hide_on_rosie>	 Hi, are you all done
[08:11:18] <abijeet>	 georgekyz, looks ok.
[08:12:13] <georgekyz>	 abijeet: thnx a lot for sharing! congrats!
[08:12:58] <georgekyz>	 Hide_on_rosie: we are finished with @abijeet patch
[08:13:25] <Hide_on_rosie>	 okay, what I have to do now
[08:13:34] <Hide_on_rosie>	 my WikimediaDebug is ready
[08:14:41] <georgekyz>	 Hide_on_rosie: the deployment window has already came to an end. I did not have the time to review your patch yet :(
[08:14:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:15:09] <Hide_on_rosie>	 :( 
[08:15:21] <georgekyz>	 Hide_on_rosie: it would be good to reschedule it for the next time window, and find someone to review it 
[08:15:26] <georgekyz>	 and deploy it
[08:15:36] <Hide_on_rosie>	 okay
[08:17:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[08:19:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:25:32] <wikibugs>	 (03PS1) 10Vgutierrez: pyrra: Limit istio latency SLI queries to a single app [puppet] - 10https://gerrit.wikimedia.org/r/1170271 (https://phabricator.wikimedia.org/T398534)
[08:25:47] <Hide_on_rosie>	 georgekyz: may I ask
[08:25:51] <Hide_on_rosie>	 what is your timezone
[08:26:30] <georgekyz>	 UTC+3, right now time is 11:26 in the morning
[08:27:40] <georgekyz>	 Hide_on_rosie: I would suggest to find another deployer who will be available because I am kinda busy with other tasks and meetings today. I am not sure if @kart_ would be available to help you
[08:28:07] <Hide_on_rosie>	 okay
[08:28:16] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170271 (https://phabricator.wikimedia.org/T398534) (owner: 10Vgutierrez)
[08:37:19] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Yeah, I normally just use -t." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 (owner: 10Volans)
[08:40:21] <wikibugs>	 (03PS2) 10Vgutierrez: pyrra: Limit istio SLI queries to a single app [puppet] - 10https://gerrit.wikimedia.org/r/1170271 (https://phabricator.wikimedia.org/T398534)
[08:42:13] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170271 (https://phabricator.wikimedia.org/T398534) (owner: 10Vgutierrez)
[08:43:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T399249)', diff saved to https://phabricator.wikimedia.org/P79310 and previous config saved to /var/cache/conftool/dbconfig/20250717-084308-marostegui.json
[08:43:14] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[08:43:30] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Looks great, thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/1170271 (https://phabricator.wikimedia.org/T398534) (owner: 10Vgutierrez)
[08:46:16] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11012208 (10elukey) Adding some issues that I found when moving Tegola to the maps-test2* cluster, so I don't forget:  - For some reason the tegola user had the wrong password set, I...
[08:47:29] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11012262 (10elukey)
[08:50:25] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] pyrra: Limit istio SLI queries to a single app [puppet] - 10https://gerrit.wikimedia.org/r/1170271 (https://phabricator.wikimedia.org/T398534) (owner: 10Vgutierrez)
[08:57:22] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: fix scraping on gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1170275 (https://phabricator.wikimedia.org/T398854)
[08:58:06] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170278
[08:58:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P79311 and previous config saved to /var/cache/conftool/dbconfig/20250717-085815-marostegui.json
[08:59:19] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170278 (owner: 10PipelineBot)
[09:00:55] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170278 (owner: 10PipelineBot)
[09:11:42] <wikibugs>	 (03PS4) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357)
[09:12:58] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[09:13:01] <wikibugs>	 (03PS1) 10Brouberol: site: assign the insetup::data_platform_ferm role to dse-k8s-worker1014 [puppet] - 10https://gerrit.wikimedia.org/r/1170279 (https://phabricator.wikimedia.org/T399779)
[09:13:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P79312 and previous config saved to /var/cache/conftool/dbconfig/20250717-091323-marostegui.json
[09:13:50] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[09:14:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11012409 (10elukey) @Jclark-ctr Hi! I think that these servers don't have the calvin password set up (sigh), so I'd need the BMC passwords to test a new version of the...
[09:18:39] <wikibugs>	 (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170277 (https://phabricator.wikimedia.org/T398689)
[09:19:45] <wikibugs>	 (03PS2) 10Brouberol: site: assign the insetup::data_platform_ferm role to dse-k8s-worker1014 [puppet] - 10https://gerrit.wikimedia.org/r/1170279 (https://phabricator.wikimedia.org/T399778)
[09:19:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[09:19:52] <wikibugs>	 (03CR) 10Dima koushha: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170277 (https://phabricator.wikimedia.org/T398689) (owner: 10Jakob)
[09:24:13] <wikibugs>	 (03CR) 10Btullis: site: assign the insetup::data_platform_ferm role to dse-k8s-worker1014 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170279 (https://phabricator.wikimedia.org/T399778) (owner: 10Brouberol)
[09:24:23] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[09:24:53] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[09:27:13] <wikibugs>	 (03CR) 10Brouberol: site: assign the insetup::data_platform_ferm role to dse-k8s-worker1014 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170279 (https://phabricator.wikimedia.org/T399778) (owner: 10Brouberol)
[09:27:17] <wikibugs>	 (03PS3) 10Brouberol: site: assign the insetup::data_platform_ferm role to dse-k8s-worker1014 [puppet] - 10https://gerrit.wikimedia.org/r/1170279 (https://phabricator.wikimedia.org/T399778)
[09:28:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T399249)', diff saved to https://phabricator.wikimedia.org/P79313 and previous config saved to /var/cache/conftool/dbconfig/20250717-092831-marostegui.json
[09:28:37] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[09:28:46] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[09:28:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1162 (T399249)', diff saved to https://phabricator.wikimedia.org/P79314 and previous config saved to /var/cache/conftool/dbconfig/20250717-092854-marostegui.json
[09:32:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11012447 (10cmooney) Arelion came back to say they no longer see CRC errrors on their side: ` Please note we are not detecting errors in our interface on Dallas e...
[09:32:15] <wikibugs>	 (03CR) 10Jakob: [C:03+2] "deploying now" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170277 (https://phabricator.wikimedia.org/T398689) (owner: 10Jakob)
[09:33:53] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170277 (https://phabricator.wikimedia.org/T398689) (owner: 10Jakob)
[09:34:39] <logmsgbot>	 !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[09:34:54] <logmsgbot>	 !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[09:35:22] <logmsgbot>	 !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[09:35:41] <logmsgbot>	 !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[09:36:02] <logmsgbot>	 !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[09:36:18] <logmsgbot>	 !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[09:40:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#11012461 (10cmooney) Arelion have delcared the situation is resolved: ` 7/17/2025 9:00:40 AM   Cause of Outage: This incident initially originated under a separ...
[09:44:01] <wikibugs>	 (03PS1) 10Vgutierrez: acme_chief: Remove certs older than 1 year [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419)
[09:44:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] acme_chief: Remove certs older than 1 year [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez)
[09:46:01] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus::pop: manage pop Prometheus instances centrally [puppet] - 10https://gerrit.wikimedia.org/r/1170282 (https://phabricator.wikimedia.org/T397003)
[09:46:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prometheus::pop: manage pop Prometheus instances centrally [puppet] - 10https://gerrit.wikimedia.org/r/1170282 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli)
[09:49:25] <wikibugs>	 (03PS2) 10Tiziano Fogli: prometheus::pop: manage pop Prometheus instances centrally [puppet] - 10https://gerrit.wikimedia.org/r/1170282 (https://phabricator.wikimedia.org/T397003)
[09:50:20] <wikibugs>	 (03PS6) 10Stang: zhwiki: Allow local securepoll setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020)
[09:50:28] <wikibugs>	 (03PS2) 10Vgutierrez: acme_chief: Remove certs older than 1 year [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419)
[09:50:46] <wikibugs>	 (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170282 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli)
[09:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:51:36] <wikibugs>	 (03CR) 10Stang: zhwiki: Allow local securepoll setup (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang)
[09:52:01] <wikibugs>	 (03CR) 10Stang: "Resolved" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang)
[09:52:43] <wikibugs>	 (03PS7) 10Stang: zhwiki: Allow local securepoll setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020)
[09:53:10] <jinxer-wm>	 FIRING: GanetiBGPDown: BGP session down between ganeti2034 and lsw1-a4-codfw - group Ganeti6 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=lsw1-a4-codfw:9804&var-bgp_group=Ganeti6&var-bgp_neighbor=ganeti2034 - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown
[09:53:22] <wikibugs>	 (03PS8) 10Stang: zhwiki: Allow local securepoll setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020)
[09:56:17] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: fix scraping on gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1170275 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb)
[09:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:57:44] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez)
[09:57:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] prometheus::pop: manage pop Prometheus instances centrally [puppet] - 10https://gerrit.wikimedia.org/r/1170282 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli)
[09:58:10] <jinxer-wm>	 RESOLVED: GanetiBGPDown: BGP session down between ganeti2034 and lsw1-a4-codfw - group Ganeti6 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=lsw1-a4-codfw:9804&var-bgp_group=Ganeti6&var-bgp_neighbor=ganeti2034 - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown
[09:58:10] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus::pop: manage pop Prometheus instances centrally [puppet] - 10https://gerrit.wikimedia.org/r/1170282 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1000)
[10:09:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:11:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T399249)', diff saved to https://phabricator.wikimedia.org/P79315 and previous config saved to /var/cache/conftool/dbconfig/20250717-101156-marostegui.json
[10:12:02] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[10:13:34] <wikibugs>	 (03PS2) 10Ayounsi: Ganeti Bird BGP [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392)
[10:14:13] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[10:14:53] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 30182
[10:15:55] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[10:16:15] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[10:16:15] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 30182
[10:16:22] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[10:18:04] <wikibugs>	 (03PS5) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357)
[10:18:49] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[10:20:26] <wikibugs>	 (03PS1) 10Tiziano Fogli: prom/metamonitor: simplify PQL query to retrieve instance list [puppet] - 10https://gerrit.wikimedia.org/r/1170286 (https://phabricator.wikimedia.org/T397003)
[10:23:16] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[10:23:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1170286 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli)
[10:23:38] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[10:23:58] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[10:24:03] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[10:24:46] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[10:24:52] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi)
[10:25:01] <wikibugs>	 (03CR) 10Ayounsi: Ganeti Bird BGP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi)
[10:25:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:26:41] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prom/metamonitor: simplify PQL query to retrieve instance list [puppet] - 10https://gerrit.wikimedia.org/r/1170286 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli)
[10:27:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P79316 and previous config saved to /var/cache/conftool/dbconfig/20250717-102704-marostegui.json
[10:27:35] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[10:28:55] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[10:32:00] <wikibugs>	 (03PS1) 10FNegri: admin: migrate fnegri to sk-ssh-ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/1170287
[10:32:45] <wikibugs>	 (03PS2) 10FNegri: admin: migrate fnegri to sk-ssh-ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/1170287
[10:39:04] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[10:40:35] <wikibugs>	 (03PS1) 10Brouberol: Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069)
[10:40:50] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10Data-Platform-SRE (2025.07.05 - 2025.07.25), 13Patch-For-Review: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#11012714 (10brouberol)
[10:40:51] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10Data-Platform-SRE (2025.07.05 - 2025.07.25), 13Patch-For-Review: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#11012715 (10brouberol) 05Open→03In progress
[10:42:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P79317 and previous config saved to /var/cache/conftool/dbconfig/20250717-104211-marostegui.json
[10:42:24] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10Data-Platform-SRE (2025.07.05 - 2025.07.25), 13Patch-For-Review: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#11012730 (10brouberol) a:03brouberol
[10:47:25] <wikibugs>	 (03PS6) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357)
[10:48:25] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[10:48:35] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[10:48:57] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[10:49:03] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[10:49:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol)
[10:49:19] <wikibugs>	 (03PS7) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357)
[10:49:41] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[10:49:45] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[10:50:19] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[10:51:01] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply
[10:52:07] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:52:09] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply
[10:52:17] <wikibugs>	 (03CR) 10Ayounsi: "PCC shows some `neighbor  external;` I *think* that it's because of PCC and it would be fine in prod, but to be double checked." [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi)
[10:52:49] <icinga-wm>	 PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy
[10:52:59] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:54:08] <jinxer-wm>	 FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:54:25] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[10:55:15] <wikibugs>	 (03PS1) 10Btullis: Tweak the java options for hive-metastore on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170293 (https://phabricator.wikimedia.org/T399711)
[10:55:17] <wikibugs>	 (03PS1) 10Btullis: Apply the hive-metastore GC changes to production [puppet] - 10https://gerrit.wikimedia.org/r/1170294 (https://phabricator.wikimedia.org/T399711)
[10:55:41] <wikibugs>	 (03PS2) 10Btullis: Tweak the java options for hive-metastore on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170293 (https://phabricator.wikimedia.org/T399711)
[10:55:56] <wikibugs>	 (03PS2) 10Btullis: Apply the hive-metastore GC changes to production [puppet] - 10https://gerrit.wikimedia.org/r/1170294 (https://phabricator.wikimedia.org/T399711)
[10:56:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Apply the hive-metastore GC changes to production [puppet] - 10https://gerrit.wikimedia.org/r/1170294 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis)
[10:56:07] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:57:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T399249)', diff saved to https://phabricator.wikimedia.org/P79318 and previous config saved to /var/cache/conftool/dbconfig/20250717-105719-marostegui.json
[10:57:23] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[10:57:34] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[10:57:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T399249)', diff saved to https://phabricator.wikimedia.org/P79319 and previous config saved to /var/cache/conftool/dbconfig/20250717-105741-marostegui.json
[11:00:13] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[11:02:46] <wikibugs>	 (03PS1) 10Marostegui: db1166: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170297 (https://phabricator.wikimedia.org/T399548)
[11:03:34] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1166: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170297 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui)
[11:03:39] <icinga-wm>	 RECOVERY - Squid on install1004 is OK: TCP OK - 0.007 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy
[11:03:59] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:04:02] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[11:04:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1166 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79320 and previous config saved to /var/cache/conftool/dbconfig/20250717-110405-marostegui.json
[11:04:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!  Ping me if you've any issues" [puppet] - 10https://gerrit.wikimedia.org/r/1170287 (owner: 10FNegri)
[11:05:19] <wikibugs>	 (03PS1) 10Marostegui: db2227: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170300 (https://phabricator.wikimedia.org/T399548)
[11:06:00] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1170293 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis)
[11:06:43] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Apply the hive-metastore GC changes to production [puppet] - 10https://gerrit.wikimedia.org/r/1170294 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis)
[11:08:42] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM install1004.wikimedia.org
[11:09:01] <wikibugs>	 (03CR) 10Jelto: [C:03+1] gerrit: fix scraping on gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1170275 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb)
[11:09:08] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:09:25] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[11:10:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2227: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170300 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui)
[11:11:28] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2227.codfw.wmnet with reason: Maintenance
[11:11:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2227 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79321 and previous config saved to /var/cache/conftool/dbconfig/20250717-111132-marostegui.json
[11:13:27] <icinga-wm>	 RECOVERY - MegaRAID on backup1007 is OK: OK: optimal, 1 logical, 24 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:14:27] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache
[11:14:36] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[11:14:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79323 and previous config saved to /var/cache/conftool/dbconfig/20250717-111454-root.json
[11:15:15] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install1004.wikimedia.org
[11:16:21] <wikibugs>	 (03CR) 10Marostegui: "I don't find any explanation for this: https://phabricator.wikimedia.org/P79324" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto)
[11:17:22] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance
[11:17:36] <marostegui>	 !log Restart pc4 T399540
[11:17:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:42] <stashbot>	 T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540
[11:19:39] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: Add an-worker 1176|1179|1186 to analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170301 (https://phabricator.wikimedia.org/T398027)
[11:22:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79325 and previous config saved to /var/cache/conftool/dbconfig/20250717-112220-root.json
[11:22:27] <logmsgbot>	 !log stevemunene@cumin1003 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1176.eqiad.wmnet
[11:23:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:23:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:24:18] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[11:24:18] <logmsgbot>	 !log stevemunene@cumin1003 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1176.eqiad.wmnet
[11:24:36] <logmsgbot>	 !log stevemunene@cumin1003 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1179.eqiad.wmnet
[11:24:37] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[11:25:51] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm
[11:26:53] <logmsgbot>	 !log stevemunene@cumin1003 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1179.eqiad.wmnet
[11:27:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:27:13] <logmsgbot>	 !log stevemunene@cumin1003 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1186.eqiad.wmnet
[11:28:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:29:17] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:30:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79326 and previous config saved to /var/cache/conftool/dbconfig/20250717-113000-root.json
[11:30:14] <logmsgbot>	 !log stevemunene@cumin1003 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1186.eqiad.wmnet
[11:33:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[11:34:17] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:37:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79327 and previous config saved to /var/cache/conftool/dbconfig/20250717-113726-root.json
[11:38:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[11:38:58] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS bookworm
[11:41:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache
[11:41:21] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[11:41:40] <wikibugs>	 (03CR) 10Marostegui: "Never mind this, I was using the wrong order!. All works fine" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto)
[11:42:06] <wikibugs>	 (03CR) 10Marostegui: "This is still something we should try to improve" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto)
[11:42:12] <wikibugs>	 (03PS1) 10Arthur taylor: Enable wbui2025 mobile user interface on Wikidata Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170304 (https://phabricator.wikimedia.org/T399703)
[11:43:41] <wikibugs>	 (03PS2) 10Arthur taylor: Enable wbui2025 mobile user interface on Wikidata Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170304 (https://phabricator.wikimedia.org/T399703)
[11:44:55] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[11:45:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79329 and previous config saved to /var/cache/conftool/dbconfig/20250717-114506-root.json
[11:45:40] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[11:50:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Degraded RAID on an-worker1175 - https://phabricator.wikimedia.org/T399355#11012932 (10Jclark-ctr) 05Open→03Resolved Replaced Failed Drive Thanks for the assistance with this @BTullis
[11:52:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79330 and previous config saved to /var/cache/conftool/dbconfig/20250717-115232-root.json
[11:59:55] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1200)
[12:00:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79332 and previous config saved to /var/cache/conftool/dbconfig/20250717-120014-root.json
[12:02:40] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:04:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T399249)', diff saved to https://phabricator.wikimedia.org/P79333 and previous config saved to /var/cache/conftool/dbconfig/20250717-120444-marostegui.json
[12:04:49] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[12:05:24] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:05:36] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:07:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79334 and previous config saved to /var/cache/conftool/dbconfig/20250717-120738-root.json
[12:10:29] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "LGTM! Thanks" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165153 (https://phabricator.wikimedia.org/T398246) (owner: 10Scott French)
[12:13:17] <icinga-wm>	 RECOVERY - MinIO server processes on backup1007 is OK: PROCS OK: 1 process with command name minio, args server https://wikitech.wikimedia.org/wiki/Media_storage/Backups
[12:18:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399671#11013011 (10jcrespo) I told @Jclark-ctr not to replace the 13th disk yet, as I was more worried about the jbod ones than the RAID: ` root@backup1007:~$ megacli -PDList -aall | grep rro Media Error Count: 0 O...
[12:19:13] <wikibugs>	 (03CR) 10Btullis: [C:03+1] hdfs: Add an-worker 1176|1179|1186 to analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170301 (https://phabricator.wikimedia.org/T398027) (owner: 10Stevemunene)
[12:19:38] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Tweak the java options for hive-metastore on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170293 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis)
[12:19:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P79335 and previous config saved to /var/cache/conftool/dbconfig/20250717-121952-marostegui.json
[12:20:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399671#11013013 (10jcrespo) Note my prediction is that we will need 3 new disks, not only 1 to be replaced (but this can be resolve for now).
[12:21:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399671#11013021 (10Jclark-ctr) 05Open→03Resolved Updated Firmware on idrac   while logged in  thanks for assistance @jcrespo
[12:23:16] <wikibugs>	 (03PS1) 10Jforrester: PendingChangesPager: Stop using ANSI-89 joins [extensions/FlaggedRevs] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170318 (https://phabricator.wikimedia.org/T399641)
[12:26:54] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] "Icinga now says: "communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK"" [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446) (owner: 10Jcrespo)
[12:28:25] <wikibugs>	 (03CR) 10FNegri: [C:03+2] admin: migrate fnegri to sk-ssh-ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/1170287 (owner: 10FNegri)
[12:30:12] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] hdfs: Add an-worker 1176|1179|1186 to analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170301 (https://phabricator.wikimedia.org/T398027) (owner: 10Stevemunene)
[12:30:14] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170326
[12:35:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P79336 and previous config saved to /var/cache/conftool/dbconfig/20250717-123459-marostegui.json
[12:35:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:36:36] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Apply the hive-metastore GC changes to production [puppet] - 10https://gerrit.wikimedia.org/r/1170294 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis)
[12:36:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:41:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:43:21] <wikibugs>	 (03PS2) 10Brouberol: Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069)
[12:50:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T399249)', diff saved to https://phabricator.wikimedia.org/P79337 and previous config saved to /var/cache/conftool/dbconfig/20250717-125007-marostegui.json
[12:50:13] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[12:50:22] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[12:50:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T399249)', diff saved to https://phabricator.wikimedia.org/P79338 and previous config saved to /var/cache/conftool/dbconfig/20250717-125029-marostegui.json
[12:51:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol)
[12:53:44] <wikibugs>	 (03PS3) 10Brouberol: Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069)
[12:54:43] <wikibugs>	 (03PS1) 10Btullis: "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1170331
[12:58:15] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:59:20] <wikibugs>	 (03CR) 10Btullis: [C:03+2] "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1170331 (owner: 10Btullis)
[12:59:31] <logmsgbot>	 !log btullis@dns1004 START - running authdns-update
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1300).
[13:00:05] <jouncebot>	 joelyrookewmde and Hide_on_rosie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:18] <joelyrookewmde>	 \o/
[13:00:26] <logmsgbot>	 !log btullis@dns1004 END - running authdns-update
[13:00:51] <Lucas_WMDE>	 I can probably deploy in 15 minutes or so but not yet :)
[13:01:05] <Hide_on_rosie>	 oh no :(
[13:03:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol)
[13:04:24] <wikibugs>	 (03CR) 10Brouberol: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol)
[13:09:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:09:33] <Lucas_WMDE>	 o/
[13:09:35] <Lucas_WMDE>	 now I can deploy ^^
[13:09:45] <Hide_on_rosie>	 hi
[13:10:18] <suzannewoodWMDE6>	 we are here for the T388685
[13:10:19] <stashbot>	 T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685
[13:11:06] <Hide_on_rosie>	 and I'm here for T399535
[13:11:07] <stashbot>	 T399535: Create "abusefilter" user group for Vietnamese Wikipedia (vi.wikipedia.org) - https://phabricator.wikimedia.org/T399535
[13:11:20] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Activate feature to resolve changelist wikibase link labels in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169077 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE)
[13:11:34] <Lucas_WMDE>	 whoa that’s a lot of “PHP Deprecated” in logspam-watch
[13:11:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169077 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE)
[13:12:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Activate feature to resolve changelist wikibase link labels in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169077 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE)
[13:12:48] <wikibugs>	 (03Merged) 10jenkins-bot: Activate feature to resolve changelist wikibase link labels in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169077 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE)
[13:13:12] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1169077|Activate feature to resolve changelist wikibase link labels in all wikis (T388685)]]
[13:14:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:15:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 joelyrookewmde, lucaswerkmeister-wmde: Backport for [[gerrit:1169077|Activate feature to resolve changelist wikibase link labels in all wikis (T388685)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:15:30] <stashbot>	 T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685
[13:16:02] <Lucas_WMDE>	 joelyrookewmde, suzannewoodWMDE6: please test :)
[13:16:18] <joelyrookewmde>	 can do, but I can't see any 1001 or 1002 servers in the extension
[13:16:23] <joelyrookewmde>	 which should we use for testing?
[13:16:53] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good but question: is there a reason you want to remove older than 365 days and a smaller interval like 6 months or something, given" [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez)
[13:18:20] <seanleong-wmde>	 Lucas_WMDE it'
[13:18:24] <seanleong-wmde>	 it's working
[13:18:51] <Lucas_WMDE>	 joelyrookewmde: k8s-mwdebug is the one you should be using these days
[13:18:53] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "*and _not_ a shorter interval like 6 months" [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez)
[13:19:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 joelyrookewmde, lucaswerkmeister-wmde: Continuing with sync
[13:19:18] <wikibugs>	 (03PS1) 10Jforrester: [metawiki] Set site name to 'Meta-Wiki', not just 'Meta' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170339 (https://phabricator.wikimedia.org/T399843)
[13:19:51] <wikibugs>	 (03CR) 10Jforrester: [C:04-2] "Waiting for community consensus first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170339 (https://phabricator.wikimedia.org/T399843) (owner: 10Jforrester)
[13:20:11] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudceph osd.yaml: update nic names for 1006 [puppet] - 10https://gerrit.wikimedia.org/r/1170341 (https://phabricator.wikimedia.org/T399281)
[13:20:54] <wikibugs>	 (03CR) 10David Caro: [C:03+1] cloudceph osd.yaml: update nic names for 1006 [puppet] - 10https://gerrit.wikimedia.org/r/1170341 (https://phabricator.wikimedia.org/T399281) (owner: 10Andrew Bogott)
[13:24:10] <seanleong-wmde>	 Lucas_WMDE Thanks for helping us with the deployment!
[13:24:16] <Lucas_WMDE>	 np :)
[13:24:43] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169077|Activate feature to resolve changelist wikibase link labels in all wikis (T388685)]] (duration: 11m 30s)
[13:24:47] <stashbot>	 T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685
[13:24:53] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM (AFAICT urbanecm’s concern was addressed)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[13:25:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[13:26:11] <wikibugs>	 (03Merged) 10jenkins-bot: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[13:26:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1169603|Create "abusefilter" editor user group for Vietnamese Wikipedia (T399535)]]
[13:26:39] <stashbot>	 T399535: Create "abusefilter" user group for Vietnamese Wikipedia (vi.wikipedia.org) - https://phabricator.wikimedia.org/T399535
[13:27:11] <Hide_on_rosie>	 Thanks Lucas_WMDE:
[13:28:35] <wikibugs>	 (03PS1) 10David Caro: prometheus-node-pinger: fix the script to return 1 on failure [puppet] - 10https://gerrit.wikimedia.org/r/1170342 (https://phabricator.wikimedia.org/T399281)
[13:28:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, tryvix1509: Backport for [[gerrit:1169603|Create "abusefilter" editor user group for Vietnamese Wikipedia (T399535)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:28:57] <wikibugs>	 (03CR) 10Vgutierrez: "no good reason besides erring on the cautious side of things" [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez)
[13:31:07] <Lucas_WMDE>	 Hide_on_rosie: please test :)
[13:32:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] prometheus-node-pinger: fix the script to return 1 on failure [puppet] - 10https://gerrit.wikimedia.org/r/1170342 (https://phabricator.wikimedia.org/T399281) (owner: 10David Caro)
[13:33:53] <Lucas_WMDE>	 https://vi.wikipedia.org/w/index.php?title=%C4%90%E1%BA%B7c_bi%E1%BB%87t:Quy%E1%BB%81n_nh%C3%B3m_ng%C6%B0%E1%BB%9Di_d%C3%B9ng&uselang=vi looks good to me FWIW (the abusefilter group gets four rights: changetags, managechangetags, abusefilter-modify, oathauth-enable
[13:33:56] <Lucas_WMDE>	 )
[13:34:02] <Hide_on_rosie>	 seems ok
[13:34:08] <Hide_on_rosie>	 https://usercontent.irccloud-cdn.com/file/Esn3BO1r/image.png
[13:34:09] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, tryvix1509: Continuing with sync
[13:34:11] <Lucas_WMDE>	 ok!
[13:35:09] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "OK, I guess 365 days is definitely a start, so +1." [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez)
[13:35:17] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170345
[13:35:32] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Enable hCaptcha on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148) (owner: 10Dreamy Jazz)
[13:35:40] <Hide_on_rosie>	 https://usercontent.irccloud-cdn.com/file/NwZsydTQ/image.png
[13:36:43] <wikibugs>	 (03CR) 10David Caro: "Tested:" [puppet] - 10https://gerrit.wikimedia.org/r/1170342 (https://phabricator.wikimedia.org/T399281) (owner: 10David Caro)
[13:36:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T399249)', diff saved to https://phabricator.wikimedia.org/P79340 and previous config saved to /var/cache/conftool/dbconfig/20250717-133641-marostegui.json
[13:36:45] <wikibugs>	 (03CR) 10David Caro: [C:03+2] prometheus-node-pinger: fix the script to return 1 on failure [puppet] - 10https://gerrit.wikimedia.org/r/1170342 (https://phabricator.wikimedia.org/T399281) (owner: 10David Caro)
[13:36:48] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[13:37:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudceph osd.yaml: update nic names for 1006 [puppet] - 10https://gerrit.wikimedia.org/r/1170341 (https://phabricator.wikimedia.org/T399281) (owner: 10Andrew Bogott)
[13:38:42] <Hide_on_rosie>	 Lucas_WMDE: Since this is my first commit to gerrit, I would like to ask whether does it sync to beta cluster?
[13:39:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169603|Create "abusefilter" editor user group for Vietnamese Wikipedia (T399535)]] (duration: 13m 12s)
[13:39:51] <stashbot>	 T399535: Create "abusefilter" user group for Vietnamese Wikipedia (vi.wikipedia.org) - https://phabricator.wikimedia.org/T399535
[13:40:54] <Lucas_WMDE>	 yes, it will deploy to the beta cluster automatically
[13:40:57] <Lucas_WMDE>	 usually within ten minutes
[13:41:11] <Hide_on_rosie>	 Okay, thanks for your help
[13:42:08] <Lucas_WMDE>	 I can already see it at https://vi.wikipedia.beta.wmcloud.org/wiki/%C4%90%E1%BA%B7c_bi%E1%BB%87t:Quy%E1%BB%81n_nh%C3%B3m_ng%C6%B0%E1%BB%9Di_d%C3%B9ng :)
[13:42:26] <Hide_on_rosie>	 :oo
[13:43:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11013297 (10elukey) Ok so I have a provision script change that seems to work, but it doesn't touch anything on the network PXE / FixedBootOrder config (except ensuring...
[13:44:41] <Hide_on_rosie>	 https://usercontent.irccloud-cdn.com/file/JYWLAhSY/IMG_4264.PNG
[13:44:46] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudceph osd.yaml: update nic names for 1006 again [puppet] - 10https://gerrit.wikimedia.org/r/1170346 (https://phabricator.wikimedia.org/T399281)
[13:44:55] <Hide_on_rosie>	 Lucas_WMDE: Why does it have only 3 rights
[13:45:01] <Hide_on_rosie>	 on beta cluster
[13:45:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudceph osd.yaml: update nic names for 1006 again [puppet] - 10https://gerrit.wikimedia.org/r/1170346 (https://phabricator.wikimedia.org/T399281) (owner: 10Andrew Bogott)
[13:46:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:47:28] <Lucas_WMDE>	 hmm
[13:47:34] <Lucas_WMDE>	 that’s a fair question Hide_on_rosie
[13:47:43] <Lucas_WMDE>	 I guess the magic code adding the oathauth stuff isn’t active on beta?
[13:48:18] <Lucas_WMDE>	 which sounds unfortunate because it’s definitely still useful on beta (cf. T396061)
[13:48:19] <stashbot>	 T396061: Groups requiring 2FA via $wgOATHRequiredForGroups do not clearly warn users without 2FA that their permissions were truncated - https://phabricator.wikimedia.org/T396061
[13:48:23] * Lucas_WMDE looks a bit
[13:49:13] <Hide_on_rosie>	 hmm
[13:49:38] <Lucas_WMDE>	 aha
[13:49:44] <Lucas_WMDE>	 on beta, *everyone* has the oathauth-enable right
[13:49:56] <Lucas_WMDE>	 therefore there’s no need to give it to the $wmgPriviligedGroups in addition to that
[13:50:12] <Lucas_WMDE>	 you can see it in the Thành viên thông thường (user) group at https://vi.wikipedia.beta.wmcloud.org/wiki/%C4%90%E1%BA%B7c_bi%E1%BB%87t:Quy%E1%BB%81n_nh%C3%B3m_ng%C6%B0%E1%BB%9Di_d%C3%B9ng
[13:50:23] <Hide_on_rosie>	 oh, nice
[13:50:44] <Lucas_WMDE>	 https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/92d68cca33cec238d9577899f35f21045628c835/wmf-config/CommonSettings.php#4024 is the code that reassigns the oathauth-enable right from user to privileged groups on production
[13:51:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:51:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P79341 and previous config saved to /var/cache/conftool/dbconfig/20250717-135150-marostegui.json
[13:53:28] <icinga-wm>	 PROBLEM - MegaRAID on backup1007 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:53:30] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on backup1007 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T399847 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:53:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847 (10ops-monitoring-bot) 03NEW
[13:56:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:06:16] <wikibugs>	 (03PS1) 10Btullis: Revert ""Fail over hive services to an-coord1004"" [dns] - 10https://gerrit.wikimedia.org/r/1170351
[14:06:40] <jinxer-wm>	 FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:06:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P79342 and previous config saved to /var/cache/conftool/dbconfig/20250717-140658-marostegui.json
[14:07:55] <jinxer-wm>	 RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:09:42] <kostajh>	 jouncebot: nowandnext
[14:09:43] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 20 minute(s)
[14:09:43] <jouncebot>	 In 0 hour(s) and 20 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1430)
[14:10:19] <wikibugs>	 (03PS1) 10Kosta Harlan: Prevent submissions of forms using hCaptcha until ready [extensions/ConfirmEdit] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170352 (https://phabricator.wikimedia.org/T395619)
[14:22:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T399249)', diff saved to https://phabricator.wikimedia.org/P79343 and previous config saved to /var/cache/conftool/dbconfig/20250717-142205-marostegui.json
[14:22:10] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[14:22:21] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[14:22:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T399249)', diff saved to https://phabricator.wikimedia.org/P79344 and previous config saved to /var/cache/conftool/dbconfig/20250717-142228-marostegui.json
[14:30:04] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1430)
[14:31:19] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] acme_chief: Remove certs older than 1 year [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez)
[14:34:36] <wikibugs>	 (03PS1) 10Tiziano Fogli: prom/metamonitor: hide DeadManSwitch alerts in Karma [puppet] - 10https://gerrit.wikimedia.org/r/1170360 (https://phabricator.wikimedia.org/T397003)
[14:35:49] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170173 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French)
[14:35:50] <wikibugs>	 (03PS3) 10Ayounsi: Ganeti Bird BGP [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392)
[14:35:56] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox: revert to httpd-fcgi image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170173 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French)
[14:36:42] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi)
[14:37:36] <wikibugs>	 (03PS1) 10Eevans: data-gateway-staging: use hostname (for SNI probe reqs) [puppet] - 10https://gerrit.wikimedia.org/r/1170361 (https://phabricator.wikimedia.org/T399856)
[14:38:19] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: depool eqsin to test backhaul cct packet loss, T399221]
[14:38:22] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: depool eqsin to test backhaul cct packet loss, T399221]
[14:38:23] <stashbot>	 T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221
[14:38:31] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: revert to httpd-fcgi image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170173 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French)
[14:39:04] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[14:40:03] <wikibugs>	 (03CR) 10Tiziano Fogli: "Patch ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1170360 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli)
[14:40:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] data-gateway-staging: use hostname (for SNI probe reqs) [puppet] - 10https://gerrit.wikimedia.org/r/1170361 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans)
[14:40:34] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply
[14:40:49] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[14:41:20] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[14:41:28] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[14:41:59] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[14:42:07] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[14:42:38] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[14:42:52] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[14:42:57] <wikibugs>	 (03CR) 10Eevans: [C:03+2] data-gateway-staging: use hostname (for SNI probe reqs) [puppet] - 10https://gerrit.wikimedia.org/r/1170361 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans)
[14:43:23] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[14:43:31] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[14:44:03] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[14:44:08] <wikibugs>	 (03PS1) 10Stevemunene: dns: Add dse-k8s codfw urls [dns] - 10https://gerrit.wikimedia.org/r/1170364 (https://phabricator.wikimedia.org/T397293)
[14:44:24] <logmsgbot>	 !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[14:48:29] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply
[14:49:07] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[14:49:38] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[14:49:54] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[14:50:25] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[14:50:41] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[14:51:12] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[14:51:32] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[14:52:04] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[14:52:28] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[14:52:59] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[14:53:21] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[14:53:57] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:54:25] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[14:55:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:59:13] <wikibugs>	 (03PS1) 10Hasan Akgün (WMDE): wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170367
[15:00:05] <jouncebot>	 dancy and andre: gettimeofday() says it's time for Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1500)
[15:00:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:01:44] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.189.0" for 2 host(s)
[15:03:31] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.189.0" completed for 2 hosts
[15:05:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:05:40] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T399249)', diff saved to https://phabricator.wikimedia.org/P79345 and previous config saved to /var/cache/conftool/dbconfig/20250717-150659-marostegui.json
[15:07:04] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[15:07:26] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170367 (owner: 10Hasan Akgün (WMDE))
[15:09:25] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[15:10:40] <jinxer-wm>	 RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:12:10] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[15:12:59] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[15:13:15] <topranks>	 !log disable one of the 2x10G links connected to Equinix IXP Peering on cr1-codfw 
[15:13:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:31] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[15:13:48] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[15:14:19] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[15:14:28] <kostajh>	 jouncebot: nowandnext
[15:14:28] <jouncebot>	 For the next 0 hour(s) and 45 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1500)
[15:14:28] <jouncebot>	 In 0 hour(s) and 45 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1600)
[15:14:34] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[15:15:03] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert ""Fail over hive services to an-coord1004"" [dns] - 10https://gerrit.wikimedia.org/r/1170351 (owner: 10Btullis)
[15:15:05] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[15:15:23] <logmsgbot>	 !log btullis@dns1004 START - running authdns-update
[15:15:24] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[15:15:56] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[15:16:19] <logmsgbot>	 !log btullis@dns1004 END - running authdns-update
[15:16:21] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[15:16:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:16:52] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[15:17:21] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[15:17:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170352 (https://phabricator.wikimedia.org/T395619) (owner: 10Kosta Harlan)
[15:19:26] <wikibugs>	 (03Merged) 10jenkins-bot: Prevent submissions of forms using hCaptcha until ready [extensions/ConfirmEdit] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170352 (https://phabricator.wikimedia.org/T395619) (owner: 10Kosta Harlan)
[15:19:48] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1170352|Prevent submissions of forms using hCaptcha until ready (T395619)]]
[15:19:53] <stashbot>	 T395619: Prevent form submission until hCaptcha has run - https://phabricator.wikimedia.org/T395619
[15:20:27] <wikibugs>	 (03PS8) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357)
[15:21:34] <dancy>	 kostajh: FYI your deployment will take a long time due to the l10n files being updated.
[15:21:42] <kostajh>	 Yes, I know 
[15:21:57] <kostajh>	 Hopefully that is OK for everyone else? 
[15:22:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P79346 and previous config saved to /var/cache/conftool/dbconfig/20250717-152207-marostegui.json
[15:22:18] <dancy>	 Yep.  No problem.  Sometimes it catches people by surprise so I thought I'd mention it.
[15:22:24] <kostajh>	 ack, thanks 
[15:23:34] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170367 (owner: 10Hasan Akgün (WMDE))
[15:24:03] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "deploying" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170367 (owner: 10Hasan Akgün (WMDE))
[15:25:47] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170367 (owner: 10Hasan Akgün (WMDE))
[15:26:07] <wikibugs>	 (03CR) 10Ssingh: "Looking good, let's add a hiera to actually get a PCC output." [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[15:27:22] <Lucas_WMDE>	 jouncebot: now
[15:27:22] <jouncebot>	 For the next 0 hour(s) and 32 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1500)
[15:27:46] <Lucas_WMDE>	 I’ll deploy an update to the wikidata query builder (helmfile.d stuff), shouldn’t affect train log triage or anything else I expect
[15:27:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[15:28:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[15:28:12] <topranks>	 !log un-drain Arelion transport circuit from codfw -> eqsin to test performance T399221
[15:28:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:16] <stashbot>	 T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221
[15:28:18] <wikibugs>	 (03PS5) 10Zabe: Set categorylinks to read new on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169198 (https://phabricator.wikimedia.org/T397912)
[15:28:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[15:29:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[15:29:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[15:29:23] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[15:31:15] * Lucas_WMDE done deploying
[15:35:23] <wikibugs>	 (03PS1) 10Zabe: Set categorylinks to read new on remaining s2 large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170371 (https://phabricator.wikimedia.org/T397912)
[15:35:52] <wikibugs>	 (03PS2) 10Zabe: Set categorylinks to read new on remaining large s2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170371 (https://phabricator.wikimedia.org/T397912)
[15:37:14] <wikibugs>	 (03CR) 10Btullis: dns: Add dse-k8s codfw urls (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1170364 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene)
[15:37:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P79347 and previous config saved to /var/cache/conftool/dbconfig/20250717-153715-marostegui.json
[15:37:52] <wikibugs>	 (03CR) 10Btullis: dns: Add dse-k8s codfw urls (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1170364 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene)
[15:39:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#11013970 (10cmooney) 05Resolved→03Open @Jclark-ctr as discussed in our call on Tuesday we will be connecting the second SFP port...
[15:42:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11013994 (10cmooney) Not sure how to progress this one.  Still see zero packet loss over the link, even running for a longer period (5 mins this time): ` cmooney@...
[15:44:14] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1170352|Prevent submissions of forms using hCaptcha until ready (T395619)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:44:18] <stashbot>	 T395619: Prevent form submission until hCaptcha has run - https://phabricator.wikimedia.org/T395619
[15:45:32] <logmsgbot>	 !log aqu@deploy1003 Started deploy [airflow-dags/analytics@9fc3ae8]: Pushing new artifacts
[15:45:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11014023 (10ssingh) > Arelion want to close the ticket as they see no issue.  I asked that they don't.  Perhaps for now we just leave eqsin depooled and the circu...
[15:46:13] <logmsgbot>	 !log aqu@deploy1003 Finished deploy [airflow-dags/analytics@9fc3ae8]: Pushing new artifacts (duration: 00m 41s)
[15:48:15] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:50:04] <logmsgbot>	 !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@9fc3ae8]: Pushing new artifacts
[15:50:22] <logmsgbot>	 !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@9fc3ae8]: Pushing new artifacts (duration: 00m 17s)
[15:51:53] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with sync
[15:52:04] <wikibugs>	 (03CR) 10Samtar: ":3" [puppet] - 10https://gerrit.wikimedia.org/r/1139049 (https://phabricator.wikimedia.org/T392692) (owner: 10Samtar)
[15:52:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:52:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T399249)', diff saved to https://phabricator.wikimedia.org/P79348 and previous config saved to /var/cache/conftool/dbconfig/20250717-155223-marostegui.json
[15:52:28] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[15:52:39] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[15:54:25] <wikibugs>	 (03PS3) 10Scott French: php8.3: initial release of 8.3 image stack [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165153 (https://phabricator.wikimedia.org/T398246)
[15:56:47] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11014121 (10cmooney) >>! In T399221#11014023, @ssingh wrote: > I think leaving eqsin depooled given that it is off-peak there and observing this for a few hours i...
[15:57:00] <wikibugs>	 (03PS1) 10Jforrester: ZLangRegistry::fetchLanguageCodeFromZid: Check for invalid Title too [extensions/WikiLambda] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170376 (https://phabricator.wikimedia.org/T399755)
[15:57:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:58:12] <wikibugs>	 10SRE-SLO: Reduce the pyrra's multi-dc configurations where it makes sense - https://phabricator.wikimedia.org/T398534#11014129 (10elukey) We discovered this Pyrra bug https://github.com/pyrra-dev/pyrra/issues/667 that is affecting all the SLOs that are istio based. The Pyrra UI assumes that the metrics are in s...
[16:00:04] <jouncebot>	 jhathaway and moritzm: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:01:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache
[16:02:08] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[16:02:22] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance
[16:02:31] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance
[16:04:35] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170352|Prevent submissions of forms using hCaptcha until ready (T395619)]] (duration: 44m 46s)
[16:04:39] <stashbot>	 T395619: Prevent form submission until hCaptcha has run - https://phabricator.wikimedia.org/T395619
[16:07:55] <wikibugs>	 (03CR) 10Scott French: "Thanks, Effie!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165153 (https://phabricator.wikimedia.org/T398246) (owner: 10Scott French)
[16:08:24] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] hiera: service.yaml: use better aliasing for text/upload [puppet] - 10https://gerrit.wikimedia.org/r/1168192 (owner: 10Ssingh)
[16:10:08] <mszabo>	 jouncebot: nowandnext
[16:10:08] <jouncebot>	 For the next 0 hour(s) and 49 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1600)
[16:10:08] <jouncebot>	 In 0 hour(s) and 49 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1700)
[16:10:08] <jouncebot>	 In 0 hour(s) and 49 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1700)
[16:10:14] <mszabo>	 wonderful
[16:10:39] <wikibugs>	 (03PS1) 10Máté Szabó: Load hCaptcha on first form interaction [extensions/ConfirmEdit] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170381 (https://phabricator.wikimedia.org/T399849)
[16:11:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170381 (https://phabricator.wikimedia.org/T399849) (owner: 10Máté Szabó)
[16:12:34] <wikibugs>	 06SRE-OnFire, 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11014189 (10fnegri)
[16:12:40] <wikibugs>	 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11014190 (10fnegri)
[16:14:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11014191 (10Eevans) 05Open→03Resolved This is now complete.  For posterity sake: We weren't able to salvage the data, the cluster was reimaged and the data on it rebuilt.
[16:16:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11014196 (10Jclark-ctr) a:03Jclark-ctr
[16:16:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11014198 (10Jclark-ctr) @jcrespo  just fyi automated ticket was opened again for this host
[16:18:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11014215 (10jcrespo) This time if fully Failed, so please change it. Do I stop the server first?
[16:21:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#11014224 (10cmooney) Crickets in the main from Arelion, one update earlier. ` 2025-07-17 14:08  Dear Customer,  Please be advised that we are seeing some errors...
[16:24:38] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:25:10] <wikibugs>	 (03PS1) 10Aqu: Analyics: Refine restore monitor timerange [puppet] - 10https://gerrit.wikimedia.org/r/1170384 (https://phabricator.wikimedia.org/T369845)
[16:25:21] <wikibugs>	 (03Merged) 10jenkins-bot: Load hCaptcha on first form interaction [extensions/ConfirmEdit] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170381 (https://phabricator.wikimedia.org/T399849) (owner: 10Máté Szabó)
[16:25:46] <logmsgbot>	 !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1170381|Load hCaptcha on first form interaction (T399849)]]
[16:25:47] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on backup1007.eqiad.wmnet with reason: failed disk
[16:25:50] <stashbot>	 T399849: hCaptcha: Load hCaptcha JS after first form interaction - https://phabricator.wikimedia.org/T399849
[16:25:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11014239 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ce8f4e27-d454-43c0-b1b5-892d46c710a6) set by jynus@cumin1003 for 1 day, 0:00:00 on 1 host(s) and their services with reason: faile...
[16:26:12] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[16:26:13] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1053.eqiad.wmnet with OS bookworm
[16:26:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014241 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm completed...
[16:26:38] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:27:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Analyics: Refine restore monitor timerange [puppet] - 10https://gerrit.wikimedia.org/r/1170384 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[16:27:43] <wikibugs>	 (03CR) 10ZhaoFJx: zhwiki: Allow local securepoll setup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang)
[16:28:24] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:28:44] <wikibugs>	 (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1170384 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[16:28:48] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:29:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11014245 (10jcrespo) I've stopped it anyway, if you could start it up again after finishing, it would help me a lot, thank you.
[16:30:28] <logmsgbot>	 !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1170381|Load hCaptcha on first form interaction (T399849)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:31:14] <wikibugs>	 (03PS2) 10Aqu: Analyics: Refine restore monitor timerange [puppet] - 10https://gerrit.wikimedia.org/r/1170384 (https://phabricator.wikimedia.org/T369845)
[16:31:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#11014254 (10Jhancock.wm) checked the physical cables and everything lines up right. couldn't get into the BMC. re-ran the reqular provisioning script and can access the BMC now. But won't let me set th...
[16:33:00] <logmsgbot>	 !log mszabo@deploy1003 mszabo: Continuing with sync
[16:40:13] <logmsgbot>	 !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170381|Load hCaptcha on first form interaction (T399849)]] (duration: 14m 26s)
[16:40:17] <stashbot>	 T399849: hCaptcha: Load hCaptcha JS after first form interaction - https://phabricator.wikimedia.org/T399849
[16:43:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11014324 (10Eevans)
[16:49:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11014363 (10Eevans) Has there been any progress toward goal #2?  I didn't see where anything had been added to the mentioned runbook.  For context: We replaced `sda` in aqs1012 recently (T39...
[16:53:38] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance
[16:53:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T399249)', diff saved to https://phabricator.wikimedia.org/P79350 and previous config saved to /var/cache/conftool/dbconfig/20250717-165345-marostegui.json
[16:53:50] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[16:56:47] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Built locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165153 (https://phabricator.wikimedia.org/T398246) (owner: 10Scott French)
[16:58:15] <wikibugs>	 (03CR) 10Kosta Harlan: [C:04-2] "Waiting for approval." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148) (owner: 10Dreamy Jazz)
[16:58:25] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#11014409 (10cmooney) Regarding the jumbo-frame complication on the plan to move to one link we are arranging to connect a second 25G on each of...
[17:00:05] <jouncebot>	 bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1700)
[17:00:05] <jouncebot>	 swfrench-wmf: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1700).
[17:00:19] <swfrench-wmf>	 o/
[17:01:02] * bd808 looks to see if there is anything to push out today
[17:01:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:05:04] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Migrate webserver-bookworm flavour back to (bookworm) mediawiki-httpd images - T378128
[17:05:10] <stashbot>	 T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128
[17:06:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:06:41] <logmsgbot>	 !log swfrench@deploy1003 swfrench: Migrate webserver-bookworm flavour back to (bookworm) mediawiki-httpd images - T378128 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:08:54] <logmsgbot>	 !log swfrench@deploy1003 swfrench: Continuing with sync
[17:09:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014464 (10VRiley-WMF) ganeti1054 has moved into A4 U38
[17:09:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014466 (10VRiley-WMF)
[17:14:28] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Migrate webserver-bookworm flavour back to (bookworm) mediawiki-httpd images - T378128 (duration: 09m 56s)
[17:14:33] <stashbot>	 T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128
[17:15:58] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170390
[17:18:19] <swfrench-wmf>	 no further mediawiki deployments planned on my end for this infra window
[17:29:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:30:57] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] php8.3: initial release of 8.3 image stack [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165153 (https://phabricator.wikimedia.org/T398246) (owner: 10Scott French)
[17:32:07] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-07-14-122305-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170391
[17:32:58] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[17:34:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:34:27] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-07-14-122305-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170391 (owner: 10BryanDavis)
[17:36:08] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-07-14-122305-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170391 (owner: 10BryanDavis)
[17:36:18] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ganeti1054 - vriley@cumin1002"
[17:36:23] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ganeti1054 - vriley@cumin1002"
[17:36:23] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:36:36] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:36:51] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:36:54] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1054
[17:37:03] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:37:38] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:37:58] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:38:10] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1054
[17:38:18] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:39:10] <jinxer-wm>	 RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:40:14] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[17:41:44] <wikibugs>	 (03PS2) 10Reedy: CommonSettings.php: Remove old $wgCentralDBname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129230 (https://phabricator.wikimedia.org/T389348)
[17:43:51] <wikibugs>	 (03CR) 10Reedy: CommonSettings.php: Remove old $wgCentralDBname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129230 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy)
[17:44:20] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[17:57:16] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.dns.netbox
[17:57:30] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[17:59:54] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:00:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11014678 (10Eevans) As a follow-up, I did find a device with a missing bootloader: aqs1014, which went up after it's partman recipe was fixed (it has had SSDs replaced in the years since tho...
[18:00:05] <jouncebot>	 dancy and andre: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1800).
[18:00:15] <dancy>	 o/
[18:00:26] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pc2016
[18:00:37] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2016
[18:02:21] <logmsgbot>	 vriley@cumin1002 provision (PID 1137132) is awaiting input
[18:02:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11014692 (10CDanis) >>! In T215183#11014363, @Eevans wrote: > Has there been any progress toward goal #2?  I didn't see where anything had been added to the mentioned runbook.  Good question...
[18:03:00] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170396 (https://phabricator.wikimedia.org/T392180)
[18:03:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170396 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot)
[18:03:32] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[18:03:53] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[18:03:55] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170396 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot)
[18:05:30] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.dns.netbox
[18:07:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:08:09] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:09:19] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[18:10:00] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Analyics: Refine restore monitor timerange [puppet] - 10https://gerrit.wikimedia.org/r/1170384 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[18:10:03] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[18:12:01] <logmsgbot>	 !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.10  refs T392180
[18:12:06] <stashbot>	 T392180: 1.45.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T392180
[18:12:06] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[18:12:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:12:32] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[18:15:56] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[18:19:39] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host pc2016.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[18:20:57] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[18:24:05] <wikibugs>	 (03PS1) 10Eevans: date-gateway-staging: staging not deployed to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856)
[18:24:36] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans)
[18:25:18] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans)
[18:26:05] <wikibugs>	 (03PS2) 10Eevans: date-gateway-staging: staging not deployed to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856)
[18:27:47] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans)
[18:27:58] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[18:30:06] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1054.eqiad.wmnet with OS bookworm
[18:30:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm
[18:30:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11014835 (10Eevans) >>! In T215183#11014691, @CDanis wrote: >>>! In T215183#11014363, @Eevans wrote: > > [ ... ] > >> For context: We replaced `sda` in aqs1012 recently (T396970) and were (I...
[18:32:03] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2016.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[18:38:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:39:04] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[18:43:38] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1054.eqiad.wmnet with reason: host reimage
[18:44:56] <wikibugs>	 (03CR) 10Dzahn: ":) had no idea, but thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1170275 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb)
[18:45:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11014882 (10Eevans) >>! In T215183#11014691, @CDanis wrote: >>>! In T215183#11014363, @Eevans wrote: > > [ ... ] >  > I also never spent much time looking at or thinking about RAID10 hosts,...
[18:48:26] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1054.eqiad.wmnet with reason: host reimage
[18:54:12] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:54:25] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:54:59] <wikibugs>	 (03PS2) 10Acamicamacaraca: Grant editpatrolprotected to sysops and bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170407
[18:55:17] <wikibugs>	 (03PS3) 10Acamicamacaraca: Grant editpatrolprotected to sysops and bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170407 (https://phabricator.wikimedia.org/T399881)
[19:03:29] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[19:03:53] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[19:03:54] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1054.eqiad.wmnet with OS bookworm
[19:04:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014924 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm completed...
[19:04:23] <wikibugs>	 (03CR) 10Scott French: [C:03+1] date-gateway-staging: staging not deployed to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans)
[19:04:48] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170407 (https://phabricator.wikimedia.org/T399881) (owner: 10Acamicamacaraca)
[19:04:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014925 (10VRiley-WMF)
[19:05:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014926 (10VRiley-WMF) 05Open→03Resolved These have been imaged
[19:07:23] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] date-gateway-staging: staging not deployed to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans)
[19:09:25] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[19:24:41] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management: Undeleted file is an incorrect version - https://phabricator.wikimedia.org/T399892#11014984 (10Bugreporter)
[19:28:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[19:32:07] <wikibugs>	 (03PS1) 10Sbisson: CX3 Build 1.0.0+20250717 [extensions/ContentTranslation] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170412 (https://phabricator.wikimedia.org/T388503)
[19:32:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ContentTranslation] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170412 (https://phabricator.wikimedia.org/T388503) (owner: 10Sbisson)
[19:35:47] <stephanebisson>	 FYI, I have a patch scheduled in the upcoming deployment window in about 25 minutes. I'll be a little late but eventually I'll be there and I'll handle my patch.
[19:42:56] <wikibugs>	 (03CR) 10Zoranzoki21: "@zivkovica006@gmail.com asked me to review this, but I'm unsure, so I'd like someone else with more knowledge to review this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170407 (https://phabricator.wikimedia.org/T399881) (owner: 10Acamicamacaraca)
[19:51:52] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899 (10REsquito-WMF) 03NEW
[19:52:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11015090 (10REsquito-WMF) this ticket is a prerequisite for https://phabricator.wikimedia.org/T396672 and that @dr0ptp4kt  is also readying a patch for additional access in ht...
[19:58:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11015111 (10HShaikh) Approved. Thank you
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T2000).
[20:00:05] <jouncebot>	 Aca and stephanebisson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:23] <Aca>	 *waves*
[20:00:28] <Kizule>	 *waves*
[20:01:33] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1170360 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli)
[20:02:56] <Aca>	 TheresNoTime Are you around for deployment? :)
[20:04:36] <stephanebisson>	 Aca are you deploying your patch?
[20:05:06] <TheresNoTime>	 Aca: I can in about 10 minutes 
[20:05:23] <TheresNoTime>	 stephanebisson: can you deploy your own patch?
[20:05:39] <stephanebisson>	 Yes, I'll go ahead if there's no objections
[20:05:51] <Kizule>	 TheresNoTime: I'm discussing Aca's patch with Aca, it might require adding messages to WikimediaMessages.
[20:06:00] <TheresNoTime>	 Kizule: ack
[20:06:09] <Aca>	 yep, it will require that
[20:06:10] <TheresNoTime>	 stephanebisson: please proceed with your patch
[20:06:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170412 (https://phabricator.wikimedia.org/T388503) (owner: 10Sbisson)
[20:08:52] <wikibugs>	 (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20250717 [extensions/ContentTranslation] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170412 (https://phabricator.wikimedia.org/T388503) (owner: 10Sbisson)
[20:09:08] <logmsgbot>	 !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1170412|CX3 Build 1.0.0+20250717 (T388503 T395417 T395418)]]
[20:09:22] <stashbot>	 T388503: Section Translation: Support expanding the existing section if it already exists - https://phabricator.wikimedia.org/T388503
[20:09:22] <stashbot>	 T395417:  CX events EventGate validation errors: translation_source_title should be string - https://phabricator.wikimedia.org/T395417
[20:09:23] <stashbot>	 T395418: CX events EventGate validation errors: event_source should be string and equal to one of the enum values - https://phabricator.wikimedia.org/T395418
[20:11:08] <logmsgbot>	 !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1170412|CX3 Build 1.0.0+20250717 (T388503 T395417 T395418)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:13:11] <wikibugs>	 (03PS2) 10Btullis: Disable all dumps timers on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/1170410 (https://phabricator.wikimedia.org/T398438)
[20:14:38] <logmsgbot>	 !log sbisson@deploy1003 sbisson: Continuing with sync
[20:15:22] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "compiled on entire "C:scap" and it's noop - https://puppet-compiler.wmflabs.org/output/1137818/6298/" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[20:15:33] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1170410 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis)
[20:16:00] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[20:17:09] <Kizule>	 TheresNoTime: WikimediaMessages patch is created by Aca as well. It might need a backport to wmf.10 as well. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1170420
[20:17:16] <Kizule>	 Relevant config patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1170407
[20:17:16] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[20:18:52] <wikibugs>	 (03CR) 10Dzahn: "more sorry for the delay from my side - i'd still deploy this but it's low priority - maybe I should just reach out to you on IRC when is " [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn)
[20:19:37] <TheresNoTime>	 Kizule: okay, I'll take a look. It will need backporting to wmf.10 yeah
[20:20:18] <logmsgbot>	 !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170412|CX3 Build 1.0.0+20250717 (T388503 T395417 T395418)]] (duration: 11m 10s)
[20:20:26] <stashbot>	 T388503: Section Translation: Support expanding the existing section if it already exists - https://phabricator.wikimedia.org/T388503
[20:20:26] <stashbot>	 T395417:  CX events EventGate validation errors: translation_source_title should be string - https://phabricator.wikimedia.org/T395417
[20:20:27] <stashbot>	 T395418: CX events EventGate validation errors: event_source should be string and equal to one of the enum values - https://phabricator.wikimedia.org/T395418
[20:20:39] <stephanebisson>	 I'm done
[20:20:48] <TheresNoTime>	 ack :)
[20:21:11] <Aca>	 nicee
[20:21:55] <wikibugs>	 (03PS1) 10Bvibber: Database index hack to speed chartinfo API [extensions/JsonConfig] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170422 (https://phabricator.wikimedia.org/T393950)
[20:23:18] <bvibber>	 if things are clear i'll push that real quick
[20:23:28] <TheresNoTime>	 bvibber: go ahead, I'm waiting on CI :)
[20:23:48] <bvibber>	 tx <3
[20:23:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/JsonConfig] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170422 (https://phabricator.wikimedia.org/T393950) (owner: 10Bvibber)
[20:33:47] <wikibugs>	 (03Merged) 10jenkins-bot: Database index hack to speed chartinfo API [extensions/JsonConfig] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170422 (https://phabricator.wikimedia.org/T393950) (owner: 10Bvibber)
[20:34:39] <logmsgbot>	 !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1170422|Database index hack to speed chartinfo API (T393950)]]
[20:34:44] <stashbot>	 T393950: Metrics for when new charts are created and embedded - https://phabricator.wikimedia.org/T393950
[20:35:33] <TheresNoTime>	 Kizule: can you take a look at & +1 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1170420 and I'll +2 it
[20:36:30] <Kizule>	 TheresNoTime: Done
[20:36:41] <logmsgbot>	 !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1170422|Database index hack to speed chartinfo API (T393950)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:37:55] <logmsgbot>	 !log bvibber@deploy1003 bvibber: Continuing with sync
[20:37:58] <bvibber>	 confirmed good
[20:38:25] <wikibugs>	 (03PS1) 10Zoranzoki21: Add editpatrolprotected messages [extensions/WikimediaMessages] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170425 (https://phabricator.wikimedia.org/T399881)
[20:38:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikimediaMessages] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170425 (https://phabricator.wikimedia.org/T399881) (owner: 10Zoranzoki21)
[20:38:46] <wikibugs>	 (03PS2) 10Samtar: Add editpatrolprotected messages [extensions/WikimediaMessages] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170425 (https://phabricator.wikimedia.org/T399881) (owner: 10Zoranzoki21)
[20:39:18] <Kizule>	 TheresNoTime: I made a cherry-pick so CI can finish in time.
[20:39:25] <Kizule>	 *on time
[20:39:33] <TheresNoTime>	 (oh, whoops, also did... hopefully that didn't mess anything up ^^')
[20:40:10] <Kizule>	 TheresNoTime: As we did it on Gerrit, it's fine. :D
[20:40:18] <Kizule>	 Just checked, nothing is different.
[20:40:24] <TheresNoTime>	 :)
[20:40:42] <Aca>	 nahhh, leave me the job of breaking wikis
[20:40:45] <Aca>	 :D
[20:41:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[20:43:26] <logmsgbot>	 !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170422|Database index hack to speed chartinfo API (T393950)]] (duration: 08m 47s)
[20:43:31] <stashbot>	 T393950: Metrics for when new charts are created and embedded - https://phabricator.wikimedia.org/T393950
[20:43:34] <bvibber>	 done
[20:43:40] <TheresNoTime>	 :)
[20:44:10] <bvibber>	 ok now for the fun part of the day -- taking the cat into her vet appointment :D
[20:44:19] <bvibber>	 later all :D
[20:44:28] <TheresNoTime>	 good luck!
[20:44:36] <Aca>	 see ya
[20:45:09] <Aca>	 hell nah, this check is taking too long
[20:46:00] <TheresNoTime>	 Aca: still got time to deploy it this window? :)
[20:46:08] <Aca>	 yes
[20:46:24] <Aca>	 sorry for waiting
[20:46:44] <Kizule>	 In the worst case scenario, I'm here for Aca. :D
[20:47:09] <Aca>	 I actually thought the messages could be added separately, and then Kizule told me they should be prepared for the deploy as well
[20:47:46] <Aca>	 so thats the context
[20:50:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[20:50:58] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[20:51:08] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[20:52:23] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[20:52:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [extensions/WikimediaMessages] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170425 (https://phabricator.wikimedia.org/T399881) (owner: 10Zoranzoki21)
[20:52:31] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[20:53:28] <wikibugs>	 (03Merged) 10jenkins-bot: Add editpatrolprotected messages [extensions/WikimediaMessages] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170425 (https://phabricator.wikimedia.org/T399881) (owner: 10Zoranzoki21)
[20:53:43] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11015246 (10aranyap) 05Resolved→03Open Hi @cmooney ! I'm having some trouble trying to access JupyterHub and after some poking around with @dr0ptp4...
[20:53:46] <logmsgbot>	 !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1170425|Add editpatrolprotected messages (T399881)]]
[20:53:50] <stashbot>	 T399881: Serbo-Croatian sysops can't protect page to patrollers only - https://phabricator.wikimedia.org/T399881
[20:55:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[20:55:14] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[20:56:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[20:58:16] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11015267 (10cmooney) Hi @aranyap yeah you are not in that group. ` cmooney@ldap-maint1001:~$ ldapsearch -x cn=wmf | grep aprum  cmooney@ldap-maint1001:...
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T2100)
[21:01:40] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:05:38] <TheresNoTime>	 Kizule: Aca: the deploy is still ongoing for the messages patch, its just being a bit slow having to rebuild some container images
[21:05:53] <Kizule>	 We started wondering what's going on.
[21:05:54] <Aca>	 ack
[21:05:57] <Kizule>	 It's okay, we can wait. Ack.
[21:06:13] <wikibugs>	 (03PS1) 10Ryan Kemper: redfish: fix inconsequential typo [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170428
[21:06:31] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] redfish: fix inconsequential typo [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170428 (owner: 10Ryan Kemper)
[21:06:40] <jinxer-wm>	 FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:06:55] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:13:06] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11015297 (10ssingh) [Claiming this as the clinic duty person this week]  @aranyap: https://ldap.toolforge.org/user/aranyap indicates you are not part o...
[21:14:00] <wikibugs>	 (03CR) 10Ssingh: "Hi. Sorry about this. Let's deploy this on Monday; please ping us whenever you are around and/or feel free to send a calendar invite." [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn)
[21:16:17] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11015321 (10ssingh) a:03ssingh
[21:16:55] <jinxer-wm>	 FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:18:11] <logmsgbot>	 !log samtar@deploy1003 zoranzoki21, samtar: Backport for [[gerrit:1170425|Add editpatrolprotected messages (T399881)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:18:15] <stashbot>	 T399881: Serbo-Croatian sysops can't protect page to patrollers only - https://phabricator.wikimedia.org/T399881
[21:18:17] <Kizule>	 Finally!
[21:18:22] <Aca>	 oh god
[21:18:47] <TheresNoTime>	 will just continue, that doesn't need testing does it?
[21:18:57] <Kizule>	 Nope
[21:19:01] <Aca>	 checkin
[21:19:07] <Aca>	 LGTM
[21:19:11] <logmsgbot>	 !log samtar@deploy1003 zoranzoki21, samtar: Continuing with sync
[21:19:28] <Aca>	 MediaWiki pages exist now
[21:19:29] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] redfish: fix inconsequential typo [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170428 (owner: 10Ryan Kemper)
[21:21:55] <jinxer-wm>	 RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:24:53] <wikibugs>	 (03PS2) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860)
[21:25:55] <wikibugs>	 (03CR) 10Ryan Kemper: "Pushing a patch to a" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper)
[21:28:32] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: fix inconsequential typo [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170428 (owner: 10Ryan Kemper)
[21:31:34] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11015340 (10aranyap) @cmooney @ssingh I just requested access through the online system. Thank you!
[21:31:35] <logmsgbot>	 !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170425|Add editpatrolprotected messages (T399881)]] (duration: 37m 49s)
[21:31:39] <stashbot>	 T399881: Serbo-Croatian sysops can't protect page to patrollers only - https://phabricator.wikimedia.org/T399881
[21:31:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170407 (https://phabricator.wikimedia.org/T399881) (owner: 10Acamicamacaraca)
[21:32:17] <TheresNoTime>	 Kizule: Aca: now deploying the config change
[21:32:29] <Kizule>	 Nice :)
[21:33:01] <Aca>	 ack
[21:33:26] <wikibugs>	 (03CR) 10Dzahn: "thank you, Sukhbir, sounds good:)" [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn)
[21:33:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper)
[21:33:59] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:35:08] <wikibugs>	 (03Merged) 10jenkins-bot: Grant editpatrolprotected to sysops and bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170407 (https://phabricator.wikimedia.org/T399881) (owner: 10Acamicamacaraca)
[21:35:21] <logmsgbot>	 !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1170407|Grant editpatrolprotected to sysops and bots (T399881)]]
[21:35:59] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:39:26] <logmsgbot>	 !log samtar@deploy1003 aleksandar, samtar: Backport for [[gerrit:1170407|Grant editpatrolprotected to sysops and bots (T399881)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:39:31] <stashbot>	 T399881: Serbo-Croatian sysops can't protect page to patrollers only - https://phabricator.wikimedia.org/T399881
[21:40:07] <TheresNoTime>	 Kizule: Aca: ready to test
[21:40:10] <Aca>	 checkin
[21:40:44] <Aca>	 protection level is now displayed in the menu, LGTM
[21:40:46] <Kizule>	 no-op on srwiki, so it's fine.
[21:41:06] <logmsgbot>	 !log samtar@deploy1003 aleksandar, samtar: Continuing with sync
[21:41:59] <icinga-wm>	 PROBLEM - SSH on bast7002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:42:59] <icinga-wm>	 RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:47:58] <logmsgbot>	 !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170407|Grant editpatrolprotected to sysops and bots (T399881)]] (duration: 12m 37s)
[21:48:03] <stashbot>	 T399881: Serbo-Croatian sysops can't protect page to patrollers only - https://phabricator.wikimedia.org/T399881
[21:48:09] <TheresNoTime>	 done finally!
[21:48:22] <Aca>	 Thank you for the deployyy
[21:48:30] <Kizule>	 Thanks, all good!
[21:48:32] <TheresNoTime>	 no worries :)
[21:56:55] <wikibugs>	 (03PS1) 10Dzahn: gerrit: replace host names in replica config with variables [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833)
[21:57:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:59:33] <wikibugs>	 (03PS2) 10Dzahn: gerrit: replace host names in replica config with variables [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833)
[22:01:23] <wikibugs>	 (03PS3) 10Dzahn: gerrit: replace host names in replica config with variables [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833)
[22:01:25] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: repool eqsin to test backhaul cct packet loss, T399221]
[22:01:29] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: repool eqsin to test backhaul cct packet loss, T399221]
[22:01:29] <stashbot>	 T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221
[22:02:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:03:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "noop per https://puppet-compiler.wmflabs.org/output/1170433/6302/" [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[22:03:23] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] gerrit: replace host names in replica config with variables [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[22:08:00] <zabe>	 jouncebot: nowandnext
[22:08:00] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 51 minute(s)
[22:08:00] <jouncebot>	 In 7 hour(s) and 51 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250718T0600)
[22:08:10] <wikibugs>	 (03CR) 10Zabe: [C:03+2] PendingChangesPager: Stop using ANSI-89 joins [extensions/FlaggedRevs] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170318 (https://phabricator.wikimedia.org/T399641) (owner: 10Jforrester)
[22:16:53] <wikibugs>	 (03Merged) 10jenkins-bot: PendingChangesPager: Stop using ANSI-89 joins [extensions/FlaggedRevs] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170318 (https://phabricator.wikimedia.org/T399641) (owner: 10Jforrester)
[22:17:20] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1170318|PendingChangesPager: Stop using ANSI-89 joins (T399641)]]
[22:17:25] <stashbot>	 T399641: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cl_target_id' in 'ON'Function: MediaWiki\Pager\IndexPager::buildQueryInfo (PendingChangesPager)Query: SELECT  page_namespace,page_title,page_len,rev_len,page_latest,fp - https://phabricator.wikimedia.org/T399641
[22:19:19] <logmsgbot>	 !log zabe@deploy1003 jforrester, zabe: Backport for [[gerrit:1170318|PendingChangesPager: Stop using ANSI-89 joins (T399641)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:20:11] <logmsgbot>	 !log zabe@deploy1003 jforrester, zabe: Continuing with sync
[22:25:28] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170318|PendingChangesPager: Stop using ANSI-89 joins (T399641)]] (duration: 08m 08s)
[22:25:32] <stashbot>	 T399641: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cl_target_id' in 'ON'Function: MediaWiki\Pager\IndexPager::buildQueryInfo (PendingChangesPager)Query: SELECT  page_namespace,page_title,page_len,rev_len,page_latest,fp - https://phabricator.wikimedia.org/T399641
[22:28:55] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:29:13] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:30:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:30:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:30:55] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:31:13] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:35:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:35:39] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:39:04] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[22:40:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:50:40] <jinxer-wm>	 FIRING: [7x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:54:10] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 #page on db1229 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 21562.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:54:12] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:54:35] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:55:40] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:56:11] <swfrench-wmf>	 o/
[22:56:16] <sukhe>	 can someone depool db1229?
[22:56:16] <swfrench-wmf>	 !incidents
[22:56:16] <sirenbot>	 6479 (UNACKED)  db1229 (paged)/MariaDB Replica Lag: s2 (paged)
[22:56:24] <swfrench-wmf>	 it's not pooled AFAICT
[22:56:24] <sukhe>	 thanks swfrench-wmf <3
[22:56:27] <sukhe>	 oh ok
[22:56:38] <swfrench-wmf>	 trying to figure out what's up
[22:56:45] <swfrench-wmf>	 !ack 6479
[22:56:46] <sirenbot>	 6479 (ACKED)  db1229 (paged)/MariaDB Replica Lag: s2 (paged)
[22:57:31] <mutante>	 !log [cumin1002:~] $ sudo dbctl instance db1229 depool
[22:57:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:57:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[22:57:46] <swfrench-wmf>	 mutante: it wasn't pooled
[22:58:05] <mutante>	 is it one that was just reimaged, similar to es* the other day?
[22:58:11] <mutante>	 eh, I mean.. kernel reboots!
[22:58:26] <cwhite>	 mutante: doesn't look like it: uptime 45d
[22:59:07] <swfrench-wmf>	 it was depooled at 16:53 today
[22:59:21] <swfrench-wmf>	 https://sal.toolforge.org/production?p=0&q=db1229&d=
[22:59:32] <swfrench-wmf>	 ah, downtime expired
[22:59:36] <swfrench-wmf>	 (6h)
[22:59:37] <mutante>	 https://phabricator.wikimedia.org/P79350
[23:00:18] <cwhite>	 swfrench-wmf: yep, I see that in the SAL as well
[23:00:21] <swfrench-wmf>	 any objections if I re-created the downtime and flag in -data-persistence?
[23:00:30] <cwhite>	 SGTM
[23:00:40] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:00:50] <mutante>	 I have the downtime cookbook open.. same thing
[23:00:54] <mutante>	 swfrench-wmf: ok, please do 
[23:01:12] <mutante>	 a comment on https://phabricator.wikimedia.org/T399249  should do
[23:01:39] <swfrench-wmf>	 ah, thanks for finding the task!
[23:03:23] <logmsgbot>	 !log swfrench@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance - T399249
[23:03:27] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[23:03:47] <mutante>	 thanks, left comment on ticket and IRC
[23:04:33] <mutante>	 cya all later again:)
[23:04:43] <cwhite>	 Resolving the page.
[23:04:53] <swfrench-wmf>	 awesome
[23:05:03] <mutante>	 oh, I thought it was already done based on cortobot, thanks
[23:05:06] <cwhite>	 {{done}}  thanks for the quick response y'all!
[23:05:11] <swfrench-wmf>	 cwhite: thank you, I always forget to do that and am unpleasantly surprised the next day :)
[23:05:12] <mutante>	 same to you
[23:05:40] <jinxer-wm>	 FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:05:59] <cwhite>	 glad to help!
[23:06:33] <mutante>	 in this context.. also: https://phabricator.wikimedia.org/T396816
[23:06:35] * mutante waves
[23:06:53] <swfrench-wmf>	 :)
[23:09:25] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[23:10:40] <jinxer-wm>	 RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:17:40] <jinxer-wm>	 FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:21:55] <jinxer-wm>	 RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:38:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:38:26] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1170443
[23:38:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1170443 (owner: 10TrainBranchBot)
[23:41:16] <wikibugs>	 (03PS1) 10Dzahn: zuul::main: install apparmor-utils, needed for docker [puppet] - 10https://gerrit.wikimedia.org/r/1170444 (https://phabricator.wikimedia.org/T395938)
[23:42:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169737 (owner: 10Krinkle)
[23:42:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[23:43:01] <wikibugs>	 (03Merged) 10jenkins-bot: multiversion: Fix "Class Wikimedia\MWConfig\Exception not found" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169737 (owner: 10Krinkle)
[23:43:14] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1169737|multiversion: Fix "Class Wikimedia\MWConfig\Exception not found"]]
[23:45:10] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1169737|multiversion: Fix "Class Wikimedia\MWConfig\Exception not found"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:50:02] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1170443 (owner: 10TrainBranchBot)
[23:55:40] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:59:48] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with sync