[00:08:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172930 [00:08:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172930 (owner: 10TrainBranchBot) [00:16:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:24] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:21:26] FIRING: [3x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:22:26] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:33:30] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172930 (owner: 10TrainBranchBot) [00:52:04] PROBLEM - Disk space on an-worker1140 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 195204 MB (5% inode=99%): /var/lib/hadoop/data/b 198963 MB (5% inode=99%): /var/lib/hadoop/data/c 187944 MB (5% inode=99%): /var/lib/hadoop/data/d 193505 MB (5% inode=99%): /var/lib/hadoop/data/e 185668 MB (4% inode=99%): /var/lib/hadoop/data/f 191577 MB (5% inode=99%): /var/lib/hadoop/data/h 197227 MB (5% inode=99%): /var/lib/hadoop/data [00:52:04] 8 MB (4% inode=99%): /var/lib/hadoop/data/j 133752 MB (3% inode=99%): /var/lib/hadoop/data/k 185046 MB (4% inode=99%): /var/lib/hadoop/data/l 192327 MB (5% inode=99%): /var/lib/hadoop/data/m 191112 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1140&var-datasource=eqiad+prometheus/ops [01:08:24] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:11:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:27:32] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:56:14] PROBLEM - Disk space on an-worker1122 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 147366 MB (3% inode=99%): /var/lib/hadoop/data/e 157246 MB (4% inode=99%): /var/lib/hadoop/data/f 152329 MB (4% inode=99%): /var/lib/hadoop/data/b 152261 MB (4% inode=99%): /var/lib/hadoop/data/g 156052 MB (4% inode=99%): /var/lib/hadoop/data/d 150126 MB (3% inode=99%): /var/lib/hadoop/data/j 158019 MB (4% inode=99%): /var/lib/hadoop/data [01:56:14] 5 MB (4% inode=99%): /var/lib/hadoop/data/h 157141 MB (4% inode=99%): /var/lib/hadoop/data/l 154670 MB (4% inode=99%): /var/lib/hadoop/data/k 155654 MB (4% inode=99%): /var/lib/hadoop/data/m 157533 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1122&var-datasource=eqiad+prometheus/ops [02:54:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:09:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:27:32] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:51:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:30:18] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2038.codfw.wmnet with reason: Maintenance [06:30:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2038 T400436', diff saved to https://phabricator.wikimedia.org/P80008 and previous config saved to /var/cache/conftool/dbconfig/20250728-063039-root.json [06:30:44] T400436: Switchover es6 master (es2037 -> es2035) - https://phabricator.wikimedia.org/T400436 [06:37:47] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11038118 (10Marostegui) @Jhancock.wm es2038 is ready for you [06:40:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2165.codfw.wmnet with reason: Maintenance [06:42:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance [06:42:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:42:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T399249)', diff saved to https://phabricator.wikimedia.org/P80009 and previous config saved to /var/cache/conftool/dbconfig/20250728-064241-marostegui.json [06:42:47] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [06:45:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T399249)', diff saved to https://phabricator.wikimedia.org/P80010 and previous config saved to /var/cache/conftool/dbconfig/20250728-064556-marostegui.json [06:54:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:56:05] (03PS1) 10Marostegui: mariadb: db2191 new candidate master for x1 [puppet] - 10https://gerrit.wikimedia.org/r/1173158 (https://phabricator.wikimedia.org/T400513) [06:57:49] (03CR) 10Marostegui: [C:03+2] mariadb: db2191 new candidate master for x1 [puppet] - 10https://gerrit.wikimedia.org/r/1173158 (https://phabricator.wikimedia.org/T400513) (owner: 10Marostegui) [06:58:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) (owner: 10Anzx) [06:58:42] (03PS3) 10Anzx: mnwwiktionary: update reconstruction namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172862 (https://phabricator.wikimedia.org/T400441) [06:58:47] (03PS1) 10Michael Große: fix: avoid using wikitext that triggers ping notifications [extensions/GrowthExperiments] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173220 (https://phabricator.wikimedia.org/T400369) [06:58:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172862 (https://phabricator.wikimedia.org/T400441) (owner: 10Anzx) [06:58:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173220 (https://phabricator.wikimedia.org/T400369) (owner: 10Michael Große) [06:59:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172865 (https://phabricator.wikimedia.org/T399269) (owner: 10Anzx) [06:59:55] (03PS1) 10Marostegui: db2196: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1173222 (https://phabricator.wikimedia.org/T400513) [07:00:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164287 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [07:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T0700). nyaa~ [07:00:05] anzx and MichaelG_WMF: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:12] o/ [07:01:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P80011 and previous config saved to /var/cache/conftool/dbconfig/20250728-070103-marostegui.json [07:01:22] 06SRE, 10Hiddenparma, 06Traffic: Browser behaviour detection at the edge - https://phabricator.wikimedia.org/T400270#11038157 (10Joe) p:05Triage→03High a:05Joe→03None [07:01:43] * MichaelG_WMF is here [07:02:09] 06SRE, 10Hiddenparma, 06Traffic: Better mapping of requests coming from datacenters/clouds - https://phabricator.wikimedia.org/T400120#11038159 (10Joe) p:05Triage→03Medium a:05Joe→03None [07:03:10] (03CR) 10Marostegui: [C:03+2] db2196: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1173222 (https://phabricator.wikimedia.org/T400513) (owner: 10Marostegui) [07:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:13:45] (03PS1) 10Marostegui: db22220: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1173231 (https://phabricator.wikimedia.org/T399955) [07:14:48] (03PS2) 10Marostegui: db2220: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1173231 (https://phabricator.wikimedia.org/T399955) [07:15:07] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1173232 (https://phabricator.wikimedia.org/T400591) [07:15:34] (03CR) 10Marostegui: [C:03+2] db2220: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1173231 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [07:16:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P80012 and previous config saved to /var/cache/conftool/dbconfig/20250728-071611-marostegui.json [07:16:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2220.codfw.wmnet with reason: Maintenance [07:16:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2220 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P80013 and previous config saved to /var/cache/conftool/dbconfig/20250728-071643-marostegui.json [07:23:38] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [07:24:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P80014 and previous config saved to /var/cache/conftool/dbconfig/20250728-072423-root.json [07:24:27] (03CR) 10Elukey: [C:03+2] redfish: simplify change_user_password for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172265 (https://phabricator.wikimedia.org/T396365) (owner: 10Elukey) [07:25:52] (03CR) 10Elukey: [C:03+2] insetup role report: update recipients [puppet] - 10https://gerrit.wikimedia.org/r/1172365 (owner: 10Volans) [07:30:02] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#11038190 (10elukey) >>! In T393044#11004658, @Jhancock.wm wrote: > (not trying to rush, just making sure i didn't miss something) Is there anything I can help with on this one? @Jhancoc... [07:30:49] (03Merged) 10jenkins-bot: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [07:31:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T399249)', diff saved to https://phabricator.wikimedia.org/P80015 and previous config saved to /var/cache/conftool/dbconfig/20250728-073119-marostegui.json [07:31:26] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:31:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:31:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance [07:32:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T399249)', diff saved to https://phabricator.wikimedia.org/P80016 and previous config saved to /var/cache/conftool/dbconfig/20250728-073203-marostegui.json [07:33:19] (03Merged) 10jenkins-bot: redfish: simplify change_user_password for Supermicro [software/spicerack] - 10https://gerrit.wikimedia.org/r/1172265 (https://phabricator.wikimedia.org/T396365) (owner: 10Elukey) [07:34:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T399249)', diff saved to https://phabricator.wikimedia.org/P80018 and previous config saved to /var/cache/conftool/dbconfig/20250728-073417-marostegui.json [07:37:33] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#11038206 (10elukey) This task needs a new Spicerack release, I hope to do one during the next couple of days! [07:38:04] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:39:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P80019 and previous config saved to /var/cache/conftool/dbconfig/20250728-073929-root.json [07:40:01] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Deploy revertrisk-language-agnostic latest published image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172622 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis) [07:48:20] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:49:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P80020 and previous config saved to /var/cache/conftool/dbconfig/20250728-074924-marostegui.json [07:52:12] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Secondary switchover s7 T400591 [07:52:17] T400591: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T400591 [07:54:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P80021 and previous config saved to /var/cache/conftool/dbconfig/20250728-075435-root.json [08:00:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2157.codfw.wmnet with reason: Maintenance [08:00:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T399728)', diff saved to https://phabricator.wikimedia.org/P80022 and previous config saved to /var/cache/conftool/dbconfig/20250728-080026-fceratto.json [08:00:34] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [08:04:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T399728)', diff saved to https://phabricator.wikimedia.org/P80023 and previous config saved to /var/cache/conftool/dbconfig/20250728-080418-fceratto.json [08:04:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P80024 and previous config saved to /var/cache/conftool/dbconfig/20250728-080432-marostegui.json [08:09:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P80025 and previous config saved to /var/cache/conftool/dbconfig/20250728-080940-root.json [08:19:00] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:19:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P80026 and previous config saved to /var/cache/conftool/dbconfig/20250728-081926-fceratto.json [08:19:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T399249)', diff saved to https://phabricator.wikimedia.org/P80027 and previous config saved to /var/cache/conftool/dbconfig/20250728-081939-marostegui.json [08:19:45] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [08:19:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance [08:20:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T399249)', diff saved to https://phabricator.wikimedia.org/P80028 and previous config saved to /var/cache/conftool/dbconfig/20250728-082002-marostegui.json [08:20:06] (03PS1) 10Tiziano Fogli: thanos-store: fix ThanosStoreSeriesGateLatencyHigh metrics name [alerts] - 10https://gerrit.wikimedia.org/r/1173329 [08:22:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T399249)', diff saved to https://phabricator.wikimedia.org/P80029 and previous config saved to /var/cache/conftool/dbconfig/20250728-082216-marostegui.json [08:27:24] (03PS1) 10Jelto: gitlab: adjust backup and restore schedules for failover [puppet] - 10https://gerrit.wikimedia.org/r/1173331 (https://phabricator.wikimedia.org/T400252) [08:29:18] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:29:20] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6432/co" [puppet] - 10https://gerrit.wikimedia.org/r/1173331 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [08:29:35] (03CR) 10Arnaudb: [C:03+1] gitlab: adjust backup and restore schedules for failover [puppet] - 10https://gerrit.wikimedia.org/r/1173331 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [08:34:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P80030 and previous config saved to /var/cache/conftool/dbconfig/20250728-083433-fceratto.json [08:37:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P80031 and previous config saved to /var/cache/conftool/dbconfig/20250728-083724-marostegui.json [08:38:21] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [08:38:31] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:41:17] (03CR) 10Clément Goubert: [C:03+2] deploy2003: Add to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1172660 (https://phabricator.wikimedia.org/T400485) (owner: 10Clément Goubert) [08:42:02] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1170160 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:42:35] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11038306 (10Clement_Goubert) a:05Clement_Goubert→03None >>! In T400485#11035012, @RobH wrote: > @Clement_Goubert, > > Please update the site.pp file with the i... [08:43:05] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11038310 (10Clement_Goubert) [08:45:49] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11038316 (10elukey) I retried the provision cookbook but for some reason that I cannot explain, I am not able to trigger a host reboot/powercycle: * Tried via `reset /system1/pwrm... [08:46:44] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:47:52] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1014.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:48:19] (03PS1) 10Tiziano Fogli: thanos-query: add pint annotation to ThanosQueryHighDNSFailures Metric names don't seem to have changed, but they appear to be exported only after the first increment. Adding a pint annotation to avoid unwanted alerts. [alerts] - 10https://gerrit.wikimedia.org/r/1173332 [08:48:47] !log hashar@deploy1003 Started deploy [integration/docroot@827d626]: build: Updating brace-expansion to 1.1.12, 2.0.2 [08:49:01] !log hashar@deploy1003 Finished deploy [integration/docroot@827d626]: build: Updating brace-expansion to 1.1.12, 2.0.2 (duration: 00m 13s) [08:49:04] (03PS2) 10Tiziano Fogli: thanos-query: add pint annotation to ThanosQueryHighDNSFailures [alerts] - 10https://gerrit.wikimedia.org/r/1173332 [08:49:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T399728)', diff saved to https://phabricator.wikimedia.org/P80032 and previous config saved to /var/cache/conftool/dbconfig/20250728-084941-fceratto.json [08:49:46] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [08:49:57] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2171.codfw.wmnet with reason: Maintenance [08:50:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T399728)', diff saved to https://phabricator.wikimedia.org/P80033 and previous config saved to /var/cache/conftool/dbconfig/20250728-085004-fceratto.json [08:52:11] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:52:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P80034 and previous config saved to /var/cache/conftool/dbconfig/20250728-085231-marostegui.json [08:54:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T399728)', diff saved to https://phabricator.wikimedia.org/P80035 and previous config saved to /var/cache/conftool/dbconfig/20250728-085359-fceratto.json [08:54:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11038323 (10elukey) @Jclark-ctr Hi! I tried to provision ml-serve10[13,14] but the BMC seems not reachable, I get connection timeouts if I try. Is there anything extra... [08:54:17] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:57:28] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11038330 (10elukey) The ticket to Dell seems not going in the right direction, but we have some direct contact with them so I hope for some good follow ups this week. O... [08:58:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2220 with weight 0 T400591', diff saved to https://phabricator.wikimedia.org/P80036 and previous config saved to /var/cache/conftool/dbconfig/20250728-085840-root.json [08:58:47] T400591: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T400591 [08:58:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T400591 [08:59:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2220 from API/vslow/dump T400591', diff saved to https://phabricator.wikimedia.org/P80037 and previous config saved to /var/cache/conftool/dbconfig/20250728-085912-root.json [08:59:33] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2220 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1173232 (https://phabricator.wikimedia.org/T400591) (owner: 10Gerrit maintenance bot) [09:02:37] !log Starting s7 codfw failover from db2218 to db2220 - T400591 [09:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2220 to s7 primary T400591', diff saved to https://phabricator.wikimedia.org/P80038 and previous config saved to /var/cache/conftool/dbconfig/20250728-090314-root.json [09:04:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2218 T400591', diff saved to https://phabricator.wikimedia.org/P80039 and previous config saved to /var/cache/conftool/dbconfig/20250728-090407-marostegui.json [09:04:12] T400591: Switchover s7 master (db2218 -> db2220) - https://phabricator.wikimedia.org/T400591 [09:07:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T399249)', diff saved to https://phabricator.wikimedia.org/P80040 and previous config saved to /var/cache/conftool/dbconfig/20250728-090739-marostegui.json [09:07:45] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:07:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance [09:08:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T399249)', diff saved to https://phabricator.wikimedia.org/P80041 and previous config saved to /var/cache/conftool/dbconfig/20250728-090802-marostegui.json [09:08:13] (03PS1) 10Marostegui: db2218: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1173333 (https://phabricator.wikimedia.org/T399955) [09:09:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P80042 and previous config saved to /var/cache/conftool/dbconfig/20250728-090907-fceratto.json [09:09:43] (03CR) 10Marostegui: [C:03+2] db2218: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1173333 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [09:10:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T399249)', diff saved to https://phabricator.wikimedia.org/P80043 and previous config saved to /var/cache/conftool/dbconfig/20250728-091016-marostegui.json [09:10:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2218.codfw.wmnet with reason: Maintenance [09:11:10] (03PS1) 10Majavah: P:toolforge::proxy: Return custom error on HTTP 500 [puppet] - 10https://gerrit.wikimedia.org/r/1173334 [09:11:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:12] (03CR) 10Filippo Giunchedi: [C:03+1] thanos-store: fix ThanosStoreSeriesGateLatencyHigh metrics name [alerts] - 10https://gerrit.wikimedia.org/r/1173329 (owner: 10Tiziano Fogli) [09:14:20] (03CR) 10Filippo Giunchedi: [C:03+1] thanos-query: add pint annotation to ThanosQueryHighDNSFailures [alerts] - 10https://gerrit.wikimedia.org/r/1173332 (owner: 10Tiziano Fogli) [09:17:09] (03CR) 10Arnaudb: [C:03+2] gerrit: Bugfixes - dry run tests [cookbooks] - 10https://gerrit.wikimedia.org/r/1170160 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:17:26] (03CR) 10Tiziano Fogli: [C:03+2] thanos-query: add pint annotation to ThanosQueryHighDNSFailures [alerts] - 10https://gerrit.wikimedia.org/r/1173332 (owner: 10Tiziano Fogli) [09:18:03] (03CR) 10Tiziano Fogli: [C:03+2] thanos-store: fix ThanosStoreSeriesGateLatencyHigh metrics name [alerts] - 10https://gerrit.wikimedia.org/r/1173329 (owner: 10Tiziano Fogli) [09:19:39] (03Merged) 10jenkins-bot: thanos-query: add pint annotation to ThanosQueryHighDNSFailures [alerts] - 10https://gerrit.wikimedia.org/r/1173332 (owner: 10Tiziano Fogli) [09:19:40] (03Merged) 10jenkins-bot: thanos-store: fix ThanosStoreSeriesGateLatencyHigh metrics name [alerts] - 10https://gerrit.wikimedia.org/r/1173329 (owner: 10Tiziano Fogli) [09:19:49] !log marostegui@cumin1002 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1:00:00 on db2218.codfw.wmnet with reason: Maintenance [09:20:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2218,2243].codfw.wmnet with reason: Maintenance [09:21:13] (03PS1) 10Elukey: WIP - sre.hosts.provision: fix PXE settings for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 [09:22:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P80044 and previous config saved to /var/cache/conftool/dbconfig/20250728-092237-root.json [09:23:26] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:23:36] (03Merged) 10jenkins-bot: gerrit: Bugfixes - dry run tests [cookbooks] - 10https://gerrit.wikimedia.org/r/1170160 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:24:09] (03CR) 10FNegri: [C:03+1] P:toolforge::proxy: Return custom error on HTTP 500 [puppet] - 10https://gerrit.wikimedia.org/r/1173334 (owner: 10Majavah) [09:24:10] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:24:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P80045 and previous config saved to /var/cache/conftool/dbconfig/20250728-092414-fceratto.json [09:24:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P80046 and previous config saved to /var/cache/conftool/dbconfig/20250728-092421-root.json [09:24:41] (03CR) 10Majavah: [C:03+2] P:toolforge::proxy: Return custom error on HTTP 500 [puppet] - 10https://gerrit.wikimedia.org/r/1173334 (owner: 10Majavah) [09:24:46] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:25:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P80047 and previous config saved to /var/cache/conftool/dbconfig/20250728-092524-marostegui.json [09:26:15] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:27:32] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:27:50] elukey@cumin1003 provision (PID 3741833) is awaiting input [09:29:14] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 06serviceops: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#11038391 (10ABran-WMF) 05Open→03Resolved this can be considered as done with the merge of https://gerrit.wikimedi... [09:29:18] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:29:36] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:30:23] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:30:42] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:31:16] RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:32:02] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: adjust backup and restore schedules for failover [puppet] - 10https://gerrit.wikimedia.org/r/1173331 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [09:34:15] (03PS1) 10Michael Große: Echo: be explicit about special wikis using Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171548 (https://phabricator.wikimedia.org/T400070) [09:37:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P80048 and previous config saved to /var/cache/conftool/dbconfig/20250728-093743-root.json [09:39:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T399728)', diff saved to https://phabricator.wikimedia.org/P80049 and previous config saved to /var/cache/conftool/dbconfig/20250728-093922-fceratto.json [09:39:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P80050 and previous config saved to /var/cache/conftool/dbconfig/20250728-093926-root.json [09:39:27] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:39:37] (03PS1) 10Hnowlan: (WIP) rest-gateway: add rest.php routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173345 (https://phabricator.wikimedia.org/T400132) [09:39:38] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2178.codfw.wmnet with reason: Maintenance [09:39:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T399728)', diff saved to https://phabricator.wikimedia.org/P80051 and previous config saved to /var/cache/conftool/dbconfig/20250728-093945-fceratto.json [09:40:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P80052 and previous config saved to /var/cache/conftool/dbconfig/20250728-094031-marostegui.json [09:41:43] (03CR) 10CI reject: [V:04-1] (WIP) rest-gateway: add rest.php routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173345 (https://phabricator.wikimedia.org/T400132) (owner: 10Hnowlan) [09:41:46] (03PS1) 10Tiziano Fogli: statsv: fix StatsvThroughput linting alerts [alerts] - 10https://gerrit.wikimedia.org/r/1173346 [09:42:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11038457 (10Jclark-ctr) @elukey. That is correct these racks do not have power yet [09:43:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T399728)', diff saved to https://phabricator.wikimedia.org/P80053 and previous config saved to /var/cache/conftool/dbconfig/20250728-094333-fceratto.json [09:45:03] 06SRE-OnFire: Harden corto systemd service - https://phabricator.wikimedia.org/T372437#11038475 (10fgiunchedi) [09:45:47] (03PS2) 10Hnowlan: (WIP) rest-gateway: add rest.php routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173345 (https://phabricator.wikimedia.org/T400132) [09:46:44] 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#11038477 (10Multichill) >>! In T388809#11025715, @BCornwall wrote: > Thanks for your patience. Hopefully we're done-done now. :) https://www.pywikipedia.org/ redirects correctly,... [09:49:11] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2241 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1173347 (https://phabricator.wikimedia.org/T400598) [09:49:15] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#11038490 (10fgiunchedi) Yes indeed trying a memcached flush seems easy and worth a try [09:49:27] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2241 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1173348 (https://phabricator.wikimedia.org/T400599) [09:50:06] (03Abandoned) 10Marostegui: mariadb: Promote db2241 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1173347 (https://phabricator.wikimedia.org/T400598) (owner: 10Gerrit maintenance bot) [09:50:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:52:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P80054 and previous config saved to /var/cache/conftool/dbconfig/20250728-095249-root.json [09:53:43] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS trixie [09:54:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P80055 and previous config saved to /var/cache/conftool/dbconfig/20250728-095432-root.json [09:55:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T399249)', diff saved to https://phabricator.wikimedia.org/P80056 and previous config saved to /var/cache/conftool/dbconfig/20250728-095539-marostegui.json [09:55:44] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:55:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 16 hosts with reason: Primary switchover x3 T400599 [09:55:53] T400599: Switchover x3 master (db2162 -> db2241) - https://phabricator.wikimedia.org/T400599 [09:55:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance [09:56:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T399249)', diff saved to https://phabricator.wikimedia.org/P80057 and previous config saved to /var/cache/conftool/dbconfig/20250728-095601-marostegui.json [09:58:14] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2241 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1173348 (https://phabricator.wikimedia.org/T400599) (owner: 10Gerrit maintenance bot) [09:58:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T399249)', diff saved to https://phabricator.wikimedia.org/P80058 and previous config saved to /var/cache/conftool/dbconfig/20250728-095815-marostegui.json [09:58:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P80059 and previous config saved to /var/cache/conftool/dbconfig/20250728-095841-fceratto.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T1000) [10:00:16] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:01:11] (03CR) 10Filippo Giunchedi: [C:03+1] statsv: fix StatsvThroughput linting alerts [alerts] - 10https://gerrit.wikimedia.org/r/1173346 (owner: 10Tiziano Fogli) [10:01:30] !log btullis@deploy1003 Started scap build-images: Updating mediawiki-cli image for T400383 [10:01:35] T400383: Recent wikibase RDF dumps on Airflow have failed - https://phabricator.wikimedia.org/T400383 [10:01:40] elukey@cumin1003 provision (PID 3742126) is awaiting input [10:01:42] (03CR) 10Tiziano Fogli: [C:03+2] statsv: fix StatsvThroughput linting alerts [alerts] - 10https://gerrit.wikimedia.org/r/1173346 (owner: 10Tiziano Fogli) [10:01:48] !log Starting x3 codfw failover from db2162 to db2241 - T400599 [10:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:52] T400599: Switchover x3 master (db2162 -> db2241) - https://phabricator.wikimedia.org/T400599 [10:02:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2241 to x3 primary T400599', diff saved to https://phabricator.wikimedia.org/P80060 and previous config saved to /var/cache/conftool/dbconfig/20250728-100208-root.json [10:02:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2162 T400599', diff saved to https://phabricator.wikimedia.org/P80061 and previous config saved to /var/cache/conftool/dbconfig/20250728-100243-marostegui.json [10:03:11] (03Merged) 10jenkins-bot: statsv: fix StatsvThroughput linting alerts [alerts] - 10https://gerrit.wikimedia.org/r/1173346 (owner: 10Tiziano Fogli) [10:05:26] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:05:57] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [10:07:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P80062 and previous config saved to /var/cache/conftool/dbconfig/20250728-100754-root.json [10:08:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P80063 and previous config saved to /var/cache/conftool/dbconfig/20250728-100806-root.json [10:09:16] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:09:26] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [10:09:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P80064 and previous config saved to /var/cache/conftool/dbconfig/20250728-100938-root.json [10:13:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P80065 and previous config saved to /var/cache/conftool/dbconfig/20250728-101322-marostegui.json [10:13:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P80066 and previous config saved to /var/cache/conftool/dbconfig/20250728-101348-fceratto.json [10:14:13] (03PS2) 10Btullis: Bump the flink-operator image to version 1.12.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172351 (https://phabricator.wikimedia.org/T398162) [10:14:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:16:25] (03PS3) 10Btullis: Bump the flink-operator image to version 1.12.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172351 (https://phabricator.wikimedia.org/T398162) [10:17:02] (03CR) 10Btullis: [V:03+2 C:03+2] Bump the flink-operator image to version 1.12.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1172351 (https://phabricator.wikimedia.org/T398162) (owner: 10Btullis) [10:17:30] !log btullis@deploy1003 Finished scap build-images: Updating mediawiki-cli image for T400383 (duration: 16m 00s) [10:17:35] T400383: Recent wikibase RDF dumps on Airflow have failed - https://phabricator.wikimedia.org/T400383 [10:19:28] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: DiskSpace (instance netbox-dev2003:9100) - https://phabricator.wikimedia.org/T400601 (10tappof) 03NEW [10:22:33] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11038584 (10elukey) After reviewing the code with some fresh eyes/brain I realized that for UEFI we have already started to use the BIOS settings for PXE, but we haven't... [10:23:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P80067 and previous config saved to /var/cache/conftool/dbconfig/20250728-102300-root.json [10:23:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P80068 and previous config saved to /var/cache/conftool/dbconfig/20250728-102311-root.json [10:24:12] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1012.eqiad.wmnet with OS trixie [10:24:30] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:24:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P80069 and previous config saved to /var/cache/conftool/dbconfig/20250728-102444-root.json [10:25:16] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [10:25:23] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:26:40] 10SRE-swift-storage, 06Commons, 10media-backups: File not found: /v1/AUTH_mw/wikipedia-commons-local-public ... for 3 files - https://phabricator.wikimedia.org/T400567#11038601 (10MatthewVernon) I'm afraid that these three images are long gone. They are not found in either swift cluster, nor in either site's... [10:26:44] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: PuppetPendingCertificateRequest (instance puppetmaster1001:9100) - https://phabricator.wikimedia.org/T400603 (10tappof) 03NEW [10:27:02] (03PS2) 10Elukey: WIP - sre.hosts.provision: fix PXE settings for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [10:27:15] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: PuppetPendingCertificateRequest (instance puppetmaster1001:9100) - https://phabricator.wikimedia.org/T400603#11038616 (10tappof) The host doesn't seem to exist. I tagged the Infrastructure-Foundation group since it’s a Puppet-related al... [10:27:55] jouncebot: nowandnexr [10:27:57] jouncebot: nowandnext [10:27:57] For the next 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T1000) [10:27:58] In 2 hour(s) and 32 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T1300) [10:28:01] (03PS1) 10David Caro: openstack.neutron.metadata_agent: increase the number of open files [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) [10:28:10] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [10:28:25] (03CR) 10CI reject: [V:04-1] openstack.neutron.metadata_agent: increase the number of open files [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [10:28:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P80070 and previous config saved to /var/cache/conftool/dbconfig/20250728-102830-marostegui.json [10:28:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T399728)', diff saved to https://phabricator.wikimedia.org/P80071 and previous config saved to /var/cache/conftool/dbconfig/20250728-102856-fceratto.json [10:29:01] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:29:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2192.codfw.wmnet with reason: Maintenance [10:29:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T399728)', diff saved to https://phabricator.wikimedia.org/P80072 and previous config saved to /var/cache/conftool/dbconfig/20250728-102918-fceratto.json [10:29:36] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: PyEz "ignore_warnings" does not work for port-block speed change warning - https://phabricator.wikimedia.org/T400261#11038624 (10cmooney) 05Open→03Resolved a:03cmooney [10:29:43] (03PS2) 10David Caro: openstack.neutron.metadata_agent: increase the number of open files [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) [10:30:10] (03CR) 10Ladsgroup: "ping 😄" [puppet] - 10https://gerrit.wikimedia.org/r/1168148 (https://phabricator.wikimedia.org/T398945) (owner: 10Ladsgroup) [10:30:21] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [10:31:42] (03PS3) 10David Caro: openstack.neutron.metadata_agent: increase the number of open files [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) [10:32:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T399728)', diff saved to https://phabricator.wikimedia.org/P80073 and previous config saved to /var/cache/conftool/dbconfig/20250728-103208-fceratto.json [10:32:17] (03CR) 10Majavah: [C:03+2] wikireplicas: Use --replace instead of --replace-all [cookbooks] - 10https://gerrit.wikimedia.org/r/1172657 (owner: 10Majavah) [10:33:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11038654 (10elukey) [10:33:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11038655 (10elukey) [10:34:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:34:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11038656 (10elukey) Perfect! So ml-serve10[12,13] are ready to go, they are running Trixie though. [10:34:32] (03PS3) 10Elukey: WIP - sre.hosts.provision: fix PXE settings for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [10:34:40] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [10:35:44] (03CR) 10Marostegui: [C:03+1] "You'll need to restart sanitarium hosts mariadb" [puppet] - 10https://gerrit.wikimedia.org/r/1168148 (https://phabricator.wikimedia.org/T398945) (owner: 10Ladsgroup) [10:36:30] (03PS4) 10David Caro: openstack.neutron.metadata_agent: increase the number of open files [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) [10:37:23] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [10:38:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P80074 and previous config saved to /var/cache/conftool/dbconfig/20250728-103817-root.json [10:38:31] (03CR) 10Lucas Werkmeister (WMDE): "The current way these blocks are formatted doesn’t make it clear that they’re all Wikimania related… I’d remove the blank lines between bl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) (owner: 10Anzx) [10:39:03] (03PS5) 10David Caro: openstack.neutron.metadata_agent: increase the number of open files [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) [10:39:16] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:39:33] (03Merged) 10jenkins-bot: wikireplicas: Use --replace instead of --replace-all [cookbooks] - 10https://gerrit.wikimedia.org/r/1172657 (owner: 10Majavah) [10:39:51] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:39:53] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [10:40:38] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:41:27] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:42:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:42:53] (03PS6) 10David Caro: openstack.neutron.metadata_agent: increase the number of open files [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) [10:43:19] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [10:43:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T399249)', diff saved to https://phabricator.wikimedia.org/P80075 and previous config saved to /var/cache/conftool/dbconfig/20250728-104337-marostegui.json [10:43:44] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:43:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance [10:44:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T399249)', diff saved to https://phabricator.wikimedia.org/P80076 and previous config saved to /var/cache/conftool/dbconfig/20250728-104400-marostegui.json [10:45:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T399249)', diff saved to https://phabricator.wikimedia.org/P80077 and previous config saved to /var/cache/conftool/dbconfig/20250728-104508-marostegui.json [10:47:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P80078 and previous config saved to /var/cache/conftool/dbconfig/20250728-104715-fceratto.json [10:47:35] (03CR) 10Majavah: [C:04-1] openstack.neutron.metadata_agent: increase the number of open files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [10:52:16] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:52:41] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:53:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P80079 and previous config saved to /var/cache/conftool/dbconfig/20250728-105323-root.json [10:54:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:54:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:55:39] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173358 [10:57:18] (03PS4) 10Elukey: WIP - sre.hosts.provision: fix PXE settings for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [10:58:04] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:58:22] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:00:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P80080 and previous config saved to /var/cache/conftool/dbconfig/20250728-110016-marostegui.json [11:00:56] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:01:00] (03PS4) 10Anzx: throttle: add rules for Wikimania 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) [11:01:12] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:01:56] (03CR) 10CI reject: [V:04-1] throttle: add rules for Wikimania 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) (owner: 10Anzx) [11:02:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P80081 and previous config saved to /var/cache/conftool/dbconfig/20250728-110222-fceratto.json [11:04:41] (03PS5) 10Anzx: throttle: add rules for Wikimania 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) [11:05:29] (03PS1) 10Ladsgroup: Drop references to flaggedrevs_tracking [puppet] - 10https://gerrit.wikimedia.org/r/1173359 (https://phabricator.wikimedia.org/T398936) [11:05:29] (03CR) 10CI reject: [V:04-1] throttle: add rules for Wikimania 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) (owner: 10Anzx) [11:07:11] (03PS6) 10Anzx: throttle: add rules for Wikimania 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) [11:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:10:33] (03CR) 10Anzx: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) (owner: 10Anzx) [11:11:04] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:11:22] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:11:49] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:13:20] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54226 bytes in 7.867 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:13:54] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.221 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:14:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:14:22] (03PS3) 10R4356thwiki: Remove $wgCentralNoticeESITestString [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173360 (https://phabricator.wikimedia.org/T400472) [11:15:03] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:15:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:15:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P80082 and previous config saved to /var/cache/conftool/dbconfig/20250728-111524-marostegui.json [11:15:42] (03PS7) 10Anzx: throttle: add rules for Wikimania 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) [11:17:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T399728)', diff saved to https://phabricator.wikimedia.org/P80083 and previous config saved to /var/cache/conftool/dbconfig/20250728-111730-fceratto.json [11:17:36] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:17:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2201.codfw.wmnet with reason: Maintenance [11:19:51] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2211.codfw.wmnet with reason: Maintenance [11:19:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T399728)', diff saved to https://phabricator.wikimedia.org/P80084 and previous config saved to /var/cache/conftool/dbconfig/20250728-111958-fceratto.json [11:20:10] marostegui: I'm about to deploy this change: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1167217 this will start deleting stuff from PC. Let me know if things break there [11:20:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:21:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167217 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [11:22:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T399728)', diff saved to https://phabricator.wikimedia.org/P80085 and previous config saved to /var/cache/conftool/dbconfig/20250728-112254-fceratto.json [11:23:00] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:23:46] (03Merged) 10jenkins-bot: ParserCache: Enable purgePeriod for SqlBagOStuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167217 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [11:24:10] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1167217|ParserCache: Enable purgePeriod for SqlBagOStuff (T398806)]] [11:24:15] T398806: Retire purge-parsercache periodic jobs - https://phabricator.wikimedia.org/T398806 [11:25:16] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:28:29] (03CR) 10R4356thwiki: "This does not depend on I7c8e2325251a5aa7dc7711d068e14d4015ee7ae0." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173360 (https://phabricator.wikimedia.org/T400472) (owner: 10R4356thwiki) [11:29:22] elukey@cumin1003 provision (PID 3753019) is awaiting input [11:30:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T399249)', diff saved to https://phabricator.wikimedia.org/P80086 and previous config saved to /var/cache/conftool/dbconfig/20250728-113031-marostegui.json [11:30:34] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1167217|ParserCache: Enable purgePeriod for SqlBagOStuff (T398806)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:30:36] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:30:41] T398806: Retire purge-parsercache periodic jobs - https://phabricator.wikimedia.org/T398806 [11:30:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1209.eqiad.wmnet with reason: Maintenance [11:30:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T399249)', diff saved to https://phabricator.wikimedia.org/P80087 and previous config saved to /var/cache/conftool/dbconfig/20250728-113054-marostegui.json [11:31:49] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [11:32:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T399249)', diff saved to https://phabricator.wikimedia.org/P80088 and previous config saved to /var/cache/conftool/dbconfig/20250728-113202-marostegui.json [11:34:18] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:35:59] (03CR) 10Daimona Eaytoy: [C:03+1] Echo: be explicit about special wikis using Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171548 (https://phabricator.wikimedia.org/T400070) (owner: 10Michael Große) [11:36:46] jouncebot: nowandnext [11:36:46] No deployments scheduled for the next 1 hour(s) and 23 minute(s) [11:36:46] In 1 hour(s) and 23 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T1300) [11:37:08] anyone mind if I do a quick service deploy? (mobileapps) [11:38:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P80089 and previous config saved to /var/cache/conftool/dbconfig/20250728-113801-fceratto.json [11:39:35] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172110 (owner: 10PipelineBot) [11:41:33] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172110 (owner: 10PipelineBot) [11:43:29] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:43:53] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:44:01] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167217|ParserCache: Enable purgePeriod for SqlBagOStuff (T398806)]] (duration: 19m 51s) [11:44:06] T398806: Retire purge-parsercache periodic jobs - https://phabricator.wikimedia.org/T398806 [11:45:27] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:46:12] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:46:21] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [11:47:06] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [11:47:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P80090 and previous config saved to /var/cache/conftool/dbconfig/20250728-114710-marostegui.json [11:50:02] (03PS4) 10Hnowlan: (WIP) rest-gateway: add rest.php routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173345 (https://phabricator.wikimedia.org/T400132) [11:50:30] (03PS1) 10Reedy: Allow index dump from non-managed cluster [extensions/CirrusSearch] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173363 (https://phabricator.wikimedia.org/T400158) [11:50:54] jouncebot: nowandnext [11:50:54] No deployments scheduled for the next 1 hour(s) and 9 minute(s) [11:50:54] In 1 hour(s) and 9 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T1300) [11:51:23] btullis: ^ I can just deploy that now if you want [11:53:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P80091 and previous config saved to /var/cache/conftool/dbconfig/20250728-115309-fceratto.json [11:54:28] (03PS2) 10Btullis: Allow index dump from non-managed cluster [extensions/CirrusSearch] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173363 (https://phabricator.wikimedia.org/T400158) (owner: 10Reedy) [11:55:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CirrusSearch] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173363 (https://phabricator.wikimedia.org/T400158) (owner: 10Reedy) [11:57:29] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11038861 (10Jclark-ctr) For Row C, I’ve logged into each console port and verified they are correctly mapped. We’re still waiting on longer power cables [11:58:21] (03CR) 10Cathal Mooney: zarcillo: Add egress to dyna.w.o (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [12:01:16] Reedy: Thanks ever so much. That would be great. This is my first backport that I've ever submitted myself. [12:02:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.175s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:02:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P80092 and previous config saved to /var/cache/conftool/dbconfig/20250728-120217-marostegui.json [12:03:48] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1186 - https://phabricator.wikimedia.org/T399991#11038885 (10Jclark-ctr) @BTullis if your able to assist with this today so i can finish this by today or tomorrow to help keep this within Dcops SLA for repairs [12:04:37] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1186 - https://phabricator.wikimedia.org/T399991#11038887 (10BTullis) Yep, sorry. I will look now. [12:08:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T399728)', diff saved to https://phabricator.wikimedia.org/P80093 and previous config saved to /var/cache/conftool/dbconfig/20250728-120816-fceratto.json [12:08:22] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:08:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2223.codfw.wmnet with reason: Maintenance [12:08:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2223 (T399728)', diff saved to https://phabricator.wikimedia.org/P80094 and previous config saved to /var/cache/conftool/dbconfig/20250728-120839-fceratto.json [12:10:16] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:12:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.303s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:12:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T399728)', diff saved to https://phabricator.wikimedia.org/P80095 and previous config saved to /var/cache/conftool/dbconfig/20250728-121234-fceratto.json [12:13:18] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1186 - https://phabricator.wikimedia.org/T399991#11038901 (10BTullis) @Jclark-ctr Apologies for the delay. You can go ahead and swap the drive whenever you're ready, now. Does the SLA depend on the Icinga check clearing, or when t... [12:15:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:17:04] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1186 - https://phabricator.wikimedia.org/T399991#11038909 (10Jclark-ctr) from ticket creation to Ticket resolved We have 10days for repairs [12:17:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T399249)', diff saved to https://phabricator.wikimedia.org/P80096 and previous config saved to /var/cache/conftool/dbconfig/20250728-121725-marostegui.json [12:17:31] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:17:40] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance [12:17:45] (03CR) 10Clément Goubert: "LGTM except commit title" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [12:17:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T399249)', diff saved to https://phabricator.wikimedia.org/P80097 and previous config saved to /var/cache/conftool/dbconfig/20250728-121747-marostegui.json [12:18:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T399249)', diff saved to https://phabricator.wikimedia.org/P80098 and previous config saved to /var/cache/conftool/dbconfig/20250728-121855-marostegui.json [12:20:16] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:21:38] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:22:44] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [12:24:06] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1186 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [12:24:10] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:43] FIRING: JobUnavailable: Reduced availability for job jmx_idp in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:25:29] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1186 - https://phabricator.wikimedia.org/T399991#11038926 (10Jclark-ctr) @BTullis the drive has been physically swapped [12:25:40] (03CR) 10Clément Goubert: (WIP) rest-gateway: add rest.php routes (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173345 (https://phabricator.wikimedia.org/T400132) (owner: 10Hnowlan) [12:26:34] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [12:27:28] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:27:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P80099 and previous config saved to /var/cache/conftool/dbconfig/20250728-122741-fceratto.json [12:30:07] (03PS1) 10Ladsgroup: objectcache: Only clean a subset of tables in SqlBagOStuff [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173371 (https://phabricator.wikimedia.org/T398806) [12:30:14] 10SRE-swift-storage, 06Commons, 10media-backups: File not found: /v1/AUTH_mw/wikipedia-commons-local-public ... for 3 files - https://phabricator.wikimedia.org/T400567#11038949 (10GPSLeo) As there are likely many more of these cases is there a possibility to scan over all files on Commons to find all files a... [12:30:30] jouncebot: nowandnext [12:30:30] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [12:30:30] In 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T1300) [12:30:51] (03CR) 10Ladsgroup: [C:03+2] objectcache: Only clean a subset of tables in SqlBagOStuff [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173371 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [12:34:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P80100 and previous config saved to /var/cache/conftool/dbconfig/20250728-123403-marostegui.json [12:34:09] (03PS1) 10Arnaudb: gerrit: add service ip address for gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1173370 (https://phabricator.wikimedia.org/T387833) [12:34:09] (03CR) 10Arnaudb: "we'll need to have a service address for gerrit-spare.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1173370 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [12:34:10] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:39:53] (03CR) 10Reedy: [C:03+2] "Get it going so we're not waiting on CI" [extensions/CirrusSearch] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173363 (https://phabricator.wikimedia.org/T400158) (owner: 10Reedy) [12:41:26] (03Merged) 10jenkins-bot: Allow index dump from non-managed cluster [extensions/CirrusSearch] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173363 (https://phabricator.wikimedia.org/T400158) (owner: 10Reedy) [12:41:46] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173373 [12:42:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P80101 and previous config saved to /var/cache/conftool/dbconfig/20250728-124249-fceratto.json [12:43:00] (03CR) 10CI reject: [V:04-1] objectcache: Only clean a subset of tables in SqlBagOStuff [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173371 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [12:43:59] (03PS7) 10David Caro: openstack.neutron.metadata_agent: increase the number of open files [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) [12:43:59] (03CR) 10David Caro: openstack.neutron.metadata_agent: increase the number of open files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [12:44:23] (03CR) 10CI reject: [V:04-1] openstack.neutron.metadata_agent: increase the number of open files [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [12:45:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Degraded RAID on an-worker1186 - https://phabricator.wikimedia.org/T399991#11039012 (10BTullis) 05Open→03Resolved We're good to go. Many thanks again @Jclark-ctr {F65686080,width=60%} [12:45:50] Then it only takes 2 minutes [12:45:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Degraded RAID on an-worker1186 - https://phabricator.wikimedia.org/T399991#11039017 (10Jclark-ctr) a:05BTullis→03Jclark-ctr [12:45:59] (03PS8) 10David Caro: openstack.neutron.metadata_agent: increase the number of open files [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) [12:47:02] (03CR) 10David Caro: openstack.neutron.metadata_agent: increase the number of open files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [12:47:05] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [12:47:27] (03CR) 10David Caro: openstack.neutron.metadata_agent: increase the number of open files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [12:47:59] (03PS1) 10Ladsgroup: diff: Avoid Phan warning with some Wikidiff2 versions [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173374 [12:48:04] (03CR) 10Ladsgroup: [C:03+2] diff: Avoid Phan warning with some Wikidiff2 versions [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173374 (owner: 10Ladsgroup) [12:49:00] (03CR) 10Lucas Werkmeister (WMDE): throttle: add rules for Wikimania 2025 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) (owner: 10Anzx) [12:49:10] (03PS9) 10David Caro: openstack.neutron.metadata_agent: increase the number of open files [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) [12:49:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P80102 and previous config saved to /var/cache/conftool/dbconfig/20250728-124910-marostegui.json [12:49:35] (03CR) 10CI reject: [V:04-1] openstack.neutron.metadata_agent: increase the number of open files [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [12:51:11] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1173363|Allow index dump from non-managed cluster (T400158)]] [12:51:17] T400158: cirrussearch dumps have failed - 2025-07-21 - https://phabricator.wikimedia.org/T400158 [12:52:20] (03PS10) 10David Caro: openstack.neutron.metadata_agent: increase the number of open files [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) [12:52:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171548 (https://phabricator.wikimedia.org/T400070) (owner: 10Michael Große) [12:53:09] (03CR) 10Anzx: throttle: add rules for Wikimania 2025 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) (owner: 10Anzx) [12:54:42] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [12:55:15] !log reedy@deploy1003 reedy: Backport for [[gerrit:1173363|Allow index dump from non-managed cluster (T400158)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:55:15] (03CR) 10Federico Ceratto: [C:03+2] Data Persistence: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 (owner: 10Volans) [12:56:53] !log reedy@deploy1003 reedy: Continuing with sync [12:57:39] (03CR) 10David Caro: [C:03+2] "Pcc is clean now, merging" [puppet] - 10https://gerrit.wikimedia.org/r/1173352 (https://phabricator.wikimedia.org/T395742) (owner: 10David Caro) [12:57:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T399728)', diff saved to https://phabricator.wikimedia.org/P80103 and previous config saved to /var/cache/conftool/dbconfig/20250728-125756-fceratto.json [12:58:02] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:58:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2228.codfw.wmnet with reason: Maintenance [12:58:13] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] Data Persistence: simplify Phabricator usage (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 (owner: 10Volans) [12:58:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2228 (T399728)', diff saved to https://phabricator.wikimedia.org/P80104 and previous config saved to /var/cache/conftool/dbconfig/20250728-125818-fceratto.json [12:58:43] (03CR) 10Lucas Werkmeister (WMDE): throttle: add rules for Wikimania 2025 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) (owner: 10Anzx) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T1300). [13:00:05] anzx, btullis, and MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:21] hi hi :) [13:00:24] o/ [13:01:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T399728)', diff saved to https://phabricator.wikimedia.org/P80105 and previous config saved to /var/cache/conftool/dbconfig/20250728-130109-fceratto.json [13:01:18] (03Merged) 10jenkins-bot: diff: Avoid Phan warning with some Wikidiff2 versions [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173374 (owner: 10Ladsgroup) [13:01:29] the two config patches are effectively no-ops as far as testing is concerned: one literally just changes a comment, the other affects a long-running maintenance script's next invokation [13:01:45] o/ [13:02:12] MichaelG_WMF: The CirrusSearch one is nearly done and out of the way [13:02:16] However, the backport includes crucially an i18n change: not pinging mentors that are away. So that might take some time to rebuild the language cache [13:02:27] let’s start with the mnwwiktionary and aswikisource config changes [13:02:34] yeahh, expect that one to be sloow [13:03:17] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] aswikisource: add publisher (প্ৰকাশক) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172865 (https://phabricator.wikimedia.org/T399269) (owner: 10Anzx) [13:03:39] 10ops-eqiad, 06SRE, 06DC-Ops: Supermicro incorrectly exposing LinkStatus in Redfish - https://phabricator.wikimedia.org/T400034#11039131 (10Jclark-ctr) updated firmware for bmc to 1.05.22 [13:04:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T399249)', diff saved to https://phabricator.wikimedia.org/P80106 and previous config saved to /var/cache/conftool/dbconfig/20250728-130418-marostegui.json [13:04:22] o/ [13:04:26] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:04:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1226.eqiad.wmnet with reason: Maintenance [13:04:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T399249)', diff saved to https://phabricator.wikimedia.org/P80107 and previous config saved to /var/cache/conftool/dbconfig/20250728-130441-marostegui.json [13:04:48] (03CR) 10Lucas Werkmeister (WMDE): mnwwiktionary: update reconstruction namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172862 (https://phabricator.wikimedia.org/T400441) (owner: 10Anzx) [13:04:50] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173363|Allow index dump from non-managed cluster (T400158)]] (duration: 13m 39s) [13:04:57] T400158: cirrussearch dumps have failed - 2025-07-21 - https://phabricator.wikimedia.org/T400158 [13:05:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172865 (https://phabricator.wikimedia.org/T399269) (owner: 10Anzx) [13:05:21] I'm clear [13:05:31] oh, I didn’t notice that [13:05:37] heh [13:05:40] (scap would’ve pointed out the lock though ^^) [13:05:42] anyway, I can deploy ^^ [13:05:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T399249)', diff saved to https://phabricator.wikimedia.org/P80108 and previous config saved to /var/cache/conftool/dbconfig/20250728-130549-marostegui.json [13:06:09] (03Merged) 10jenkins-bot: aswikisource: add publisher (প্ৰকাশক) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172865 (https://phabricator.wikimedia.org/T399269) (owner: 10Anzx) [13:06:22] !log sukhe@idp1004:~$ sudo systemctl restart tomcat10.service [13:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:25] unexpected commits on wmf.11… [13:06:45] (03CR) 10Ladsgroup: [C:03+2] "try again" [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173371 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [13:06:55] ok, apparently it’s a comment-only change [13:07:34] 10ops-eqiad, 06SRE, 06DC-Ops: Supermicro incorrectly exposing LinkStatus in Redfish - https://phabricator.wikimedia.org/T400034#11039148 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [13:07:49] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1172865|aswikisource: add publisher (প্ৰকাশক) namespace (T399269)]] [13:07:54] T399269: Add "প্ৰকাশক" namespace and "Edition" field in Index page for Assamese Wikisource - https://phabricator.wikimedia.org/T399269 [13:07:59] Lucas_WMDE: it was broken [13:08:25] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1173374 [13:08:46] (context: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1173371) [13:09:10] Amir1: I was gonna ask what you’re doing there as well [13:09:26] (03CR) 10Ladsgroup: objectcache: Only clean a subset of tables in SqlBagOStuff [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173371 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [13:09:43] RESOLVED: JobUnavailable: Reduced availability for job jmx_idp in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:09:47] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, anzx: Backport for [[gerrit:1172865|aswikisource: add publisher (প্ৰকাশক) namespace (T399269)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:09:58] anzx: please test the aswikisource change :) [13:10:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:10:19] Lucas_WMDE: looks good [13:10:22] hm [13:10:25] not to me… [13:10:40] to me it looks like the name of ns104 *changed* from পৃষ্ঠ to প্ৰকাশক ? [13:10:47] I removed my +2 for now (I was planning to deploy it before this window but it was broken) [13:10:59] thanks [13:11:13] anzx: AFAICT this isn’t adding a namespace, it’s renaming it [13:11:16] (03PS2) 10Arnaudb: Gerrit: Add service ip for gerrit2003 [dns] - 10https://gerrit.wikimedia.org/r/1173376 (https://phabricator.wikimedia.org/T387833) [13:11:16] (03CR) 10Arnaudb: "This adds the DNS records to gerrit-spare, on gerrit2003" [dns] - 10https://gerrit.wikimedia.org/r/1173376 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [13:11:18] without an alias for the old name [13:11:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:13] 10SRE-swift-storage, 06Commons, 10media-backups: File not found: /v1/AUTH_mw/wikipedia-commons-local-public ... for 3 files - https://phabricator.wikimedia.org/T400567#11039161 (10jcrespo) >>! In T400567#11038949, @GPSLeo wrote: > As there are likely many more of these cases is there a possibility to scan ov... [13:12:22] google translate claims the previous name meant “page” [13:13:04] (the word is also listed at https://en.wiktionary.org/wiki/%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0#Sanskrit under alternative scripts) [13:13:30] so to me it looks like this unintentionally reused a namespace ID that was already assigned [13:14:08] Lucas_WMDE: please revert [13:14:13] !log lucaswerkmeister-wmde@deploy1003 Sync cancelled. [13:14:15] ack [13:14:25] hm, can you revert in spiderpig… [13:14:44] or do I need to do it manually [13:15:22] (03PS1) 10Lucas Werkmeister (WMDE): Revert "aswikisource: add publisher (প্ৰকাশক) namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173381 (https://phabricator.wikimedia.org/T399269) [13:15:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173381 (https://phabricator.wikimedia.org/T399269) (owner: 10Lucas Werkmeister (WMDE)) [13:16:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P80109 and previous config saved to /var/cache/conftool/dbconfig/20250728-131617-fceratto.json [13:16:32] (03Merged) 10jenkins-bot: Revert "aswikisource: add publisher (প্ৰকাশক) namespace" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173381 (https://phabricator.wikimedia.org/T399269) (owner: 10Lucas Werkmeister (WMDE)) [13:16:32] apparently T396106 is the task for teaching spiderpig how to revert changes [13:16:33] T396106: spiderpig should give the revert procedure after a canceled deployment - https://phabricator.wikimedia.org/T396106 [13:16:46] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1173381|Revert "aswikisource: add publisher (প্ৰকাশক) namespace" (T399269)]] [13:16:52] T399269: Add "প্ৰকাশক" namespace and "Edition" field in Index page for Assamese Wikisource - https://phabricator.wikimedia.org/T399269 [13:17:01] Lucas_WMDE: it was because 104 ns page namespace was set 104 in https://gerrit.wikimedia.org/g/operations/mediawiki-config/+blame/master/wmf-config%2FInitialiseSettings.php#1767 [13:17:23] which i didn't notice earlier [13:17:49] I see, thanks [13:17:56] that also explains why it didn’t show up in diffConfig [13:18:42] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1173381|Revert "aswikisource: add publisher (প্ৰকাশক) namespace" (T399269)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:19:29] now I see no diff between mwdebug and production anymore (which is the expected outcome ^^) [13:19:31] (03PS29) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [13:19:37] anzx: anything you want to test for the revert? [13:19:43] no [13:19:58] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [13:20:00] ok, then let’s sync [13:20:07] (technically I guess we could skip the sync but I’d rather let it roll out) [13:20:52] btullis: do you still need a deployment? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1173363 says it’s already merged [13:20:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P80110 and previous config saved to /var/cache/conftool/dbconfig/20250728-132056-marostegui.json [13:21:25] Lucas_WMDE: i will schedule all my patches to next window [13:21:31] ok, good luck! [13:22:11] Lucas_WMDE: I think that change might have been what Reedy was referring to earlier? [13:22:31] ah, indeed [13:22:44] then I guess you’re next once the current deploy finishes ^^ [13:23:00] I’d probably do the two config changes together first, and in the meantime start the gate-and-submit for the backport already [13:23:25] (03PS1) 10Andrew Bogott: neutron metadata agent: restart service daily [puppet] - 10https://gerrit.wikimedia.org/r/1173388 (https://phabricator.wikimedia.org/T395742) [13:23:31] sounds good. As mentioned above, there is nothing to test for the config changes [13:24:22] (03CR) 10Majavah: [C:04-1] "let's wait and see if https://gerrit.wikimedia.org/r/c/operations/puppet/+/1173352 works first?" [puppet] - 10https://gerrit.wikimedia.org/r/1173388 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [13:25:14] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173381|Revert "aswikisource: add publisher (প্ৰকাশক) namespace" (T399269)]] (duration: 08m 28s) [13:25:19] T399269: Add "প্ৰকাশক" namespace and "Edition" field in Index page for Assamese Wikisource - https://phabricator.wikimedia.org/T399269 [13:25:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164287 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [13:25:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171548 (https://phabricator.wikimedia.org/T400070) (owner: 10Michael Große) [13:26:18] also, I’ll just quickly plug my wikimedia-debug-diff script, which is how I noticed the problem with the aswikisource namespace change: https://github.com/lucaswerkmeister/home/blob/main/.bashrc.d/wikimedia-debug-diff [13:27:09] (03Merged) 10jenkins-bot: Growth: enable new way of refreshing LinkRecommendations for more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164287 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [13:27:12] (03Merged) 10jenkins-bot: Echo: be explicit about special wikis using Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171548 (https://phabricator.wikimedia.org/T400070) (owner: 10Michael Große) [13:27:24] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1164287|Growth: enable new way of refreshing LinkRecommendations for more wikis (T386250 T392944)]], [[gerrit:1171548|Echo: be explicit about special wikis using Wikipedia logo (T400070)]] [13:27:29] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173220 (https://phabricator.wikimedia.org/T400369) (owner: 10Michael Große) [13:27:32] FIRING: [4x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:27:33] T386250: Rewrite refreshLinkRecommendations to not iterate through article topics - https://phabricator.wikimedia.org/T386250 [13:27:33] T392944: Enable the iterative way of refreshing LinkRecommendations for more wikis - https://phabricator.wikimedia.org/T392944 [13:27:33] T400070: Clean up Echo setting on ex-wikipedia special wikis - https://phabricator.wikimedia.org/T400070 [13:29:02] (03CR) 10Andrew Bogott: "The same lock-up has been happening in codfw1dev where there's a lot less traffic. So I suspect this is a leak and not just a limit that's" [puppet] - 10https://gerrit.wikimedia.org/r/1173388 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [13:29:21] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, migr: Backport for [[gerrit:1164287|Growth: enable new way of refreshing LinkRecommendations for more wikis (T386250 T392944)]], [[gerrit:1171548|Echo: be explicit about special wikis using Wikipedia logo (T400070)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:30:09] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, migr: Continuing with sync [13:31:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P80112 and previous config saved to /var/cache/conftool/dbconfig/20250728-133124-fceratto.json [13:31:40] Lucas_WMDE: once you're done, please let me know, my patch is somewhat important [13:31:48] ok [13:35:31] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164287|Growth: enable new way of refreshing LinkRecommendations for more wikis (T386250 T392944)]], [[gerrit:1171548|Echo: be explicit about special wikis using Wikipedia logo (T400070)]] (duration: 08m 06s) [13:35:39] T386250: Rewrite refreshLinkRecommendations to not iterate through article topics - https://phabricator.wikimedia.org/T386250 [13:35:39] T392944: Enable the iterative way of refreshing LinkRecommendations for more wikis - https://phabricator.wikimedia.org/T392944 [13:35:40] T400070: Clean up Echo setting on ex-wikipedia special wikis - https://phabricator.wikimedia.org/T400070 [13:36:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P80113 and previous config saved to /var/cache/conftool/dbconfig/20250728-133604-marostegui.json [13:36:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:37:45] @Amir1 could you specify "somewhat important"? The GrowthExperiments backport is only "Mentors getting pinged despite explicitly opting out"-important, but not "the stability of the wikis is at stake"-important, but it will take a while to merge due to i18n changes. Should we move your change in front of that? [13:39:01] (03PS8) 10Anzx: throttle: add rules for Wikimania 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) [13:39:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173220 (https://phabricator.wikimedia.org/T400369) (owner: 10Michael Große) [13:39:27] it can wait a bit, so far things are not down but yes, it's about stability: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1173371 [13:39:55] (03Merged) 10jenkins-bot: fix: avoid using wikitext that triggers ping notifications [extensions/GrowthExperiments] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173220 (https://phabricator.wikimedia.org/T400369) (owner: 10Michael Große) [13:40:11] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1173220|fix: avoid using wikitext that triggers ping notifications (T400369)]] [13:40:16] T400369: Do not mention mentors while they are away - https://phabricator.wikimedia.org/T400369 [13:41:06] Ok, thanks :) [13:41:15] (03CR) 10Anzx: throttle: add rules for Wikimania 2025 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) (owner: 10Anzx) [13:42:10] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "should be okay to deploy later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) (owner: 10Anzx) [13:44:12] (03PS1) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1172625 (https://phabricator.wikimedia.org/T338470) [13:44:12] (03CR) 10Arnaudb: [C:04-2] "no submission before the switchover" [puppet] - 10https://gerrit.wikimedia.org/r/1172625 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [13:44:15] (03PS1) 10Cyndywikime: Add GetStartedNotification experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173396 (https://phabricator.wikimedia.org/T400048) [13:45:30] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:46:05] (03Abandoned) 10Arnaudb: mariadb: add instance metric polling [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091190 (https://phabricator.wikimedia.org/T376596) (owner: 10Arnaudb) [13:46:05] (03Abandoned) 10Arnaudb: mysql: add port number to MysqlClient [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100114 (https://phabricator.wikimedia.org/T381086) (owner: 10Arnaudb) [13:46:05] (03Abandoned) 10Arnaudb: peopleweb: disable envoy request timeout, enable log [puppet] - 10https://gerrit.wikimedia.org/r/1112205 (https://phabricator.wikimedia.org/T383750) (owner: 10Arnaudb) [13:46:05] (03Abandoned) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [13:46:06] (03Abandoned) 10Arnaudb: sre.mysql.upgrade: Switch to Host, apt-get and mysql helpers [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [13:46:09] (03Abandoned) 10Arnaudb: dbtools: command line helper to evaluate a host, or a group of hosts [software] - 10https://gerrit.wikimedia.org/r/1091250 (https://phabricator.wikimedia.org/T378715) (owner: 10Arnaudb) [13:46:13] (03Abandoned) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [13:46:17] (03Abandoned) 10Arnaudb: mariadb: basic script to analyse general-log-file [software] - 10https://gerrit.wikimedia.org/r/1092832 (https://phabricator.wikimedia.org/T377451) (owner: 10Arnaudb) [13:46:21] (03Abandoned) 10Arnaudb: mariadb: productionize db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1075108 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [13:46:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T399728)', diff saved to https://phabricator.wikimedia.org/P80114 and previous config saved to /var/cache/conftool/dbconfig/20250728-134632-fceratto.json [13:46:37] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:46:51] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [13:47:33] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [13:48:04] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [13:48:45] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11039317 (10elukey) I fixed the above problems (see WIP patch), but then new ones popped up: ` {'error': {'@Message.ExtendedInfo': [{'Message': 'Invalid Attribute was '... [13:49:04] hm, not sure what to make of scap [13:49:14] it started build-and-push-container-images at 13:40:53 [13:49:26] and according to the log file, it’s been running docker-pusher since 13:43:12 [13:49:44] so… it built the image in 2¼ minutes, but then has been spending over 5 minutes pushing it? [13:49:49] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:49:58] (03CR) 10Btullis: [V:03+1 C:03+2] Disable all dumps timers on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/1170410 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [13:50:14] the docker-pusher process is still running, at least… [13:51:05] mh, no idea where in that new process the re-building of the language-cache is going to factor in [13:51:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T399249)', diff saved to https://phabricator.wikimedia.org/P80115 and previous config saved to /var/cache/conftool/dbconfig/20250728-135111-marostegui.json [13:51:17] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:51:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [13:51:47] (03CR) 10Xcollazo: [C:03+1] "CC @btullis@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1173359 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup) [13:51:56] ok, it finally made more progress [13:52:03] “Waiting 300 seconds for swift after full mediawiki image build (T390251)” oof [13:52:03] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [13:53:13] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [13:53:21] (03PS1) 10Bking: Revert^2 "dse-k8s: Add dse-k8s-codfw k8s configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1173399 [13:53:38] (03CR) 10Bking: [V:03+2 C:03+2] Revert^2 "dse-k8s: Add dse-k8s-codfw k8s configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1173399 (owner: 10Bking) [13:57:13] (03PS3) 10Anzx: aswikisource: add publisher (প্ৰকাশক) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173397 (https://phabricator.wikimedia.org/T399269) [13:58:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173397 (https://phabricator.wikimedia.org/T399269) (owner: 10Anzx) [14:00:37] looks like deploying the image is just as slow as building it 😬 [14:02:21] 4½ minutes to deploy it to 12 k8s testservers [14:04:17] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, migr: Backport for [[gerrit:1173220|fix: avoid using wikitext that triggers ping notifications (T400369)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:04:23] T400369: Do not mention mentors while they are away - https://phabricator.wikimedia.org/T400369 [14:04:28] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:04:29] MichaelG_WMF: please test ^^ [14:04:41] will do [14:07:19] Looks good for me! [14:07:34] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, migr: Continuing with sync [14:07:35] @Lucas_WMDE ready to move forward [14:07:35] ok, thanks! [14:11:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11039370 (10fnegri) I think the partman recipe is incompatible with the new servers, I'll look into it. [14:12:42] Amir1: maybe you already want to +2 your backport and kick off another gate-and-submit build? [14:12:51] (https://spiderpig.wikimedia.org/jobs/353 will still take a couple of minutes to finish) [14:16:36] Sure [14:17:18] (03CR) 10Ladsgroup: [C:03+2] "again" [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173371 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [14:17:56] (03PS30) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [14:20:04] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173220|fix: avoid using wikitext that triggers ping notifications (T400369)]] (duration: 39m 52s) [14:20:09] T400369: Do not mention mentors while they are away - https://phabricator.wikimedia.org/T400369 [14:20:10] wooh [14:20:22] backport+config window remains open for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1173371 [14:20:24] jouncebot: nowandnext [14:20:24] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [14:20:24] In 0 hour(s) and 9 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T1430) [14:20:25] Lucas_WMDE: Thank you! [14:20:28] Amir1: over to you [14:20:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link errors: ssw1-d1-codfw <-> ssw1-f1-codfw - https://phabricator.wikimedia.org/T400253#11039411 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm looks like errors ceased after cleaning. no increments since friday. [14:22:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2141.codfw.wmnet with reason: Maintenance [14:26:11] CI already failed 😩 [14:26:39] (03CR) 10CI reject: [V:04-1] objectcache: Only clean a subset of tables in SqlBagOStuff [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173371 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [14:26:55] Amir1: ^ and again… [14:29:55] :/ [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T1430) [14:30:16] (03CR) 10Ladsgroup: [C:03+2] "..." [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173371 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [14:30:44] backport+config window still ongoing [14:31:34] (03PS1) 10Btullis: Upgrade the flink-operator CRDs to match the upstream resease v1.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173403 (https://phabricator.wikimedia.org/T398162) [14:35:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11039493 (10jhathaway) p:05Triage→03Medium [14:35:48] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: DiskSpace (instance netbox-dev2003:9100) - https://phabricator.wikimedia.org/T400601#11039496 (10cmooney) a:03cmooney [14:36:19] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: DiskSpace (instance netbox-dev2003:9100) - https://phabricator.wikimedia.org/T400601#11039497 (10cmooney) p:05Triage→03Medium [14:37:37] (03CR) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1154085 (owner: 10CDobbins) [14:37:47] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: PuppetPendingCertificateRequest (instance puppetmaster1001:9100) - https://phabricator.wikimedia.org/T400603#11039502 (10elukey) 05Open→03Resolved a:03elukey ` elukey@puppetmaster1001:~$ sudo puppet cert destroy cloudcephosd20... [14:39:56] (03PS1) 10Fabfur: aptrepo: adding component/golang to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1173406 (https://phabricator.wikimedia.org/T400620) [14:43:01] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173406 (https://phabricator.wikimedia.org/T400620) (owner: 10Fabfur) [14:43:36] (03CR) 10Ssingh: [C:03+1] aptrepo: adding component/golang to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1173406 (https://phabricator.wikimedia.org/T400620) (owner: 10Fabfur) [14:44:20] (03Merged) 10jenkins-bot: objectcache: Only clean a subset of tables in SqlBagOStuff [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173371 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [14:44:52] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1173371|objectcache: Only clean a subset of tables in SqlBagOStuff (T398806)]] [14:44:58] T398806: Retire purge-parsercache periodic jobs - https://phabricator.wikimedia.org/T398806 [14:48:53] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [14:48:53] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1173371|objectcache: Only clean a subset of tables in SqlBagOStuff (T398806)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:50:11] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [14:51:31] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:52:10] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2038 [14:52:19] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2038 [14:52:35] (03PS1) 10Btullis: Update flink-operator helm chart to match the upstream release 1.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173407 (https://phabricator.wikimedia.org/T398162) [14:53:04] (03PS2) 10Btullis: Update flink-operator helm chart to match the upstream release v1.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173407 (https://phabricator.wikimedia.org/T398162) [14:54:57] (03CR) 10Dzahn: "looks good. How did you acquire the IP? Was it marked in netbox as reserved/placeholder?" [dns] - 10https://gerrit.wikimedia.org/r/1173376 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [14:57:17] (03CR) 10Dzahn: "aware of the "bind service IP: false" thing above? Seems like another change will be needed when to flip this to true." [puppet] - 10https://gerrit.wikimedia.org/r/1173370 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [14:57:22] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173371|objectcache: Only clean a subset of tables in SqlBagOStuff (T398806)]] (duration: 12m 30s) [14:57:28] T398806: Retire purge-parsercache periodic jobs - https://phabricator.wikimedia.org/T398806 [14:59:07] 06SRE, 10Hiddenparma, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11039625 (10Joe) Given we recognize that enforcing our policy might cause disruption, we will proceed with care. First of all,** toolsforge is excluded from the block** for the... [14:59:18] !log UTC afternoon backport+config window done [14:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:12] (03CR) 10Fabfur: [C:03+2] aptrepo: adding component/golang to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1173406 (https://phabricator.wikimedia.org/T400620) (owner: 10Fabfur) [15:04:30] (03CR) 10Ottomata: [C:03+1] Upgrade the flink-operator CRDs to match the upstream resease v1.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173403 (https://phabricator.wikimedia.org/T398162) (owner: 10Btullis) [15:05:17] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link errors: ssw1-d1-codfw <-> ssw1-f1-codfw - https://phabricator.wikimedia.org/T400253#11039647 (10cmooney) Awesome, thank you! [15:06:38] (03CR) 10Ottomata: [C:03+1] Update flink-operator helm chart to match the upstream release v1.12 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173407 (https://phabricator.wikimedia.org/T398162) (owner: 10Btullis) [15:07:47] (03PS3) 10Bernard Wang: Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172683 (https://phabricator.wikimedia.org/T380515) [15:07:48] FIRING: PuppetDisabled: Puppet disabled on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [15:07:56] (03PS4) 10Bernard Wang: Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172683 (https://phabricator.wikimedia.org/T380515) [15:08:54] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11039668 (10Jhancock.wm) @Marostegui es2038 is moved, updated, and powered up! for es2039. it's not gonna fit cleanly into our racking scheme. There isn't a... [15:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:21:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10MinT: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626 (10Jclark-ctr) 03NEW [15:23:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10MinT: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11039727 (10Jclark-ctr) [15:23:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11039740 (10Jclark-ctr) 05Open→03Resolved I am closing out this ticket and opening second ticket T400626 for the remaining two servers since power will not be... [15:25:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10MinT: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11039754 (10Jclark-ctr) [15:27:07] (03PS5) 10Elukey: WIP - sre.hosts.provision: fix PXE settings for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [15:27:37] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:30:05] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T1530). [15:34:23] (03PS1) 10BryanDavis: shellbox: Bump image versions to 025-07-28-151806 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173411 (https://phabricator.wikimedia.org/T383018) [15:34:25] !log dancy@deploy1003 Installing scap version "4.192.0" for 180 host(s) [15:34:36] (03CR) 10CI reject: [V:04-1] shellbox: Bump image versions to 025-07-28-151806 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173411 (https://phabricator.wikimedia.org/T383018) (owner: 10BryanDavis) [15:34:59] elukey@cumin1003 provision (PID 3777865) is awaiting input [15:36:01] (03PS3) 10Ahmon Dancy: deployment_server: Add pretrain systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1172678 (https://phabricator.wikimedia.org/T398873) [15:36:27] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#11039785 (10ABran-WMF) a:03Dzahn [15:36:34] 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11039789 (10Jhancock.wm) note to myself. These must be racked in DH7 cage. [15:38:56] !log dancy@deploy1003 Installation of scap version "4.192.0" completed for 180 hosts [15:40:00] (03CR) 10BryanDavis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173411 (https://phabricator.wikimedia.org/T383018) (owner: 10BryanDavis) [15:41:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2145.codfw.wmnet with reason: Maintenance [15:41:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T399249)', diff saved to https://phabricator.wikimedia.org/P80118 and previous config saved to /var/cache/conftool/dbconfig/20250728-154114-marostegui.json [15:41:20] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:50:41] (03CR) 10BryanDavis: [C:03+2] shellbox: Bump image versions to 025-07-28-151806 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173411 (https://phabricator.wikimedia.org/T383018) (owner: 10BryanDavis) [15:52:58] (03Merged) 10jenkins-bot: shellbox: Bump image versions to 025-07-28-151806 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173411 (https://phabricator.wikimedia.org/T383018) (owner: 10BryanDavis) [15:54:18] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#11039851 (10Dzahn) I revoked https://phabricator.wikimedia.org/auth/sshkey/view/516/ [16:00:32] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:00:53] (03PS6) 10Elukey: WIP - sre.hosts.provision: fix PXE settings for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [16:00:56] (03CR) 10Ahmon Dancy: "Ready for SRE review/merge" [puppet] - 10https://gerrit.wikimedia.org/r/1172678 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [16:02:44] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:04:23] I have a new shellbox container for SyntaxHighlight, so I will be deploying updated shellbox containers everywhere soon. [16:07:12] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [16:07:47] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [16:07:53] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [16:08:11] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [16:08:17] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [16:08:53] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [16:08:59] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:09:32] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:09:38] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [16:10:04] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [16:10:10] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [16:10:46] elukey@cumin1003 provision (PID 3780801) is awaiting input [16:11:05] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [16:11:27] (03PS7) 10Elukey: WIP - sre.hosts.provision: fix PXE settings for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [16:11:39] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:11:52] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:11:59] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [16:12:52] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [16:12:56] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:12:58] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [16:13:31] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [16:13:37] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [16:14:13] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [16:14:19] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:15:04] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:15:11] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [16:15:34] (03PS18) 10Tiziano Fogli: nrpe wrapper: add wrapper to be invoked a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446) [16:15:41] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [16:15:48] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [16:16:24] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [16:16:41] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2043.codfw.wmnet with OS bullseye [16:16:55] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [16:18:06] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [16:18:51] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [16:18:57] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [16:19:19] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [16:19:25] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [16:20:01] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [16:20:07] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:20:40] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:20:47] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [16:21:15] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [16:21:21] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [16:22:19] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [16:23:45] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [16:24:04] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2043.codfw.wmnet with OS bullseye [16:24:14] (03PS1) 10Elukey: DNM: reimage - test for new cp nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1173418 (https://phabricator.wikimedia.org/T392851) [16:24:59] (03PS2) 10Elukey: DNM: reimage - test for new cp nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1173418 (https://phabricator.wikimedia.org/T392851) [16:25:18] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [16:28:01] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bullseye [16:29:34] (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-07-24-232055-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173420 (https://phabricator.wikimedia.org/T400395) [16:30:00] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11039946 (10elukey) Last one: ` UEFI0417: When TLS is enabled, insecure HTTP boot without TLS is not allowed. It is recommended to use HTTP boot over TLS for better sec... [16:31:54] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-07-24-232055-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173420 (https://phabricator.wikimedia.org/T400395) (owner: 10BryanDavis) [16:33:37] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-07-24-232055-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173420 (https://phabricator.wikimedia.org/T400395) (owner: 10BryanDavis) [16:34:05] (03PS8) 10Elukey: WIP - sre.hosts.provision: fix PXE settings for Dell iDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) [16:34:05] (03PS3) 10Elukey: DNM: reimage - test for new cp nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1173418 (https://phabricator.wikimedia.org/T392851) [16:34:23] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:35:37] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [16:35:59] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [16:36:16] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [16:36:47] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [16:37:08] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [16:37:24] (03PS1) 10Kimberly Sarabia: Enable AA test on all wikis [extensions/WikimediaEvents] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173421 (https://phabricator.wikimedia.org/T399486) [16:37:38] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [16:37:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173421 (https://phabricator.wikimedia.org/T399486) (owner: 10Kimberly Sarabia) [16:43:44] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:44:34] (03CR) 10Dzahn: [C:03+1] "I asked if Jasmine would like to deploy this, with no rush at all. Based on the last time we touched redirects.dat." [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [16:44:35] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [16:45:44] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync [16:46:13] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11039994 (10BTullis) [16:46:25] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [16:47:08] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10Data-Platform-SRE (2025.07.26 - 2025.08.15), 13Patch-For-Review: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#11040026 (10BTullis) [16:47:28] (03PS1) 10CDanis: rudimentary support for decoding sha256: tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1173425 [16:48:46] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11040072 (10BTullis) [16:50:06] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bullseye [16:50:11] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): SSD firmware update for an-mariadb100[1-2] - https://phabricator.wikimedia.org/T394498#11040102 (10BTullis) [16:51:58] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#11040142 (10BTullis) [16:53:36] (03CR) 10CI reject: [V:04-1] rudimentary support for decoding sha256: tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1173425 (owner: 10CDanis) [16:55:02] (03PS1) 10Vgutierrez: traffic: Fix HaproxyKafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/1173427 (https://phabricator.wikimedia.org/T400039) [16:55:14] (03CR) 10Clément Goubert: "Can I trouble you to add httpbb testing for these URLs, at least for deployment time, we can probably remove them afterwards. It'll help m" [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [16:55:48] (03PS2) 10Kimberly Sarabia: Enable AA test on 50 wikis [extensions/WikimediaEvents] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173421 (https://phabricator.wikimedia.org/T399486) [16:57:53] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171705 (owner: 10PipelineBot) [16:57:56] (03PS4) 10Clément Goubert: deployment_server: Add pretrain systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1172678 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [16:57:57] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1172678 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [16:58:15] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11040242 (10elukey) I was able to reach Debian Install and this is what I got in the Partitions disks step: ` ┌───────────────────────┤ [!!] Partition disks ├───────... [16:59:28] (03CR) 10Dzahn: [C:03+1] "oh, yes, totally fair. I should add the tests. coming up soon." [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T1700) [17:00:04] ryankemper: OwO what's this, a deployment window?? Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T1700). nyaa~ [17:01:21] 10SRE-swift-storage, 06Commons, 10media-backups: File not found: /v1/AUTH_mw/wikipedia-commons-local-public ... for 3 files - https://phabricator.wikimedia.org/T400567#11040320 (10MatthewVernon) >>! In T400567#11038949, @GPSLeo wrote: > As there are likely many more of these cases is there a possibility to s... [17:06:12] (03PS3) 10Dzahn: redirects: update SVN rewrite rules, do not link to Phabricator anymore [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) [17:06:32] (03CR) 10Dzahn: "tests added" [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [17:07:35] (03CR) 10Clément Goubert: [C:03+2] deployment_server: Add pretrain systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1172678 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [17:10:54] (03PS2) 10Vgutierrez: traffic: Fix HaproxyKafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/1173427 (https://phabricator.wikimedia.org/T400039) [17:11:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:15:24] (03CR) 10Thcipriani: "You'll need to add the https://github.com/sourcegraph/zoekt code to this repo for it to find these files. Docker/blubber is trying to find" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [17:16:23] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#11040354 (10Dzahn) >>! In T177826#10030822, @hashar wrote: > For production I think that comes from: >... [17:24:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11040371 (10BTullis) [17:26:15] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11040378 (10Jhancock.wm) @elukey host was found powered off. pulled the power and then restarted. [17:27:17] RESOLVED: ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:28:02] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11040388 (10BTullis) [17:28:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11040390 (10BTullis) [17:29:27] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11040394 (10ssingh) >>! In T392851#11040242, @elukey wrote: > I was able to reach Debian Install and this is what I got in the Partitions disks step: > > ` > ┌──────... [17:34:08] (03PS1) 10Cathal Mooney: Add reverse delegations for codfw K8s dse ranges [dns] - 10https://gerrit.wikimedia.org/r/1173433 (https://phabricator.wikimedia.org/T400037) [17:35:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T399249)', diff saved to https://phabricator.wikimedia.org/P80119 and previous config saved to /var/cache/conftool/dbconfig/20250728-173548-marostegui.json [17:35:54] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [17:37:46] (03CR) 10VolkerE: [C:03+1] Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172683 (https://phabricator.wikimedia.org/T380515) (owner: 10Bernard Wang) [17:37:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/5 (Transport: cr2-eqiad:xe-1/0/1:0 (Arelion, IC-314533 24ms 10Gbps wave) {#11371}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:41:48] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11040440 (10Jhancock.wm) i also checked the settings and it _looks_ okay. I'll try running the imaging script this afternoon and see what happens. [17:42:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/5 (Transport: cr2-eqiad:xe-1/0/1:0 (Arelion, IC-314533 24ms 10Gbps wave) {#11371}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:45:26] !log depooling cp4037 to upgrade to latest haproxykafka version (0.3.11) (T400620) [17:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:31] T400620: Can't build haproxykafka package anymore - https://phabricator.wikimedia.org/T400620 [17:46:15] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [17:50:49] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2150.codfw.wmnet with reason: Maintenance [17:50:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P80120 and previous config saved to /var/cache/conftool/dbconfig/20250728-175055-marostegui.json [17:50:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T399728)', diff saved to https://phabricator.wikimedia.org/P80121 and previous config saved to /var/cache/conftool/dbconfig/20250728-175056-fceratto.json [17:51:04] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [17:51:07] !log repooling cp4037 after upgrading haproxykafka (T400620) [17:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:12] T400620: Can't build haproxykafka package anymore - https://phabricator.wikimedia.org/T400620 [17:51:45] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [17:53:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T399728)', diff saved to https://phabricator.wikimedia.org/P80122 and previous config saved to /var/cache/conftool/dbconfig/20250728-175354-fceratto.json [17:54:27] (03CR) 10Btullis: [C:03+1] "Nice. Thanks." [dns] - 10https://gerrit.wikimedia.org/r/1173433 (https://phabricator.wikimedia.org/T400037) (owner: 10Cathal Mooney) [17:56:14] PROBLEM - Disk space on an-worker1122 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 150094 MB (3% inode=99%): /var/lib/hadoop/data/e 162150 MB (4% inode=99%): /var/lib/hadoop/data/f 154719 MB (4% inode=99%): /var/lib/hadoop/data/b 159167 MB (4% inode=99%): /var/lib/hadoop/data/g 157651 MB (4% inode=99%): /var/lib/hadoop/data/d 152263 MB (4% inode=99%): /var/lib/hadoop/data/j 159001 MB (4% inode=99%): /var/lib/hadoop/data [17:56:14] 6 MB (4% inode=99%): /var/lib/hadoop/data/h 160883 MB (4% inode=99%): /var/lib/hadoop/data/l 158679 MB (4% inode=99%): /var/lib/hadoop/data/k 155482 MB (4% inode=99%): /var/lib/hadoop/data/m 152178 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1122&var-datasource=eqiad+prometheus/ops [18:06:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P80123 and previous config saved to /var/cache/conftool/dbconfig/20250728-180603-marostegui.json [18:09:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P80124 and previous config saved to /var/cache/conftool/dbconfig/20250728-180901-fceratto.json [18:13:47] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [18:14:50] !log depooling cp4037 to upgrade new haproxykafka version (T400620) [18:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:55] T400620: Can't build haproxykafka package anymore - https://phabricator.wikimedia.org/T400620 [18:15:15] 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#11040552 (10BCornwall) It seems to be working for me: ` [~]$ curl -s https://pywikipedia.org -o/dev/null -w 'status: %{http_code}, location: %{redirect_url}\n' status: 301, locati... [18:15:29] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [18:15:46] !log repooled cp4037 (T400620) [18:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172711 (https://phabricator.wikimedia.org/T400510) (owner: 10Arlolra) [18:21:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T399249)', diff saved to https://phabricator.wikimedia.org/P80125 and previous config saved to /var/cache/conftool/dbconfig/20250728-182111-marostegui.json [18:21:17] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:21:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2146.codfw.wmnet with reason: Maintenance [18:21:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T399249)', diff saved to https://phabricator.wikimedia.org/P80126 and previous config saved to /var/cache/conftool/dbconfig/20250728-182134-marostegui.json [18:22:29] !log haproxykafka 0.3.11 uploaded to apt repo (T400620) [18:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:35] T400620: Can't build haproxykafka package anymore - https://phabricator.wikimedia.org/T400620 [18:24:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P80127 and previous config saved to /var/cache/conftool/dbconfig/20250728-182409-fceratto.json [18:28:15] (03PS1) 10DDesouza: Undeploy Readers Use Cases Survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173441 (https://phabricator.wikimedia.org/T399736) [18:32:09] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637 (10RobH) 03NEW [18:32:11] 10ops-eqiad, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638 (10RobH) 03NEW [18:33:47] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11040677 (10RobH) [18:34:03] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11040680 (10RobH) [18:34:26] 10ops-eqiad, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11040686 (10RobH) [18:34:56] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11040688 (10RobH) [18:35:19] 10ops-eqiad, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11040692 (10RobH) [18:36:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173441 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza) [18:36:59] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11040704 (10RobH) a:03joanna_borun Joanna, As the order approval is still pending on order task T398372, I've gone ahead and filed the racking task here and now it needs to be updated by som... [18:37:46] 10ops-eqiad, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11040707 (10RobH) a:03joanna_borun Joanna, As the order approval is still pending on order task T398373, I've gone ahead and filed the racking task here and now it needs to be updated by som... [18:39:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T399728)', diff saved to https://phabricator.wikimedia.org/P80128 and previous config saved to /var/cache/conftool/dbconfig/20250728-183916-fceratto.json [18:39:23] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [18:39:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2159.codfw.wmnet with reason: Maintenance [18:39:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T399728)', diff saved to https://phabricator.wikimedia.org/P80129 and previous config saved to /var/cache/conftool/dbconfig/20250728-183939-fceratto.json [18:42:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T399728)', diff saved to https://phabricator.wikimedia.org/P80130 and previous config saved to /var/cache/conftool/dbconfig/20250728-184232-fceratto.json [18:53:43] (03PS1) 10Ahmon Dancy: data.yaml: Allow release-engineering to administer pretrain timer [puppet] - 10https://gerrit.wikimedia.org/r/1173446 (https://phabricator.wikimedia.org/T398873) [18:55:50] (03CR) 10Ssingh: [C:03+1] "Nice find and explains the case of the missing metrics." [alerts] - 10https://gerrit.wikimedia.org/r/1173427 (https://phabricator.wikimedia.org/T400039) (owner: 10Vgutierrez) [18:57:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P80131 and previous config saved to /var/cache/conftool/dbconfig/20250728-185739-fceratto.json [18:59:14] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11040781 (10bking) @Jhancock.wm I think you meant to ping me instead of Jesse ;). No worries either way though, I... [19:08:03] FIRING: PuppetDisabled: Puppet disabled on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [19:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:12:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P80132 and previous config saved to /var/cache/conftool/dbconfig/20250728-191247-fceratto.json [19:16:14] PROBLEM - Disk space on an-worker1122 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 151440 MB (4% inode=99%): /var/lib/hadoop/data/e 161328 MB (4% inode=99%): /var/lib/hadoop/data/f 159658 MB (4% inode=99%): /var/lib/hadoop/data/b 160863 MB (4% inode=99%): /var/lib/hadoop/data/g 157723 MB (4% inode=99%): /var/lib/hadoop/data/d 156289 MB (4% inode=99%): /var/lib/hadoop/data/j 159400 MB (4% inode=99%): /var/lib/hadoop/data [19:16:14] 4 MB (4% inode=99%): /var/lib/hadoop/data/h 159600 MB (4% inode=99%): /var/lib/hadoop/data/l 159777 MB (4% inode=99%): /var/lib/hadoop/data/k 157886 MB (4% inode=99%): /var/lib/hadoop/data/m 150072 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1122&var-datasource=eqiad+prometheus/ops [19:17:01] !log updating haproxykafka to v0.3.11 on A:cp-ulsfo (T400620) [19:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:06] T400620: Can't build haproxykafka package anymore - https://phabricator.wikimedia.org/T400620 [19:27:36] !log restarting haproxykafka on A:cp-ulsfo (T400620) [19:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:41] T400620: Can't build haproxykafka package anymore - https://phabricator.wikimedia.org/T400620 [19:27:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T399728)', diff saved to https://phabricator.wikimedia.org/P80133 and previous config saved to /var/cache/conftool/dbconfig/20250728-192754-fceratto.json [19:28:01] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [19:28:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2168.codfw.wmnet with reason: Maintenance [19:28:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T399728)', diff saved to https://phabricator.wikimedia.org/P80134 and previous config saved to /var/cache/conftool/dbconfig/20250728-192817-fceratto.json [19:31:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T399728)', diff saved to https://phabricator.wikimedia.org/P80135 and previous config saved to /var/cache/conftool/dbconfig/20250728-193110-fceratto.json [19:39:17] (03CR) 10Krinkle: [C:03+1] redirects: update SVN rewrite rules, do not link to Phabricator anymore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [19:46:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P80136 and previous config saved to /var/cache/conftool/dbconfig/20250728-194618-fceratto.json [19:46:42] !log dancy@deploy1003 Started deploy [releng/jenkins-deploy@b89eed0] (releasing): Disabling the MediaWiki publish WMF single-version image job (T398873) [19:46:47] T398873: Move nightly image build from releases-jenkins to deployment.eqiad.wmnet - https://phabricator.wikimedia.org/T398873 [19:47:13] (03PS2) 10Acamicamacaraca: Localize mk.wikibooks sitename and metanamespace name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173456 (https://phabricator.wikimedia.org/T400644) [19:47:30] !log dancy@deploy1003 Finished deploy [releng/jenkins-deploy@b89eed0] (releasing): Disabling the MediaWiki publish WMF single-version image job (T398873) (duration: 01m 11s) [19:48:14] (03PS1) 10Fabfur: haproxykafka: adding watchdog timeout option [puppet] - 10https://gerrit.wikimedia.org/r/1173459 (https://phabricator.wikimedia.org/T400199) [19:50:28] (03PS3) 10Acamicamacaraca: Localize mk.wikibooks sitename and metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173456 (https://phabricator.wikimedia.org/T400644) [19:51:41] (03PS2) 10CDanis: rudimentary support for decoding sha256: tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1173425 (https://phabricator.wikimedia.org/T397696) [19:53:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173456 (https://phabricator.wikimedia.org/T400644) (owner: 10Acamicamacaraca) [19:53:39] (03CR) 10CI reject: [V:04-1] rudimentary support for decoding sha256: tags [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1173425 (https://phabricator.wikimedia.org/T397696) (owner: 10CDanis) [19:59:21] (03PS2) 10Dr0ptp4kt: WIP DNM: Access to airflow-platform-eng [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (https://phabricator.wikimedia.org/T396672) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T2000). [20:00:04] bwang, anzx, kimberly_sarabia, cscott, danisztls, and Aca: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] (03PS3) 10Dr0ptp4kt: Add access for platform engineering Airflow and data [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (https://phabricator.wikimedia.org/T396672) [20:00:26] hey [20:00:33] o/ [20:00:35] *waves* [20:00:49] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173459 (https://phabricator.wikimedia.org/T400199) (owner: 10Fabfur) [20:01:01] (03CR) 10CI reject: [V:04-1] Add access for platform engineering Airflow and data [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (https://phabricator.wikimedia.org/T396672) (owner: 10Dr0ptp4kt) [20:01:18] i'm here, and i can spiderpig [20:01:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P80137 and previous config saved to /var/cache/conftool/dbconfig/20250728-200125-fceratto.json [20:01:46] o/ [20:03:20] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11041020 (10dr0ptp4kt) 05Resolved→03Open [20:05:42] (03PS4) 10Dr0ptp4kt: Add access for platform engineering Airflow and data [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (https://phabricator.wikimedia.org/T396672) [20:06:24] (03CR) 10CI reject: [V:04-1] Add access for platform engineering Airflow and data [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (https://phabricator.wikimedia.org/T396672) (owner: 10Dr0ptp4kt) [20:07:19] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11041031 (10dr0ptp4kt) Hi @ssingh , over in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1165605 I added the `analytics-privatedata-users` piece for this ticket, as we... [20:10:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T399249)', diff saved to https://phabricator.wikimedia.org/P80138 and previous config saved to /var/cache/conftool/dbconfig/20250728-201013-marostegui.json [20:10:19] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [20:11:18] it will be my 1st time but I can self-deploy my change [20:12:18] (03PS5) 10Dr0ptp4kt: Add access for platform engineering Airflow and data [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (https://phabricator.wikimedia.org/T396672) [20:12:59] (03CR) 10CI reject: [V:04-1] Add access for platform engineering Airflow and data [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (https://phabricator.wikimedia.org/T396672) (owner: 10Dr0ptp4kt) [20:13:13] sorry to be late - if anyone needs a deployer, feel free to ping me - otherwise self-deployers, please self-deploy [20:13:24] looks like bwang is first on the list? [20:13:35] but i haven't seen them here? [20:13:41] (03PS6) 10Dr0ptp4kt: Add access for platform engineering Airflow and data [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (https://phabricator.wikimedia.org/T396672) [20:14:05] anzx is next, and they are here. anzx do you need cjming's help to deploy? [20:14:16] bwang is legit. [20:14:32] cjming I might prefer your deploy, as I haven't done self-deployments earlier. [20:14:36] oh, you mean they haven't arrive yet.. gotcha [20:14:59] yes need cjming deploy [20:15:00] yeah [20:15:32] ok - i'll start with anzx's patches then if that's ok [20:15:48] (03PS7) 10Dr0ptp4kt: Add access for platform engineering Airflow and data [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (https://phabricator.wikimedia.org/T396672) [20:15:58] 👍 [20:16:07] (03PS9) 10Anzx: throttle: add rules for Wikimania 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) [20:16:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T399728)', diff saved to https://phabricator.wikimedia.org/P80139 and previous config saved to /var/cache/conftool/dbconfig/20250728-201633-fceratto.json [20:16:39] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [20:16:49] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2182.codfw.wmnet with reason: Maintenance [20:16:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T399728)', diff saved to https://phabricator.wikimedia.org/P80140 and previous config saved to /var/cache/conftool/dbconfig/20250728-201655-fceratto.json [20:17:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) (owner: 10Anzx) [20:17:18] (03PS2) 10Fabfur: haproxykafka: adding watchdog timeout option [puppet] - 10https://gerrit.wikimedia.org/r/1173459 (https://phabricator.wikimedia.org/T400199) [20:18:10] (03Merged) 10jenkins-bot: throttle: add rules for Wikimania 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) (owner: 10Anzx) [20:18:27] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1172659|throttle: add rules for Wikimania 2025 (T400276)]] [20:18:32] T400276: Request for IP throttling exemption for Novotel and Trademark/Tribe Venues – Wikimania 2025 - https://phabricator.wikimedia.org/T400276 [20:19:47] 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277#11041086 (10ttaylor) - ssh ttaylor@bast1003.wikimedia.org succeeded - ssh ttaylor@stat1008.eqiad.wmnet succeeded - kinit succeeded. I set a password... [20:19:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T399728)', diff saved to https://phabricator.wikimedia.org/P80141 and previous config saved to /var/cache/conftool/dbconfig/20250728-201950-fceratto.json [20:20:24] !log cjming@deploy1003 cjming, anzx: Backport for [[gerrit:1172659|throttle: add rules for Wikimania 2025 (T400276)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:20:29] cjming: nothing to test, please continue [20:20:40] ack [20:21:20] 06SRE, 10SRE-Access-Requests: Requesting access to SSH login to analytics clients with Hadoop access for ttaylor - https://phabricator.wikimedia.org/T400277#11041090 (10jhathaway) 05Open→03Resolved a:03jhathaway great, please re-open if you have any issues [20:21:42] !log cjming@deploy1003 cjming, anzx: Continuing with sync [20:22:05] (03PS1) 10Dzahn: ci: remove old jenkins@gallium RSA key SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1173468 (https://phabricator.wikimedia.org/T177826) [20:22:11] (03PS4) 10Anzx: mnwwiktionary: update reconstruction namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172862 (https://phabricator.wikimedia.org/T400441) [20:22:31] (03CR) 10CI reject: [V:04-1] ci: remove old jenkins@gallium RSA key SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1173468 (https://phabricator.wikimedia.org/T177826) (owner: 10Dzahn) [20:22:45] (03CR) 10Clare Ming: [C:03+1] Enable AA test on 50 wikis [extensions/WikimediaEvents] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173421 (https://phabricator.wikimedia.org/T399486) (owner: 10Kimberly Sarabia) [20:23:40] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: DiskSpace (instance netbox-dev2003:9100) - https://phabricator.wikimedia.org/T400601#11041104 (10cmooney) I freed a few gigs on this by running `apt clean` and `journalctl --vacuum-time=10d` to remove logs older than 10 days. ` root@net... [20:25:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P80142 and previous config saved to /var/cache/conftool/dbconfig/20250728-202521-marostegui.json [20:26:16] anzx: do you want to address Lucas's comment on your next patch? [20:27:06] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172659|throttle: add rules for Wikimania 2025 (T400276)]] (duration: 08m 39s) [20:27:06] cjming: i will add old task numbers [20:27:12] T400276: Request for IP throttling exemption for Novotel and Trademark/Tribe Venues – Wikimania 2025 - https://phabricator.wikimedia.org/T400276 [20:27:16] cool - thanks [20:28:04] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11041114 (10CDobbins) 05Open→03In progress p:05Triage→03Medium a:03CDobbins [20:28:42] (03PS5) 10Anzx: mnwwiktionary: update reconstruction namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172862 (https://phabricator.wikimedia.org/T400441) [20:29:51] (03PS6) 10Anzx: mnwwiktionary: update reconstruction namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172862 (https://phabricator.wikimedia.org/T400441) [20:30:01] (03CR) 10Anzx: mnwwiktionary: update reconstruction namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172862 (https://phabricator.wikimedia.org/T400441) (owner: 10Anzx) [20:30:09] cjming: done [20:30:15] great [20:30:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172862 (https://phabricator.wikimedia.org/T400441) (owner: 10Anzx) [20:31:27] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11041123 (10CDobbins) @Ottomata or @Ahoelzl would either of you mind approving @SD0001's request? Please let me know if there's something I neglected to add or update [20:31:32] (03Merged) 10jenkins-bot: mnwwiktionary: update reconstruction namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172862 (https://phabricator.wikimedia.org/T400441) (owner: 10Anzx) [20:31:44] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1172862|mnwwiktionary: update reconstruction namespace (T400441)]] [20:31:49] T400441: Update Reconstruction namespace name in Mon Wiktionary - https://phabricator.wikimedia.org/T400441 [20:31:51] dancy: if you're around - i'm trying to ssh into the maintenance server to run the maintenance script for adding namespaces but i'm getting a public key denied - what is the current server? [20:32:34] (03CR) 10Sohom Datta: [C:03+1] "LGTM, (Aca asked me to take a look/double check on Discord!)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173456 (https://phabricator.wikimedia.org/T400644) (owner: 10Acamicamacaraca) [20:33:00] cjming: I'm no dancy but I believe the maintenance servers are no more per wikitech-l last week. [20:33:12] oh! thanks thcipriani [20:33:19] so no need to run those anymore? [20:33:24] we can now run them from the deploy server https://wikitech.wikimedia.org/wiki/Maintenance_scripts [20:33:39] !log cjming@deploy1003 cjming, anzx: Backport for [[gerrit:1172862|mnwwiktionary: update reconstruction namespace (T400441)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:33:40] @kimberly_sarabia was next I think, unless bwang has appeared? [20:33:54] im here [20:34:08] no such luck :) the scripts are still needed, but how we run them is a little different. [20:34:10] oh, i might have jumped the gun, the anzx backport is still running [20:34:26] ya - one more after this [20:34:45] thcipriani: gtk - thanks for the info! [20:34:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P80143 and previous config saved to /var/cache/conftool/dbconfig/20250728-203458-fceratto.json [20:35:18] cjming: looks good [20:35:31] !log cjming@deploy1003 cjming, anzx: Continuing with sync [20:35:40] looks like I need to update the Backport Deployers page, but namespaceDupes is correct: https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#namespaceDupes [20:36:06] * thcipriani edits [20:36:14] PROBLEM - Disk space on an-worker1122 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 149855 MB (3% inode=99%): /var/lib/hadoop/data/e 157737 MB (4% inode=99%): /var/lib/hadoop/data/f 158233 MB (4% inode=99%): /var/lib/hadoop/data/b 158188 MB (4% inode=99%): /var/lib/hadoop/data/g 156114 MB (4% inode=99%): /var/lib/hadoop/data/d 156343 MB (4% inode=99%): /var/lib/hadoop/data/j 159605 MB (4% inode=99%): /var/lib/hadoop/data [20:36:14] 4 MB (4% inode=99%): /var/lib/hadoop/data/h 158244 MB (4% inode=99%): /var/lib/hadoop/data/l 156463 MB (4% inode=99%): /var/lib/hadoop/data/k 159687 MB (4% inode=99%): /var/lib/hadoop/data/m 149608 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1122&var-datasource=eqiad+prometheus/ops [20:36:28] * cjming bows to thcipriani [20:38:06] (03PS1) 10DLynch: Tone check: don't cause an error when the model fails [extensions/VisualEditor] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173470 [20:38:09] (03PS4) 10Anzx: aswikisource: add publisher (প্ৰকাশক) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173397 (https://phabricator.wikimedia.org/T399269) [20:38:14] (03PS1) 10DLynch: Edit check: skip collapsed ranges when computing modified content branch nodes [extensions/VisualEditor] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173471 (https://phabricator.wikimedia.org/T400573) [20:39:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/VisualEditor] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173470 (owner: 10DLynch) [20:39:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/VisualEditor] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173471 (https://phabricator.wikimedia.org/T400573) (owner: 10DLynch) [20:39:55] (I know, I'm jumping in late. I'll just keep an eye until others are done.) [20:40:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P80144 and previous config saved to /var/cache/conftool/dbconfig/20250728-204028-marostegui.json [20:40:40] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Wed 13 Aug 2025 08:40:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [20:40:53] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172862|mnwwiktionary: update reconstruction namespace (T400441)]] (duration: 09m 08s) [20:40:58] T400441: Update Reconstruction namespace name in Mon Wiktionary - https://phabricator.wikimedia.org/T400441 [20:41:37] oops, accidentally disconnected [20:42:02] just to note, my patch would also require namespaceDupes.php [20:43:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173397 (https://phabricator.wikimedia.org/T399269) (owner: 10Anzx) [20:43:08] Aca: ack [20:43:54] (03Merged) 10jenkins-bot: aswikisource: add publisher (প্ৰকাশক) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173397 (https://phabricator.wikimedia.org/T399269) (owner: 10Anzx) [20:44:09] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1173397|aswikisource: add publisher (প্ৰকাশক) namespace (T399269)]] [20:44:15] T399269: Add "প্ৰকাশক" namespace and "Edition" field in Index page for Assamese Wikisource - https://phabricator.wikimedia.org/T399269 [20:46:12] !log cjming@deploy1003 cjming, anzx: Backport for [[gerrit:1173397|aswikisource: add publisher (প্ৰকাশক) namespace (T399269)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:46:34] whoever knows - what does the '+' before the wiki name mean again in core-Namespaces.php? [20:47:07] it's at the top of one of the files I think? [20:47:30] cjming: looks good [20:47:45] !log cjming@deploy1003 cjming, anzx: Continuing with sync [20:47:54] i can just run `mwscript-k8s --comment=T399269 --follow -- namespaceDupes aswikisource --fix | tee ~/T399269` for patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1173397 right? [20:47:54] cjming: at the top of InitialiseSettings.php [20:47:57] + adds to config instead of overriding it [20:47:59] means the setting is added to an existing global array [20:48:11] ah that's right - thanks [20:48:30] (overriding what's set for the various db lists the wiki is in, that is) [20:48:34] right, only typically useful for array-valued configuration, which is why tyou don't see it as often [20:50:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P80145 and previous config saved to /var/cache/conftool/dbconfig/20250728-205005-fceratto.json [20:50:07] cjming: that command looks right, if you run without --fix it'll give you a bit of a report [20:50:24] right on - thx [20:53:06] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173397|aswikisource: add publisher (প্ৰকাশক) namespace (T399269)]] (duration: 08m 56s) [20:53:11] T399269: Add "প্ৰকাশক" namespace and "Edition" field in Index page for Assamese Wikisource - https://phabricator.wikimedia.org/T399269 [20:55:12] thcipriani: i wonder if i screwed up the maint script for mnwwiktionary -- at the end of it, i got "oh noeees" instead of "Looks good!" [20:55:23] should i re-run? [20:55:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T399249)', diff saved to https://phabricator.wikimedia.org/P80146 and previous config saved to /var/cache/conftool/dbconfig/20250728-205536-marostegui.json [20:55:42] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [20:55:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173421 (https://phabricator.wikimedia.org/T399486) (owner: 10Kimberly Sarabia) [20:55:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2153.codfw.wmnet with reason: Maintenance [20:55:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T399249)', diff saved to https://phabricator.wikimedia.org/P80147 and previous config saved to /var/cache/conftool/dbconfig/20250728-205559-marostegui.json [20:56:04] kimberly_sarabia: doing your backport now [20:56:17] cjming: ty! [20:56:46] cjming: wanna make a phab paste and we can check? [20:57:15] it might just mean there were some conflicts that couldn't be auto-resolved (and will have to be manually fixed by the folks on-wiki) [20:57:19] (03Merged) 10jenkins-bot: Enable AA test on 50 wikis [extensions/WikimediaEvents] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173421 (https://phabricator.wikimedia.org/T399486) (owner: 10Kimberly Sarabia) [20:57:20] thcipriani: `mwscript-k8s --comment=T400441 --follow -- namespaceDupes mnwwiktionary --fix | tee ~/T400441` [20:57:20] T400441: Update Reconstruction namespace name in Mon Wiktionary - https://phabricator.wikimedia.org/T400441 [20:57:21] ...maybe [20:57:32] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1173421|Enable AA test on 50 wikis (T399486)]] [20:57:37] T399486: Turn on the A/A test in production - https://phabricator.wikimedia.org/T399486 [20:58:03] thcipriani: not sure what you mean by phab paste -- but the cmd above is what i ran on the deployment server [20:58:42] oh, I meant the output: cat ~/T400441 | phaste [20:59:18] thcipriani: https://phabricator.wikimedia.org/P80149 [20:59:28] !log cjming@deploy1003 ksarabia, cjming: Backport for [[gerrit:1173421|Enable AA test on 50 wikis (T399486)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:00:04] Reedy, sbassett, Maryum, and manfredi: Your horoscope predicts another Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T2100). [21:00:04] kimberly_sarabia: ok to sync? [21:00:26] security deployers: ok for the current backport window to go over? [21:00:52] cjming: yup [21:00:55] !log cjming@deploy1003 ksarabia, cjming: Continuing with sync [21:02:37] Reedy, sbassett, Maryum: there are 3 more config patches left in the queue of the previous backport window - please lmk if it's ok to proceed or if you need the window [21:03:01] I believe Scott and Maryum are planning on deploying [21:03:14] But not sure if they're around just yet, so you're probably ok to continue for now [21:03:42] cool - thx [21:04:21] cscott: as soon as Kim's patch is done, please go ahead with your patch - i'm also happy to do it for you if you like [21:04:37] i like running the spiderpig :) [21:04:43] lol - me too [21:05:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T399728)', diff saved to https://phabricator.wikimedia.org/P80150 and previous config saved to /var/cache/conftool/dbconfig/20250728-210513-fceratto.json [21:05:18] i should say, i like running the spiderpig when i know I've got Real Deployers (tm) around who could step in to fix things if something did go off the rails. [21:05:21] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [21:05:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [21:06:12] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173421|Enable AA test on 50 wikis (T399486)]] (duration: 08m 39s) [21:06:18] T399486: Turn on the A/A test in production - https://phabricator.wikimedia.org/T399486 [21:06:21] cscott: all you [21:06:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet with reason: Maintenance [21:06:57] ok! [21:07:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172711 (https://phabricator.wikimedia.org/T400510) (owner: 10Arlolra) [21:07:22] cjming: ah, I see, there were three pages with a conflicts, I added your paste to the task for followup. [21:07:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2208.codfw.wmnet with reason: Maintenance [21:07:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T399728)', diff saved to https://phabricator.wikimedia.org/P80151 and previous config saved to /var/cache/conftool/dbconfig/20250728-210738-fceratto.json [21:08:02] cjming: Are we good to start the sec deployment window? We have one kinda quick-merge config patch to get out first :) [21:08:20] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to 39 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172711 (https://phabricator.wikimedia.org/T400510) (owner: 10Arlolra) [21:08:34] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1172711|Deploy Parsoid Read Views to 39 Wikipedias (T400510)]] [21:08:36] sbassett: as soon as cscott is done, we can pass over to you [21:08:39] T400510: Parsoid Read Views to Wikipedia deploy ~2025-07-23 - https://phabricator.wikimedia.org/T400510 [21:09:05] sbassett: sorry about that - thought we could squeeze one more in [21:09:28] No prob, just let us know [21:10:32] !log cscott@deploy1003 arlolra, cscott: Backport for [[gerrit:1172711|Deploy Parsoid Read Views to 39 Wikipedias (T400510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:10:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T399728)', diff saved to https://phabricator.wikimedia.org/P80152 and previous config saved to /var/cache/conftool/dbconfig/20250728-211034-fceratto.json [21:10:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173441 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza) [21:10:43] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [21:11:39] sbassett: should i close the late UTC backport window after cscott's patch? presumably you have more to do than the quick config patch? [21:11:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:12:54] cjming: that’d be ideal [21:13:02] tho I don’t think we’ll need more than an hour [21:13:13] alrighty - np - thanks for your patience [21:13:36] !log cscott@deploy1003 arlolra, cscott: Continuing with sync [21:13:41] tested, looks good, continuing [21:13:55] danisztls, Aca: sorry for the news - i'm going to have to close the late UTC backport window after this last patch [21:14:39] i feel back about squeezing danisztls' patch out of the window, since i did that last week as well :( [21:14:39] No prob, i'm rescheduling my patch for another window [21:14:41] cjming: no problem, it is Monday and it was a busy window [21:14:42] cjming: can you try `mwscript-k8s --comment=T400441 --follow -- namespaceDupes mnwwiktionary --fix --add-prefix=T400441 | tee ~/T400441` for mnwiktionary [21:14:43] T400441: Update Reconstruction namespace name in Mon Wiktionary - https://phabricator.wikimedia.org/T400441 [21:15:17] anzx: sure - 1 sec [21:15:35] cscott: no problem, this is a diff patch btw :) [21:16:20] cjming: was namespacedupes.php run on aswikisource [21:16:28] (03PS1) 10SBassett: SECURITY: Fix stored i18n XSS through href attributes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173481 [21:16:29] anzx: yes [21:16:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173456 (https://phabricator.wikimedia.org/T400644) (owner: 10Acamicamacaraca) [21:18:58] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172711|Deploy Parsoid Read Views to 39 Wikipedias (T400510)]] (duration: 10m 23s) [21:19:03] T400510: Parsoid Read Views to Wikipedia deploy ~2025-07-23 - https://phabricator.wikimedia.org/T400510 [21:19:25] sbassett: all yours [21:19:28] ok, i'm done [21:19:32] have fun [21:19:38] scott handing off to scott [21:19:53] ^ :) [21:19:54] !log end of UTC late backport window [21:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:23] anzx: ran your script for mnwiktionary and it worked fine [21:20:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173481 (owner: 10SBassett) [21:20:40] cjming: Thank you [21:20:49] np! [21:21:14] (03Merged) 10jenkins-bot: SECURITY: Fix stored i18n XSS through href attributes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173481 (owner: 10SBassett) [21:21:30] !log sbassett@deploy1003 Started scap sync-world: Backport for [[gerrit:1173481|SECURITY: Fix stored i18n XSS through href attributes]] [21:23:29] !log sbassett@deploy1003 sbassett: Backport for [[gerrit:1173481|SECURITY: Fix stored i18n XSS through href attributes]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:24:25] !log sbassett@deploy1003 sbassett: Continuing with sync [21:25:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P80153 and previous config saved to /var/cache/conftool/dbconfig/20250728-212541-fceratto.json [21:29:55] !log sbassett@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173481|SECURITY: Fix stored i18n XSS through href attributes]] (duration: 08m 24s) [21:37:30] (03PS4) 10Dzahn: redirects: update SVN rewrite rules, do not link to Phabricator anymore [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) [21:38:43] sbassett: Are you all done with security patches, such that I could squeeze in my patches from the earlier backport window? [21:40:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P80154 and previous config saved to /var/cache/conftool/dbconfig/20250728-214049-fceratto.json [21:41:10] (03CR) 10BCornwall: [C:03+2] ncredir: Redirect wikipedialibrary.org [puppet] - 10https://gerrit.wikimedia.org/r/1172391 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall) [21:43:20] kemayo: about to wrap up security patches now [21:43:25] should be 10 minutes or less [21:43:56] No rush. I'll just be stealing the Web window, which should be uncontroversial since that team doesn't even exist now. :D [21:49:17] excellent [21:50:44] (03PS1) 10CDobbins: admiin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) [21:51:10] (03PS2) 10CDobbins: admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) [21:55:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T399728)', diff saved to https://phabricator.wikimedia.org/P80155 and previous config saved to /var/cache/conftool/dbconfig/20250728-215556-fceratto.json [21:56:03] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [21:56:13] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2218.codfw.wmnet with reason: Maintenance [21:56:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T399728)', diff saved to https://phabricator.wikimedia.org/P80156 and previous config saved to /var/cache/conftool/dbconfig/20250728-215619-fceratto.json [21:58:09] running scap sync-world [21:58:34] (03PS3) 10CDobbins: admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) [21:59:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T399728)', diff saved to https://phabricator.wikimedia.org/P80157 and previous config saved to /var/cache/conftool/dbconfig/20250728-215914-fceratto.json [22:01:08] (03CR) 10Dzahn: "CCed users: if you think you should keep access please reach out to us" [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [22:01:44] (03CR) 10SBassett: "Relevant bug: T400501" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173481 (owner: 10SBassett) [22:04:51] (03PS4) 10CDobbins: admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) [22:05:33] (03CR) 10CI reject: [V:04-1] admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [22:06:59] (03CR) 10Dzahn: "users who are removed need to be moved to the special "absent" group for former users at the top of the file" [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [22:08:17] (03CR) 10Dzahn: admin: remove prod access for listed users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [22:08:33] (03CR) 10BCornwall: admin: remove prod access for listed users (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [22:10:59] !log security deploy for multiple patches including T400526 T395858 T400500 T400545 [22:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:27] (03CR) 10Dzahn: admin: remove prod access for listed users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [22:11:48] (03CR) 10Dzahn: redirects: update SVN rewrite rules, do not link to Phabricator anymore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [22:14:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P80158 and previous config saved to /var/cache/conftool/dbconfig/20250728-221421-fceratto.json [22:14:32] Kemayo enjoy your deploy [22:14:41] maryum: Thanks! [22:15:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173470 (owner: 10DLynch) [22:15:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173471 (https://phabricator.wikimedia.org/T400573) (owner: 10DLynch) [22:22:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T399249)', diff saved to https://phabricator.wikimedia.org/P80159 and previous config saved to /var/cache/conftool/dbconfig/20250728-222212-marostegui.json [22:22:18] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [22:26:21] (03Merged) 10jenkins-bot: Tone check: don't cause an error when the model fails [extensions/VisualEditor] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173470 (owner: 10DLynch) [22:26:22] (03Merged) 10jenkins-bot: Edit check: skip collapsed ranges when computing modified content branch nodes [extensions/VisualEditor] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1173471 (https://phabricator.wikimedia.org/T400573) (owner: 10DLynch) [22:28:06] Huh, I hadn't seen a spiderpig deploy just outright fail before. [22:28:14] (03CR) 10Matanya: [C:03+1] admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [22:29:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P80160 and previous config saved to /var/cache/conftool/dbconfig/20250728-222929-fceratto.json [22:37:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P80161 and previous config saved to /var/cache/conftool/dbconfig/20250728-223720-marostegui.json [22:40:06] (03CR) 10Jalexander: [C:03+1] "fine with removal, noted on the phab that potential use down the road but can cross that bridge if we come to it." [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [22:42:57] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1173470|Tone check: don't cause an error when the model fails]], [[gerrit:1173471|Edit check: skip collapsed ranges when computing modified content branch nodes (T400573)]] [22:43:03] T400573: Tone check runs on adjacent paragraphs - https://phabricator.wikimedia.org/T400573 [22:44:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T399728)', diff saved to https://phabricator.wikimedia.org/P80162 and previous config saved to /var/cache/conftool/dbconfig/20250728-224436-fceratto.json [22:44:42] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [22:44:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2221.codfw.wmnet with reason: Maintenance [22:45:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2221 (T399728)', diff saved to https://phabricator.wikimedia.org/P80163 and previous config saved to /var/cache/conftool/dbconfig/20250728-224459-fceratto.json [22:45:13] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1173470|Tone check: don't cause an error when the model fails]], [[gerrit:1173471|Edit check: skip collapsed ranges when computing modified content branch nodes (T400573)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:45:52] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661 (10RobH) 03NEW [22:45:57] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11041469 (10RobH) [22:46:44] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11041473 (10RobH) a:03jasmine_ @jasimine_, Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and ad... [22:47:12] !log kemayo@deploy1003 kemayo: Continuing with sync [22:47:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T399728)', diff saved to https://phabricator.wikimedia.org/P80164 and previous config saved to /var/cache/conftool/dbconfig/20250728-224754-fceratto.json [22:52:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P80165 and previous config saved to /var/cache/conftool/dbconfig/20250728-225227-marostegui.json [22:52:29] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173470|Tone check: don't cause an error when the model fails]], [[gerrit:1173471|Edit check: skip collapsed ranges when computing modified content branch nodes (T400573)]] (duration: 09m 31s) [22:52:37] T400573: Tone check runs on adjacent paragraphs - https://phabricator.wikimedia.org/T400573 [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250728T2300) [23:03:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P80166 and previous config saved to /var/cache/conftool/dbconfig/20250728-230302-fceratto.json [23:07:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T399249)', diff saved to https://phabricator.wikimedia.org/P80167 and previous config saved to /var/cache/conftool/dbconfig/20250728-230735-marostegui.json [23:07:41] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [23:07:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2170.codfw.wmnet with reason: Maintenance [23:07:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T399249)', diff saved to https://phabricator.wikimedia.org/P80168 and previous config saved to /var/cache/conftool/dbconfig/20250728-230758-marostegui.json [23:08:03] FIRING: PuppetDisabled: Puppet disabled on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [23:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:18:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P80169 and previous config saved to /var/cache/conftool/dbconfig/20250728-231810-fceratto.json [23:33:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T399728)', diff saved to https://phabricator.wikimedia.org/P80170 and previous config saved to /var/cache/conftool/dbconfig/20250728-233317-fceratto.json [23:33:24] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [23:33:33] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2222.codfw.wmnet with reason: Maintenance [23:33:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T399728)', diff saved to https://phabricator.wikimedia.org/P80171 and previous config saved to /var/cache/conftool/dbconfig/20250728-233340-fceratto.json [23:36:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T399728)', diff saved to https://phabricator.wikimedia.org/P80172 and previous config saved to /var/cache/conftool/dbconfig/20250728-233635-fceratto.json [23:39:04] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:39:22] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:39:54] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:40:12] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:51:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P80173 and previous config saved to /var/cache/conftool/dbconfig/20250728-235143-fceratto.json