[00:06:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P80174 and previous config saved to /var/cache/conftool/dbconfig/20250729-000651-fceratto.json [00:21:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T399728)', diff saved to https://phabricator.wikimedia.org/P80175 and previous config saved to /var/cache/conftool/dbconfig/20250729-002159-fceratto.json [00:22:05] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [00:22:05] brb [00:33:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T399249)', diff saved to https://phabricator.wikimedia.org/P80176 and previous config saved to /var/cache/conftool/dbconfig/20250729-003325-marostegui.json [00:33:32] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [00:48:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P80177 and previous config saved to /var/cache/conftool/dbconfig/20250729-004833-marostegui.json [01:03:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P80178 and previous config saved to /var/cache/conftool/dbconfig/20250729-010340-marostegui.json [01:11:41] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:18:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T399249)', diff saved to https://phabricator.wikimedia.org/P80179 and previous config saved to /var/cache/conftool/dbconfig/20250729-011848-marostegui.json [01:18:54] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [01:19:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2173.codfw.wmnet with reason: Maintenance [01:19:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T399249)', diff saved to https://phabricator.wikimedia.org/P80180 and previous config saved to /var/cache/conftool/dbconfig/20250729-011911-marostegui.json [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T0200) [02:45:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T399249)', diff saved to https://phabricator.wikimedia.org/P80181 and previous config saved to /var/cache/conftool/dbconfig/20250729-024525-marostegui.json [02:45:32] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T0300) [03:00:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P80182 and previous config saved to /var/cache/conftool/dbconfig/20250729-030033-marostegui.json [03:03:06] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.12 refs T396373 [03:03:11] T396373: 1.45.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T396373 [03:06:26] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:08:04] FIRING: PuppetDisabled: Puppet disabled on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [03:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:15:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P80183 and previous config saved to /var/cache/conftool/dbconfig/20250729-031540-marostegui.json [03:30:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T399249)', diff saved to https://phabricator.wikimedia.org/P80184 and previous config saved to /var/cache/conftool/dbconfig/20250729-033048-marostegui.json [03:30:54] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [03:31:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2174.codfw.wmnet with reason: Maintenance [03:31:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T399249)', diff saved to https://phabricator.wikimedia.org/P80185 and previous config saved to /var/cache/conftool/dbconfig/20250729-033111-marostegui.json [03:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:48:55] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.12 refs T396373 (duration: 45m 49s) [03:51:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T0400) [04:02:02] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.9 (duration: 01m 56s) [04:26:18] RECOVERY - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is OK: wikitech-static OK - wikitech and wikitech-static in sync (11609 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [04:56:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T399249)', diff saved to https://phabricator.wikimedia.org/P80186 and previous config saved to /var/cache/conftool/dbconfig/20250729-045619-marostegui.json [04:56:25] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [05:09:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P80187 and previous config saved to /var/cache/conftool/dbconfig/20250729-051127-marostegui.json [05:11:54] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11041748 (10Marostegui) >>! In T399927#11039668, @Jhancock.wm wrote: > @Marostegui es2038 is moved, updated, and powered up! Thank you! > > for es2039. it... [05:12:05] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11041749 (10Marostegui) [05:26:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P80188 and previous config saved to /var/cache/conftool/dbconfig/20250729-052634-marostegui.json [05:28:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P80189 and previous config saved to /var/cache/conftool/dbconfig/20250729-052843-root.json [05:41:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T399249)', diff saved to https://phabricator.wikimedia.org/P80190 and previous config saved to /var/cache/conftool/dbconfig/20250729-054142-marostegui.json [05:41:48] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [05:41:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2176.codfw.wmnet with reason: Maintenance [05:42:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T399249)', diff saved to https://phabricator.wikimedia.org/P80191 and previous config saved to /var/cache/conftool/dbconfig/20250729-054206-marostegui.json [05:43:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P80192 and previous config saved to /var/cache/conftool/dbconfig/20250729-054349-root.json [05:49:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [05:52:45] checking [05:55:59] yep a little elevated 5xx for swift indeed [05:58:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P80193 and previous config saved to /var/cache/conftool/dbconfig/20250729-055855-root.json [05:59:40] not really sure what's the best next action here, Emperor maybe ? [05:59:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T0600) [06:00:05] marostegui, Amir1, and federico3: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T0600). [06:02:30] godog: lets see how it unfolds if the alert is resolved [06:03:06] indeed, I'll keep looking in the mean time [06:07:10] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11041786 (10ayounsi) You can probably skip 2039 for now and jump to 2040 until we figure out what's best for 2039. For 2039 as they're still using the row w... [06:09:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:12:51] TIL the kafkatee -> webrequest writing to logstash and files is broken [06:13:50] since may 20th to be exact [06:14:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P80194 and previous config saved to /var/cache/conftool/dbconfig/20250729-061401-root.json [06:14:12] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11041789 (10Marostegui) >>! In T399927#11041786, @ayounsi wrote: > You can probably skip 2039 for now and jump to 2040 until we figure out what's best for 20... [06:16:26] (03PS1) 10Jelto: gitlab: make sure config backup is scheduled before data backup [puppet] - 10https://gerrit.wikimedia.org/r/1173614 (https://phabricator.wikimedia.org/T400252) [06:22:43] (03PS1) 10Marostegui: mariadb: Add db126[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/1173619 (https://phabricator.wikimedia.org/T400214) [06:25:36] (03CR) 10Marostegui: [C:03+2] mariadb: Add db126[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/1173619 (https://phabricator.wikimedia.org/T400214) (owner: 10Marostegui) [06:26:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11041797 (10Marostegui) [06:26:32] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11041798 (10Marostegui) Patches are done [06:28:47] (03PS1) 10Marostegui: site.pp: Add reference task [puppet] - 10https://gerrit.wikimedia.org/r/1173622 (https://phabricator.wikimedia.org/T400213) [06:29:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P80195 and previous config saved to /var/cache/conftool/dbconfig/20250729-062907-root.json [06:29:16] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1173622 (https://phabricator.wikimedia.org/T400213) (owner: 10Marostegui) [06:29:19] (03CR) 10Marostegui: [C:03+2] site.pp: Add reference task [puppet] - 10https://gerrit.wikimedia.org/r/1173622 (https://phabricator.wikimedia.org/T400213) (owner: 10Marostegui) [06:31:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:35:10] (03PS1) 10Marostegui: db1202: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1173624 (https://phabricator.wikimedia.org/T399955) [06:35:50] (03CR) 10Marostegui: [C:03+2] db1202: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1173624 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [06:36:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1202.eqiad.wmnet with reason: Maintenance [06:36:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1202 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P80196 and previous config saved to /var/cache/conftool/dbconfig/20250729-063657-marostegui.json [06:39:33] (03CR) 10Arnaudb: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1173614 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [06:43:34] (03PS2) 10Arnaudb: gerrit: add service ip address for gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1173370 (https://phabricator.wikimedia.org/T387833) [06:44:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P80197 and previous config saved to /var/cache/conftool/dbconfig/20250729-064405-root.json [06:44:20] (03CR) 10Arnaudb: "good catch! thanks! it's been fixed in the next PS" [puppet] - 10https://gerrit.wikimedia.org/r/1173370 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [06:51:44] FIRING: RipeAtlasAnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32390538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:56:43] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32390538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:59:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P80199 and previous config saved to /var/cache/conftool/dbconfig/20250729-065910-root.json [06:59:52] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 28089952 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:00:04] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T0700). nyaa~ [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:52] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 5138328 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:05:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T399249)', diff saved to https://phabricator.wikimedia.org/P80200 and previous config saved to /var/cache/conftool/dbconfig/20250729-070549-marostegui.json [07:05:55] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:07:06] (03PS1) 10Bartosz Wójtowicz: ml-services: Increase `max_replicas` to 3 for edit-check on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173796 (https://phabricator.wikimedia.org/T400606) [07:08:04] FIRING: PuppetDisabled: Puppet disabled on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [07:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:09:34] (03CR) 10Jelto: [C:03+2] gitlab: make sure config backup is scheduled before data backup [puppet] - 10https://gerrit.wikimedia.org/r/1173614 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [07:09:53] !log upgrading haproxykafka to 0.3.11 on A:cp (T400620) [07:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:59] T400620: Can't build haproxykafka package anymore - https://phabricator.wikimedia.org/T400620 [07:14:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P80201 and previous config saved to /var/cache/conftool/dbconfig/20250729-071418-root.json [07:16:53] * Emperor arrives [07:19:47] (03CR) 10Arnaudb: "As discussed off band this have been a conversation on #wikimedia-sre-foundation with @ayounsi@wikimedia.org and @cmooney@wikimedia.org (h" [dns] - 10https://gerrit.wikimedia.org/r/1173376 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [07:19:57] godog: I'm not sure what "overflow" means in envoy graph terms; but the odd spikes in 5xx from earlier look like the sort that's traffic related [07:20:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P80202 and previous config saved to /var/cache/conftool/dbconfig/20250729-072057-marostegui.json [07:21:03] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11041918 (10elukey) I am wondering if the issue is related to how the current recipe defines the EFI partition: ` d-i partman-auto/expert_recipe string \ multiraid ::... [07:21:12] Emperor: ack, thank you that makes sense [07:21:39] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1156.eqiad.wmnet with reason: Maintenance [07:21:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:21:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T399728)', diff saved to https://phabricator.wikimedia.org/P80203 and previous config saved to /var/cache/conftool/dbconfig/20250729-072153-fceratto.json [07:21:59] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [07:24:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T399728)', diff saved to https://phabricator.wikimedia.org/P80204 and previous config saved to /var/cache/conftool/dbconfig/20250729-072445-fceratto.json [07:26:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:27:06] (03PS3) 10Vgutierrez: traffic: Fix HaproxyKafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/1173427 (https://phabricator.wikimedia.org/T400039) [07:27:19] (03CR) 10Vgutierrez: traffic: Fix HaproxyKafkaNoMessages alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1173427 (https://phabricator.wikimedia.org/T400039) (owner: 10Vgutierrez) [07:29:12] (03PS4) 10Effie Mouzeli: dsh: remove testservers from scap destinations 1 [puppet] - 10https://gerrit.wikimedia.org/r/1169673 (https://phabricator.wikimedia.org/T397498) [07:29:17] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [07:29:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P80205 and previous config saved to /var/cache/conftool/dbconfig/20250729-072924-root.json [07:31:43] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32390538 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:33:28] (03CR) 10Fabfur: [C:03+1] traffic: Fix HaproxyKafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/1173427 (https://phabricator.wikimedia.org/T400039) (owner: 10Vgutierrez) [07:35:48] (03CR) 10Vgutierrez: [C:03+2] traffic: Fix HaproxyKafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/1173427 (https://phabricator.wikimedia.org/T400039) (owner: 10Vgutierrez) [07:36:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P80206 and previous config saved to /var/cache/conftool/dbconfig/20250729-073604-marostegui.json [07:37:00] (03PS1) 10Effie Mouzeli: etcd::tlsproxy: Remove testserver ACLs 2 [puppet] - 10https://gerrit.wikimedia.org/r/1173871 (https://phabricator.wikimedia.org/T397498) [07:38:20] !log haproxykafka upgraded to 0.3.11 on A:cp (T400620) [07:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:25] T400620: Can't build haproxykafka package anymore - https://phabricator.wikimedia.org/T400620 [07:39:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P80207 and previous config saved to /var/cache/conftool/dbconfig/20250729-073953-fceratto.json [07:44:10] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2006.codfw.wmnet with reason: host reimage [07:44:10] (03CR) 10Marostegui: [C:03+1] "The views will need to be dropped too." [puppet] - 10https://gerrit.wikimedia.org/r/1173359 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup) [07:44:20] (03PS1) 10Elukey: install_server: fix hwraid-1dev-nvme and modify boss_leavelvm.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1173876 (https://phabricator.wikimedia.org/T393044) [07:46:31] (03PS2) 10Elukey: install_server: fix hwraid-1dev-nvme and modify boss_leavelvm.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1173876 (https://phabricator.wikimedia.org/T393044) [07:46:50] (03CR) 10Vgutierrez: [C:04-1] haproxykafka: adding alert for unexpected restarts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [07:48:44] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2006.codfw.wmnet with reason: host reimage [07:50:18] (03PS1) 10Effie Mouzeli: conftool-data: remove testservers [puppet] - 10https://gerrit.wikimedia.org/r/1173877 (https://phabricator.wikimedia.org/T397498) [07:50:36] (03PS2) 10Effie Mouzeli: conftool-data: remove testservers 3 [puppet] - 10https://gerrit.wikimedia.org/r/1173877 (https://phabricator.wikimedia.org/T397498) [07:51:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T399249)', diff saved to https://phabricator.wikimedia.org/P80208 and previous config saved to /var/cache/conftool/dbconfig/20250729-075112-marostegui.json [07:51:18] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:51:20] (03CR) 10Ozge: [C:03+1] ml-services: Increase `max_replicas` to 3 for edit-check on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173796 (https://phabricator.wikimedia.org/T400606) (owner: 10Bartosz Wójtowicz) [07:51:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2188.codfw.wmnet with reason: Maintenance [07:51:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T399249)', diff saved to https://phabricator.wikimedia.org/P80209 and previous config saved to /var/cache/conftool/dbconfig/20250729-075135-marostegui.json [07:53:50] (03PS1) 10Filippo Giunchedi: kafkatee: fix webrequest input [puppet] - 10https://gerrit.wikimedia.org/r/1173878 (https://phabricator.wikimedia.org/T371366) [07:55:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P80210 and previous config saved to /var/cache/conftool/dbconfig/20250729-075500-fceratto.json [07:55:13] (03CR) 10AikoChou: [C:03+1] ml-services: Increase `max_replicas` to 3 for edit-check on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173796 (https://phabricator.wikimedia.org/T400606) (owner: 10Bartosz Wójtowicz) [08:01:20] (03CR) 10Jelto: "one question in line, also `hieradata/hosts/gerrit2003.yaml` has to be double checked maybe?" [puppet] - 10https://gerrit.wikimedia.org/r/1172625 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [08:03:06] (03PS1) 10Effie Mouzeli: profile::hcaptcha::proxy: update upstream url [puppet] - 10https://gerrit.wikimedia.org/r/1173879 (https://phabricator.wikimedia.org/T399211) [08:06:06] (03PS1) 10Giuseppe Lavagetto: Introduce selectors [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1173880 (https://phabricator.wikimedia.org/T399058) [08:06:17] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#11041974 (10elukey) Found the correct recipe, this is the result: ` elukey@sretest2006:~$ sudo fdisk -l Disk /dev/nvme0n1: 447.07 GiB, 480036519936 bytes, 93757132... [08:06:18] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Introduce selectors [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1173880 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [08:07:15] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [08:09:16] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Introduce selectors - oblivian@cumin1003" [08:09:18] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Introduce selectors - oblivian@cumin1003 [08:09:49] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [08:09:49] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2006.codfw.wmnet with OS bookworm [08:09:49] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Introduce selectors - oblivian@cumin1003 [08:09:50] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Introduce selectors - oblivian@cumin1003" [08:10:03] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Increase `max_replicas` to 3 for edit-check on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173796 (https://phabricator.wikimedia.org/T400606) (owner: 10Bartosz Wójtowicz) [08:10:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T399728)', diff saved to https://phabricator.wikimedia.org/P80211 and previous config saved to /var/cache/conftool/dbconfig/20250729-081007-fceratto.json [08:10:16] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [08:10:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1162.eqiad.wmnet with reason: Maintenance [08:10:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1162 (T399728)', diff saved to https://phabricator.wikimedia.org/P80212 and previous config saved to /var/cache/conftool/dbconfig/20250729-081033-fceratto.json [08:11:52] (03Merged) 10jenkins-bot: ml-services: Increase `max_replicas` to 3 for edit-check on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173796 (https://phabricator.wikimedia.org/T400606) (owner: 10Bartosz Wójtowicz) [08:12:10] 06SRE, 10SRE-SLO, 06Traffic: Page on ATS backend errors relative to traffic - https://phabricator.wikimedia.org/T400675 (10fgiunchedi) 03NEW [08:13:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T399728)', diff saved to https://phabricator.wikimedia.org/P80213 and previous config saved to /var/cache/conftool/dbconfig/20250729-081312-fceratto.json [08:15:12] (03PS3) 10Arnaudb: gerrit: add service ip address for gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1173370 (https://phabricator.wikimedia.org/T387833) [08:15:50] (03PS4) 10Arnaudb: gerrit: add service ip address for gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1173370 (https://phabricator.wikimedia.org/T387833) [08:16:29] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [08:17:12] (03PS2) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1172625 (https://phabricator.wikimedia.org/T338470) [08:18:43] (03CR) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1172625 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [08:19:26] (03PS1) 10Elukey: installserver: add preseed config for sretest2009 [puppet] - 10https://gerrit.wikimedia.org/r/1173883 (https://phabricator.wikimedia.org/T396365) [08:22:30] (03CR) 10Elukey: [C:03+2] installserver: add preseed config for sretest2009 [puppet] - 10https://gerrit.wikimedia.org/r/1173883 (https://phabricator.wikimedia.org/T396365) (owner: 10Elukey) [08:28:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P80214 and previous config saved to /var/cache/conftool/dbconfig/20250729-082819-fceratto.json [08:30:35] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS bookworm [08:32:53] (03PS3) 10Fabfur: haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) [08:32:59] (03CR) 10Fabfur: haproxykafka: adding alert for unexpected restarts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [08:34:33] (03CR) 10CI reject: [V:04-1] haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [08:41:58] (03PS4) 10Fabfur: haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) [08:42:20] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2009.codfw.wmnet with reason: host reimage [08:43:24] (03CR) 10CI reject: [V:04-1] haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [08:43:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P80215 and previous config saved to /var/cache/conftool/dbconfig/20250729-084327-fceratto.json [08:45:44] (03PS5) 10Fabfur: haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) [08:46:55] (03CR) 10CI reject: [V:04-1] haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [08:48:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2009.codfw.wmnet with reason: host reimage [08:48:24] (03CR) 10Effie Mouzeli: [C:03+2] profile::hcaptcha::proxy: update upstream url [puppet] - 10https://gerrit.wikimedia.org/r/1173879 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [08:48:35] (03PS6) 10Fabfur: haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) [08:48:38] (03PS1) 10Giuseppe Lavagetto: deploy: install the code so that cli scripts are available [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1173889 (https://phabricator.wikimedia.org/T399058) [08:52:43] (03PS19) 10Tiziano Fogli: nrpe wrapper: add wrapper to be invoked a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446) [08:52:43] (03CR) 10Tiziano Fogli: "All functions and classes are now properly documented." [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [08:58:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T399728)', diff saved to https://phabricator.wikimedia.org/P80216 and previous config saved to /var/cache/conftool/dbconfig/20250729-085834-fceratto.json [08:58:40] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [08:58:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1182.eqiad.wmnet with reason: Maintenance [08:58:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T399728)', diff saved to https://phabricator.wikimedia.org/P80217 and previous config saved to /var/cache/conftool/dbconfig/20250729-085857-fceratto.json [08:59:01] (03PS2) 10Giuseppe Lavagetto: deploy: install the code so that cli scripts are available [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1173889 (https://phabricator.wikimedia.org/T399058) [09:01:05] (03CR) 10Elukey: [C:03+1] deploy: install the code so that cli scripts are available [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1173889 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [09:01:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T399728)', diff saved to https://phabricator.wikimedia.org/P80218 and previous config saved to /var/cache/conftool/dbconfig/20250729-090149-fceratto.json [09:03:23] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:03:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:05:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T399249)', diff saved to https://phabricator.wikimedia.org/P80219 and previous config saved to /var/cache/conftool/dbconfig/20250729-090507-marostegui.json [09:05:13] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:06:46] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:07:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:07:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2009.codfw.wmnet with OS bookworm [09:11:08] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#11042143 (10elukey) I was able to reimage the host: ` elukey@sretest2009:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 63G 0 63G 0% /dev tmpfs... [09:16:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P80220 and previous config saved to /var/cache/conftool/dbconfig/20250729-091656-fceratto.json [09:20:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P80221 and previous config saved to /var/cache/conftool/dbconfig/20250729-092015-marostegui.json [09:22:47] (03CR) 10Tiziano Fogli: "Overall, the patch looks good to me. I’ve left one comment inline that also applies to other occurrences of $labels.instance, just to impr" [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [09:23:07] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1172625 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [09:24:15] (03CR) 10Cyndywikime: [C:04-1] "[DNM]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173396 (https://phabricator.wikimedia.org/T400048) (owner: 10Cyndywikime) [09:30:39] (03CR) 10Cathal Mooney: [C:03+1] "Not fully familiar with this role but the IPs are correct for the vlan and it seems sane to me." [puppet] - 10https://gerrit.wikimedia.org/r/1173370 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:32:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P80222 and previous config saved to /var/cache/conftool/dbconfig/20250729-093204-fceratto.json [09:33:38] (03CR) 10Cathal Mooney: [C:03+2] Add reverse delegations for codfw K8s dse ranges [dns] - 10https://gerrit.wikimedia.org/r/1173433 (https://phabricator.wikimedia.org/T400037) (owner: 10Cathal Mooney) [09:34:19] !log cmooney@dns2005 START - running authdns-update [09:35:09] !log cmooney@dns2005 END - running authdns-update [09:35:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P80223 and previous config saved to /var/cache/conftool/dbconfig/20250729-093523-marostegui.json [09:36:41] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: DiskSpace (instance netbox-dev2003:9100) - https://phabricator.wikimedia.org/T400601#11042209 (10cmooney) 05Open→03Resolved [09:38:44] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM for first iteration" [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [09:42:41] (03PS1) 10Vgutierrez: hiera,haproxykafka: Shrink socket batch size in cp5027 [puppet] - 10https://gerrit.wikimedia.org/r/1173900 (https://phabricator.wikimedia.org/T400199) [09:43:43] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173900 (https://phabricator.wikimedia.org/T400199) (owner: 10Vgutierrez) [09:45:13] (03CR) 10Clément Goubert: [C:03+2] data.yaml: Allow release-engineering to administer pretrain timer [puppet] - 10https://gerrit.wikimedia.org/r/1173446 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [09:45:23] (03PS7) 10Fabfur: haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) [09:45:36] (03CR) 10Fabfur: haproxykafka: adding alert for unexpected restarts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [09:47:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T399728)', diff saved to https://phabricator.wikimedia.org/P80224 and previous config saved to /var/cache/conftool/dbconfig/20250729-094711-fceratto.json [09:47:17] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:47:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1188.eqiad.wmnet with reason: Maintenance [09:47:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T399728)', diff saved to https://phabricator.wikimedia.org/P80225 and previous config saved to /var/cache/conftool/dbconfig/20250729-094733-fceratto.json [09:48:55] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [09:49:13] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bullseye [09:50:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T399728)', diff saved to https://phabricator.wikimedia.org/P80226 and previous config saved to /var/cache/conftool/dbconfig/20250729-095009-fceratto.json [09:50:22] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [09:50:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T399249)', diff saved to https://phabricator.wikimedia.org/P80227 and previous config saved to /var/cache/conftool/dbconfig/20250729-095030-marostegui.json [09:50:36] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:50:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2202.codfw.wmnet with reason: Maintenance [09:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:54:42] (03PS1) 10Brouberol: deployment_server: remove all configuration related to airflow artefict deployment [puppet] - 10https://gerrit.wikimedia.org/r/1173902 (https://phabricator.wikimedia.org/T395296) [09:54:43] (03PS1) 10Brouberol: cumin: remove any mention of an-airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173903 (https://phabricator.wikimedia.org/T395296) [09:55:07] (03CR) 10CI reject: [V:04-1] deployment_server: remove all configuration related to airflow artefict deployment [puppet] - 10https://gerrit.wikimedia.org/r/1173902 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [09:56:00] !log depooling cp5027 and upgrading haproxykafka to version 0.3.12 (T400620) [09:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:05] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp5027.eqsin.wmnet [09:56:05] T400620: Can't build haproxykafka package anymore - https://phabricator.wikimedia.org/T400620 [09:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:59:34] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1173900 (https://phabricator.wikimedia.org/T400199) (owner: 10Vgutierrez) [09:59:35] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bullseye [09:59:38] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11042274 (10elukey) While trying a custom recipe and getting the same d-i error, I checked the d-i shell and this seems problematic: ` ~ # ls /dev/s* /dev/snapshot /de... [09:59:45] (03CR) 10Vgutierrez: [C:03+2] hiera,haproxykafka: Shrink socket batch size in cp5027 [puppet] - 10https://gerrit.wikimedia.org/r/1173900 (https://phabricator.wikimedia.org/T400199) (owner: 10Vgutierrez) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T1000) [10:03:37] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] deploy: install the code so that cli scripts are available [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1173889 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [10:04:51] !log repooling cp5027 (T400199) - note previous ticket # (T400620) was wrong [10:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:58] T400199: Prevent HaproxyKafka from hanging - https://phabricator.wikimedia.org/T400199 [10:04:59] T400620: Can't build haproxykafka package anymore - https://phabricator.wikimedia.org/T400620 [10:05:02] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp5027.eqsin.wmnet [10:05:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P80228 and previous config saved to /var/cache/conftool/dbconfig/20250729-100517-fceratto.json [10:15:44] (03PS2) 10Brouberol: deployment_server: remove all config related to airflow artifact deployment [puppet] - 10https://gerrit.wikimedia.org/r/1173902 (https://phabricator.wikimedia.org/T395296) [10:15:44] (03PS2) 10Brouberol: cumin: remove any mention of an-airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173903 (https://phabricator.wikimedia.org/T395296) [10:20:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P80230 and previous config saved to /var/cache/conftool/dbconfig/20250729-102025-fceratto.json [10:21:46] (03PS1) 10Brouberol: Remove an-airflow host-specific hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1173911 (https://phabricator.wikimedia.org/T390941) [10:21:49] (03PS1) 10Brouberol: Remove airflow-search role [puppet] - 10https://gerrit.wikimedia.org/r/1173912 (https://phabricator.wikimedia.org/T390941) [10:21:51] (03PS1) 10Brouberol: Remove references to deprecated airflow roles [puppet] - 10https://gerrit.wikimedia.org/r/1173913 (https://phabricator.wikimedia.org/T390941) [10:22:07] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173902 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [10:24:07] (03PS1) 10Stevemunene: dse-k8s: Add dse-k8s-codfw etcd cluster configuration [puppet] - 10https://gerrit.wikimedia.org/r/1173914 (https://phabricator.wikimedia.org/T397293) [10:24:53] (03PS8) 10Fabfur: haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) [10:25:44] jouncebot: nowandnext [10:25:44] For the next 0 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T1000) [10:25:44] In 1 hour(s) and 34 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T1200) [10:29:15] 06SRE, 10Hiddenparma, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11042337 (10Joe) a:03Joe [10:30:32] (03PS1) 10Ladsgroup: Reduce frequency of parsercache purge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173917 (https://phabricator.wikimedia.org/T398806) [10:33:44] (03CR) 10Ladsgroup: [C:03+2] Reduce frequency of parsercache purge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173917 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [10:34:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173917 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [10:34:56] (03Merged) 10jenkins-bot: Reduce frequency of parsercache purge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173917 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [10:35:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T399728)', diff saved to https://phabricator.wikimedia.org/P80231 and previous config saved to /var/cache/conftool/dbconfig/20250729-103532-fceratto.json [10:35:36] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1173917|Reduce frequency of parsercache purge (T398806)]] [10:35:38] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:35:43] T398806: Retire purge-parsercache periodic jobs - https://phabricator.wikimedia.org/T398806 [10:35:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1197.eqiad.wmnet with reason: Maintenance [10:35:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T399728)', diff saved to https://phabricator.wikimedia.org/P80232 and previous config saved to /var/cache/conftool/dbconfig/20250729-103555-fceratto.json [10:36:09] 06SRE: Add ability to validate JWTs in haproxy - https://phabricator.wikimedia.org/T400238#11042374 (10Vgutierrez) a:03Vgutierrez [10:37:51] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1173917|Reduce frequency of parsercache purge (T398806)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:38:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T399728)', diff saved to https://phabricator.wikimedia.org/P80233 and previous config saved to /var/cache/conftool/dbconfig/20250729-103831-fceratto.json [10:38:56] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [10:40:30] (03PS1) 10Vgutierrez: hiera,haproxykafka: Shrink socket batch size in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1173919 (https://phabricator.wikimedia.org/T400199) [10:41:10] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173919 (https://phabricator.wikimedia.org/T400199) (owner: 10Vgutierrez) [10:45:03] (03PS4) 10Federico Ceratto: zarcillo: Add egress to Netbox and config-master [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) [10:45:52] (03CR) 10Fabfur: [C:03+1] hiera,haproxykafka: Shrink socket batch size in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1173919 (https://phabricator.wikimedia.org/T400199) (owner: 10Vgutierrez) [10:46:02] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173917|Reduce frequency of parsercache purge (T398806)]] (duration: 10m 26s) [10:46:03] (03PS5) 10Federico Ceratto: zarcillo: Add egress to Netbox and config-master [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) [10:46:07] T398806: Retire purge-parsercache periodic jobs - https://phabricator.wikimedia.org/T398806 [10:46:26] (03CR) 10Federico Ceratto: "Updated commit message and rebased." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [10:46:35] (03CR) 10Vgutierrez: [C:03+2] hiera,haproxykafka: Shrink socket batch size in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1173919 (https://phabricator.wikimedia.org/T400199) (owner: 10Vgutierrez) [10:46:54] (03CR) 10Clément Goubert: [C:03+1] redirects: update SVN rewrite rules, do not link to Phabricator anymore [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [10:47:00] (03PS1) 10Sergio Gimeno: Growth: remove conditional user options for obsolete experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173920 [10:47:03] (03CR) 10Federico Ceratto: zarcillo: Add egress to Netbox and config-master (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [10:48:06] 06SRE, 10Phabricator: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11042397 (10Aklapper) [10:48:10] (03CR) 10Cathal Mooney: [C:03+1] "LGTM. Need to discuss the authentication side of it but no problem to allow the traffic." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [10:48:59] (03CR) 10Federico Ceratto: [C:03+2] zarcillo: Add egress to Netbox and config-master [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [10:49:54] !log upgrading haproxykafka to 0.3.12 on A:cp-eqsin (and applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1173919 too) (T400199) [10:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:00] T400199: Prevent HaproxyKafka from hanging - https://phabricator.wikimedia.org/T400199 [10:53:18] (03CR) 10MVernon: [C:03+1] "I don't know if the ordering is important when listing two preseed files for a host/set of hosts; assuming that will work as I hope it wil" [puppet] - 10https://gerrit.wikimedia.org/r/1173876 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [10:53:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P80234 and previous config saved to /var/cache/conftool/dbconfig/20250729-105339-fceratto.json [10:58:05] (03PS3) 10Elukey: install_server: fix hwraid-1dev-nvme and modify boss_leavelvm.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1173876 (https://phabricator.wikimedia.org/T393044) [10:58:21] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:58:36] (03CR) 10Elukey: "Rebased on top of production since another preseed change caused a conflict :)" [puppet] - 10https://gerrit.wikimedia.org/r/1173876 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [11:00:19] (03PS1) 10Ladsgroup: mediawiki: Retire purge parser cahce maint scripts [puppet] - 10https://gerrit.wikimedia.org/r/1173922 (https://phabricator.wikimedia.org/T398806) [11:00:32] (03PS1) 10Elukey: redfish: expand is_uefi for Dells [software/spicerack] - 10https://gerrit.wikimedia.org/r/1173923 (https://phabricator.wikimedia.org/T392851) [11:02:50] (03CR) 10MVernon: [C:03+1] "My question / suggestion from my previous comment still applies, but +1 if you'd like to go ahead this way :)" [puppet] - 10https://gerrit.wikimedia.org/r/1173876 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [11:03:23] !log done upgrading haproxykafka to 0.3.12 on A:cp-eqsin (T400199) [11:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:29] T400199: Prevent HaproxyKafka from hanging - https://phabricator.wikimedia.org/T400199 [11:05:11] (03CR) 10Jforrester: Enable sitemaps API (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173575 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [11:06:30] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2203.codfw.wmnet with reason: Maintenance [11:06:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T399249)', diff saved to https://phabricator.wikimedia.org/P80235 and previous config saved to /var/cache/conftool/dbconfig/20250729-110637-marostegui.json [11:06:43] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:08:04] FIRING: PuppetDisabled: Puppet disabled on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [11:08:05] (03PS2) 10Clément Goubert: mediawiki: Retire purge parser cahce maint scripts [puppet] - 10https://gerrit.wikimedia.org/r/1173922 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [11:08:06] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173922 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [11:08:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P80236 and previous config saved to /var/cache/conftool/dbconfig/20250729-110846-fceratto.json [11:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:12:41] FIRING: [16x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:15:26] (03PS1) 10Urbanecm: Add CommunityConfigurationExample to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173924 (https://phabricator.wikimedia.org/T372049) [11:15:27] (03PS1) 10Urbanecm: [beta] Add wmgUseCommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173925 (https://phabricator.wikimedia.org/T372049) [11:15:29] (03PS1) 10Urbanecm: [beta] cswiki: Enable CommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173926 (https://phabricator.wikimedia.org/T372049) [11:15:36] (03CR) 10Urbanecm: [C:04-2] "not yet deployable" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173924 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm) [11:16:53] (03CR) 10Clément Goubert: [C:03+1] "LGTM, deployment will require, on `deploy1003`:" [puppet] - 10https://gerrit.wikimedia.org/r/1173922 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [11:17:41] FIRING: [112x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:23:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T399728)', diff saved to https://phabricator.wikimedia.org/P80237 and previous config saved to /var/cache/conftool/dbconfig/20250729-112354-fceratto.json [11:24:00] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:24:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance [11:25:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1229.eqiad.wmnet with reason: Maintenance [11:25:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T399728)', diff saved to https://phabricator.wikimedia.org/P80238 and previous config saved to /var/cache/conftool/dbconfig/20250729-112515-fceratto.json [11:27:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T399728)', diff saved to https://phabricator.wikimedia.org/P80239 and previous config saved to /var/cache/conftool/dbconfig/20250729-112759-fceratto.json [11:33:15] (03CR) 10Ladsgroup: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1173922 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [11:33:39] (03CR) 10Ladsgroup: [C:03+2] mediawiki: Retire purge parser cahce maint scripts [puppet] - 10https://gerrit.wikimedia.org/r/1173922 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [11:40:12] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Remove tables that are not cataloged from filtered_tables.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup) [11:43:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P80240 and previous config saved to /var/cache/conftool/dbconfig/20250729-114306-fceratto.json [11:46:44] !log ladsgroup@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [11:48:09] !log ladsgroup@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [11:49:09] (03PS2) 10Ladsgroup: Drop references to flaggedrevs_tracking [puppet] - 10https://gerrit.wikimedia.org/r/1173359 (https://phabricator.wikimedia.org/T398936) [11:49:21] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Drop references to flaggedrevs_tracking [puppet] - 10https://gerrit.wikimedia.org/r/1173359 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup) [11:55:49] 06SRE, 06collaboration-services, 10Gerrit, 06Traffic: Document how to deploy changes to DNS repo without Gerrit working - https://phabricator.wikimedia.org/T336754#11042563 (10ABran-WMF) [11:56:46] !log dropping flaggedrevs tables in frwiki, bawiki, siwiki (T398944) [11:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:51] T398944: Expand the check of table catalog to detect tables where they shouldn't be - https://phabricator.wikimedia.org/T398944 [11:57:41] FIRING: [112x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [11:57:57] !log ladsgroup@cumin1002 START - Cookbook sre.wikireplicas.update-views [11:58:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P80241 and previous config saved to /var/cache/conftool/dbconfig/20250729-115814-fceratto.json [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T1200) [12:01:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [12:07:45] (03PS1) 10Ladsgroup: mediawiki-dumps-legacy: Drop flaggedrevs_tracking job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173930 (https://phabricator.wikimedia.org/T398936) [12:13:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T399728)', diff saved to https://phabricator.wikimedia.org/P80242 and previous config saved to /var/cache/conftool/dbconfig/20250729-121321-fceratto.json [12:13:27] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:13:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1233.eqiad.wmnet with reason: Maintenance [12:13:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T399728)', diff saved to https://phabricator.wikimedia.org/P80243 and previous config saved to /var/cache/conftool/dbconfig/20250729-121343-fceratto.json [12:15:01] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173931 [12:15:02] 06SRE, 10SRE-swift-storage, 06Traffic: Cannot upload on Commons or even here - https://phabricator.wikimedia.org/T349671#11042628 (10MatthewVernon) 05Open→03Declined [I think we're beyond the point of useful investigation of this particular incident, and generally uploading does work] [12:16:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T399728)', diff saved to https://phabricator.wikimedia.org/P80244 and previous config saved to /var/cache/conftool/dbconfig/20250729-121635-fceratto.json [12:17:22] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173931 (owner: 10PipelineBot) [12:17:45] (03PS2) 10Brouberol: common/search/airflow: drop hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1173910 (https://phabricator.wikimedia.org/T390941) [12:19:00] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173931 (owner: 10PipelineBot) [12:20:42] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:21:03] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:23:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T399249)', diff saved to https://phabricator.wikimedia.org/P80245 and previous config saved to /var/cache/conftool/dbconfig/20250729-122352-marostegui.json [12:23:58] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:25:23] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:26:10] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:26:19] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:27:02] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:28:54] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:31:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P80246 and previous config saved to /var/cache/conftool/dbconfig/20250729-123142-fceratto.json [12:35:46] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:37:29] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Undeploy Readers Use Cases Survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173441 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza) [12:42:53] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:43:47] (03CR) 10Phuedx: "Thanks! This shouldn't impact ongoing everyone experiments." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172279 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [12:46:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P80248 and previous config saved to /var/cache/conftool/dbconfig/20250729-124650-fceratto.json [12:48:35] (03CR) 10Lucas Werkmeister (WMDE): "LGTM per T400644#11042718; note that this will need a maintenance script run, probably `cleanupTitles` if I’m not mistaken, after deployme" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173456 (https://phabricator.wikimedia.org/T400644) (owner: 10Acamicamacaraca) [12:48:38] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Localize mk.wikibooks sitename and metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173456 (https://phabricator.wikimedia.org/T400644) (owner: 10Acamicamacaraca) [12:49:22] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "(But maybe try `namespaceDupes` first and see if that’s enough to fix the titles.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173456 (https://phabricator.wikimedia.org/T400644) (owner: 10Acamicamacaraca) [12:51:17] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#11042722 (10Jclark-ctr) a:05brouberol→03Jclark-ctr [12:51:33] 10ops-eqiad, 06DC-Ops: Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#11042724 (10Jclark-ctr) a:05BTullis→03Jclark-ctr [12:51:38] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172683 (https://phabricator.wikimedia.org/T380515) (owner: 10Bernard Wang) [12:52:11] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#11042737 (10Jclark-ctr) a:05Stevemunene→03Jclark-ctr [12:52:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#11042752 (10Jclark-ctr) a:05cmooney→03Jclark-ctr [12:53:08] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 is not booting up - https://phabricator.wikimedia.org/T398794#11042757 (10Jclark-ctr) a:05Marostegui→03VRiley-WMF [12:53:42] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#11042765 (10Jclark-ctr) a:05FCeratto-WMF→03VRiley-WMF [12:54:24] (03PS28) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [12:55:11] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6b [12:56:27] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1189 - https://phabricator.wikimedia.org/T398773#11042785 (10Jclark-ctr) a:05BTullis→03Jclark-ctr [12:56:55] (03PS29) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T1300) [13:00:04] danisztls, Aca, and bwang: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:00:13] I can self-deploy [13:00:14] o/ [13:00:18] I can deploy in a moment [13:00:32] danisztls: then you can go ahead imho :) [13:00:39] and I can take over afterwards [13:01:15] ok [13:01:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173441 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza) [13:01:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T399728)', diff saved to https://phabricator.wikimedia.org/P80249 and previous config saved to /var/cache/conftool/dbconfig/20250729-130157-fceratto.json [13:02:07] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:02:13] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance [13:02:38] (03Merged) 10jenkins-bot: Undeploy Readers Use Cases Survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173441 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza) [13:02:55] hello i’m here for the deploy window [13:03:03] !log dani@deploy1003 Started scap sync-world: Backport for [[gerrit:1173441|Undeploy Readers Use Cases Survey v2 (T399736)]] [13:03:08] T399736: Open-ended survey of English Wikipedia readers v2 - https://phabricator.wikimedia.org/T399736 [13:03:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1254.eqiad.wmnet with reason: Maintenance [13:03:16] (03CR) 10Btullis: [C:03+1] "I wonder why the PCC check for pupet v5 failed." [puppet] - 10https://gerrit.wikimedia.org/r/1173910 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:03:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1254 (T399728)', diff saved to https://phabricator.wikimedia.org/P80250 and previous config saved to /var/cache/conftool/dbconfig/20250729-130316-fceratto.json [13:03:43] (03CR) 10Brouberol: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173910 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:04:46] o/ [13:04:53] (03CR) 10Btullis: [C:03+1] Remove airflow-search role [puppet] - 10https://gerrit.wikimedia.org/r/1173912 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:05:07] (03PS2) 10Brouberol: data: remove any privilege related to airlfow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1173909 (https://phabricator.wikimedia.org/T390941) [13:05:12] (03PS1) 10Brouberol: an-airflow: remove any role-speciific hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1173908 (https://phabricator.wikimedia.org/T390941) [13:05:13] !log dani@deploy1003 dani: Backport for [[gerrit:1173441|Undeploy Readers Use Cases Survey v2 (T399736)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:05:14] (03PS1) 10Brouberol: an-airflow: remove roles [puppet] - 10https://gerrit.wikimedia.org/r/1173907 (https://phabricator.wikimedia.org/T390941) [13:05:17] (03PS1) 10Brouberol: preseed: remove an-airflow preseed mapping [puppet] - 10https://gerrit.wikimedia.org/r/1173906 (https://phabricator.wikimedia.org/T390941) [13:05:19] (03PS1) 10Brouberol: site: assign the insetup::data_platform_ferm role to an-airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173905 (https://phabricator.wikimedia.org/T390941) [13:06:07] 06SRE, 10Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#11042819 (10Jclark-ctr) [13:06:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T399728)', diff saved to https://phabricator.wikimedia.org/P80251 and previous config saved to /var/cache/conftool/dbconfig/20250729-130608-fceratto.json [13:06:15] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:06:49] (03CR) 10Btullis: site: assign the insetup::data_platform_ferm role to an-airflow hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173905 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:06:55] !log dani@deploy1003 dani: Continuing with sync [13:07:04] (03CR) 10Btullis: [C:03+1] preseed: remove an-airflow preseed mapping [puppet] - 10https://gerrit.wikimedia.org/r/1173906 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:07:53] lmk whenever we deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1172683 [13:07:56] (03PS2) 10Tim Starling: Enable sitemaps API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173575 (https://phabricator.wikimedia.org/T400023) [13:08:15] (03CR) 10Btullis: [C:03+1] an-airflow: remove roles [puppet] - 10https://gerrit.wikimedia.org/r/1173907 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:08:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.6b [13:08:36] (03CR) 10Btullis: [C:03+1] an-airflow: remove any role-speciific hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1173908 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:08:43] (03CR) 10Btullis: [C:03+1] Remove an-airflow host-specific hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1173911 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:08:44] (03CR) 10Tim Starling: Enable sitemaps API (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173575 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [13:08:52] (03CR) 10Brouberol: site: assign the insetup::data_platform_ferm role to an-airflow hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173905 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:09:29] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2216.codfw.wmnet with reason: Maintenance [13:09:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T399249)', diff saved to https://phabricator.wikimedia.org/P80252 and previous config saved to /var/cache/conftool/dbconfig/20250729-130936-marostegui.json [13:09:42] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:10:55] (03CR) 10Btullis: "I am not sure whether or not we keep this, for historical reasons." [puppet] - 10https://gerrit.wikimedia.org/r/1173913 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:11:23] (03CR) 10Btullis: "+1 from me in principe, though." [puppet] - 10https://gerrit.wikimedia.org/r/1173913 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:12:23] !log dani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173441|Undeploy Readers Use Cases Survey v2 (T399736)]] (duration: 09m 19s) [13:12:28] T399736: Open-ended survey of English Wikipedia readers v2 - https://phabricator.wikimedia.org/T399736 [13:12:32] (03CR) 10MVernon: "Changes made, CI happy, test-cookbook happy, going to merge." [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon) [13:12:35] (03CR) 10MVernon: [C:03+2] swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon) [13:12:40] Lucas_WMDE: all yours [13:13:46] ok! [13:13:50] (03CR) 10Btullis: deployment_server: remove all config related to airflow artifact deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173902 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [13:14:12] let’s continue with Aca then :) [13:14:22] sure thing! [13:14:53] (03CR) 10Brouberol: deployment_server: remove all config related to airflow artifact deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173902 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [13:14:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173456 (https://phabricator.wikimedia.org/T400644) (owner: 10Acamicamacaraca) [13:15:05] (03CR) 10Xcollazo: "CC @btullis@wikimedia.org" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173930 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup) [13:15:52] (03Merged) 10jenkins-bot: Localize mk.wikibooks sitename and metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173456 (https://phabricator.wikimedia.org/T400644) (owner: 10Acamicamacaraca) [13:16:16] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1173456|Localize mk.wikibooks sitename and metanamespace (T400644)]] [13:16:21] T400644: Localize mk.wikibooks sitename and metanamespace - https://phabricator.wikimedia.org/T400644 [13:16:25] (03PS3) 10Brouberol: deployment_server: remove all config related to airflow artifact deployment [puppet] - 10https://gerrit.wikimedia.org/r/1173902 (https://phabricator.wikimedia.org/T395296) [13:16:25] (03PS3) 10Brouberol: cumin: remove any mention of an-airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173903 (https://phabricator.wikimedia.org/T395296) [13:17:00] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11042889 (10Ottomata) Approved. [13:17:14] (03CR) 10Brouberol: deployment_server: remove all config related to airflow artifact deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173902 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [13:17:31] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.00 [13:17:36] (03CR) 10Ottomata: [C:03+1] Add access for platform engineering Airflow and data [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (https://phabricator.wikimedia.org/T396672) (owner: 10Dr0ptp4kt) [13:18:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.00 [13:18:23] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, aleksandar: Backport for [[gerrit:1173456|Localize mk.wikibooks sitename and metanamespace (T400644)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:18:37] checkin [13:19:57] sitenotice changed as expected, namespace name changed, LGTM, we might also want to run namespaceDupes.php [13:20:03] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, aleksandar: Continuing with sync [13:20:04] (03CR) 10Elukey: "Looks good! Left a couple of comments :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [13:20:05] yup [13:20:39] (03CR) 10Ssingh: [C:03+2] Add access for platform engineering Airflow and data [puppet] - 10https://gerrit.wikimedia.org/r/1165605 (https://phabricator.wikimedia.org/T396672) (owner: 10Dr0ptp4kt) [13:21:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P80254 and previous config saved to /var/cache/conftool/dbconfig/20250729-132116-fceratto.json [13:22:04] (03CR) 10Btullis: cumin: remove any mention of an-airflow hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1173903 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [13:22:05] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:23:30] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173902 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [13:24:09] (03CR) 10Brouberol: Kafka: expose a KafkaAdminClient builder method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [13:24:17] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.01 [13:24:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.01 [13:24:54] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.02 [13:25:30] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173456|Localize mk.wikibooks sitename and metanamespace (T400644)]] (duration: 09m 14s) [13:25:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.02 [13:25:34] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.03 [13:25:35] T400644: Localize mk.wikibooks sitename and metanamespace - https://phabricator.wikimedia.org/T400644 [13:25:52] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169673 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [13:26:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.03 [13:26:08] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.04 [13:26:23] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173871 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [13:26:29] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173877 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [13:26:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.04 [13:27:39] “49 pages to fix, 48 were resolvable.”; “1440 links to fix, 1440 were resolvable, 0 were deleted.” [13:27:41] and at the end [13:27:45] “Oh noeees” [13:27:51] is that referring to the one unresolvable page? [13:28:15] guess so https://gerrit.wikimedia.org/g/mediawiki/core/+/63a86dd64afe2882b20fd55daddc8fe7505ceb2f/maintenance/namespaceDupes.php#126 [13:28:19] anyway, let’s run it properly [13:29:12] !log lucaswerkmeister-wmde@deploy1003 ~ $ mwscript-k8s --follow --comment=T400644 -- namespaceDupes mkwikibooks --fix | tee T400644 [13:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:17] I have sysop rights there, so if anything is left, I can fix it manually [13:29:56] ah, found it [13:29:59] id=3092 ns=0 dbk=Викикниги:Портал_на_заедницата *** dest title exists and --add-prefix not specified [13:30:25] seems like one is just a redirect to the other [13:30:58] yep [13:31:06] !log lucaswerkmeister-wmde@deploy1003 ~ $ mwscript-k8s --follow --comment=T400644 -- namespaceDupes mkwikibooks --fix --add-prefix=T400644/ | tee T40064-2 [13:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:14] T400644: Localize mk.wikibooks sitename and metanamespace - https://phabricator.wikimedia.org/T400644 [13:31:14] T40064: WLM App upload shows no progress indicator - https://phabricator.wikimedia.org/T40064 [13:31:31] oops [13:32:02] wrong task ID [13:32:03] meh [13:32:13] anyway, Aca you probably want to move https://mk.wikibooks.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BA%D0%BD%D0%B8%D0%B3%D0%B8:T400644/%D0%9F%D0%BE%D1%80%D1%82%D0%B0%D0%BB_%D0%BD%D0%B0_%D0%B7%D0%B0%D0%B5%D0%B4%D0%BD%D0%B8%D1%86%D0%B0%D1%82%D0%B0 to the right title now [13:32:25] lemme just dump the maintenance script output on phab [13:32:34] and then we can probably move on with bwang after that [13:33:37] hold up [13:34:16] Done [13:35:43] (also I genuinely can’t tell what Reedy changed in https://phabricator.wikimedia.org/T400644#11042949, I see the exact same string on both sides o_O) [13:36:15] 06SRE, 10SRE-swift-storage: Integrity check of commons' original images container dbs - https://phabricator.wikimedia.org/T400700 (10MatthewVernon) 03NEW [13:36:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172683 (https://phabricator.wikimedia.org/T380515) (owner: 10Bernard Wang) [13:36:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P80257 and previous config saved to /var/cache/conftool/dbconfig/20250729-133623-fceratto.json [13:36:28] bwang: deploying ^ now :) [13:36:45] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.00 [13:37:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.00 [13:37:05] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.01 [13:37:06] !log check container dbs for all commons original containers T400700 [13:37:08] (03Merged) 10jenkins-bot: Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172683 (https://phabricator.wikimedia.org/T380515) (owner: 10Bernard Wang) [13:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:14] T400700: Integrity check of commons' original images container dbs - https://phabricator.wikimedia.org/T400700 [13:37:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.01 [13:37:26] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.02 [13:37:29] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1172683|Enable new mobile search experience everywhere (not including empty search recommendations) (T380515)]] [13:37:35] T380515: Deployment: Enable new mobile experience - https://phabricator.wikimedia.org/T380515 [13:37:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.02 [13:37:46] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.03 [13:38:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.03 [13:38:07] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.04 [13:38:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.04 [13:38:28] (03CR) 10Scott French: [C:03+1] "Thanks for sequencing this first. One thing I'm not sure about is whether this will take two agent runs to "converge" given the handoff fr" [puppet] - 10https://gerrit.wikimedia.org/r/1169673 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [13:38:28] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.05 [13:38:59] ok ty! [13:39:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.05 [13:39:05] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.06 [13:39:37] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, bwang: Backport for [[gerrit:1172683|Enable new mobile search experience everywhere (not including empty search recommendations) (T380515)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:39:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.06 [13:39:46] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.07 [13:40:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.07 [13:40:25] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.08 [13:40:29] bwang: please test :) [13:40:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.08 [13:41:02] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.09 [13:41:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.09 [13:41:35] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.0a [13:42:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.0a [13:42:12] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.0b [13:42:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.0b [13:42:46] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.0c [13:43:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.0c [13:43:23] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.0d [13:43:30] look good! [13:43:52] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, bwang: Continuing with sync [13:43:52] ok! [13:43:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.0d [13:43:59] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.0e [13:44:06] (03CR) 10Scott French: [C:03+1] etcd::tlsproxy: Remove testserver ACLs 2 [puppet] - 10https://gerrit.wikimedia.org/r/1173871 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [13:44:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.0e [13:44:33] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.0f [13:44:43] (03CR) 10Scott French: [C:03+1] conftool-data: remove testservers 3 [puppet] - 10https://gerrit.wikimedia.org/r/1173877 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [13:45:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.0f [13:45:11] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.10 [13:45:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.10 [13:45:54] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.11 [13:46:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.11 [13:46:32] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.12 [13:47:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.12 [13:47:09] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.13 [13:47:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.13 [13:47:45] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.14 [13:48:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.14 [13:48:21] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.15 [13:48:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.15 [13:48:54] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.16 [13:49:15] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172683|Enable new mobile search experience everywhere (not including empty search recommendations) (T380515)]] (duration: 11m 45s) [13:49:20] T380515: Deployment: Enable new mobile experience - https://phabricator.wikimedia.org/T380515 [13:49:29] !log UTC afternoon backport+config window done [13:49:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.16 [13:49:33] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.17 [13:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:52] (03PS5) 10Ssingh: hiera: service.yaml: use better aliasing for text/upload [puppet] - 10https://gerrit.wikimedia.org/r/1168192 [13:50:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.17 [13:50:06] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.18 [13:50:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.18 [13:50:45] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.19 [13:51:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.19 [13:51:20] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.1a [13:51:24] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6440/" [puppet] - 10https://gerrit.wikimedia.org/r/1168192 (owner: 10Ssingh) [13:51:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T399728)', diff saved to https://phabricator.wikimedia.org/P80258 and previous config saved to /var/cache/conftool/dbconfig/20250729-135131-fceratto.json [13:51:37] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:51:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1259.eqiad.wmnet with reason: Maintenance [13:51:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.1a [13:51:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1259 (T399728)', diff saved to https://phabricator.wikimedia.org/P80259 and previous config saved to /var/cache/conftool/dbconfig/20250729-135154-fceratto.json [13:51:56] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.1b [13:52:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.1b [13:52:39] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.1c [13:53:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.1c [13:53:18] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.1d [13:53:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.1d [13:53:55] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.1e [13:54:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.1e [13:54:35] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.1f [13:54:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T399728)', diff saved to https://phabricator.wikimedia.org/P80260 and previous config saved to /var/cache/conftool/dbconfig/20250729-135445-fceratto.json [13:55:09] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:55:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.1f [13:55:12] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.20 [13:55:43] (03CR) 10Elukey: Kafka: expose a KafkaAdminClient builder method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [13:55:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.20 [13:55:52] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.21 [13:56:17] elukey: I'm planning to add kafka-jumbo101[7-8] to the cluster tomorrow, through the same procedure than before: a) change role b) run puppet on node + ZK + kafka cluster c ) run puppet on deployment server d) apply new external-services rules e) RR the cluster. Do you want to be around and/or notified when it happens? [13:56:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.21 [13:56:29] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.22 [13:57:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.22 [13:57:10] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.23 [13:57:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.23 [13:57:48] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.24 [13:58:04] (note: the RR cookbook now bounces the controller last, so we should not replay the last outage [13:58:06] ) [13:58:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.24 [13:58:21] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.25 [13:58:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.25 [13:58:57] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.26 [13:59:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.26 [13:59:35] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.27 [14:00:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.27 [14:00:10] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.28 [14:00:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.28 [14:00:45] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.29 [14:01:19] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.29 [14:01:22] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.2a [14:01:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.2a [14:01:58] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.2b [14:02:11] brouberol: o/ I totally trust you, and the new cookbook should cover issues, but if you want me around I can be online when you do the maintenance [14:02:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.2b [14:02:33] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.2c [14:02:45] elukey: understood! [14:03:06] I'll proceed by myself but will ping you when things start, just as an FYI [14:03:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.2c [14:03:10] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.2d [14:03:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.2d [14:03:47] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.2e [14:04:21] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.2e [14:04:24] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.2f [14:04:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.2f [14:04:58] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.30 [14:05:27] (03CR) 10Btullis: deployment_server: remove all config related to airflow artifact deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173902 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [14:05:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.30 [14:05:38] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.31 [14:06:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.31 [14:06:18] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.32 [14:06:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.32 [14:06:54] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.33 [14:07:26] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.33 [14:07:29] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.34 [14:07:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.34 [14:08:02] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.35 [14:08:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.35 [14:08:36] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.36 [14:09:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.36 [14:09:15] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.37 [14:09:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.37 [14:09:52] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.38 [14:09:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P80261 and previous config saved to /var/cache/conftool/dbconfig/20250729-140953-fceratto.json [14:10:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.38 [14:10:30] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.39 [14:10:52] (03CR) 10Btullis: [C:03+1] dse-k8s: Add dse-k8s-codfw etcd cluster configuration [puppet] - 10https://gerrit.wikimedia.org/r/1173914 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [14:11:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.39 [14:11:05] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.3a [14:11:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.3a [14:11:43] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.3b [14:12:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.3b [14:12:21] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.3c [14:12:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.3c [14:12:57] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.3d [14:13:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.3d [14:13:37] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.3e [14:14:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.3e [14:14:13] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.3f [14:14:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.3f [14:14:50] (03PS4) 10Brouberol: deployment_server: remove all config related to airflow artifact deployment [puppet] - 10https://gerrit.wikimedia.org/r/1173902 (https://phabricator.wikimedia.org/T395296) [14:14:50] (03PS4) 10Brouberol: cumin: remove any mention of an-airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173903 (https://phabricator.wikimedia.org/T395296) [14:14:51] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.40 [14:15:27] (03CR) 10Brouberol: deployment_server: remove all config related to airflow artifact deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173902 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [14:15:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.40 [14:15:31] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.41 [14:16:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.41 [14:16:06] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.42 [14:16:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.42 [14:16:40] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.43 [14:17:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.43 [14:17:18] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.44 [14:17:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.44 [14:17:54] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.45 [14:18:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.45 [14:18:33] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.46 [14:18:50] (03PS2) 10Zabe: Stop writing to cl_to and cl_collation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169769 (https://phabricator.wikimedia.org/T399579) [14:19:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.46 [14:19:09] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.47 [14:19:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.47 [14:19:46] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.48 [14:20:21] (03PS1) 10Bking: Introduce opensearch-operator-crds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173947 (https://phabricator.wikimedia.org/T397246) [14:20:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.48 [14:20:27] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.49 [14:21:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.49 [14:21:05] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.4a [14:21:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.4a [14:21:45] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.4b [14:22:19] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.4b [14:22:22] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.4c [14:22:33] (03CR) 10Vgutierrez: [C:03+1] haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [14:22:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.4c [14:22:59] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.4d [14:23:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.4d [14:23:38] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.4e [14:23:40] jouncebot: nowandnext [14:23:40] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [14:23:40] In 0 hour(s) and 6 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T1430) [14:24:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.4e [14:24:17] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.4f [14:24:22] (03CR) 10Zabe: [C:03+2] Stop writing to cl_to and cl_collation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169769 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [14:24:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.4f [14:24:52] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.50 [14:24:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T399249)', diff saved to https://phabricator.wikimedia.org/P80262 and previous config saved to /var/cache/conftool/dbconfig/20250729-142452-marostegui.json [14:24:59] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Set custom runbook URL [puppet] - 10https://gerrit.wikimedia.org/r/1173949 [14:25:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P80263 and previous config saved to /var/cache/conftool/dbconfig/20250729-142500-fceratto.json [14:25:01] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [14:25:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.50 [14:25:27] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.51 [14:25:33] (03Merged) 10jenkins-bot: Stop writing to cl_to and cl_collation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169769 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [14:26:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.51 [14:26:07] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.52 [14:26:13] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169769|Stop writing to cl_to and cl_collation on testwiki (T399579)]] [14:26:18] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [14:26:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.52 [14:26:44] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.53 [14:26:46] (03CR) 10Stevemunene: [C:03+2] dse-k8s: Add dse-k8s-codfw etcd cluster configuration [puppet] - 10https://gerrit.wikimedia.org/r/1173914 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [14:27:10] (03PS1) 10Fabfur: hiera,haproxykafka: shrink socket batch size on all DCs [puppet] - 10https://gerrit.wikimedia.org/r/1173950 (https://phabricator.wikimedia.org/T400199) [14:27:26] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.53 [14:27:28] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.54 [14:27:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173950 (https://phabricator.wikimedia.org/T400199) (owner: 10Fabfur) [14:28:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.54 [14:28:03] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.55 [14:28:25] !log zabe@deploy1003 zabe: Backport for [[gerrit:1169769|Stop writing to cl_to and cl_collation on testwiki (T399579)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:28:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.55 [14:28:42] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.56 [14:29:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.56 [14:29:19] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.57 [14:29:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.57 [14:29:55] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.58 [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T1430) [14:30:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.58 [14:30:31] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.59 [14:31:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.59 [14:31:15] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.5a [14:31:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.5a [14:31:50] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.5b [14:31:58] !log zabe@deploy1003 zabe: Continuing with sync [14:32:18] 10SRE-SLO, 10EditCheck, 10Lift-Wing, 06Machine-Learning-Team, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11043355 (10elukey) Today I tried to review the graphs in the Tone Check's latency SLO page, and this is what I found:... [14:32:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.5b [14:32:27] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.5c [14:32:47] (03CR) 10Vgutierrez: [C:03+1] hiera,haproxykafka: shrink socket batch size on all DCs [puppet] - 10https://gerrit.wikimedia.org/r/1173950 (https://phabricator.wikimedia.org/T400199) (owner: 10Fabfur) [14:32:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.5c [14:33:02] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.5d [14:33:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.5d [14:33:37] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.5e [14:33:37] (03CR) 10Fabfur: [C:03+2] hiera,haproxykafka: shrink socket batch size on all DCs [puppet] - 10https://gerrit.wikimedia.org/r/1173950 (https://phabricator.wikimedia.org/T400199) (owner: 10Fabfur) [14:34:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.5e [14:34:07] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.5f [14:34:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.5f [14:34:43] !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1173950 and upgrading haproxykafka to 0.3.12 on A:cp (T400199) [14:34:43] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.60 [14:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:48] T400199: Prevent HaproxyKafka from hanging - https://phabricator.wikimedia.org/T400199 [14:35:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.60 [14:35:21] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.61 [14:35:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.61 [14:35:54] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.62 [14:36:26] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.62 [14:36:29] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.63 [14:37:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.63 [14:37:06] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.64 [14:37:25] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169769|Stop writing to cl_to and cl_collation on testwiki (T399579)]] (duration: 11m 12s) [14:37:29] (03CR) 10Fabfur: [C:03+2] haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [14:37:31] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [14:37:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.64 [14:37:43] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.65 [14:38:13] (03CR) 10Tiziano Fogli: [C:03+1] haproxykafka: adding alert for unexpected restarts [alerts] - 10https://gerrit.wikimedia.org/r/1172347 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [14:38:21] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.65 [14:38:23] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.66 [14:38:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.66 [14:39:00] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.67 [14:39:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.67 [14:39:36] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.68 [14:40:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P80264 and previous config saved to /var/cache/conftool/dbconfig/20250729-144000-marostegui.json [14:40:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T399728)', diff saved to https://phabricator.wikimedia.org/P80265 and previous config saved to /var/cache/conftool/dbconfig/20250729-144008-fceratto.json [14:40:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.68 [14:40:15] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.69 [14:40:15] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [14:40:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:40:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.69 [14:40:51] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.6a [14:41:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.6a [14:41:30] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.6b [14:42:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.6b [14:42:10] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.6c [14:42:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.6c [14:42:47] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.6d [14:43:21] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.6d [14:43:24] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.6e [14:43:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.6e [14:43:58] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.6f [14:44:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.6f [14:44:37] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.70 [14:45:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.70 [14:45:12] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.71 [14:45:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.71 [14:45:47] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.72 [14:46:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.72 [14:46:25] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.73 [14:46:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.73 [14:47:01] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.74 [14:47:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.74 [14:47:34] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#11043446 (10DLynch) @elukey I'm not actually sure what rate we're expecting at the moment. That doesn't look implausible, at least.... [14:47:35] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.75 [14:48:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.75 [14:48:12] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.76 [14:48:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.76 [14:48:45] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.77 [14:49:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.77 [14:49:21] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.78 [14:49:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.78 [14:49:59] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.79 [14:50:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.79 [14:50:38] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.7a [14:51:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.7a [14:51:14] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.7b [14:51:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.7b [14:51:51] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.7c [14:52:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.7c [14:52:30] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.7d [14:53:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.7d [14:53:07] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.7e [14:53:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.7e [14:53:49] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.7f [14:54:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.7f [14:54:23] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.80 [14:54:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.80 [14:54:58] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.81 [14:55:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P80266 and previous config saved to /var/cache/conftool/dbconfig/20250729-145507-marostegui.json [14:55:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.81 [14:55:36] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.82 [14:56:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.82 [14:56:09] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.83 [14:56:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.83 [14:56:42] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.84 [14:57:03] !log haproxykafka updated to 0.3.12 on A:cp (T400199) [14:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:08] T400199: Prevent HaproxyKafka from hanging - https://phabricator.wikimedia.org/T400199 [14:57:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.84 [14:57:13] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.85 [14:57:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.85 [14:57:51] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.86 [14:58:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.86 [14:58:26] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.87 [14:58:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.87 [14:59:01] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.88 [14:59:31] (03CR) 10Jforrester: Enable sitemaps API (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173575 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [14:59:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.88 [14:59:40] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.89 [15:00:05] jelto, arnoldokoth, and mutante: SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T1500). Please do the needful. [15:00:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.89 [15:00:19] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.8a [15:00:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.8a [15:00:54] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.8b [15:00:59] 10SRE-SLO: Reduce the pyrra's multi-dc configurations where it makes sense - https://phabricator.wikimedia.org/T398534#11043543 (10elukey) The Grafana calendar dashboard is currently broken for Istio based SLOs like citoid: https://grafana-rw.wikimedia.org/d/ccssRIenz/slo-quarterly-drilldown?forceLogin&from=2025... [15:01:26] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.8b [15:01:28] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.8c [15:02:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.8c [15:02:05] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.8d [15:02:06] !log jelto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator deploy [15:02:32] !log jelto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deploy [15:02:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.8d [15:02:44] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.8e [15:02:47] !log brennen@deploy1003 Started deploy [phabricator/deployment@1df7631]: test deploy phab2002 for T400718 [15:02:53] T400718: Deploy Phabricator/Phorge 2025-07-29 - https://phabricator.wikimedia.org/T400718 [15:03:01] !log jelto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1005.eqiad.wmnet with reason: Phabricator deploy [15:03:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.8e [15:03:17] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.8f [15:03:29] !log brennen@deploy1003 Finished deploy [phabricator/deployment@1df7631]: test deploy phab2002 for T400718 (duration: 00m 42s) [15:03:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.8f [15:03:51] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.90 [15:04:03] !log brennen@deploy1003 Started deploy [phabricator/deployment@1df7631]: deploy phab1004 for T400718 [15:04:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.90 [15:04:25] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.91 [15:04:53] !log brennen@deploy1003 Finished deploy [phabricator/deployment@1df7631]: deploy phab1004 for T400718 (duration: 00m 50s) [15:05:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.91 [15:05:04] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.92 [15:05:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.92 [15:05:39] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.93 [15:06:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.93 [15:06:14] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.94 [15:06:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.94 [15:06:48] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.95 [15:07:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.95 [15:07:26] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.96 [15:07:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.96 [15:07:59] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.97 [15:08:04] FIRING: PuppetDisabled: Puppet disabled on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [15:08:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.97 [15:08:36] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.98 [15:08:45] 10SRE-SLO: The Pyrra SLO Duration panel is broken when the latency metric is in milliseconds - https://phabricator.wikimedia.org/T400724 (10elukey) 03NEW [15:09:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.98 [15:09:13] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.99 [15:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.99 [15:09:54] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.9a [15:10:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T399249)', diff saved to https://phabricator.wikimedia.org/P80267 and previous config saved to /var/cache/conftool/dbconfig/20250729-151015-marostegui.json [15:10:21] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:10:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.9a [15:10:30] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.9b [15:11:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.9b [15:11:03] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.9c [15:11:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.9c [15:11:45] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.9d [15:11:58] 10SRE-SLO: Reduce the pyrra's multi-dc configurations where it makes sense - https://phabricator.wikimedia.org/T398534#11043670 (10elukey) I had a chat with Filippo and https://github.com/thanos-io/thanos/issues/1598#issuecomment-2610564533 seems telling us that it is not really possible :( We should probably c... [15:12:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.9d [15:12:20] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.9e [15:12:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.9e [15:12:58] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.9f [15:13:00] (03CR) 10JHathaway: [C:03+1] redfish: expand is_uefi for Dells [software/spicerack] - 10https://gerrit.wikimedia.org/r/1173923 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [15:13:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.9f [15:13:40] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.a0 [15:14:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.a0 [15:14:15] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.a1 [15:14:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.a1 [15:14:55] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.a2 [15:15:03] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#11043676 (10DLynch) I have realized I was parsing those queries as if they were Lua, but it's probably more likely that `=~` means p... [15:15:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.a2 [15:15:32] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.a3 [15:15:54] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:16:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.a3 [15:16:08] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.a4 [15:16:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.a4 [15:16:41] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.a5 [15:17:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.a5 [15:17:20] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.a6 [15:18:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.a6 [15:18:03] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.a7 [15:18:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.a7 [15:18:40] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.a8 [15:19:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.a8 [15:19:20] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.a9 [15:19:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.a9 [15:19:55] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.aa [15:20:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.aa [15:20:39] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ab [15:21:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ab [15:21:19] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ac [15:21:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ac [15:21:58] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ad [15:22:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ad [15:22:38] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ae [15:23:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ae [15:23:17] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.af [15:23:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.af [15:23:55] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.b0 [15:24:28] (03CR) 10Brouberol: Kafka: expose a KafkaAdminClient builder method (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [15:24:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.b0 [15:24:34] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.b1 [15:25:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.b1 [15:25:16] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.b2 [15:25:45] (03PS4) 10Brouberol: Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) [15:25:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.b2 [15:25:54] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.b3 [15:26:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.b3 [15:26:26] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.b4 [15:26:53] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11043718 (10Jhancock.wm) since the server is relatively new, I'd prefer to move it to a new rack. If we leave it where it is, and break convention, it'll bre... [15:27:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.b4 [15:27:03] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.b5 [15:27:12] (03PS5) 10Brouberol: Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) [15:27:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.b5 [15:27:37] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.b6 [15:28:11] (03PS6) 10Brouberol: Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) [15:28:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.b6 [15:28:17] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.b7 [15:28:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.b7 [15:28:54] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.b8 [15:29:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.b8 [15:29:28] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.b9 [15:30:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.b9 [15:30:06] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ba [15:30:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ba [15:30:43] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.bb [15:31:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.bb [15:31:20] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.bc [15:31:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.bc [15:31:53] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.bd [15:32:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.bd [15:32:31] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.be [15:33:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.be [15:33:11] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.bf [15:33:37] (03CR) 10Dzahn: [C:03+1] Gerrit: Add service ip for gerrit2003 [dns] - 10https://gerrit.wikimedia.org/r/1173376 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [15:33:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.bf [15:33:54] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.c0 [15:34:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171239 (https://phabricator.wikimedia.org/T385286) (owner: 10Matthias Mullie) [15:34:26] (03PS2) 10Dzahn: ci: remove old jenkins@gallium RSA key SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1173468 (https://phabricator.wikimedia.org/T177826) [15:34:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.c0 [15:34:30] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.c1 [15:34:54] (03CR) 10CI reject: [V:04-1] ci: remove old jenkins@gallium RSA key SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1173468 (https://phabricator.wikimedia.org/T177826) (owner: 10Dzahn) [15:35:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.c1 [15:35:13] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.c2 [15:35:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.c2 [15:35:52] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.c3 [15:36:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.c3 [15:36:28] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.c4 [15:36:56] (03CR) 10Dzahn: "thanks a lot! just one extra comment. so we have 2 upstreams here. one is the sourcebot upstream Dockerfile which I was trying to translat" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [15:36:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.c4 [15:37:00] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.c5 [15:37:14] (03CR) 10CI reject: [V:04-1] Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [15:37:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.c5 [15:37:35] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.c6 [15:38:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.c6 [15:38:11] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.c7 [15:38:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.c7 [15:38:47] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.c8 [15:39:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.c8 [15:39:25] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.c9 [15:39:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.c9 [15:39:58] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ca [15:40:15] (03PS4) 10Dzahn: add blubber builder config to build zoekt [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) [15:40:28] (03CR) 10CI reject: [V:04-1] add blubber builder config to build zoekt [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [15:40:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ca [15:40:32] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cb [15:41:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.cb [15:41:11] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cc [15:41:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.cc [15:41:51] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cd [15:42:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.cd [15:42:27] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ce [15:43:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ce [15:43:10] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cf [15:43:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.cf [15:43:43] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.d0 [15:44:00] (03PS5) 10Dzahn: add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) [15:44:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.d0 [15:44:16] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.d1 [15:44:20] (03CR) 10CI reject: [V:04-1] add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [15:44:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.d1 [15:44:52] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.d2 [15:45:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.d2 [15:45:32] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.d3 [15:46:08] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.d3 [15:46:11] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.d4 [15:46:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.d4 [15:46:46] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.d5 [15:47:19] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.d5 [15:47:21] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.d6 [15:47:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.d6 [15:47:57] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.d7 [15:48:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.d7 [15:48:36] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.d8 [15:48:59] (03PS6) 10Dzahn: add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) [15:49:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.d8 [15:49:12] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.d9 [15:49:12] (03CR) 10CI reject: [V:04-1] add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [15:49:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.d9 [15:49:51] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.da [15:50:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.da [15:50:28] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.db [15:51:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.db [15:51:06] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.dc [15:51:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.dc [15:51:48] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.dd [15:52:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.dd [15:52:26] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.de [15:53:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.de [15:53:04] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.df [15:53:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.df [15:53:41] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.e0 [15:54:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.e0 [15:54:14] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.e1 [15:54:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.e1 [15:54:48] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.e2 [15:55:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.e2 [15:55:19] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.e3 [15:55:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.e3 [15:55:57] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.e4 [15:56:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.e4 [15:56:29] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.e5 [15:57:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.e5 [15:57:05] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.e6 [15:57:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.e6 [15:57:42] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.e7 [15:57:56] FIRING: [56x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:58:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.e7 [15:58:14] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.e8 [15:58:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.e8 [15:58:48] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.e9 [15:59:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.e9 [15:59:25] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ea [15:59:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ea [15:59:58] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.eb [16:00:04] jhathaway and moritzm: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.eb [16:00:32] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ec [16:01:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ec [16:01:09] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ed [16:01:30] (03CR) 10Vgutierrez: "wmfuniq should be added to https://wikitech.wikimedia.org/wiki/X-Analytics" [puppet] - 10https://gerrit.wikimedia.org/r/1172402 (owner: 10CDanis) [16:01:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ed [16:01:46] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ee [16:02:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ee [16:02:23] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ef [16:02:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ef [16:02:58] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.f0 [16:03:17] (03CR) 10Vgutierrez: [C:03+1] haproxy: scrub part of x-analytics even when xwd debug [puppet] - 10https://gerrit.wikimedia.org/r/1172401 (owner: 10CDanis) [16:03:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.f0 [16:03:26] 06SRE, 10envoy, 06serviceops, 06Traffic: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#11043915 (10hnowlan) Ideally we'd need to go to 1.33 or later with this work - I have not scoped how many more complications this work will entail but for WE5.1.3 and future rate limiting work we'l... [16:03:26] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.f1 [16:03:56] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#11043916 (10Jhancock.wm) this one is gonna be for an-worker or an-presto servers to test on once we wrap up with this. I can test out the reimage when you say go. [16:03:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.f1 [16:04:00] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.f2 [16:04:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.f2 [16:04:33] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.f3 [16:05:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.f3 [16:05:04] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.f4 [16:05:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.f4 [16:05:36] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.f5 [16:06:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.f5 [16:06:12] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.f6 [16:06:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.f6 [16:06:45] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.f7 [16:07:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.f7 [16:07:21] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.f8 [16:07:32] (03CR) 10Vgutierrez: [C:03+1] varnish: include wmfuniq count in x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1172402 (owner: 10CDanis) [16:07:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.f8 [16:07:57] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.f9 [16:08:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.f9 [16:08:32] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.fa [16:09:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.fa [16:09:05] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.fb [16:09:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.fb [16:09:33] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.fc [16:10:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.fc [16:10:08] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.fd [16:10:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.fd [16:10:45] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.fe [16:11:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.fe [16:11:17] !log mvernon@cumin2002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ff [16:11:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ff [16:15:36] (03PS1) 10Giuseppe Lavagetto: Fix haproxy no hdr [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1173973 [16:15:47] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Fix haproxy no hdr [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1173973 (owner: 10Giuseppe Lavagetto) [16:16:15] (03PS3) 10Dzahn: ci: remove old jenkins@gallium RSA key SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1173468 (https://phabricator.wikimedia.org/T177826) [16:16:16] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Fix haproxy no header condition - oblivian@cumin1003" [16:16:17] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix haproxy no header condition - oblivian@cumin1003 [16:16:41] (03CR) 10CI reject: [V:04-1] ci: remove old jenkins@gallium RSA key SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1173468 (https://phabricator.wikimedia.org/T177826) (owner: 10Dzahn) [16:17:00] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix haproxy no header condition - oblivian@cumin1003 [16:17:01] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Fix haproxy no header condition - oblivian@cumin1003" [16:17:07] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:18:53] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [16:18:59] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11043977 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [16:22:41] RESOLVED: [56x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:24:33] (03PS1) 10FNegri: installserver: setup new hosts clouddb102[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/1173974 (https://phabricator.wikimedia.org/T393733) [16:24:57] (03PS2) 10FNegri: installserver: setup new hosts clouddb102[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/1173974 (https://phabricator.wikimedia.org/T393733) [16:25:00] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:25:17] (03PS3) 10FNegri: installserver: setup new hosts clouddb102[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/1173974 (https://phabricator.wikimedia.org/T393733) [16:36:47] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#11044044 (10Jhancock.wm) Thank you so much for your help on these! everything looks great. This one is a 1 CPU config F. It'll be an alternative to the CP servers. So once everything is finalized I can... [16:40:32] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [16:40:45] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11044052 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2010.codfw.wmnet with OS bookworm [16:41:53] andrew@cumin2002 reimage (PID 1077602) is awaiting input [16:47:31] (03PS2) 10Scott French: httpd-fcgi: add missing image build dependency [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1173977 [16:47:52] (03CR) 10Scott French: [V:03+2] "Built locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1173977 (owner: 10Scott French) [16:53:02] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [16:53:11] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11044097 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors: - cloudcephosd104... [16:53:12] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11044098 (10Andrew) The Debian installer fails on cloudcephosd1042, complaining that no drives can be found. This seems to be because all drives are marked as 'raid' drives but not assigned t... [16:56:34] (03PS1) 10Bking: cirrussearch: remove soon-to-be-decommed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173981 (https://phabricator.wikimedia.org/T395855) [16:56:35] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [16:56:46] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11044107 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T1700) [17:03:50] PROBLEM - Host ps1-e4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:05:52] RECOVERY - Host ps1-e4-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.42 ms [17:07:16] (03CR) 10Urbanecm: [C:03+1] "No objections. This is similar to enwiki's config, and T&S+Security gave their thumbs-up on the task. The only risk is the group, but the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [17:16:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11044183 (10ssingh) 05Open→03Resolved @dr0ptp4kt : Closing this task as part of the clinic duty week this week (@CDobbins' first week). Please re-open if there are any... [17:16:11] (03PS2) 10Bking: cirrussearch: remove soon-to-be-decommed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173981 (https://phabricator.wikimedia.org/T395855) [17:16:47] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173981 (https://phabricator.wikimedia.org/T395855) (owner: 10Bking) [17:21:59] (03PS3) 10Scott French: httpd-fcgi: add missing image build dependency [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1173977 [17:21:59] (03CR) 10Scott French: [V:03+2] "Built locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1173977 (owner: 10Scott French) [17:26:43] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:28:45] andrew@cumin2002 reimage (PID 1095957) is awaiting input [17:36:05] (03CR) 10BCornwall: [V:03+1 C:03+2] acme-chief: Add wikipedialibrary.org to certs [puppet] - 10https://gerrit.wikimedia.org/r/1172393 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall) [17:42:13] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [17:42:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11044304 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors: - cloudcephosd104... [17:43:04] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11044305 (10Jhancock.wm) There's two things i can think of. One is converting the disks to raid capable and either setting two RAID0 or one RAID1. Setting a "First Dev... [17:43:29] !log dancy@deploy1003 Installing scap version "4.193.0" for 2 host(s) [17:43:35] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1022.eqiad.wmnet with OS bookworm [17:43:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11044306 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with... [17:45:02] (03PS4) 10Scott French: httpd-fcgi: add missing image build dependency [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1173977 [17:45:15] !log dancy@deploy1003 Installation of scap version "4.193.0" completed for 2 hosts [17:48:41] (03PS1) 10Ahmon Dancy: pretrain: Use bash to execute multiple commands [puppet] - 10https://gerrit.wikimedia.org/r/1173991 (https://phabricator.wikimedia.org/T398873) [17:48:45] (03CR) 10Scott French: [V:03+2] "Built locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1173977 (owner: 10Scott French) [17:49:16] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1032.eqiad.wmnet [17:49:30] (03PS2) 10Ahmon Dancy: pretrain: Use bash to execute multiple commands [puppet] - 10https://gerrit.wikimedia.org/r/1173991 (https://phabricator.wikimedia.org/T398873) [17:49:38] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173991 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [17:51:01] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [17:51:07] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [17:51:16] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11044322 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [17:54:40] (03PS3) 10Ahmon Dancy: pretrain: Use bash to execute multiple commands [puppet] - 10https://gerrit.wikimedia.org/r/1173991 (https://phabricator.wikimedia.org/T398873) [17:54:48] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173991 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [17:56:33] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [17:56:42] vriley@cumin1002 netbox (PID 3371761) is awaiting input [17:57:40] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clouddb1023 - vriley@cumin1002" [17:58:03] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clouddb1023 - vriley@cumin1002" [17:58:03] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:58:43] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host clouddb1023 [17:59:26] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:59:58] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host clouddb1023 [18:00:00] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/1173991/7114/deploy1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1173991 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [18:00:04] brennen and dduvall: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T1800). [18:00:27] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173994 [18:00:31] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:01:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11044388 (10VRiley-WMF) [18:01:58] o/ [18:02:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11044393 (10VRiley-WMF) Tried to run reimage on clouddb1022, to no avail. Running through clouddb1023 to see if there is a difference. [18:03:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11044395 (10VRiley-WMF) [18:03:49] !log train 1.45.0-wmf.12 status: no current blockers, rolling to group0 using spiderpig [18:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:30] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173995 (https://phabricator.wikimedia.org/T396373) [18:04:32] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173995 (https://phabricator.wikimedia.org/T396373) (owner: 10TrainBranchBot) [18:05:20] vriley@cumin1002 provision (PID 3374494) is awaiting input [18:05:45] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173995 (https://phabricator.wikimedia.org/T396373) (owner: 10TrainBranchBot) [18:08:52] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:10:50] vriley@cumin1002 provision (PID 3374494) is awaiting input [18:11:35] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:11:43] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:13:26] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.12 refs T396373 [18:13:32] T396373: 1.45.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T396373 [18:18:06] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:21:30] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for nokia switches eqiad - cmooney@cumin1003" [18:21:34] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for nokia switches eqiad - cmooney@cumin1003" [18:21:34] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:29:01] (03PS1) 10Kgraessle: Add experiment code to group by toggle - adding config for beta to override enrollment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174002 (https://phabricator.wikimedia.org/T397728) [18:29:19] (03PS2) 10Kgraessle: Add experiment code to group by toggle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174002 (https://phabricator.wikimedia.org/T397728) [18:30:05] (03CR) 10CI reject: [V:04-1] Add experiment code to group by toggle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174002 (https://phabricator.wikimedia.org/T397728) (owner: 10Kgraessle) [18:30:34] (03CR) 10Btullis: [C:03+1] cirrussearch: remove soon-to-be-decommed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173981 (https://phabricator.wikimedia.org/T395855) (owner: 10Bking) [18:31:36] (03PS3) 10Kgraessle: Add experiment code to group by toggle - adding config for beta to override enrollment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174002 (https://phabricator.wikimedia.org/T397728) [18:31:51] vriley@cumin1002 reimage (PID 3371317) is awaiting input [18:33:26] (03CR) 10Clare Ming: [C:03+1] Add experiment code to group by toggle - adding config for beta to override enrollment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174002 (https://phabricator.wikimedia.org/T397728) (owner: 10Kgraessle) [18:34:47] (03PS4) 10Kgraessle: Add experiment code to group by toggle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174002 (https://phabricator.wikimedia.org/T397728) [18:35:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174002 (https://phabricator.wikimedia.org/T397728) (owner: 10Kgraessle) [18:37:49] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11044506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors: - cloudcephosd104... [18:38:19] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [18:38:27] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11044507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [18:41:54] (03Abandoned) 10CDanis: haproxy: use_benthos=>true [puppet] - 10https://gerrit.wikimedia.org/r/1165582 (https://phabricator.wikimedia.org/T329332) (owner: 10CDanis) [18:43:14] (03PS1) 10Btullis: Allow the Airflow webserver to support long requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174005 (https://phabricator.wikimedia.org/T400493) [18:43:56] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host clouddb1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:44:39] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:45:38] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host clouddb1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:46:40] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:48:19] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host clouddb1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:54:30] (03CR) 10Phuedx: Add experiment code to group by toggle (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174002 (https://phabricator.wikimedia.org/T397728) (owner: 10Kgraessle) [18:54:35] (03CR) 10Phuedx: [C:03+1] Add experiment code to group by toggle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174002 (https://phabricator.wikimedia.org/T397728) (owner: 10Kgraessle) [18:55:28] (03PS1) 10BCornwall: Add batch of pay-for-edit domains [dns] - 10https://gerrit.wikimedia.org/r/1174007 (https://phabricator.wikimedia.org/T400731) [18:57:18] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [18:57:27] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11044646 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors: - cloudcephosd104... [18:57:31] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1042'] [18:57:51] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1042'] [18:58:05] !log andrew@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1042'] [18:59:00] !log andrew@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1042.eqiad.wmnet'] [18:59:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11044655 (10VRiley-WMF) [18:59:08] !log andrew@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1042.eqiad.wmnet'] [18:59:30] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1042.eqiad.wmnet'] [18:59:39] !log andrew@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcephosd1042.eqiad.wmnet'] [19:00:33] (03CR) 10Effie Mouzeli: [C:03+1] httpd-fcgi: add missing image build dependency [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1173977 (owner: 10Scott French) [19:01:31] (03CR) 10Dzahn: [C:03+2] pretrain: Use bash to execute multiple commands [puppet] - 10https://gerrit.wikimedia.org/r/1173991 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [19:02:12] (03PS1) 10BCornwall: acme-chief: Add batch of pay-for-edit domains [puppet] - 10https://gerrit.wikimedia.org/r/1174010 (https://phabricator.wikimedia.org/T400731) [19:02:55] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11044661 (10Andrew) Please note that the partman recipe user here cannot be run more than once in a row on Bullseye (cc: @BTullis ) so when rerunning you have to take the tedious steps of swi... [19:03:33] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11044662 (10Andrew) Now I'm stuck on getting it to boot to the OS drive after the debian install completes. [19:04:01] (03PS4) 10Dzahn: ci: remove old jenkins@gallium RSA key SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1173468 (https://phabricator.wikimedia.org/T177826) [19:04:25] (03CR) 10CI reject: [V:04-1] ci: remove old jenkins@gallium RSA key SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1173468 (https://phabricator.wikimedia.org/T177826) (owner: 10Dzahn) [19:05:41] jouncebot: nowandnext [19:05:41] For the next 0 hour(s) and 54 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T1800) [19:05:42] In 0 hour(s) and 54 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T2000) [19:05:47] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:07:34] (03CR) 10Xcollazo: [C:03+1] Allow the Airflow webserver to support long requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174005 (https://phabricator.wikimedia.org/T400493) (owner: 10Btullis) [19:08:04] FIRING: PuppetDisabled: Puppet disabled on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=wdqs-main&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [19:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:11:16] (03PS5) 10Dzahn: ci: remove old jenkins@gallium RSA key SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1173468 (https://phabricator.wikimedia.org/T177826) [19:14:34] (03PS1) 10Gmodena: data-engineering: eventgate: raise basline for anomalies [alerts] - 10https://gerrit.wikimedia.org/r/1174012 (https://phabricator.wikimedia.org/T398437) [19:15:37] (03PS2) 10Gmodena: data-engineering: eventgate: baseline for anomalies [alerts] - 10https://gerrit.wikimedia.org/r/1174012 (https://phabricator.wikimedia.org/T398437) [19:16:42] (03CR) 10Bking: [C:03+2] cirrussearch: remove soon-to-be-decommed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173981 (https://phabricator.wikimedia.org/T395855) (owner: 10Bking) [19:19:05] (03PS1) 10Ahmon Dancy: data.yaml: Fix copy-pasted typo in journalctl command [puppet] - 10https://gerrit.wikimedia.org/r/1174014 [19:19:33] (03CR) 10Ottomata: [C:03+2] data-engineering: eventgate: baseline for anomalies [alerts] - 10https://gerrit.wikimedia.org/r/1174012 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena) [19:19:35] (03CR) 10Ahmon Dancy: "Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/1173991 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [19:19:57] vriley@cumin1002 provision (PID 3381528) is awaiting input [19:21:00] (03Merged) 10jenkins-bot: data-engineering: eventgate: baseline for anomalies [alerts] - 10https://gerrit.wikimedia.org/r/1174012 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena) [19:21:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166286 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński) [19:21:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159610 (owner: 10Bartosz Dziewoński) [19:30:52] 10ops-codfw, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15), 13Patch-For-Review: Decommission cirrussearch2055-2060 - https://phabricator.wikimedia.org/T395855#11044824 (10bking) a:05bking→03None [19:31:03] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host clouddb1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:31:31] (03CR) 10Dzahn: "re: "go.sum not found" -> https://golangtutorial.dev/tips/fixing-missing-go-sum-entry-for-module-providing-package-in-golang/" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [19:32:32] 06SRE, 10envoy, 06serviceops, 06Traffic: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#11044850 (10RLazarus) [19:34:34] (03PS1) 10CDanis: probenet: Report CDN host handling each measure request [extensions/WikimediaEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174016 (https://phabricator.wikimedia.org/T398596) [19:35:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174016 (https://phabricator.wikimedia.org/T398596) (owner: 10CDanis) [19:35:51] question for ops folks: is there a deployment freeze during wikimania this year? The deployment calendar doesn't seem to have one, but I recall we've often done that in the past. [19:37:21] thcipriani or other relengeers: ^ was there an explicit decision about this? [19:40:39] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:42:13] (03CR) 10Dzahn: [C:04-1] "it's /usr/bin/" [puppet] - 10https://gerrit.wikimedia.org/r/1174014 (owner: 10Ahmon Dancy) [19:42:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-gf95s - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [19:42:50] (03CR) 10Dzahn: [C:04-1] "actually, same for systemctl lines above this" [puppet] - 10https://gerrit.wikimedia.org/r/1174014 (owner: 10Ahmon Dancy) [19:43:04] https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar is the doc I would expect to document it [19:43:40] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 is not booting up - https://phabricator.wikimedia.org/T398794#11044870 (10VRiley-WMF) Hey @Marostegui I should be available to update this at any time. Please let me know when would be a good time for this acitvity. [19:45:15] 06SRE, 10envoy, 06serviceops, 06Traffic: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#11044872 (10RLazarus) a:03RLazarus >>! In T380211#11043915, @hnowlan wrote: > Ideally we'd need to go to 1.33 or later with this work Ack. We're running 1.23 and the current release is 1.35, so... [19:45:34] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1022.eqiad.wmnet with OS bookworm [19:45:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11044874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS b... [19:49:41] vriley@cumin1002 provision (PID 3385041) is awaiting input [19:51:04] (03CR) 10Ayounsi: [C:03+1] Gerrit: Add service ip for gerrit2003 [dns] - 10https://gerrit.wikimedia.org/r/1173376 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [19:53:11] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host clouddb1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:59:20] (03CR) 10Dzahn: [C:03+2] Gerrit: Add service ip for gerrit2003 [dns] - 10https://gerrit.wikimedia.org/r/1173376 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [19:59:32] !log dzahn@dns1004 START - running authdns-update [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T2000). [20:00:05] MatmaRex, katherine_g, and cdanis: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:30] a lot of things [20:00:43] o/ I'm happy to go last [20:00:44] hi [20:00:48] o/ [20:01:08] !log dzahn@dns1004 END - running authdns-update [20:01:11] katherine_g: will you want to deploy your patch yourself? [20:01:22] sure! I can start that now with spiderpig [20:01:36] i have some routine maintenance scripts i need someone to run for me, please be my hands :) [20:01:55] katherine_g: in that case, maybe i can do MatmaRex's stuff and then transfer it to you? :) [20:01:55] (and two config changes that should have no visible effect) [20:02:36] sure [20:02:59] !log Run mwscript-k8s --comment="T400618" --follow -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=dewiki --logwiki=metawiki 'Editor Socks' 'Socks' [20:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:05] T400618: Unblock stuck global rename of Socks - https://phabricator.wikimedia.org/T400618 [20:03:23] (03PS2) 10Bartosz Dziewoński: Use FallbackContentHandler for another undeployed content handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166286 (https://phabricator.wikimedia.org/T124748) [20:03:32] (03CR) 10Urbanecm: [C:03+2] Use FallbackContentHandler for another undeployed content handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166286 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński) [20:03:35] (03CR) 10Urbanecm: [C:03+2] Simplify $wgContactConfig required checkboxes validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159610 (owner: 10Bartosz Dziewoński) [20:04:08] katherine_g: you're holding the deployment lock though :) [20:04:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166286 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński) [20:04:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159610 (owner: 10Bartosz Dziewoński) [20:04:28] (03Merged) 10jenkins-bot: Use FallbackContentHandler for another undeployed content handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166286 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński) [20:04:30] (03Merged) 10jenkins-bot: Simplify $wgContactConfig required checkboxes validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159610 (owner: 10Bartosz Dziewoński) [20:04:32] ooh wasn't sure that would lock it, try now? [20:04:37] (03PS1) 10Andrew Bogott: Add puppet role and preseed for cloudcephosd1052 [puppet] - 10https://gerrit.wikimedia.org/r/1174022 (https://phabricator.wikimedia.org/T394333) [20:04:53] katherine_g: works, ty! [20:04:56] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1166286|Use FallbackContentHandler for another undeployed content handler (T124748)]], [[gerrit:1159610|Simplify $wgContactConfig required checkboxes validation]] [20:05:02] T124748: Deprecate Graph and Data namespaces on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748 [20:05:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#11044975 (10Andrew) hostname should be cloudcephosd1052. Attached patch sets up initial puppet and partman. [20:05:23] MatmaRex: your second script complains `ERROR: Auto migration is disabled and email addresses do not match for: Inverted Pages` :/ [20:05:52] (03CR) 10Dzahn: [C:03+2] "dns1004:~] $ host gerrit-spare.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1173376 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [20:06:02] urbanecm: try with --auto, i think that's somewhat expected [20:06:16] was looking for the right param, but you were quicker [20:06:19] that works [20:06:51] i'm not sure why exactly that happens, but given that we want to lock the account anyway, i don't think it matters [20:07:05] !log urbanecm@deploy1003 matmarex, urbanecm: Backport for [[gerrit:1166286|Use FallbackContentHandler for another undeployed content handler (T124748)]], [[gerrit:1159610|Simplify $wgContactConfig required checkboxes validation]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:35] the first one isn't easily testable, having a look at the second [20:07:38] ty [20:07:55] 06SRE: FY25/26 WE4.3.1: edge uniques in requestctl - https://phabricator.wikimedia.org/T400753 (10CDanis) 03NEW [20:07:58] 06SRE: FY25/26 WE4.3.1: edge uniques in requestctl - https://phabricator.wikimedia.org/T400753#11044988 (10CDanis) [20:08:07] !log Run `mwscript-k8s --follow -- extensions/CentralAuth/maintenance/migrateAccount.php --wiki=aawiki --username='Inverted Pages' --auto` (T396091) [20:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:12] T396091: Fix accidentally unmerged global account - https://phabricator.wikimedia.org/T396091 [20:08:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#11044990 (10Andrew) @Jclark-ctr are we waiting on more DACs before we can move ahead with these? [20:08:25] (03PS3) 10CDanis: haproxy: scrub part of x-analytics even when xwd debug [puppet] - 10https://gerrit.wikimedia.org/r/1172401 (https://phabricator.wikimedia.org/T400753) [20:08:27] !log Run `mwscript-k8s --file=users.txt --follow -- extensions/CentralAuth/maintenance/attachAccount.php --wiki=aawiki --userlist users.txt` (T396091; users.txt is `Inverted Pages`) [20:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:33] (03PS3) 10CDanis: varnish: include wmfuniq count in x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1172402 (https://phabricator.wikimedia.org/T400753) [20:09:25] urbanecm: looks good [20:10:35] urbanecm: btw, i think i figured out why the renames get stuck so often, or at least one of the reasons why. they should be retried, but instead they get stuck in a bad state. i'll try to fix this [20:10:42] MatmaRex: run attachAccount.php to fully fix it, and locked via the web interface. hopefully all good now? [20:11:12] !log urbanecm@deploy1003 matmarex, urbanecm: Continuing with sync [20:11:21] looks look good, proceeding with patches [20:11:21] urbanecm: yeah, that account looks good [20:11:42] i thought attachAccount.php wouldn't be necessary when the global account didn't exist in the first place [20:11:50] (03CR) 10Ahmon Dancy: "All other existing references to journalctl and systemctl in this file use /bin, so I followed suit. /bin/journalctl and /usr/bin/journal" [puppet] - 10https://gerrit.wikimedia.org/r/1174014 (owner: 10Ahmon Dancy) [20:11:58] (03CR) 10Andrew Bogott: [C:03+2] Add puppet role and preseed for cloudcephosd1052 [puppet] - 10https://gerrit.wikimedia.org/r/1174022 (https://phabricator.wikimedia.org/T394333) (owner: 10Andrew Bogott) [20:12:16] MatmaRex: i first created it via migrate account, but that attached only _one_ account (out of the three existing local ones)) [20:12:47] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1023.eqiad.wmnet with OS bookworm [20:13:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11045008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1023.eqiad.wmnet with... [20:14:22] i see [20:16:39] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166286|Use FallbackContentHandler for another undeployed content handler (T124748)]], [[gerrit:1159610|Simplify $wgContactConfig required checkboxes validation]] (duration: 11m 43s) [20:16:42] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [20:16:45] T124748: Deprecate Graph and Data namespaces on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748 [20:16:58] a lot of wrrors like `PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Pager\IndexPager::buildQueryInfo (MediaWiki\Pager\LogPager): [1969] Query execution was interrupted (max_statement_time exceeded)`, but that seems to be pre-existing [20:17:28] yeah… i don't see how any of this could touch IndexPager [20:17:35] Nod. Those are existing. [20:17:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11045018 (10VRiley-WMF) I have it the same issue with the same error on clouddb1023 [20:18:02] dancy: fairly-high volume though! :/ [20:18:03] that looks like a timeout from one of the usual suspects [20:18:09] MatmaRex: anyway, should be done now. [20:18:24] thanks urbanecm [20:18:27] np [20:18:30] katherine_g: over to you :) [20:18:39] alright, ty! [20:19:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174002 (https://phabricator.wikimedia.org/T397728) (owner: 10Kgraessle) [20:19:39] https://phabricator.wikimedia.org/search/query/SnTi9j2QtbxM/#R "IndexPager::buildQueryInfo" "ignored an error originally raised from" [20:20:06] (03Merged) 10jenkins-bot: Add experiment code to group by toggle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174002 (https://phabricator.wikimedia.org/T397728) (owner: 10Kgraessle) [20:20:48] urbanecm: That deprecation warning was hidden by logspam-watch for quite a while until I uncovered it recently. [20:20:57] 20:20:12 concurrent prep is locked by mwpresync (pid 3827323) on Tue Jul 29 20:16:42 2025; reason is "Publishing wmf/next image". [20:21:05] that one is new to me [20:21:15] we...probably should not run automated jobs during a window [20:21:17] oh that's me. lemme kill that. [20:21:24] (or is it not automated?) [20:21:24] I thought I had cleanly terminated it earlier [20:22:32] cdanis: Do you have root on deploy1003? If so, can you `kill 3849855` ? [20:22:46] (03CR) 10Dzahn: [C:03+1] "you are right, they are hard links. both exist. "which" says /usr/bin which made me comment that." [puppet] - 10https://gerrit.wikimedia.org/r/1174014 (owner: 10Ahmon Dancy) [20:23:00] dancy: looks like it's gone? [20:23:11] (03CR) 10Dzahn: [C:03+2] data.yaml: Fix copy-pasted typo in journalctl command [puppet] - 10https://gerrit.wikimedia.org/r/1174014 (owner: 10Ahmon Dancy) [20:23:25] cdanis: and `kill 3827323` please. [20:23:50] dancy: done [20:23:58] the first one I didn't kill btw, it was dead already [20:24:11] Gotcha. Thanks for the assist [20:24:16] np [20:24:22] Sorry for the disruption folks! [20:25:14] np [20:29:46] cdanis: I would like to use the window when you're done with your deployment. [20:29:54] ah, is katherine_g all done? [20:30:01] yep! [20:30:07] yeah, that was a beta-only change. [20:30:10] (Quickie) [20:30:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174016 (https://phabricator.wikimedia.org/T398596) (owner: 10CDanis) [20:30:49] (03CR) 10Majavah: "/usr/bin is the canonical path:" [puppet] - 10https://gerrit.wikimedia.org/r/1174014 (owner: 10Ahmon Dancy) [20:30:50] got it, thanks :) [20:31:57] (03Merged) 10jenkins-bot: probenet: Report CDN host handling each measure request [extensions/WikimediaEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174016 (https://phabricator.wikimedia.org/T398596) (owner: 10CDanis) [20:32:19] !log cdanis@deploy1003 Started scap sync-world: Backport for [[gerrit:1174016|probenet: Report CDN host handling each measure request (T398596)]] [20:32:25] T398596: Consider using the alternate chain of Google Trust Services certificates - https://phabricator.wikimedia.org/T398596 [20:34:27] !log cdanis@deploy1003 cdanis: Backport for [[gerrit:1174016|probenet: Report CDN host handling each measure request (T398596)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:36:27] MatmaRex: fyi T400755. we probably do not need the permission to begin with. [20:36:28] T400755: Remove the stewards ability to delete/unmerge global accounts - https://phabricator.wikimedia.org/T400755 [20:37:30] !log cdanis@deploy1003 cdanis: Continuing with sync [20:37:56] urbanecm: +1 [20:38:18] urbanecm: I've been meaning to file something similar for a while, thanks [20:41:27] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [20:41:34] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11045097 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors: - cloudcephosd104... [20:42:07] (03PS1) 10Zabe: Remove centralauth-unmerge from stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174035 (https://phabricator.wikimedia.org/T400755) [20:42:25] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [20:42:32] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11045100 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [20:42:47] !log cdanis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1174016|probenet: Report CDN host handling each measure request (T398596)]] (duration: 10m 27s) [20:42:53] T398596: Consider using the alternate chain of Google Trust Services certificates - https://phabricator.wikimedia.org/T398596 [20:44:18] (03CR) 10Btullis: "This will need a chart version bump, but +1 in principle from me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173930 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup) [20:46:10] (03CR) 10Btullis: site: assign the insetup::data_platform_ferm role to an-airflow hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173905 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [20:47:20] (03CR) 10Btullis: [C:03+1] site: assign the insetup::data_platform_ferm role to an-airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173905 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [20:49:21] cdanis: Good to go [20:49:23] ? [20:53:22] (03PS1) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [20:57:31] (03CR) 10Urbanecm: [C:04-2] "code lgtm, but let's leave the task on for a while, to ensure there are no objections" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174035 (https://phabricator.wikimedia.org/T400755) (owner: 10Zabe) [20:59:02] (03PS2) 10Bking: Introduce opensearch-operator-crds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173947 (https://phabricator.wikimedia.org/T397246) [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250729T2100) [21:00:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Discrepencies with cableid & ports on some msw in c/d <-> msw1-eqiad - https://phabricator.wikimedia.org/T400159#11045143 (10wiki_willy) a:03Jclark-ctr [21:01:00] vriley@cumin1002 reimage (PID 3388460) is awaiting input [21:01:31] dancy: sorry! yes [21:01:40] thx [21:02:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: msw1-eqiad: cable me0 dedicated mgmt port directly to the switch itself - https://phabricator.wikimedia.org/T400161#11045152 (10wiki_willy) a:03VRiley-WMF [21:02:19] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [21:02:57] 10ops-eqiad, 06SRE, 06DC-Ops: Decom eqiad row B <-> cloudsw links - https://phabricator.wikimedia.org/T391489#11045153 (10wiki_willy) a:03Jclark-ctr [21:03:17] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 00m 57s) [21:04:14] (03PS4) 10CDanis: haproxy: scrub part of x-analytics even when xwd debug [puppet] - 10https://gerrit.wikimedia.org/r/1172401 [21:04:46] (03PS1) 10CDanis: haproxy: tests: env var for which docker [puppet] - 10https://gerrit.wikimedia.org/r/1174043 [21:05:06] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#11045162 (10wiki_willy) a:03VRiley-WMF [21:05:44] (03PS5) 10CDanis: haproxy: scrub part of x-analytics even when xwd debug [puppet] - 10https://gerrit.wikimedia.org/r/1172401 (https://phabricator.wikimedia.org/T400753) [21:05:45] (03PS4) 10CDanis: varnish: include wmfuniq count in x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1172402 (https://phabricator.wikimedia.org/T400753) [21:06:17] (03Abandoned) 10CDanis: haproxy: Allow empty ring defs as a placeholder [puppet] - 10https://gerrit.wikimedia.org/r/1118211 (owner: 10CDanis) [21:08:35] 10ops-eqiad, 06SRE, 06DC-Ops: Outbound errors on interface cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://phabricator.wikimedia.org/T398006#11045167 (10wiki_willy) a:03Jclark-ctr [21:08:39] 10ops-eqiad, 06SRE, 06DC-Ops: Outbound errors on interface cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://phabricator.wikimedia.org/T398006#11045168 (10Jclark-ctr) 05Open→03Declined [21:09:20] !log ryankemper@cumin1002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) [21:13:43] 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#11045171 (10wiki_willy) [21:16:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: second frack parent tracking task - https://phabricator.wikimedia.org/T392006#11045175 (10wiki_willy) 05Open→03Resolved a:03RobH Resolving task, we will be installing two new Fundraising cabinets as a solution instead. [21:16:49] !log ryankemper@cumin1002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250714/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) [21:18:18] 06SRE, 06Data-Platform-SRE, 06DC-Ops: Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#11045178 (10wiki_willy) [21:19:35] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250714/ using stat1009.eqiad.wmnet) [21:20:34] (03CR) 10Fabfur: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1174043 (owner: 10CDanis) [21:20:45] (03CR) 10CDanis: [C:03+2] haproxy: tests: env var for which docker [puppet] - 10https://gerrit.wikimedia.org/r/1174043 (owner: 10CDanis) [21:20:57] (03PS2) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [21:24:32] (03PS1) 10Mstyles: WebAuthn: Add config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174048 (https://phabricator.wikimedia.org/T399665) [21:27:42] (03PS1) 10Mstyles: OATHAuth: Add Config Variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174049 (https://phabricator.wikimedia.org/T400579) [21:29:39] (03PS1) 10Bking: cirrussearch2091: add back into production [puppet] - 10https://gerrit.wikimedia.org/r/1174050 (https://phabricator.wikimedia.org/T400640) [21:30:20] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch2091: add back into production [puppet] - 10https://gerrit.wikimedia.org/r/1174050 (https://phabricator.wikimedia.org/T400640) (owner: 10Bking) [21:30:49] (03CR) 10Bking: [C:03+2] cirrussearch2091: add back into production [puppet] - 10https://gerrit.wikimedia.org/r/1174050 (https://phabricator.wikimedia.org/T400640) (owner: 10Bking) [21:34:10] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye [21:36:42] (03PS1) 10Ahmon Dancy: pretrain: Use && instead of ';' to separate commands [puppet] - 10https://gerrit.wikimedia.org/r/1174051 (https://phabricator.wikimedia.org/T398873) [21:39:03] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1173510 (owner: 10TrainBranchBot) [21:39:33] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1174051 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [21:44:56] 10ops-eqiad, 06DC-Ops: Alert for device lsw1-e8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T400758 (10phaultfinder) 03NEW [21:47:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1174057 [21:47:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1174057 (owner: 10TrainBranchBot) [21:47:40] (03CR) 10Melos: [C:04-1] admin: remove prod access for listed users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [21:50:39] (03CR) 10Dzahn: [C:03+2] pretrain: Use && instead of ';' to separate commands [puppet] - 10https://gerrit.wikimedia.org/r/1174051 (https://phabricator.wikimedia.org/T398873) (owner: 10Ahmon Dancy) [21:51:25] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2091.codfw.wmnet with reason: host reimage [21:51:50] (03CR) 10Ladsgroup: "Thanks! Would you mind doing that? I'm not super comfortable with the codebase and this is last piece left to call the ticket done." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173930 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup) [21:58:58] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2091.codfw.wmnet with reason: host reimage [21:59:57] 10ops-eqiad, 06DC-Ops: Alert for device lsw1-f8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T400763 (10phaultfinder) 03NEW [22:01:11] vriley@cumin1002 reimage (PID 3392182) is awaiting input [22:01:47] I'm grabbing the tail of the Web deployment window. [22:01:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172397 (https://phabricator.wikimedia.org/T366095) (owner: 10DLynch) [22:02:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-gf95s - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [22:02:51] (03Merged) 10jenkins-bot: Enable DiscussionTools thanks on existing "report incident" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172397 (https://phabricator.wikimedia.org/T366095) (owner: 10DLynch) [22:03:15] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1172397|Enable DiscussionTools thanks on existing "report incident" wikis (T366095)]] [22:03:21] T366095: Deploy comment thanking to all wikis - https://phabricator.wikimedia.org/T366095 [22:05:26] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1172397|Enable DiscussionTools thanks on existing "report incident" wikis (T366095)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:07:57] PROBLEM - OpenSearch health check for shards on 9200 on logstash2035 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f63f96d21c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [22:07:57] org/wiki/Search%23Administration [22:08:31] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1174057 (owner: 10TrainBranchBot) [22:08:35] cwhite@cumin2002 reimage (PID 1248291) is awaiting input [22:09:50] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2035.codfw.wmnet with OS bookworm [22:10:18] !log kemayo@deploy1003 kemayo: Continuing with sync [22:10:19] !log cwhite@cumin2002 START - Cookbook sre.hosts.move-vlan for host logstash2035 [22:10:29] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [22:14:07] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250714/ using stat1009.eqiad.wmnet) [22:15:01] !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host logstash2035 - cwhite@cumin2002" [22:15:06] !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host logstash2035 - cwhite@cumin2002" [22:15:06] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:15:07] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash2035.codfw.wmnet 28.32.192.10.in-addr.arpa 8.2.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [22:15:10] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash2035.codfw.wmnet 28.32.192.10.in-addr.arpa 8.2.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [22:15:11] !log cwhite@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host logstash2035 [22:15:44] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172397|Enable DiscussionTools thanks on existing "report incident" wikis (T366095)]] (duration: 12m 28s) [22:15:49] T366095: Deploy comment thanking to all wikis - https://phabricator.wikimedia.org/T366095 [22:18:13] cwhite@cumin2002 reimage (PID 1248291) is awaiting input [22:19:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2091.codfw.wmnet with OS bullseye [22:21:40] (03CR) 10Dzahn: "maybe start with the users who have already opted in, jamesur and matanya and then do another change?" [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [22:23:32] !log cwhite@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logstash2035 [22:23:32] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host logstash2035 [22:24:30] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250714/ using stat1009.eqiad.wmnet) [22:31:49] (03CR) 10Btullis: "I'm happy to do it, but I'm out for a few days." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173930 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup) [22:37:59] (03PS1) 10Zabe: CommonSettings: Stop setting wgDBuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174071 [22:40:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:42:18] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2035.codfw.wmnet with reason: host reimage [22:48:54] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2035.codfw.wmnet with reason: host reimage [22:54:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [22:54:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [22:55:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:03:23] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174080 [23:04:40] (03CR) 10Urbanecm: [C:04-1] admin: remove prod access for listed users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [23:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:10:26] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2035.codfw.wmnet with OS bookworm [23:12:14] PROBLEM - OpenSearch health check for shards on 9200 on logstash2035 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:38:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1174091 [23:38:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1174091 (owner: 10TrainBranchBot) [23:49:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:51:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1174091 (owner: 10TrainBranchBot) [23:57:04] (03PS3) 10Andrea Denisse: centrallog: Add sampling rules for debug logging [puppet] - 10https://gerrit.wikimedia.org/r/1173442 (https://phabricator.wikimedia.org/T383309) [23:57:05] (03CR) 10Andrea Denisse: "Hi folks, after testing various approaches in Pontoon, I’ve integrated the debug sampling rules directly into the main rsyslog-receiver co" [puppet] - 10https://gerrit.wikimedia.org/r/1173442 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [23:58:11] (03PS4) 10Andrea Denisse: centrallog: Add sampling rules for debug logging [puppet] - 10https://gerrit.wikimedia.org/r/1173442 (https://phabricator.wikimedia.org/T383309)