[00:05:23] 06SRE, 10SRE-Access-Requests: Superset / LDAP access for aude - https://phabricator.wikimedia.org/T402022#11104951 (10Dzahn) [00:05:45] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [00:08:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P81618 and previous config saved to /var/cache/conftool/dbconfig/20250821-000817-fceratto.json [00:08:32] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:08:46] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1045 [00:09:02] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1045 [00:09:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1180685 [00:09:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1180685 (owner: 10TrainBranchBot) [00:09:50] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:14:34] (03CR) 10Cathal Mooney: [C:03+1] [WIP] Routed ganeti: improve firewalling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [00:23:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T399249)', diff saved to https://phabricator.wikimedia.org/P81619 and previous config saved to /var/cache/conftool/dbconfig/20250821-002325-fceratto.json [00:23:31] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [00:23:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1221.eqiad.wmnet with reason: Maintenance [00:23:59] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [00:24:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T399249)', diff saved to https://phabricator.wikimedia.org/P81620 and previous config saved to /var/cache/conftool/dbconfig/20250821-002406-fceratto.json [00:24:28] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:26:30] (03PS1) 10Jdlrobson: Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180689 [00:28:52] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11104988 (10VRiley-WMF) I was able to make a bit more progress with cloudcephosd1045. There was some foam that was stuck inside the port, and once removed it seemingly came up and the card st... [00:35:44] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1180685 (owner: 10TrainBranchBot) [00:37:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T399249)', diff saved to https://phabricator.wikimedia.org/P81621 and previous config saved to /var/cache/conftool/dbconfig/20250821-003707-fceratto.json [00:37:12] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [00:52:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P81622 and previous config saved to /var/cache/conftool/dbconfig/20250821-005215-fceratto.json [01:00:43] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:07:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P81623 and previous config saved to /var/cache/conftool/dbconfig/20250821-010723-fceratto.json [01:11:49] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:12:54] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 12m 11s) [01:22:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T399249)', diff saved to https://phabricator.wikimedia.org/P81624 and previous config saved to /var/cache/conftool/dbconfig/20250821-012230-fceratto.json [01:22:36] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [01:22:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance [01:22:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T399249)', diff saved to https://phabricator.wikimedia.org/P81625 and previous config saved to /var/cache/conftool/dbconfig/20250821-012253-fceratto.json [01:26:49] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:05:29] (03PS1) 10Andrew Bogott: cloudceph: add new OSDs: cloudcephosd1042-1051 [puppet] - 10https://gerrit.wikimedia.org/r/1180693 (https://phabricator.wikimedia.org/T395910) [02:06:32] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180693 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott) [02:09:40] (03PS6) 10Novem Linguae: Enable electionclerk user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155805 (https://phabricator.wikimedia.org/T396347) (owner: 10Huji) [02:10:43] (03CR) 10Andrew Bogott: [C:03+2] cloudceph: add new OSDs: cloudcephosd1042-1051 [puppet] - 10https://gerrit.wikimedia.org/r/1180693 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott) [02:21:56] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11105108 (10Andrew) [02:33:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:35:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T399249)', diff saved to https://phabricator.wikimedia.org/P81626 and previous config saved to /var/cache/conftool/dbconfig/20250821-023536-fceratto.json [02:39:11] (03PS1) 10Andrew Bogott: cloudceph: mark out some OSD nodes not yet ready for action [puppet] - 10https://gerrit.wikimedia.org/r/1180696 (https://phabricator.wikimedia.org/T401693) [02:40:00] (03CR) 10Andrew Bogott: [C:03+2] cloudceph: mark out some OSD nodes not yet ready for action [puppet] - 10https://gerrit.wikimedia.org/r/1180696 (https://phabricator.wikimedia.org/T401693) (owner: 10Andrew Bogott) [02:49:17] (03PS1) 10Andrew Bogott: cloudceph: further mark out OSD nodes not yet ready for action [puppet] - 10https://gerrit.wikimedia.org/r/1180697 (https://phabricator.wikimedia.org/T401693) [02:50:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P81627 and previous config saved to /var/cache/conftool/dbconfig/20250821-025044-fceratto.json [02:50:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180697 (https://phabricator.wikimedia.org/T401693) (owner: 10Andrew Bogott) [02:53:20] (03PS2) 10Andrew Bogott: cloudceph: further mark out OSD nodes not yet ready for action [puppet] - 10https://gerrit.wikimedia.org/r/1180697 (https://phabricator.wikimedia.org/T401693) [02:54:46] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180697 (https://phabricator.wikimedia.org/T401693) (owner: 10Andrew Bogott) [02:57:17] (03CR) 10Andrew Bogott: [C:03+2] cloudceph: further mark out OSD nodes not yet ready for action [puppet] - 10https://gerrit.wikimedia.org/r/1180697 (https://phabricator.wikimedia.org/T401693) (owner: 10Andrew Bogott) [03:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:05:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P81628 and previous config saved to /var/cache/conftool/dbconfig/20250821-030552-fceratto.json [03:06:46] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11105131 (10Andrew) [03:23:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T399249)', diff saved to https://phabricator.wikimedia.org/P81629 and previous config saved to /var/cache/conftool/dbconfig/20250821-032059-fceratto.json [03:23:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1218.eqiad.wmnet with reason: Maintenance [03:23:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T399249)', diff saved to https://phabricator.wikimedia.org/P81630 and previous config saved to /var/cache/conftool/dbconfig/20250821-032111-fceratto.json [04:03:08] 10ops-codfw, 06SRE, 06DC-Ops: Add scs-e3-codfw to monitoring - https://phabricator.wikimedia.org/T401310#11105165 (10Papaul) p:05Triage→03Medium [04:32:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T399249)', diff saved to https://phabricator.wikimedia.org/P81631 and previous config saved to /var/cache/conftool/dbconfig/20250821-043224-fceratto.json [04:32:29] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [04:33:44] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1026.eqiad.wmnet w/ force delete existing files, repooling both afterwards [04:33:49] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [04:47:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P81632 and previous config saved to /var/cache/conftool/dbconfig/20250821-044731-fceratto.json [05:01:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 21 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180701 (https://phabricator.wikimedia.org/T402134) (owner: 10Anzx) [05:02:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P81633 and previous config saved to /var/cache/conftool/dbconfig/20250821-050239-fceratto.json [05:03:22] (03PS1) 10Arnaudb: gitlab: throttling policy toggle [puppet] - 10https://gerrit.wikimedia.org/r/1180703 (https://phabricator.wikimedia.org/T400971) [05:04:22] (03CR) 10Arnaudb: [C:03+2] "throttling policy switch back to drop" [puppet] - 10https://gerrit.wikimedia.org/r/1180703 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [05:08:38] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:17:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T399249)', diff saved to https://phabricator.wikimedia.org/P81634 and previous config saved to /var/cache/conftool/dbconfig/20250821-051746-fceratto.json [05:17:52] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [05:18:02] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1219.eqiad.wmnet with reason: Maintenance [05:18:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T399249)', diff saved to https://phabricator.wikimedia.org/P81635 and previous config saved to /var/cache/conftool/dbconfig/20250821-051809-fceratto.json [05:24:56] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1026.eqiad.wmnet w/ force delete existing files, repooling both afterwards [05:25:01] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [05:25:27] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1023.eqiad.wmnet -> wdqs2024.codfw.wmnet w/ force delete existing files, repooling both afterwards [05:47:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T0600). [06:18:38] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:19:20] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1023.eqiad.wmnet -> wdqs2024.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:19:25] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [06:22:42] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1023.eqiad.wmnet -> wdqs2026.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:35:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T399249)', diff saved to https://phabricator.wikimedia.org/P81636 and previous config saved to /var/cache/conftool/dbconfig/20250821-063550-fceratto.json [06:35:55] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [06:36:46] 06SRE, 06Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11105337 (10ayounsi) Note we need to keep in mind that the main goal here is to move the mgmt routers to use BGP instead of OSPF. It's fine to do some light recabling if it m... [06:45:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T399249)', diff saved to https://phabricator.wikimedia.org/P81637 and previous config saved to /var/cache/conftool/dbconfig/20250821-064539-fceratto.json [06:45:44] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [06:47:16] (03PS1) 10Giuseppe Lavagetto: Add deprecation scope [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1180713 [06:50:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P81638 and previous config saved to /var/cache/conftool/dbconfig/20250821-065057-fceratto.json [06:55:52] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Add deprecation scope - oblivian@cumin1003" [06:55:54] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Add deprecation scope - oblivian@cumin1003 [06:56:39] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Add deprecation scope - oblivian@cumin1003 [06:56:40] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Add deprecation scope - oblivian@cumin1003" [06:58:03] (03CR) 10Kevin Bazira: [C:03+2] "Proceeding to +2" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180506 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [06:59:38] (03Merged) 10jenkins-bot: ml-services: stop using weighted_tags.rc0 stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180506 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [07:00:05] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T0700) [07:00:05] anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] o/ [07:00:31] kevinbazira: thanks for the +2 :) [07:00:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P81639 and previous config saved to /var/cache/conftool/dbconfig/20250821-070047-fceratto.json [07:01:17] dcausse: o/ np... going to proceed with a deployment [07:01:21] !log installing openjdk-21security updates [07:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:19] (03PS2) 10Giuseppe Lavagetto: varnish: add new requestctl file for deprecations [puppet] - 10https://gerrit.wikimedia.org/r/1180711 (https://phabricator.wikimedia.org/T398161) [07:04:10] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6679/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180712 (https://phabricator.wikimedia.org/T398161) (owner: 10Giuseppe Lavagetto) [07:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:06:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P81640 and previous config saved to /var/cache/conftool/dbconfig/20250821-070605-fceratto.json [07:08:54] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [07:09:16] deployment complete --^ [07:12:03] !log taavi@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database rkiwiki (T392502) [07:12:07] T392502: [wikireplicas] Create views for new wiki rkiwiki - https://phabricator.wikimedia.org/T392502 [07:12:49] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:13:19] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1023.eqiad.wmnet -> wdqs2026.codfw.wmnet w/ force delete existing files, repooling both afterwards [07:13:24] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [07:15:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P81641 and previous config saved to /var/cache/conftool/dbconfig/20250821-071554-fceratto.json [07:16:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling needed between cages to eqiad 2025/6 switch refresh - https://phabricator.wikimedia.org/T402432#11105392 (10ayounsi) [07:17:56] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180584 (https://phabricator.wikimedia.org/T362869) (owner: 10David Caro) [07:18:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling needed between cages to eqiad 2025/6 switch refresh - https://phabricator.wikimedia.org/T402432#11105396 (10ayounsi) Sounds good ! It would also be fine to route temporarily through the core routers depend... [07:21:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T399249)', diff saved to https://phabricator.wikimedia.org/P81642 and previous config saved to /var/cache/conftool/dbconfig/20250821-072113-fceratto.json [07:21:18] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:21:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1238.eqiad.wmnet with reason: Maintenance [07:21:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T399249)', diff saved to https://phabricator.wikimedia.org/P81643 and previous config saved to /var/cache/conftool/dbconfig/20250821-072136-fceratto.json [07:22:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:28:38] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:31:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T399249)', diff saved to https://phabricator.wikimedia.org/P81644 and previous config saved to /var/cache/conftool/dbconfig/20250821-073102-fceratto.json [07:31:07] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:31:18] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1232.eqiad.wmnet with reason: Maintenance [07:31:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T399249)', diff saved to https://phabricator.wikimedia.org/P81645 and previous config saved to /var/cache/conftool/dbconfig/20250821-073125-fceratto.json [07:39:09] Hi! I want to run a maintenance script to add wikidata support for betwiktionary. Let me know if this is a bad time, otherwise i will proceed https://phabricator.wikimedia.org/T402130 [07:39:58] gonna do it with mwscript-k8s for the first time so i'm crossing fingers =# [07:43:25] (03PS1) 10Ayounsi: Create temp test VM in magru [puppet] - 10https://gerrit.wikimedia.org/r/1180724 (https://phabricator.wikimedia.org/T396864) [07:52:33] !log joelyrookewmde@deploy1003 mwscript-k8s job started: foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https # [Add wikidata support ticket PhabId] [07:53:21] !log ^for T402130 [07:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:25] T402130: Create Wiktionary Betawi - https://phabricator.wikimedia.org/T402130 [08:00:05] jnuche and jeena: Time to snap out of that daydream and deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T0800). [08:00:35] hi, deploying train in ~m [08:00:40] *5m [08:04:50] joelyrookewmde: has your script completed? [08:05:09] we're in the 'l's [08:05:14] so maybe halfway [08:05:39] sorry about that! [08:05:41] joelyrookewmde: ok, please let me know when I can start the train :) [08:05:53] will do [08:08:43] !log taavi@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database rkiwiki (T392502) [08:08:48] T392502: [wikireplicas] Create views for new wiki rkiwiki - https://phabricator.wikimedia.org/T392502 [08:09:23] !log taavi@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database bewwiktionary (T402137) [08:09:27] T402137: [wikireplicas] Create views for new wiki bewwiktionary - https://phabricator.wikimedia.org/T402137 [08:09:34] !log taavi@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database bewwiktionary (T402137) [08:09:55] !log taavi@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database zghwiktionary (T399788) [08:09:59] T399788: [wikireplicas] Create views for new wiki zghwiktionary - https://phabricator.wikimedia.org/T399788 [08:10:06] !log taavi@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database zghwiktionary (T399788) [08:10:12] !log taavi@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database minwikibooks (T395502) [08:10:16] T395502: [wikireplicas] Create views for new wiki minwikibooks - https://phabricator.wikimedia.org/T395502 [08:10:22] !log taavi@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database minwikibooks (T395502) [08:10:35] !log taavi@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database madwikisource (T391770) [08:10:39] T391770: [wikireplicas] Create views for new wiki madwikisource - https://phabricator.wikimedia.org/T391770 [08:10:45] !log taavi@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database madwikisource (T391770) [08:10:53] !log taavi@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database tlwikisource (T388657) [08:10:58] T388657: [wikireplicas] Create views for new wiki tlwikisource - https://phabricator.wikimedia.org/T388657 [08:11:04] !log taavi@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database tlwikisource (T388657) [08:11:26] !log installing openjdk-17 security updates [08:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:02] (03CR) 10Alexandros Kosiaris: [C:03+1] php: remove deprecated ${} string interpolation [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1180653 (https://phabricator.wikimedia.org/T402424) (owner: 10Scott French) [08:15:50] jnuche script is done! [08:16:07] joelyrookewmde: thx! [08:17:07] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180812 (https://phabricator.wikimedia.org/T396376) [08:17:11] I enjoyed my 16 mins of holding up a train a la wild west '=D [08:18:32] !log ayounsi@cumin1003 START - Cookbook sre.ganeti.makevm for new host testvm7001.magru.wmnet [08:18:33] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:18:44] log Finished populateSitesTable for bewwiktionary https://phabricator.wikimedia.org/T402130 [08:18:51] !log Finished populateSitesTable for bewwiktionary https://phabricator.wikimedia.org/T402130 [08:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:57] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm7001.magru.wmnet - ayounsi@cumin1003" [08:24:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm7001.magru.wmnet - ayounsi@cumin1003" [08:24:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:24:01] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache testvm7001.magru.wmnet on all recursors [08:24:04] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm7001.magru.wmnet on all recursors [08:24:34] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm7001.magru.wmnet - ayounsi@cumin1003" [08:24:39] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm7001.magru.wmnet - ayounsi@cumin1003" [08:24:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:27:41] ayounsi@cumin1003 makevm (PID 3021153) is awaiting input [08:28:54] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.15 refs T396376 [08:28:58] T396376: 1.45.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T396376 [08:29:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:31:24] ^ the esams alert is caused by Puppet server restarts, will recover shortly [08:31:38] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm7001.magru.wmnet with OS bookworm [08:37:17] (03PS2) 10Ayounsi: nftables: Configure a directory with rules affecting the forward chain [puppet] - 10https://gerrit.wikimedia.org/r/1180734 (https://phabricator.wikimedia.org/T402372) (owner: 10Muehlenhoff) [08:44:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T399249)', diff saved to https://phabricator.wikimedia.org/P81646 and previous config saved to /var/cache/conftool/dbconfig/20250821-084426-fceratto.json [08:44:32] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [08:51:39] (03CR) 10Thiemo Kreuz (WMDE): "Yes. I realized to late. 😄" [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180525 (https://phabricator.wikimedia.org/T402273) (owner: 10D3r1ck01) [08:54:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:58:27] (03PS1) 10Bartosz Wójtowicz: statistics: Update model upload script to check for correct boto3 version. [puppet] - 10https://gerrit.wikimedia.org/r/1180823 (https://phabricator.wikimedia.org/T394301) [08:58:38] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm7001.magru.wmnet with reason: host reimage [08:59:14] (03PS1) 10KartikMistry: Filter non-top-level sections during section title assignment [extensions/ContentTranslation] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180824 (https://phabricator.wikimedia.org/T387427) [08:59:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P81647 and previous config saved to /var/cache/conftool/dbconfig/20250821-085933-fceratto.json [08:59:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ContentTranslation] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180824 (https://phabricator.wikimedia.org/T387427) (owner: 10KartikMistry) [09:03:40] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm7001.magru.wmnet with reason: host reimage [09:04:13] (03PS4) 10Ayounsi: [WIP] Routed ganeti: improve firewalling [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) [09:04:43] (03PS1) 10Tiziano Fogli: pdb_resource_exporter: add query to track migrated nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/1180826 (https://phabricator.wikimedia.org/T395446) [09:05:46] (03CR) 10Vgutierrez: [C:03+1] "looking good:" [puppet] - 10https://gerrit.wikimedia.org/r/1180116 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [09:05:49] 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#11105664 (10fnegri) [09:06:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:07:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:09:41] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [09:14:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P81648 and previous config saved to /var/cache/conftool/dbconfig/20250821-091441-fceratto.json [09:15:59] (03PS5) 10Muehlenhoff: [WIP] Routed ganeti: improve firewalling [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [09:16:03] (03PS1) 10Vgutierrez: admin: Add sadiyamohammed13 to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/1180827 (https://phabricator.wikimedia.org/T401118) [09:16:08] (03CR) 10Tiziano Fogli: [C:03+2] "Self-merging since this only adds a query to an existing exporter." [puppet] - 10https://gerrit.wikimedia.org/r/1180826 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [09:16:42] 06SRE, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product, 13Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T401118#11105719 (10Vgutierrez) a:03Vgutierrez [09:17:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [09:17:53] (03CR) 10Vgutierrez: "note to reviewers: It's my understanding that LDAP groups should be modified after merging this CR per https://wikitech.wikimedia.org/wiki" [puppet] - 10https://gerrit.wikimedia.org/r/1180827 (https://phabricator.wikimedia.org/T401118) (owner: 10Vgutierrez) [09:18:56] (03CR) 10Clément Goubert: [C:03+1] "nit: maybe add a comment explaining exclusion is done at the node exporter level?" [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [09:19:36] (03CR) 10Cathal Mooney: [WIP] Routed ganeti: improve firewalling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [09:19:39] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm7001.magru.wmnet with OS bookworm [09:19:39] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm7001.magru.wmnet [09:21:32] (03CR) 10Clément Goubert: [C:03+1] php: remove deprecated ${} string interpolation [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1180653 (https://phabricator.wikimedia.org/T402424) (owner: 10Scott French) [09:24:44] (03CR) 10Ayounsi: [WIP] Routed ganeti: improve firewalling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [09:25:51] (03PS6) 10Ayounsi: Routed ganeti: improve firewalling [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) [09:26:56] (03PS7) 10Ayounsi: Routed ganeti: improve firewalling [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) [09:27:00] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [09:29:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T399249)', diff saved to https://phabricator.wikimedia.org/P81650 and previous config saved to /var/cache/conftool/dbconfig/20250821-092948-fceratto.json [09:29:54] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:30:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1234.eqiad.wmnet with reason: Maintenance [09:30:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T399249)', diff saved to https://phabricator.wikimedia.org/P81651 and previous config saved to /var/cache/conftool/dbconfig/20250821-093011-fceratto.json [09:37:06] (03CR) 10Muehlenhoff: "Two nits, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [09:39:36] (03CR) 10David Caro: [C:03+2] aptrepo: add k8s 1.30 and helm to trixie-wikimedia repo [puppet] - 10https://gerrit.wikimedia.org/r/1180584 (https://phabricator.wikimedia.org/T362869) (owner: 10David Caro) [09:39:48] (03CR) 10Ayounsi: Routed ganeti: improve firewalling (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [09:39:51] (03PS8) 10Ayounsi: Routed ganeti: improve firewalling [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) [09:45:12] (03CR) 10Stevemunene: [C:03+2] dse-k8s: Add dse-k8s-codfw to service list [puppet] - 10https://gerrit.wikimedia.org/r/1180116 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [09:47:35] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:48:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:48:49] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:56:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.837 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:56:40] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.994 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:58:40] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-ctrl2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1000) [10:03:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:04:50] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:04:50] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [10:05:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:10:46] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.791 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:10:50] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 8.670 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:13:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:14:06] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:14:16] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 81 connections established with conf2004.codfw.wmnet:4001 (min=83) https://wikitech.wikimedia.org/wiki/PyBal [10:14:20] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:16:18] PROBLEM - PyBal connections to etcd on lvs2014 is CRITICAL: CRITICAL: 95 connections established with conf2004.codfw.wmnet:4001 (min=97) https://wikitech.wikimedia.org/wiki/PyBal [10:17:27] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm7001.magru.wmnet with OS bookworm [10:19:50] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:20:40] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.378 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:22:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:23:02] ayounsi@cumin1003 reimage (PID 3036867) is awaiting input [10:23:41] RESOLVED: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:25:52] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm7001.magru.wmnet with OS bookworm [10:26:16] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm7001.magru.wmnet with OS bookworm [10:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:32:15] pybal alerts are related to ongoing work by stevemunene [10:37:58] !log stevemunene@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[2013-2014].codfw.wmnet} and A:lvs (T397301) [10:38:02] T397301: Bootstrap the dse-k8s-codfw cluster - https://phabricator.wikimedia.org/T397301 [10:38:30] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:43:18] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 83 connections established with conf2004.codfw.wmnet:4001 (min=83) https://wikitech.wikimedia.org/wiki/PyBal [10:44:36] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:44:55] 10SRE-swift-storage, 10Observability-Logging, 10SRE Observability (FY2025/2026-Q1): rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11105891 (10MatthewVernon) Hi folks, thanks for the investigation while I was away! `sdg` is indeed failing on this host, I'll star... [10:45:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T399249)', diff saved to https://phabricator.wikimedia.org/P81652 and previous config saved to /var/cache/conftool/dbconfig/20250821-104516-fceratto.json [10:45:22] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:48:59] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm7001.magru.wmnet with reason: host reimage [10:49:24] RECOVERY - PyBal connections to etcd on lvs2014 is OK: OK: 97 connections established with conf2004.codfw.wmnet:4001 (min=97) https://wikitech.wikimedia.org/wiki/PyBal [10:49:58] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[2013-2014].codfw.wmnet} and A:lvs (T397301) [10:50:03] T397301: Bootstrap the dse-k8s-codfw cluster - https://phabricator.wikimedia.org/T397301 [10:51:58] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:52:28] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:52:58] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:53:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:53:30] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:53:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:54:01] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm7001.magru.wmnet with reason: host reimage [10:58:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:58:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:58:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:00:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P81653 and previous config saved to /var/cache/conftool/dbconfig/20250821-110024-fceratto.json [11:01:32] (03CR) 10Stevemunene: [C:03+2] dse-k8s: setup the dse-k8s-codfw helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178827 (https://phabricator.wikimedia.org/T397297) (owner: 10Stevemunene) [11:02:46] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180827 (https://phabricator.wikimedia.org/T401118) (owner: 10Vgutierrez) [11:03:54] (03CR) 10Vgutierrez: [C:03+2] admin: Add sadiyamohammed13 to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/1180827 (https://phabricator.wikimedia.org/T401118) (owner: 10Vgutierrez) [11:04:22] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:06:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:06:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:09:03] (03Merged) 10jenkins-bot: dse-k8s: setup the dse-k8s-codfw helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178827 (https://phabricator.wikimedia.org/T397297) (owner: 10Stevemunene) [11:09:31] (03CR) 10Stevemunene: [C:03+2] dse-k8s: Add helmfile configuration for dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179654 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [11:09:56] (03PS1) 10KartikMistry: CX3 Build 1.0.0+20250821 [extensions/ContentTranslation] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180838 (https://phabricator.wikimedia.org/T387427) [11:10:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/ContentTranslation] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180838 (https://phabricator.wikimedia.org/T387427) (owner: 10KartikMistry) [11:10:39] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm7001.magru.wmnet with OS bookworm [11:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:11:46] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.879 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:14:17] 06SRE, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product, 13Patch-For-Review: Grant Access to for  - https://phabricator.wikimedia.org/T401118#11105954 (10Vgutierrez) 05Open→03Resolved ` vgutierrez@ldap-maint1001:~$ ldapsearch -x cn=nda |grep sad member:... [11:15:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P81654 and previous config saved to /var/cache/conftool/dbconfig/20250821-111531-fceratto.json [11:15:54] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 9.979 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:18:22] (03Merged) 10jenkins-bot: dse-k8s: Add helmfile configuration for dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179654 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [11:22:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for Dima_Koushha_WMDE - https://phabricator.wikimedia.org/T402384#11105977 (10Dima_Koushha_WMDE) Hi @Vgutierrez done! Should I link the change to the ticket? Thanks! [11:29:33] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:30:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T399249)', diff saved to https://phabricator.wikimedia.org/P81655 and previous config saved to /var/cache/conftool/dbconfig/20250821-113039-fceratto.json [11:30:44] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:30:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1235.eqiad.wmnet with reason: Maintenance [11:31:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T399249)', diff saved to https://phabricator.wikimedia.org/P81656 and previous config saved to /var/cache/conftool/dbconfig/20250821-113101-fceratto.json [11:33:17] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [11:34:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:34:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:37:40] RESOLVED: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:39:38] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 2.033 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:39:46] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:42:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:42:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:43:54] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 9.714 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:44:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.631 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:47:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:47:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:50:27] 06SRE, 06Infrastructure-Foundations: Nokia: Support Python config generation and JSON-RPC transport in Homer - https://phabricator.wikimedia.org/T402511 (10cmooney) 03NEW p:05Triage→03Medium [11:50:49] 06SRE, 06Infrastructure-Foundations: Nokia: Support Python config generation and JSON-RPC transport in Homer - https://phabricator.wikimedia.org/T402511#11106017 (10cmooney) [11:51:52] (03PS4) 10Cathal Mooney: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) [11:52:17] (03PS3) 10Cathal Mooney: Nokia JSON-RPC: Add secrets to support using JSON-RPC API [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) [11:52:40] (03PS2) 10Cathal Mooney: Nokia: Add initial Python files for nokia switch system config [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) [11:52:42] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.056 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:52:48] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 4.127 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:53:04] (03PS5) 10Cathal Mooney: wmf-plugin: New function to expose generic interface data to modules [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1180553 (https://phabricator.wikimedia.org/T402511) [11:55:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:55:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:57:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.881 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:57:50] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 4.659 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1200) [12:01:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:01:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:02:27] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Nokia: Support Python config generation and JSON-RPC transport in Homer - https://phabricator.wikimedia.org/T402511#11106030 (10cmooney) [12:03:31] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Nokia: Support Python config generation and JSON-RPC transport in Homer - https://phabricator.wikimedia.org/T402511#11106032 (10cmooney) [12:06:24] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:06:42] (03CR) 10D3r1ck01: "@umherirrender_de.wp@web.de, so based on yesterday's activity, this should be abandoned, and we move forward with @thiemo.kreuz@wikimedia." [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180525 (https://phabricator.wikimedia.org/T402273) (owner: 10D3r1ck01) [12:11:26] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [12:11:38] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.987 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:11:50] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 3.811 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:11:52] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [12:11:59] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [12:13:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for Dima_Koushha_WMDE - https://phabricator.wikimedia.org/T402384#11106081 (10Vgutierrez) SSH key verified out of band via https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1180839 [12:14:06] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [12:14:57] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for Dima_Koushha_WMDE - https://phabricator.wikimedia.org/T402384#11106086 (10Vgutierrez) [12:15:12] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [12:15:28] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [12:18:16] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [12:18:29] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [12:20:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:21:34] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [12:21:44] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [12:21:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:21:56] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:22:51] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180848 [12:23:15] the magru is caused by a puppetserver restart and expected, will recover soon [12:24:50] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host testvm7001.magru.wmnet with OS bookworm [12:24:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 9.489 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:25:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.745 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:27:15] (03PS1) 10Vgutierrez: admin: Add dimakoushha to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/1180851 (https://phabricator.wikimedia.org/T402384) [12:33:51] (03CR) 10Volans: [C:03+1] "Change LGTM based on Supermicro's feedback, ofc to be tested on our actual hosts with the various firmware versions." [cookbooks] - 10https://gerrit.wikimedia.org/r/1180627 (https://phabricator.wikimedia.org/T387577) (owner: 10JHathaway) [12:37:38] (03CR) 10Ayounsi: Nokia JSON-RPC: Add secrets to support using JSON-RPC API (031 comment) [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:40:43] (03CR) 10Ayounsi: [C:03+1] wmf-plugin: New function to expose generic interface data to modules [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1180553 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:41:03] (03CR) 10Ayounsi: [C:04-1] Nokia JSON-RPC: Add secrets to support using JSON-RPC API [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:41:14] (03CR) 10Ayounsi: Nokia JSON-RPC: Add secrets to support using JSON-RPC API [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:43:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T399249)', diff saved to https://phabricator.wikimedia.org/P81657 and previous config saved to /var/cache/conftool/dbconfig/20250821-124318-fceratto.json [12:43:23] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:46:15] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Nokia: Support Python config generation and JSON-RPC transport in Homer - https://phabricator.wikimedia.org/T402511#11106160 (10cmooney) [12:48:47] (03CR) 10Volans: [C:03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1180553 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:48:55] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm7001.magru.wmnet with reason: host reimage [12:50:13] (03PS1) 10Ayounsi: Add mock homer password [labs/private] - 10https://gerrit.wikimedia.org/r/1180855 (https://phabricator.wikimedia.org/T402511) [12:50:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:51:25] (03PS1) 10Ayounsi: Homer: add password to config file [puppet] - 10https://gerrit.wikimedia.org/r/1180856 (https://phabricator.wikimedia.org/T402511) [12:51:59] (03PS2) 10Ayounsi: Add mock homer password [labs/private] - 10https://gerrit.wikimedia.org/r/1180855 (https://phabricator.wikimedia.org/T402511) [12:52:06] (03CR) 10CI reject: [V:04-1] Homer: add password to config file [puppet] - 10https://gerrit.wikimedia.org/r/1180856 (https://phabricator.wikimedia.org/T402511) (owner: 10Ayounsi) [12:53:03] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm7001.magru.wmnet with reason: host reimage [12:54:20] (03PS2) 10Ayounsi: Homer: add password to config file [puppet] - 10https://gerrit.wikimedia.org/r/1180856 (https://phabricator.wikimedia.org/T402511) [12:57:48] (03PS1) 10Daimona Eaytoy: Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_WRITE_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180858 (https://phabricator.wikimedia.org/T397476) [12:58:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180858 (https://phabricator.wikimedia.org/T397476) (owner: 10Daimona Eaytoy) [12:58:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P81658 and previous config saved to /var/cache/conftool/dbconfig/20250821-125825-fceratto.json [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1300) [13:00:05] kart_, anzx, and Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:17] o/ [13:00:30] here [13:00:35] I can self deploy.. [13:00:53] (03CR) 10Ayounsi: Nokia JSON-RPC: Add secrets to support using JSON-RPC API (031 comment) [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:01:01] i can deploy the remaining 2 patches once kart_ is done, unless anyone else wants to [13:01:19] o/ [13:01:41] oh, kart_'s a backport [13:01:52] so, let's start with the config changes and then turn over to kart_ [13:02:15] urbanecm: doable. Please let me know once done with configs. [13:02:18] will do! [13:02:22] (03CR) 10Ayounsi: Nokia JSON-RPC: Add secrets to support using JSON-RPC API (031 comment) [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:03:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180701 (https://phabricator.wikimedia.org/T402134) (owner: 10Anzx) [13:03:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180700 (https://phabricator.wikimedia.org/T402134) (owner: 10Anzx) [13:04:10] (03Merged) 10jenkins-bot: bewwiktionary: set sitename, project namespace & timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180701 (https://phabricator.wikimedia.org/T402134) (owner: 10Anzx) [13:04:13] (03Merged) 10jenkins-bot: bewwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180700 (https://phabricator.wikimedia.org/T402134) (owner: 10Anzx) [13:04:38] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1180701|bewwiktionary: set sitename, project namespace & timezone (T402134)]], [[gerrit:1180700|bewwiktionary: add logos (T402134)]] [13:04:42] T402134: Post-creation work for bewwiktionary - https://phabricator.wikimedia.org/T402134 [13:07:48] !log urbanecm@deploy1003 urbanecm, anzx: Backport for [[gerrit:1180701|bewwiktionary: set sitename, project namespace & timezone (T402134)]], [[gerrit:1180700|bewwiktionary: add logos (T402134)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:07:52] urbanecm: checking [13:08:06] Ty [13:08:50] urbanecm, kart_: might as well start gate-and-submit for that backport already if you’re expecting it to be slow? ^^ [13:09:08] urbanecm: all looks good, ok to sync [13:09:52] fair [13:09:54] !log urbanecm@deploy1003 urbanecm, anzx: Continuing with sync [13:10:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm7001.magru.wmnet with OS bookworm [13:10:15] Lucas_WMDE: although i usually attempt to start only one merge ahead [13:10:23] that way i can actually control what gets pulled :/ [13:10:31] (03PS2) 10Daimona Eaytoy: Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_WRITE_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180858 (https://phabricator.wikimedia.org/T397476) [13:10:34] (03CR) 10Urbanecm: [C:03+2] Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_WRITE_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180858 (https://phabricator.wikimedia.org/T397476) (owner: 10Daimona Eaytoy) [13:11:20] oh, I didn’t realize there was still a config change in the queue [13:11:23] sorry [13:11:29] no worries [13:11:32] I thought you’d done them all together ^^ [13:11:56] (03Merged) 10jenkins-bot: Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_WRITE_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180858 (https://phabricator.wikimedia.org/T397476) (owner: 10Daimona Eaytoy) [13:13:13] urbanecm: that's fine. we can wait :) [13:13:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P81659 and previous config saved to /var/cache/conftool/dbconfig/20250821-131333-fceratto.json [13:16:12] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180701|bewwiktionary: set sitename, project namespace & timezone (T402134)]], [[gerrit:1180700|bewwiktionary: add logos (T402134)]] (duration: 11m 34s) [13:16:16] T402134: Post-creation work for bewwiktionary - https://phabricator.wikimedia.org/T402134 [13:16:39] (03PS1) 10Giuseppe Lavagetto: Bugfix [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1180861 [13:16:53] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1180858|Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_WRITE_NEW (T397476)]] [13:16:58] T397476: Country of event data migration (free text -> code; optional -> required; remove country from address) - https://phabricator.wikimedia.org/T397476 [13:19:34] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2212.codfw.wmnet with reason: Maintenance [13:19:37] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Bugfix [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1180861 (owner: 10Giuseppe Lavagetto) [13:19:45] (03PS1) 10Tiziano Fogli: pdb_resource_exporter: add query to track migrated nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/1180862 (https://phabricator.wikimedia.org/T395446) [13:20:09] (03CR) 10Urbanecm: [C:03+2] CX3 Build 1.0.0+20250821 [extensions/ContentTranslation] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180838 (https://phabricator.wikimedia.org/T387427) (owner: 10KartikMistry) [13:20:26] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfix - oblivian@cumin1003" [13:20:27] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfix - oblivian@cumin1003 [13:21:15] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfix - oblivian@cumin1003 [13:21:16] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfix - oblivian@cumin1003" [13:21:17] !log urbanecm@deploy1003 urbanecm, daimona: Backport for [[gerrit:1180858|Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_WRITE_NEW (T397476)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:21:38] Daimona: do you mind double checking? :) [13:21:45] Yup, thank you! [13:22:03] (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20250821 [extensions/ContentTranslation] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180838 (https://phabricator.wikimedia.org/T387427) (owner: 10KartikMistry) [13:22:36] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [13:23:39] (03CR) 10Tiziano Fogli: [C:03+2] "Self-merging since this only adds a query to an existing exporter." [puppet] - 10https://gerrit.wikimedia.org/r/1180862 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [13:25:14] Tested, LGTM [13:25:30] !log urbanecm@deploy1003 urbanecm, daimona: Continuing with sync [13:25:30] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [13:25:33] (03CR) 10Majavah: [C:03+1] mariadb::ferm_wmcs: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1037766 (owner: 10Muehlenhoff) [13:25:33] proceeding [13:26:09] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [13:26:42] urbanecm: wmf branch merges are quicker now since it doesn't run CI tests already passed. Is that also same for master branch? [13:27:22] kart_: i don't think so, because the assumption for `wmf` is that there is a corresponding `master` branch. that said, this is just my expectation, maybe it works in some other way. [13:27:34] 06SRE, 06Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11106403 (10Papaul) Understood [13:27:35] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [13:27:42] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [13:27:55] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [13:28:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:28:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T399249)', diff saved to https://phabricator.wikimedia.org/P81660 and previous config saved to /var/cache/conftool/dbconfig/20250821-132839-fceratto.json [13:28:45] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:28:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1239.eqiad.wmnet with reason: Maintenance [13:30:55] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180858|Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_WRITE_NEW (T397476)]] (duration: 14m 01s) [13:31:00] T397476: Country of event data migration (free text -> code; optional -> required; remove country from address) - https://phabricator.wikimedia.org/T397476 [13:31:01] Daimona: deployed [13:31:09] kart_: go ahead with your patch :) [13:31:21] cool [13:32:39] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1180838|CX3 Build 1.0.0+20250821 (T387427)]] [13:32:43] T387427: Section selector shows html markup in section title and such sections fails to load - https://phabricator.wikimedia.org/T387427 [13:33:40] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:36:43] !log kartik@deploy1003 kartik: Backport for [[gerrit:1180838|CX3 Build 1.0.0+20250821 (T387427)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:36:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [13:36:58] !incidents [13:36:59] 6627 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:37:01] !ack 6627 [13:37:01] 6627 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:37:07] quicker than me :) [13:37:29] heh [13:37:31] same issue as yesterday? [13:37:37] kinda [13:37:48] <_joe_> what's going on? [13:37:48] does turnilo works for you? [13:38:01] yeah https://grafana.wikimedia.org/goto/t5wunvXNR?orgId=1 [13:38:18] _joe_: haproxy rejecting traffic cause it's maxing out the 20k connections to varnish [13:38:29] we might want to roll back that maxconn I set [13:38:33] <_joe_> the same actors again? [13:38:43] it might be causing more harm than good [13:38:47] <_joe_> cdanis: not sure tbh, unless this has to do with not-an-attack [13:38:48] _joe_: apparently [13:39:22] <_joe_> ok, can someone from traffic actually take care of https://phabricator.wikimedia.org/T402501, and extend the ban to any other involved actors? [13:39:30] (03PS5) 10Tchanders: Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) [13:40:19] (03CR) 10Tchanders: "Resolving this, following general agreement." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [13:40:33] moving to -sec [13:40:42] (03CR) 10Cathal Mooney: Nokia JSON-RPC: Add secrets to support using JSON-RPC API (032 comments) [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:40:57] urbanecm: belated thank you :) (Sorry, am in a call) [13:41:33] npo [13:41:36] *np [13:41:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [13:42:24] (03Abandoned) 10KartikMistry: Filter non-top-level sections during section title assignment [extensions/ContentTranslation] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180824 (https://phabricator.wikimedia.org/T387427) (owner: 10KartikMistry) [13:46:12] (03PS1) 10Muehlenhoff: Add an additional FIDO ecdsa-sk SSH key for me [puppet] - 10https://gerrit.wikimedia.org/r/1180869 [13:46:53] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [13:47:03] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [13:47:05] !log kartik@deploy1003 kartik: Continuing with sync [13:52:17] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180838|CX3 Build 1.0.0+20250821 (T387427)]] (duration: 19m 38s) [13:52:22] T387427: Section selector shows html markup in section title and such sections fails to load - https://phabricator.wikimedia.org/T387427 [13:53:25] (03CR) 10Urbanecm: [C:04-2] "the task is not clear whether we want 100% of new users or all users. i asked in https://wikimedia.slack.com/archives/G0101329ZC7/p1755784" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [13:55:04] All done. [13:56:39] (03CR) 10Cathal Mooney: Nokia JSON-RPC: Add secrets to support using JSON-RPC API (031 comment) [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:57:03] (03CR) 10Slyngshede: [C:03+1] "Verified OK via Slack" [puppet] - 10https://gerrit.wikimedia.org/r/1180869 (owner: 10Muehlenhoff) [13:59:07] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [14:00:55] (03CR) 10Thiemo Kreuz (WMDE): "Please feel free to backport this here, in case it's still helpful in this branch. It's a valid fix for the immediate problem." [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180525 (https://phabricator.wikimedia.org/T402273) (owner: 10D3r1ck01) [14:06:56] (03PS1) 10CDanis: haproxy: disable maxconn [puppet] - 10https://gerrit.wikimedia.org/r/1180873 (https://phabricator.wikimedia.org/T401695) [14:09:22] (03PS3) 10Cwhite: resources: remove most filters [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) [14:13:50] (03CR) 10Muehlenhoff: [C:03+2] Add an additional FIDO ecdsa-sk SSH key for me [puppet] - 10https://gerrit.wikimedia.org/r/1180869 (owner: 10Muehlenhoff) [14:14:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: hw troubleshooting: disk (sdg) errors on ms-be1071 - https://phabricator.wikimedia.org/T402346#11106641 (10MatthewVernon) @Jclark-ctr I've hopefully highlighted the relevant drive with `sudo megacli -PDLocate -PhysDrv [32:3] -a0`, and it's ready for you to... [14:16:23] !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [14:16:35] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11106653 (10Ladsgroup) Sorry. Manuel is out. es2040 is easily doable. For when do you want it done? [14:16:40] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11106654 (10Jclark-ctr) [14:17:57] !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [14:18:45] (03CR) 10Ssingh: [C:03+1] "Worth a shot." [puppet] - 10https://gerrit.wikimedia.org/r/1180873 (https://phabricator.wikimedia.org/T401695) (owner: 10CDanis) [14:20:24] (03CR) 10CDanis: [C:03+2] haproxy: disable maxconn [puppet] - 10https://gerrit.wikimedia.org/r/1180873 (https://phabricator.wikimedia.org/T401695) (owner: 10CDanis) [14:28:52] (03CR) 10Jforrester: [C:03+1] php: remove deprecated ${} string interpolation [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1180653 (https://phabricator.wikimedia.org/T402424) (owner: 10Scott French) [14:29:22] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11106748 (10Ladsgroup) Talking to Papaul. I'm doing es2040 right now. Will do es2039 later. [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1430) [14:30:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool es2040 T399927', diff saved to https://phabricator.wikimedia.org/P81661 and previous config saved to /var/cache/conftool/dbconfig/20250821-143039-ladsgroup.json [14:30:44] T399927: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927 [14:33:40] !log bking@cumin1002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [14:33:42] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: hw troubleshooting: disk (sdg) errors on ms-be1071 - https://phabricator.wikimedia.org/T402346#11106771 (10Eevans) >>! In T402346#11106641, @MatthewVernon wrote: > [ ... ] > Showing my working: > `lshw -C disk` says `/dev/sdg` is `bus info: scsi@0:2.5.0`.... [14:33:42] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [14:34:22] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on es2040.codfw.wmnet with reason: 10GB-fication [14:40:53] (03CR) 10Muehlenhoff: [C:03+2] mariadb::ferm_wmcs: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1037766 (owner: 10Muehlenhoff) [14:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:42:09] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11106802 (10Ladsgroup) @Papaul Shut down and ready for you. [14:43:42] (03CR) 10Ssingh: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1172056/15 sorry if I am mistaken, but we didn't carry over the changes from this pat" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [14:44:46] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis amwikimedia, cnwikimedia, donatewiki, gewikimedia, grwikimedia, hiwikimedia, idwikimedia, maiwikimedia, ngwikimedia, nostalgiawiki, punjabiwikimedia, romdwikimedia, rswikimedia, votewiki, wbwikimedia in section s5 [14:47:13] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [14:48:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance [14:49:38] fceratto@cumin1002 sanitize-wiki (PID 1619964) is awaiting input [14:50:03] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Checking sanitization for wikis amwikimedia, cnwikimedia, donatewiki, gewikimedia, grwikimedia, hiwikimedia, idwikimedia, maiwikimedia, ngwikimedia, nostalgiawiki, punjabiwikimedia, romdwikimedia, rswikimedia, votewiki, wbwikimedia in section s5 [14:50:07] 10ops-codfw, 06DC-Ops: codfw netbox cable cleanup - https://phabricator.wikimedia.org/T402535 (10RobH) 03NEW p:05Triage→03Medium [14:50:11] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis amwikimedia, cnwikimedia, donatewiki, gewikimedia, grwikimedia, hiwikimedia, idwikimedia, maiwikimedia, ngwikimedia, nostalgiawiki, punjabiwikimedia, romdwikimedia, rswikimedia, votewiki, wbwikimedia in section s5 [14:51:38] 10ops-eqiad, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536 (10RobH) 03NEW p:05Triage→03Medium [14:51:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:51:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:52:04] (03PS1) 10Zabe: Do not bypass LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180884 (https://phabricator.wikimedia.org/T402494) [14:52:32] jouncebot: nowandnext [14:52:32] For the next 0 hour(s) and 7 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1430) [14:52:32] In 0 hour(s) and 7 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1500) [14:52:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:52:40] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-ctrl2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:52:42] (03CR) 10Zabe: [C:03+2] Do not bypass LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180884 (https://phabricator.wikimedia.org/T402494) (owner: 10Zabe) [14:54:44] fceratto@cumin1002 sanitize-wiki (PID 1628231) is awaiting input [14:57:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:00:05] jnuche and jeena: Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1500). Please do the needful. [15:00:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T399249)', diff saved to https://phabricator.wikimedia.org/P81662 and previous config saved to /var/cache/conftool/dbconfig/20250821-150021-fceratto.json [15:00:27] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:03:17] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: hw troubleshooting: disk (sdg) errors on ms-be1071 - https://phabricator.wikimedia.org/T402346#11106916 (10MatthewVernon) Yes, I think that's correct. [15:03:33] (03CR) 10Dreamy Jazz: [C:03+1] Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [15:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:04:48] fceratto@cumin1002 sanitize-wiki (PID 1628231) is awaiting input [15:05:51] !log joal@deploy1003 Started deploy [analytics/refinery@9fc3b38] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@9fc3b380] [15:06:47] !log joal@deploy1003 Finished deploy [analytics/refinery@9fc3b38] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@9fc3b380] (duration: 00m 55s) [15:07:28] !log joal@deploy1003 Started deploy [analytics/refinery@9fc3b38]: Regular analytics weekly train [analytics/refinery@9fc3b380] [15:08:38] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:55] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis amwikimedia, cnwikimedia, donatewiki, gewikimedia, grwikimedia, hiwikimedia, idwikimedia, maiwikimedia, ngwikimedia, nostalgiawiki, punjabiwikimedia, romdwikimedia, rswikimedia, votewiki, wbwikimedia in section s5 [15:10:50] (03Merged) 10jenkins-bot: Do not bypass LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180884 (https://phabricator.wikimedia.org/T402494) (owner: 10Zabe) [15:11:08] !log joal@deploy1003 Finished deploy [analytics/refinery@9fc3b38]: Regular analytics weekly train [analytics/refinery@9fc3b380] (duration: 03m 40s) [15:11:13] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis amwikimedia, cnwikimedia, donatewiki, gewikimedia, grwikimedia, hiwikimedia, idwikimedia, maiwikimedia, ngwikimedia, nostalgiawiki, punjabiwikimedia, romdwikimedia, rswikimedia, votewiki, wbwikimedia in section s5 [15:11:26] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Disable build_mw_next_container_image [puppet] - 10https://gerrit.wikimedia.org/r/1180887 (https://phabricator.wikimedia.org/T402508) [15:11:40] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1180884|Do not bypass LinksMigration for categorylinks (T402494)]] [15:11:43] !log joal@deploy1003 Started deploy [analytics/refinery@9fc3b38] (thin): Regular analytics weekly train THIN [analytics/refinery@9fc3b380] [15:11:44] T402494: RelatedChanges doesn't show page creations after 19th August 2025 - https://phabricator.wikimedia.org/T402494 [15:12:39] !log joal@deploy1003 Finished deploy [analytics/refinery@9fc3b38] (thin): Regular analytics weekly train THIN [analytics/refinery@9fc3b380] (duration: 00m 56s) [15:12:57] (03Abandoned) 10Ahmon Dancy: scap: drop unused parameters from the configuration [puppet] - 10https://gerrit.wikimedia.org/r/810048 (owner: 10Giuseppe Lavagetto) [15:13:38] (03Abandoned) 10Ahmon Dancy: scap: Make wmflabs php7 behaviour match prod's [puppet] - 10https://gerrit.wikimedia.org/r/499025 (https://phabricator.wikimedia.org/T219242) (owner: 10Alex Monk) [15:14:40] (03PS1) 10Bking: cirrussearch: bring cirrussearch2089 back to production [puppet] - 10https://gerrit.wikimedia.org/r/1180888 (https://phabricator.wikimedia.org/T399943) [15:14:51] !log zabe@deploy1003 zabe: Backport for [[gerrit:1180884|Do not bypass LinksMigration for categorylinks (T402494)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:15:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P81663 and previous config saved to /var/cache/conftool/dbconfig/20250821-151528-fceratto.json [15:15:35] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#11106971 (10MoritzMuehlenhoff) [15:15:42] fceratto@cumin1002 sanitize-wiki (PID 1662459) is awaiting input [15:17:11] !log zabe@deploy1003 zabe: Continuing with sync [15:18:31] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180888 (https://phabricator.wikimedia.org/T399943) (owner: 10Bking) [15:20:43] (03PS1) 10Daimona Eaytoy: Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180890 (https://phabricator.wikimedia.org/T397476) [15:21:49] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:21:59] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:22:27] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180884|Do not bypass LinksMigration for categorylinks (T402494)]] (duration: 10m 47s) [15:22:32] T402494: RelatedChanges doesn't show page creations after 19th August 2025 - https://phabricator.wikimedia.org/T402494 [15:22:39] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.469 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:22:57] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 6.596 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:23:38] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:34] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:25:43] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1026.eqiad.wmnet w/ force delete existing files, repooling both afterwards [15:25:47] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [15:26:01] (03PS4) 10Cathal Mooney: Nokia: Add examples for Nokia password hashes commonly used [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) [15:26:31] (03PS5) 10Cathal Mooney: Nokia: Add examples for Nokia password hashes commonly used [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) [15:29:09] (03CR) 10Cathal Mooney: Nokia: Add examples for Nokia password hashes commonly used (031 comment) [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [15:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:30:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P81664 and previous config saved to /var/cache/conftool/dbconfig/20250821-153036-fceratto.json [15:30:52] jouncebot: nowandnext [15:30:52] For the next 0 hour(s) and 29 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1500) [15:30:52] In 0 hour(s) and 29 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1600) [15:31:22] (03CR) 10Scott French: "Thanks, Ahmon!" [puppet] - 10https://gerrit.wikimedia.org/r/1180887 (https://phabricator.wikimedia.org/T402508) (owner: 10Ahmon Dancy) [15:31:24] (03CR) 10Scott French: [C:03+2] scap.cfg.erb: Disable build_mw_next_container_image [puppet] - 10https://gerrit.wikimedia.org/r/1180887 (https://phabricator.wikimedia.org/T402508) (owner: 10Ahmon Dancy) [15:32:24] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1023.eqiad.wmnet -> wdqs2024.codfw.wmnet w/ force delete existing files, repooling both afterwards [15:32:25] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1023.eqiad.wmnet -> wdqs2024.codfw.wmnet w/ force delete existing files, repooling both afterwards [15:32:29] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [15:33:50] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1023.eqiad.wmnet -> wdqs2024.codfw.wmnet w/ force delete existing files, repooling both afterwards [15:33:51] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1023.eqiad.wmnet -> wdqs2024.codfw.wmnet w/ force delete existing files, repooling both afterwards [15:34:16] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1023.eqiad.wmnet -> wdqs2024.codfw.wmnet w/ force delete existing files, repooling both afterwards [15:35:26] (03PS1) 10Zabe: Set categorylinks to read new on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180895 (https://phabricator.wikimedia.org/T397912) [15:45:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T399249)', diff saved to https://phabricator.wikimedia.org/P81666 and previous config saved to /var/cache/conftool/dbconfig/20250821-154543-fceratto.json [15:45:49] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:45:52] (03PS1) 10Kosta Harlan: hCaptcha: Fix topic name for frontend metrics [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180898 [15:45:59] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1241.eqiad.wmnet with reason: Maintenance [15:46:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T399249)', diff saved to https://phabricator.wikimedia.org/P81667 and previous config saved to /var/cache/conftool/dbconfig/20250821-154605-fceratto.json [15:50:00] (03CR) 10Cathal Mooney: [C:03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1180856 (https://phabricator.wikimedia.org/T402511) (owner: 10Ayounsi) [15:50:31] jouncebot: nowandnext [15:50:31] For the next 0 hour(s) and 9 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1500) [15:50:31] In 0 hour(s) and 9 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1600) [15:51:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180898 (owner: 10Kosta Harlan) [15:51:54] (03CR) 10Bking: [C:03+2] cirrussearch: bring cirrussearch2089 back to production [puppet] - 10https://gerrit.wikimedia.org/r/1180888 (https://phabricator.wikimedia.org/T399943) (owner: 10Bking) [15:52:17] (03CR) 10CDobbins: [V:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:52:49] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:53:01] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:53:07] !log set cirrussearch2089 to active in netbox T399943 [15:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:11] T399943: Unresponsive management for cirrussearch2089.mgmt:22 - https://phabricator.wikimedia.org/T399943 [15:53:24] (03CR) 10Bking: [C:03+2] "self-merging to quiet down some alerts and so we don't forget." [puppet] - 10https://gerrit.wikimedia.org/r/1180888 (https://phabricator.wikimedia.org/T399943) (owner: 10Bking) [15:53:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 3.664 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:53:55] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 3.702 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:55:30] 06SRE, 10SRE-Access-Requests: Superset / LDAP access for aude - https://phabricator.wikimedia.org/T402022#11107337 (10Dzahn) pretty sure this is about being added to "analytics-privatedata-users". not sure about Kerberos or not. We could start with the group and see or we may need to reach out to analytics p... [15:56:49] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:57:01] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:57:05] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11107340 (10Papaul) [15:57:35] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11107341 (10Papaul) @Ladsgroup es2040 is done and update. Thank you [15:57:45] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:57:59] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 7.223 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:58:54] (03CR) 10Tiziano Fogli: [C:03+1] k8s-ops: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [15:59:26] (03CR) 10Tiziano Fogli: [C:03+1] resources: remove most filters [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [16:00:05] jhathaway and moritzm: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:05] (03PS1) 10MVernon: swift: remove 3 drained codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1180901 (https://phabricator.wikimedia.org/T354872) [16:02:14] (03Merged) 10jenkins-bot: hCaptcha: Fix topic name for frontend metrics [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180898 (owner: 10Kosta Harlan) [16:02:30] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1180898|hCaptcha: Fix topic name for frontend metrics]] [16:03:09] (03CR) 10Tiziano Fogli: "An option to reduce duplication and improve readability could be the use of YAML anchors, since I think only the expr will change between " [alerts] - 10https://gerrit.wikimedia.org/r/1179178 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [16:03:38] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:09] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Managing sanitization for wikis amwikimedia, cnwikimedia, donatewiki, gewikimedia, grwikimedia, hiwikimedia, idwikimedia, maiwikimedia, ngwikimedia, nostalgiawiki, punjabiwikimedia, romdwikimedia, rswikimedia, votewiki, wbwikimedia in section s5 [16:05:07] kostajh: do you have anything lined up for after your ongoing backport? if not, I might sneak in a deployment that picks up a new base image to clean up some log spam. [16:05:37] swfrench-wmf: no, I'll be finished after this is done syncing [16:05:53] kostajh: great, thanks! [16:06:50] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:06:55] (03CR) 10Federico Ceratto: [C:03+1] "I checked that the 3 hostnames flagged for draining in the yaml file related to codfw match the descriptions and the 2 related tasks where" [puppet] - 10https://gerrit.wikimedia.org/r/1180901 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [16:06:59] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1180898|hCaptcha: Fix topic name for frontend metrics]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:07:02] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:07:17] swfrench-wmf: I'd like to deploy a scap update before you run your part. [16:07:33] (03CR) 10Scott French: [V:03+2] "Thanks for the reviews, all!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1180653 (https://phabricator.wikimedia.org/T402424) (owner: 10Scott French) [16:07:33] !log kharlan@deploy1003 kharlan: Continuing with sync [16:07:53] (03CR) 10BCornwall: [V:03+1 C:03+2] mediawiki: Remove unused wikidata.org vhost and fix beta redirect [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle) [16:08:02] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 9.748 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:08:13] dancy: sure! any concerns if the first build on the new scap version is a full build? [16:08:26] Nope [16:08:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1251.eqiad.wmnet with reason: Maintenance [16:08:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1251 (T399249)', diff saved to https://phabricator.wikimedia.org/P81668 and previous config saved to /var/cache/conftool/dbconfig/20250821-160838-fceratto.json [16:08:44] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [16:08:44] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11107455 (10Jclark-ctr) @cmooney @ayounsi It looks like there’s nothing I or Juniper can do unless the OS is updated. A reboot might clear the alarms, but there’s a chance they could return, per Ju... [16:08:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.802 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:09:25] dancy: great, just give me a heads-up when I should proceed :) [16:09:40] swfrench-wmf: Nevermind. My deployment can wait until later [16:10:05] (03PS1) 10Giuseppe Lavagetto: haproxy: move ua policy enforcement to the requestctl backends [puppet] - 10https://gerrit.wikimedia.org/r/1180902 [16:10:05] (03CR) 10MVernon: [C:03+2] swift: remove 3 drained codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1180901 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [16:10:33] dancy: sounds good. let me know if you happen to change your mind in the interim [16:10:38] Will do [16:11:02] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:11:17] (03CR) 10Scott French: [V:03+2 C:03+2] php: remove deprecated ${} string interpolation [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1180653 (https://phabricator.wikimedia.org/T402424) (owner: 10Scott French) [16:11:50] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:12:47] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180898|hCaptcha: Fix topic name for frontend metrics]] (duration: 10m 17s) [16:12:59] swfrench-wmf: I'm done now [16:13:00] (03PS2) 10Giuseppe Lavagetto: haproxy: move ua policy enforcement to the requestctl backends [puppet] - 10https://gerrit.wikimedia.org/r/1180902 [16:13:07] kostajh: great, thanks! [16:15:51] (03PS3) 10Giuseppe Lavagetto: haproxy: move ua policy enforcement to the requestctl backends [puppet] - 10https://gerrit.wikimedia.org/r/1180902 [16:16:16] (03CR) 10Herron: [C:03+1] resources: remove most filters [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [16:16:20] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11107506 (10MatthewVernon) @VRiley-WMF I'm afraid not; the broken disk in ms-be1071 (T402346) is currently blocking the ring manager from making any chang... [16:16:25] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1026.eqiad.wmnet w/ force delete existing files, repooling both afterwards [16:16:29] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [16:16:34] (03CR) 10Dzahn: [C:03+2] phabricator: bump APCu shared memory size to 4096M [puppet] - 10https://gerrit.wikimedia.org/r/1180643 (https://phabricator.wikimedia.org/T401157) (owner: 10Brennen Bearnes) [16:18:38] FIRING: [6x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:18:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 4.010 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:18:56] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6680/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180902 (owner: 10Giuseppe Lavagetto) [16:19:14] off we go [16:19:25] !log swfrench@deploy1003 Started scap sync-world: Deployment to pick up new PHP production images and drop unused metadata label - T402424 T401254 [16:19:31] T402424: PHP Deprecated: Using ${var} in strings is deprecated, use {$var} instead in /srv/monitoring/lib.php on line 99 - https://phabricator.wikimedia.org/T402424 [16:19:32] T401254: Upgrade mw-debug/next to PHP 8.3 - https://phabricator.wikimedia.org/T401254 [16:19:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:19:44] (03CR) 10Vgutierrez: [V:03+1 C:03+1] haproxy: move ua policy enforcement to the requestctl backends [puppet] - 10https://gerrit.wikimedia.org/r/1180902 (owner: 10Giuseppe Lavagetto) [16:20:36] (03PS1) 10RLazarus: aptrepo: Add envoy-future component for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1180904 (https://phabricator.wikimedia.org/T380211) [16:23:55] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1023.eqiad.wmnet -> wdqs2024.codfw.wmnet w/ force delete existing files, repooling both afterwards [16:23:59] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [16:24:02] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 57533.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:24:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 57557.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:24:26] PROBLEM - MariaDB Replica Lag: s4 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 57558.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:24:28] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180637 (owner: 10Ncmonitor) [16:24:33] FIRING: [10x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:24:44] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:25:28] (03PS1) 10Reedy: Replace use of deprecated ParsoidExtensionAPI::addModuleStyles() [extensions/wikihiero] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180905 (https://phabricator.wikimedia.org/T402370) [16:25:38] jouncebot: nowandnext [16:25:38] For the next 0 hour(s) and 34 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1600) [16:25:38] In 0 hour(s) and 34 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1700) [16:25:39] In 0 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1700) [16:26:02] (03CR) 10Dzahn: "I would say "nowadays" is actually "we used to" and this is like one of the last remnants that we never converted, fwiw." [puppet] - 10https://gerrit.wikimedia.org/r/1180255 (owner: 10BCornwall) [16:27:05] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: move ua policy enforcement to the requestctl backends [puppet] - 10https://gerrit.wikimedia.org/r/1180902 (owner: 10Giuseppe Lavagetto) [16:27:45] (03CR) 10Dzahn: gerrit: add daemons ssh host key to known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar) [16:29:37] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11107597 (10VRiley-WMF) No problem! Thank you for the update! [16:29:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2089-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:30:03] (03CR) 10Scott French: [C:03+1] aptrepo: Add envoy-future component for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1180904 (https://phabricator.wikimedia.org/T380211) (owner: 10RLazarus) [16:31:05] Reedy: I have a somewhat long-running deploy ongoing due to a base image change. ETA probably another 20-30m. [16:32:01] np :) [16:32:13] (03CR) 10RLazarus: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1180904 (https://phabricator.wikimedia.org/T380211) (owner: 10RLazarus) [16:32:16] I would've thrown that patch out if thre was nothing ongoing [16:37:44] (03PS1) 10Reedy: Avoid PHP notice in AbstractEventRegistrationSpecialPage (country field) [extensions/CampaignEvents] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180907 (https://phabricator.wikimedia.org/T402441) [16:42:09] !log swfrench@deploy1003 swfrench: Deployment to pick up new PHP production images and drop unused metadata label - T402424 T401254 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:42:18] T402424: PHP Deprecated: Using ${var} in strings is deprecated, use {$var} instead in /srv/monitoring/lib.php on line 99 - https://phabricator.wikimedia.org/T402424 [16:42:18] T401254: Upgrade mw-debug/next to PHP 8.3 - https://phabricator.wikimedia.org/T401254 [16:42:50] (03CR) 10Krinkle: [C:04-1] "In Beta with this patch applied, https://test.wikipedia.beta.wmcloud.org/ with a mobile user agent returns "404 Domain not served here" wh" [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:42:56] hi operations folks! [16:43:20] I've written a patch to change the attributes of CentralNotice's hide cookies: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/1180165 [16:43:39] and I wanted to get more input before deployment [16:43:46] !log swfrench@deploy1003 swfrench: Continuing with sync [16:43:54] I /think/ it should be safe, and is moving in the direction we need to go [16:44:07] but since it could have such a wide impact I'd love to get more eyes on it [16:44:52] ejegg: usually if you're looking for code review, that happens before a patch is +2'd :-) [16:45:53] taavi: yeah, I should have let my team members know I wanted to shop it around some more before merging [16:45:57] (03CR) 10Krinkle: [C:04-1] "https://test.wikipedia.beta.wmcloud.org/wiki/Main_Page?Sdf" [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:46:13] but fortunately CentralNotice has a special deploy branch that doesn't hit the train automatically [16:47:34] although now I'm looking at that, I'm a bit surprised to see a direct call to `setcookie()`, I thought mediawiki's request/response handler classes had wrappers for that which generally should be used instead [16:47:58] ah, is that so? [16:48:17] https://gerrit.wikimedia.org/g/mediawiki/core/+/015c0678eae1906e6444b2dd2710e04283298a40/includes/Request/WebResponse.php#157 [16:50:27] oh, only since 1.22 :) [16:55:30] !log swfrench@deploy1003 Finished scap sync-world: Deployment to pick up new PHP production images and drop unused metadata label - T402424 T401254 (duration: 36m 37s) [16:55:36] T402424: PHP Deprecated: Using ${var} in strings is deprecated, use {$var} instead in /srv/monitoring/lib.php on line 99 - https://phabricator.wikimedia.org/T402424 [16:55:37] T401254: Upgrade mw-debug/next to PHP 8.3 - https://phabricator.wikimedia.org/T401254 [16:55:46] (03CR) 10BryanDavis: [C:04-1] "Breaking VCL compilation in beta. I am removing this set of cherry-picks there." [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:57:37] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: bring cirrussearch2089 back to production [puppet] - 10https://gerrit.wikimedia.org/r/1180888 (https://phabricator.wikimedia.org/T399943) (owner: 10Bking) [17:00:05] bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1700) [17:00:44] (03CR) 10Stevemunene: [C:03+2] analytics: Refine remove systemd job [puppet] - 10https://gerrit.wikimedia.org/r/1180149 (https://phabricator.wikimedia.org/T392698) (owner: 10Aqu) [17:05:37] (03CR) 10Reedy: [C:03+2] Replace use of deprecated ParsoidExtensionAPI::addModuleStyles() [extensions/wikihiero] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180905 (https://phabricator.wikimedia.org/T402370) (owner: 10Reedy) [17:05:38] (03CR) 10Reedy: [C:03+2] Avoid PHP notice in AbstractEventRegistrationSpecialPage (country field) [extensions/CampaignEvents] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180907 (https://phabricator.wikimedia.org/T402441) (owner: 10Reedy) [17:07:48] (03Merged) 10jenkins-bot: Replace use of deprecated ParsoidExtensionAPI::addModuleStyles() [extensions/wikihiero] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180905 (https://phabricator.wikimedia.org/T402370) (owner: 10Reedy) [17:07:50] (03Merged) 10jenkins-bot: Avoid PHP notice in AbstractEventRegistrationSpecialPage (country field) [extensions/CampaignEvents] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180907 (https://phabricator.wikimedia.org/T402441) (owner: 10Reedy) [17:08:39] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1180905|Replace use of deprecated ParsoidExtensionAPI::addModuleStyles() (T402370)]], [[gerrit:1180907|Avoid PHP notice in AbstractEventRegistrationSpecialPage (country field) (T402441)]] [17:08:45] T402370: PHP Deprecated: Use of Wikimedia\Parsoid\Ext\ParsoidExtensionAPI::addModuleStyles was deprecated in Parsoid 0.20. [Called from WikiHiero\Hooks::sourceToDom] - https://phabricator.wikimedia.org/T402370 [17:08:45] T402441: PHP Warning: Undefined array key "EventMeetingCountry" - https://phabricator.wikimedia.org/T402441 [17:15:02] !log reedy@deploy1003 reedy: Backport for [[gerrit:1180905|Replace use of deprecated ParsoidExtensionAPI::addModuleStyles() (T402370)]], [[gerrit:1180907|Avoid PHP notice in AbstractEventRegistrationSpecialPage (country field) (T402441)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:15:08] T402370: PHP Deprecated: Use of Wikimedia\Parsoid\Ext\ParsoidExtensionAPI::addModuleStyles was deprecated in Parsoid 0.20. [Called from WikiHiero\Hooks::sourceToDom] - https://phabricator.wikimedia.org/T402370 [17:15:08] T402441: PHP Warning: Undefined array key "EventMeetingCountry" - https://phabricator.wikimedia.org/T402441 [17:15:21] !log reedy@deploy1003 reedy: Continuing with sync [17:22:51] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180905|Replace use of deprecated ParsoidExtensionAPI::addModuleStyles() (T402370)]], [[gerrit:1180907|Avoid PHP notice in AbstractEventRegistrationSpecialPage (country field) (T402441)]] (duration: 14m 12s) [17:22:57] T402370: PHP Deprecated: Use of Wikimedia\Parsoid\Ext\ParsoidExtensionAPI::addModuleStyles was deprecated in Parsoid 0.20. [Called from WikiHiero\Hooks::sourceToDom] - https://phabricator.wikimedia.org/T402370 [17:22:57] T402441: PHP Warning: Undefined array key "EventMeetingCountry" - https://phabricator.wikimedia.org/T402441 [17:23:10] (03PS1) 10Andrew Bogott: openstack serverpackages: don't pin systemd [puppet] - 10https://gerrit.wikimedia.org/r/1180913 (https://phabricator.wikimedia.org/T247013) [17:23:17] that should fix a decent amount of logspam... [17:24:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T399249)', diff saved to https://phabricator.wikimedia.org/P81670 and previous config saved to /var/cache/conftool/dbconfig/20250821-172448-fceratto.json [17:24:54] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [17:26:30] (03PS4) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180637 (owner: 10Ncmonitor) [17:26:45] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180637 (owner: 10Ncmonitor) [17:32:50] (03PS2) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [17:36:20] Thanks Reedy! [17:38:42] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180913 (https://phabricator.wikimedia.org/T247013) (owner: 10Andrew Bogott) [17:39:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P81671 and previous config saved to /var/cache/conftool/dbconfig/20250821-173955-fceratto.json [17:40:32] (03PS5) 10RLazarus: profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [17:41:08] (03CR) 10Krinkle: varnish: Implement new direct routing for mobile views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [17:41:26] (03PS6) 10RLazarus: profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [17:41:52] (03CR) 10CI reject: [V:04-1] profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [17:44:29] (03PS1) 10BryanDavis: developer-portal: Bump to 2025-08-21-122456-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180916 [17:46:57] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2025-08-21-122456-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180916 (owner: 10BryanDavis) [17:48:20] (03CR) 10Umherirrender: "The backport is not needed, the JsonConfig change avoids the request, that was failing with this error." [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180525 (https://phabricator.wikimedia.org/T402273) (owner: 10D3r1ck01) [17:48:24] (03Abandoned) 10Umherirrender: libs: Handle null domain in Cookie::canServeDomain [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180525 (https://phabricator.wikimedia.org/T402273) (owner: 10D3r1ck01) [17:48:52] (03PS7) 10RLazarus: profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [17:49:10] (03Merged) 10jenkins-bot: developer-portal: Bump to 2025-08-21-122456-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180916 (owner: 10BryanDavis) [17:49:48] (03CR) 10Cathal Mooney: [C:03+1] "Thanks for this. I think it should be safe to merge any time now as I've added the required hiera key to the private repo." [puppet] - 10https://gerrit.wikimedia.org/r/1180856 (https://phabricator.wikimedia.org/T402511) (owner: 10Ayounsi) [17:49:56] (03PS1) 10Vgutierrez: cache::haproxy: make log sequence numbers unique across ports [puppet] - 10https://gerrit.wikimedia.org/r/1180918 (https://phabricator.wikimedia.org/T401383) [17:52:29] (03PS2) 10Vgutierrez: cache::haproxy: make log sequence numbers unique across ports [puppet] - 10https://gerrit.wikimedia.org/r/1180918 (https://phabricator.wikimedia.org/T401383) [17:52:44] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6681/console" [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [17:53:04] (03PS2) 10Meno25: Update redirected link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180917 [17:53:05] (03CR) 10Meno25: "Minor change. Please review. Thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180917 (owner: 10Meno25) [17:54:28] (03CR) 10Herron: [C:03+1] profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [17:54:50] (03CR) 10RLazarus: [V:03+1 C:03+2] profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [17:55:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P81672 and previous config saved to /var/cache/conftool/dbconfig/20250821-175503-fceratto.json [17:57:18] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180637 (owner: 10Ncmonitor) [18:00:04] jnuche and jeena: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T1800). [18:00:05] (03PS2) 10Krinkle: varnish: Merge m-dot and X-Subdomain block in cluster_fe_recv_pre_purge [puppet] - 10https://gerrit.wikimedia.org/r/1180166 (https://phabricator.wikimedia.org/T401595) [18:00:05] (03PS5) 10Krinkle: varnish: Document mobile user agent regexen and mobile_redirect logic [puppet] - 10https://gerrit.wikimedia.org/r/1180220 (https://phabricator.wikimedia.org/T401595) [18:00:05] (03PS3) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [18:00:05] (03PS1) 10Krinkle: trafficserver: Check x-dt-host in rb-mw-mangling.lua before using [puppet] - 10https://gerrit.wikimedia.org/r/1180919 (https://phabricator.wikimedia.org/T402557) [18:01:25] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:01:54] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:02:06] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:02:29] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:02:37] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:03:05] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:05:14] (03PS3) 10Krinkle: varnish: Merge m-dot and X-Subdomain block in cluster_fe_recv_pre_purge [puppet] - 10https://gerrit.wikimedia.org/r/1180166 (https://phabricator.wikimedia.org/T401595) [18:05:14] (03PS6) 10Krinkle: varnish: Document mobile user agent regexen and mobile_redirect logic [puppet] - 10https://gerrit.wikimedia.org/r/1180220 (https://phabricator.wikimedia.org/T401595) [18:05:14] (03PS2) 10Krinkle: trafficserver: Check x-dt-host in rb-mw-mangling.lua before using [puppet] - 10https://gerrit.wikimedia.org/r/1180919 (https://phabricator.wikimedia.org/T402557) [18:05:15] (03PS4) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [18:07:14] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180921 [18:07:48] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6683/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180166 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [18:08:19] (03CR) 10BCornwall: [C:03+2] varnish: Merge m-dot and X-Subdomain block in cluster_fe_recv_pre_purge (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180166 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [18:08:38] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:10:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T399249)', diff saved to https://phabricator.wikimedia.org/P81673 and previous config saved to /var/cache/conftool/dbconfig/20250821-181010-fceratto.json [18:10:16] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:10:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [18:11:26] (03CR) 10BCornwall: [C:03+2] varnish: Document mobile user agent regexen and mobile_redirect logic [puppet] - 10https://gerrit.wikimedia.org/r/1180220 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [18:14:35] (03PS5) 10Cathal Mooney: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) [18:18:12] !log Upgrading LibreNMS to v25.8.0 - T402263 [18:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:49] !log denisse@deploy1003 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 25.8.0 - T402263 [18:18:57] !log denisse@deploy1003 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 25.8.0 - T402263 (duration: 00m 08s) [18:19:20] (03CR) 10BCornwall: [V:03+2 C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180921 (owner: 10Ncmonitor) [18:20:43] !log dancy@deploy1003 Installing scap version "4.208.0" for 169 host(s) [18:22:49] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: Add Python modules to configure Nokia SR Linux switches - https://phabricator.wikimedia.org/T402577 (10cmooney) 03NEW p:05Triage→03Medium [18:22:51] !log denisse@deploy1003 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 25.8.0 - T402263 [18:22:59] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: Add Python modules to configure Nokia SR Linux switches - https://phabricator.wikimedia.org/T402577#11108196 (10cmooney) [18:23:10] !log denisse@deploy1003 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 25.8.0 - T402263 (duration: 00m 18s) [18:23:15] (03CR) 10Andrew Bogott: [C:03+2] puppetserver: check for rebase in puppetserver-deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1163883 (https://phabricator.wikimedia.org/T397877) (owner: 10BryanDavis) [18:23:25] (03PS1) 10Cathal Mooney: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) [18:24:39] !log dancy@deploy1003 Installation of scap version "4.208.0" completed for 169 hosts [18:26:16] (03PS3) 10Krinkle: trafficserver: Check x-dt-host in rb-mw-mangling.lua before using [puppet] - 10https://gerrit.wikimedia.org/r/1180919 (https://phabricator.wikimedia.org/T402557) [18:26:16] (03PS5) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [18:27:31] (03PS4) 10Krinkle: trafficserver: Check x-dt-host in rb-mw-mangling.lua before using [puppet] - 10https://gerrit.wikimedia.org/r/1180919 (https://phabricator.wikimedia.org/T401595) [18:27:33] (03PS6) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [18:29:49] (03CR) 10CI reject: [V:04-1] trafficserver: Check x-dt-host in rb-mw-mangling.lua before using [puppet] - 10https://gerrit.wikimedia.org/r/1180919 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [18:30:10] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [18:34:05] (03PS2) 10Bernard Wang: Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) [18:34:56] (03CR) 10CI reject: [V:04-1] Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang) [18:36:26] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1042 [18:36:51] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1042 [18:38:34] (03PS3) 10Cathal Mooney: Nokia: Add initial Python files for nokia switch system config [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) [18:39:08] (03PS5) 10Krinkle: trafficserver: Check x-dt-host in rb-mw-mangling.lua before using [puppet] - 10https://gerrit.wikimedia.org/r/1180919 (https://phabricator.wikimedia.org/T401595) [18:39:08] (03PS7) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [18:41:19] (03CR) 10CI reject: [V:04-1] trafficserver: Check x-dt-host in rb-mw-mangling.lua before using [puppet] - 10https://gerrit.wikimedia.org/r/1180919 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [18:44:31] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1043 [18:44:56] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1043 [18:48:29] (03PS4) 10Cathal Mooney: Nokia: Add initial Python files for nokia switch system config [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) [18:48:42] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180931 [18:49:15] (03PS5) 10Cathal Mooney: Nokia: Add initial Python files for nokia switch system config [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) [18:49:25] (03PS6) 10Krinkle: trafficserver: Check x-dt-host in rb-mw-mangling.lua before using [puppet] - 10https://gerrit.wikimedia.org/r/1180919 (https://phabricator.wikimedia.org/T401595) [18:49:25] (03PS8) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [18:49:37] (03CR) 10Urbanecm: [Growth] enwiki: Deploy "Add a link" to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [18:51:36] (03CR) 10CI reject: [V:04-1] trafficserver: Check x-dt-host in rb-mw-mangling.lua before using [puppet] - 10https://gerrit.wikimedia.org/r/1180919 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [18:52:50] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:53:04] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:53:22] (03PS2) 10Cathal Mooney: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) [18:53:58] (03PS3) 10Cathal Mooney: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) [18:54:48] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1044 [18:55:04] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:55:15] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1044 [18:55:24] (03CR) 10Urbanecm: [C:04-1] "once the A/Cs were clarified, this is now reviewable. i noted an issue inline." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [18:55:44] !log bking@cumin1002 conftool action : set/weight=10; selector: name=cirrussearch2113. [18:55:49] !log bking@cumin1002 conftool action : set/weight=10; selector: name=cirrussearch2113. [18:56:02] !log bking@cumin1002 conftool action : set/weight=10; selector: name=cirrussearch2091. [18:57:02] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180933 [18:57:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:59:44] (03PS4) 10Scott French: image-suggestion: cleanup unused refs to service listener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171703 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [18:59:45] (03CR) 10Scott French: [C:03+1] "@eevans@wikimedia.org - I think we're good to move forward on this, as long as you're alright with @cparle@wikimedia.org's change to remov" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171703 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [18:59:56] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180934 [19:00:47] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1047 [19:01:05] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1047 [19:02:17] (03PS3) 10JHathaway: provision: always set NIC to EFI in UEFI mode [cookbooks] - 10https://gerrit.wikimedia.org/r/1180627 (https://phabricator.wikimedia.org/T387577) [19:03:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.192 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:03:54] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.526 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:04:31] !log bking@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2091\.codfw\.wmnet [19:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:04:43] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180934 (owner: 10Ncmonitor) [19:04:49] !log bking@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2113\.codfw\.wmnet [19:08:50] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581 (10phaultfinder) 03NEW [19:11:08] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: supermicro [19:13:16] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:13:51] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582 (10phaultfinder) 03NEW [19:14:53] (03PS3) 10Bernard Wang: Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) [19:15:46] (03CR) 10Bernard Wang: Update vector search config with new wgVectorTypeahead (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang) [19:15:51] (03CR) 10CI reject: [V:04-1] Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang) [19:16:37] (03PS4) 10Bernard Wang: Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) [19:17:06] (03PS7) 10Krinkle: trafficserver: Check x-dt-host in rb-mw-mangling.lua before using [puppet] - 10https://gerrit.wikimedia.org/r/1180919 (https://phabricator.wikimedia.org/T401595) [19:17:06] (03PS9) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [19:17:31] (03CR) 10CI reject: [V:04-1] Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang) [19:18:05] PROBLEM - mailman3_queue_size on lists1004 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 173 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [19:21:19] jhathaway@cumin1002 provision (PID 2045166) is awaiting input [19:21:47] (03CR) 10BCornwall: [C:03+2] trafficserver: Check x-dt-host in rb-mw-mangling.lua before using [puppet] - 10https://gerrit.wikimedia.org/r/1180919 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [19:21:58] (03PS4) 10Cathal Mooney: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) [19:23:15] (03PS5) 10Cathal Mooney: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) [19:23:28] (03CR) 10BCornwall: [V:03+1 C:03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6685/console" [puppet] - 10https://gerrit.wikimedia.org/r/1180919 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [19:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:30:25] 06SRE, 10envoy, 06serviceops, 06Traffic: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 (10RLazarus) 03NEW [19:35:06] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:35:25] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:35:35] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1180946 [19:35:40] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180947 [19:35:45] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180948 [19:36:05] (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1180946 (owner: 10Ncmonitor) [19:36:07] (03CR) 10BCornwall: [V:03+2 C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1180946 (owner: 10Ncmonitor) [19:36:20] !log brett@dns1004 START - running authdns-update [19:36:35] FIRING: MailmanBounceQueueHigh: Mailman bounce queue on lists1004:9100 has more than 50 messages - https://wikitech.wikimedia.org/wiki/Mailman/Runbooks#MailmanBounceQueueHigh - https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?forceLogin&from=now-3h&orgId=1&to=now&viewPanel=2 - https://alerts.wikimedia.org/?q=alertname%3DMailmanBounceQueueHigh [19:37:27] !log brett@dns1004 END - running authdns-update [19:38:07] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180947 (owner: 10Ncmonitor) [19:38:09] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:38:22] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180947 (owner: 10Ncmonitor) [19:38:23] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:38:24] (03CR) 10BCornwall: [V:03+2 C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180947 (owner: 10Ncmonitor) [19:39:32] (03CR) 10BCornwall: [V:03+2 C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180948 (owner: 10Ncmonitor) [19:39:40] (03PS6) 10Cathal Mooney: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) [19:41:02] (03PS7) 10Cathal Mooney: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) [19:41:46] (03PS1) 10RLazarus: Update to v1.26.8 and drop buster [debs/envoyproxy] (v1.26) - 10https://gerrit.wikimedia.org/r/1180949 (https://phabricator.wikimedia.org/T402584) [19:43:37] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:44:16] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:47:38] jhathaway@cumin1002 provision (PID 2095261) is awaiting input [19:48:15] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:48:24] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: supermicro [19:48:35] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180951 [19:48:46] (03PS1) 10Dzahn: admin: add user aude to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1180952 (https://phabricator.wikimedia.org/T402022) [19:49:05] (03CR) 10CI reject: [V:04-1] admin: add user aude to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1180952 (https://phabricator.wikimedia.org/T402022) (owner: 10Dzahn) [19:49:10] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180951 (owner: 10Ncmonitor) [19:50:07] (03PS2) 10Dzahn: admin: add user aude to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1180952 (https://phabricator.wikimedia.org/T402022) [19:50:20] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:52:07] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad: row C/D switch refresh configuration task - https://phabricator.wikimedia.org/T402588 (10cmooney) 03NEW p:05Triage→03Medium [19:52:21] (03PS1) 10Cathal Mooney: Add new Nokia switches to IBGP spine/leaf pod definitions in sites [homer/public] - 10https://gerrit.wikimedia.org/r/1180953 (https://phabricator.wikimedia.org/T402588) [19:52:26] (03CR) 10Dzahn: [C:03+1] "group approval should not be needed anymore per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#analytics-privatedata-" [puppet] - 10https://gerrit.wikimedia.org/r/1180952 (https://phabricator.wikimedia.org/T402022) (owner: 10Dzahn) [19:52:51] (03CR) 10Scott French: [C:03+1] Update to v1.26.8 and drop buster (031 comment) [debs/envoyproxy] (v1.26) - 10https://gerrit.wikimedia.org/r/1180949 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [19:53:13] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Superset / LDAP access for aude - https://phabricator.wikimedia.org/T402022#11108512 (10Dzahn) [19:55:08] (03PS2) 10Cathal Mooney: Add new IBGP cluster in eqiad with pod for row C/D Nokia switches [homer/public] - 10https://gerrit.wikimedia.org/r/1180953 (https://phabricator.wikimedia.org/T402588) [19:55:37] (03PS2) 10RLazarus: Update to v1.26.8 and drop buster [debs/envoyproxy] (v1.26) - 10https://gerrit.wikimedia.org/r/1180949 (https://phabricator.wikimedia.org/T402584) [19:56:18] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Superset / LDAP access for aude - https://phabricator.wikimedia.org/T402022#11108515 (10Dzahn) 05Open→03In progress uploaded a patch. Tagged with "Data Engineering" per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requ... [19:59:03] (03PS1) 10Krinkle: trafficserver: Fix confusion in rb-mw-mangling_test.lua cases [puppet] - 10https://gerrit.wikimedia.org/r/1180957 (https://phabricator.wikimedia.org/T401595) [19:59:33] (03CR) 10RLazarus: [C:03+2] Update to v1.26.8 and drop buster (031 comment) [debs/envoyproxy] (v1.26) - 10https://gerrit.wikimedia.org/r/1180949 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [20:00:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11108542 (10cmooney) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:06] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Eqiad: row C/D switch refresh configuration task - https://phabricator.wikimedia.org/T402588#11108541 (10cmooney) [20:00:34] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:01:15] (03CR) 10Dzahn: "looks good to me, key matches, UID matches, has approval by known WMDE engineering manager. but may need NDA process. usually WMDE staff d" [puppet] - 10https://gerrit.wikimedia.org/r/1180851 (https://phabricator.wikimedia.org/T402384) (owner: 10Vgutierrez) [20:01:22] 06SRE, 06Infrastructure-Foundations, 10netops: codfw expansion: configure new Nokia switches in rows E/F - https://phabricator.wikimedia.org/T402590 (10cmooney) 03NEW p:05Triage→03Medium [20:01:37] RECOVERY - MariaDB Replica Lag: s4 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:01:38] 06SRE, 06Infrastructure-Foundations, 10netops: codfw expansion: configure new Nokia switches in rows E/F - https://phabricator.wikimedia.org/T402590#11108572 (10cmooney) [20:02:15] (03PS1) 10Cathal Mooney: Add new Nokia switches to ibgp pod e/f in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1180958 (https://phabricator.wikimedia.org/T402590) [20:02:44] (03CR) 10Dzahn: [V:03+1 C:03+1] "nevermind, also already on NDA spreadsheet. looks ready to merge to me" [puppet] - 10https://gerrit.wikimedia.org/r/1180851 (https://phabricator.wikimedia.org/T402384) (owner: 10Vgutierrez) [20:04:20] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics for Dima_Koushha_WMDE - https://phabricator.wikimedia.org/T402384#11108600 (10Dzahn) 05Open→03In progress [20:05:44] jhathaway@cumin1002 reimage (PID 2125126) is awaiting input [20:07:22] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [20:10:32] (03CR) 10Aude: [C:03+1] "thanks for helping with this" [puppet] - 10https://gerrit.wikimedia.org/r/1180952 (https://phabricator.wikimedia.org/T402022) (owner: 10Dzahn) [20:13:55] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:15:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:15:35] jouncebot: nowandnext [20:15:35] For the next 0 hour(s) and 44 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T2000) [20:15:35] In 0 hour(s) and 44 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T2100) [20:15:50] (03CR) 10Zabe: [C:03+2] Set categorylinks to read new on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180895 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [20:16:38] (03Merged) 10jenkins-bot: Set categorylinks to read new on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180895 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [20:16:53] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.257 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:16:59] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 2.331 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:17:05] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1180895|Set categorylinks to read new on enwiki (T397912)]] [20:17:09] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [20:19:02] !log lists1004 - sudo exim4 -qf - forced delivery attempt as reaction to alerting about large mail queue [20:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:52] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [20:19:57] (03PS2) 10Zabe: Stop writing to cl_to and cl_collation on large s7 and s8 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180609 (https://phabricator.wikimedia.org/T399579) [20:21:59] (03CR) 10Zabe: [C:03+1] Update redirected link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180917 (owner: 10Meno25) [20:22:42] !log zabe@deploy1003 zabe: Backport for [[gerrit:1180895|Set categorylinks to read new on enwiki (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:22:46] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [20:23:05] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [20:23:43] !log zabe@deploy1003 zabe: Continuing with sync [20:24:59] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:26:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:26:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:26:59] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 2.646 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:28:02] (03CR) 10Zabe: [C:03+2] Update redirected link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180917 (owner: 10Meno25) [20:28:53] (03Merged) 10jenkins-bot: Update redirected link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180917 (owner: 10Meno25) [20:29:02] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180895|Set categorylinks to read new on enwiki (T397912)]] (duration: 11m 58s) [20:29:07] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [20:29:48] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1180917|Update redirected link]] [20:32:22] (03PS1) 10Krinkle: [WIP] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 [20:32:29] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2024.codfw.wmnet -> wdqs2027.codfw.wmnet w/ force delete existing files, repooling both afterwards [20:32:30] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2024.codfw.wmnet -> wdqs2027.codfw.wmnet w/ force delete existing files, repooling both afterwards [20:32:34] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [20:33:44] (03CR) 10Cwhite: [C:03+2] resources: remove most filters [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [20:34:56] (03Merged) 10jenkins-bot: resources: remove most filters [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [20:35:22] !log zabe@deploy1003 zabe, meno25: Backport for [[gerrit:1180917|Update redirected link]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:35:24] (03Abandoned) 10Jforrester: Switch php7.4-cli to bullseye and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T356293) (owner: 10Jforrester) [20:35:45] !log zabe@deploy1003 zabe, meno25: Continuing with sync [20:37:35] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [20:38:05] RECOVERY - mailman3_queue_size on lists1004 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [20:38:25] !log deleted a bunch of old bounce messages in the exim queue on lists1004 [20:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:29] (03CR) 10Zabe: [C:03+2] Stop writing to cl_to and cl_collation on large s7 and s8 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180609 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [20:39:16] (03Merged) 10jenkins-bot: Stop writing to cl_to and cl_collation on large s7 and s8 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180609 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [20:39:34] mutante: I wonder if keeping a registry of the addresses filing the list and eventually removing them would be a good idea. Some of those accounts may be inactive already. [20:39:45] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1002.eqiad.wmnet with reason: supermicro [20:39:53] (03PS1) 10Reedy: CommonSettings: Add hcaptcha.wikimedia.org to $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180970 (https://phabricator.wikimedia.org/T382148) [20:40:54] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180917|Update redirected link]] (duration: 11m 06s) [20:41:08] (03PS2) 10Reedy: CommonSettings: Add hcaptcha.wikimedia.org to $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180970 (https://phabricator.wikimedia.org/T382148) [20:41:27] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1180609|Stop writing to cl_to and cl_collation on large s7 and s8 wikis (T399579)]] [20:41:31] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [20:41:35] RESOLVED: MailmanBounceQueueHigh: Mailman bounce queue on lists1004:9100 has more than 50 messages - https://wikitech.wikimedia.org/wiki/Mailman/Runbooks#MailmanBounceQueueHigh - https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?forceLogin&from=now-3h&orgId=1&to=now&viewPanel=2 - https://alerts.wikimedia.org/?q=alertname%3DMailmanBounceQueueHigh [20:41:50] jouncebot: nowandnext [20:41:50] For the next 0 hour(s) and 18 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T2000) [20:41:50] In 0 hour(s) and 18 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T2100) [20:42:14] (03PS3) 10Reedy: CommonSettings: Add hcaptcha.wikimedia.org to $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180970 (https://phabricator.wikimedia.org/T382148) [20:42:19] (03CR) 10Reedy: [C:03+2] CommonSettings: Add hcaptcha.wikimedia.org to $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180970 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [20:43:13] (03Merged) 10jenkins-bot: CommonSettings: Add hcaptcha.wikimedia.org to $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180970 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [20:43:48] (03PS9) 10BCornwall: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [20:46:44] (03PS38) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [20:47:11] (03CR) 10CI reject: [V:04-1] dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [20:47:13] !log zabe@deploy1003 zabe: Backport for [[gerrit:1180609|Stop writing to cl_to and cl_collation on large s7 and s8 wikis (T399579)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:47:18] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [20:47:49] !log zabe@deploy1003 zabe: Continuing with sync [20:48:28] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2024.codfw.wmnet -> wdqs2027.codfw.wmnet w/ force delete existing files, repooling both afterwards [20:48:33] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [20:49:16] (03PS1) 10Andrea Denisse: grafana: Disable dashboard sync for a version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1180972 (https://phabricator.wikimedia.org/T402544) [20:49:16] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1180972/6687/" [puppet] - 10https://gerrit.wikimedia.org/r/1180972 (https://phabricator.wikimedia.org/T402544) (owner: 10Andrea Denisse) [20:50:37] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180973 [20:50:52] (03CR) 10Dzahn: [C:03+1] grafana: Disable dashboard sync for a version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1180972 (https://phabricator.wikimedia.org/T402544) (owner: 10Andrea Denisse) [20:51:56] (03PS39) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [20:53:13] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180609|Stop writing to cl_to and cl_collation on large s7 and s8 wikis (T399579)]] (duration: 11m 46s) [20:53:18] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [20:53:27] Reedy: feel free to take over [20:56:14] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6689/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T2100) [21:03:33] (03CR) 10BCornwall: [C:03+1] "Beautiful" [puppet] - 10https://gerrit.wikimedia.org/r/1180957 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [21:05:21] (03CR) 10BCornwall: [C:03+2] trafficserver: Fix confusion in rb-mw-mangling_test.lua cases [puppet] - 10https://gerrit.wikimedia.org/r/1180957 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [21:09:32] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1180970|CommonSettings: Add hcaptcha.wikimedia.org to $wgCrossSiteAJAXdomains (T382148)]] [21:09:37] T382148: Enable hCaptcha on test2wiki - https://phabricator.wikimedia.org/T382148 [21:12:22] 07Puppet, 10Beta-Cluster-Infrastructure: /usr/local/bin/puppetserver-deploy-code emits scary looking error messages during a `git rebase` operation - https://phabricator.wikimedia.org/T397877#11108801 (10bd808) 05In progress→03Resolved [21:14:01] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host people1005.eqiad.wmnet [21:14:02] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [21:15:21] !log reedy@deploy1003 reedy: Backport for [[gerrit:1180970|CommonSettings: Add hcaptcha.wikimedia.org to $wgCrossSiteAJAXdomains (T382148)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:15:25] T382148: Enable hCaptcha on test2wiki - https://phabricator.wikimedia.org/T382148 [21:16:03] !log reedy@deploy1003 reedy: Continuing with sync [21:18:28] ohoho! [21:18:45] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bookworm [21:18:57] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1005.eqiad.wmnet - dzahn@cumin1002" [21:20:47] (03PS1) 10Dzahn: site: add people1005 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1180977 (https://phabricator.wikimedia.org/T402596) [21:21:02] (03PS2) 10Dzahn: site: add people1005 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1180977 (https://phabricator.wikimedia.org/T402596) [21:21:11] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180970|CommonSettings: Add hcaptcha.wikimedia.org to $wgCrossSiteAJAXdomains (T382148)]] (duration: 11m 39s) [21:21:17] T382148: Enable hCaptcha on test2wiki - https://phabricator.wikimedia.org/T382148 [21:21:18] (03CR) 10Dzahn: [C:03+2] site: add people1005 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1180977 (https://phabricator.wikimedia.org/T402596) (owner: 10Dzahn) [21:21:34] 10ops-esams, 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#11108817 (10RobH) 05In progress→03Resolved After discussion within both Traffic and DC Ops we're going to resolve this with the fans just running faster. [21:21:50] !log ryankemper@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=wdqs-internal-scholarly,name=eqiad [21:22:02] dzahn@cumin1002 makevm (PID 2239507) is awaiting input [21:22:34] !log T386098 Depooled eqiad `wdqs-internal-scholarly` in preparation for data transfer [21:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:39] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [21:22:54] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1027.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:23:08] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) (T386098, transfer newly-reloaded data) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1027.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:23:16] ^ accidentally started outside a tmux, manualyl killed [21:23:34] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1027.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:25:20] !log bking@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=search,name=eqiad [21:25:33] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM people1005.eqiad.wmnet - dzahn@cumin1002" [21:25:33] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:25:34] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache people1005.eqiad.wmnet on all recursors [21:25:37] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) people1005.eqiad.wmnet on all recursors [21:25:48] !log bking@cumin1002 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on 55 hosts with reason: T395571 [21:25:53] T395571: Verify/fix Logstash pipeline/log rotate for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571 [21:25:56] (03CR) 10Cwhite: "Ack. I personally try to avoid yaml anchors, except in a few cases:" [alerts] - 10https://gerrit.wikimedia.org/r/1179178 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [21:26:07] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people1005.eqiad.wmnet - dzahn@cumin1002" [21:26:12] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM people1005.eqiad.wmnet - dzahn@cumin1002" [21:26:13] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-coord1002.eqiad.wmnet with OS bookworm [21:26:14] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180973 (owner: 10Ncmonitor) [21:26:22] (03PS2) 10Krinkle: [WIP] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 [21:26:30] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bookworm [21:26:44] (03PS1) 10Cathal Mooney: Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) [21:26:51] !log bking@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 55 hosts with reason: T400160 [21:26:56] T400160: Investigate eqiad cluster quorum failure issues - https://phabricator.wikimedia.org/T400160 [21:27:32] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host people1005.eqiad.wmnet with OS trixie [21:28:52] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:30:04] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:33:39] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-coord1002.eqiad.wmnet with OS bookworm [21:35:29] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2024.codfw.wmnet -> wdqs2027.codfw.wmnet w/ force delete existing files, repooling both afterwards [21:35:34] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [21:37:33] (03CR) 10Andrew Bogott: [C:03+2] openstack serverpackages: don't pin systemd [puppet] - 10https://gerrit.wikimedia.org/r/1180913 (https://phabricator.wikimedia.org/T247013) (owner: 10Andrew Bogott) [21:38:00] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:38:01] (03PS3) 10Krinkle: [WIP] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 [21:38:08] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:38:29] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bookworm [21:38:38] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:41:07] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 8.972 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:41:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 1.426 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:46:06] (03CR) 10JHathaway: provision: always set NIC to EFI in UEFI mode (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1180627 (https://phabricator.wikimedia.org/T387577) (owner: 10JHathaway) [21:46:06] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:46:08] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:46:11] (03CR) 10JHathaway: [C:03+2] provision: always set NIC to EFI in UEFI mode [cookbooks] - 10https://gerrit.wikimedia.org/r/1180627 (https://phabricator.wikimedia.org/T387577) (owner: 10JHathaway) [21:48:26] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on people1005.eqiad.wmnet with reason: host reimage [21:48:40] PROBLEM - HTTPS non-canonical-redirect-28 on ncredir7003 is CRITICAL: SSL CRITICAL - failed to verify wikimedia.li against wikipedia.com, *.en-wp.com, *.en-wp.org, *.mediawiki.com, *.voyagewiki.com, *.voyagewiki.org, *.wiikipedia.com, *.wikibook.com, *.wikibooks.com, *.wikiepdia.com, *.wikiepdia.org, *.wikiipedia.org, *.wikijunior.com, *.wikijunior.net, *.wikijunior.org, *.wikipedia.com, en-wp.com, en-wp.org, mediawiki.com, voyagewiki.com [21:48:40] wiki.org, wiikipedia.com, wikibook.com, wikibooks.com, wikiepdia.com, wikiepdia.org, wikiipedia.org, wikijunior.com, wikijunior.net, wikijunior.org https://wikitech.wikimedia.org/wiki/Ncredir [21:49:57] !log bking@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=search,name=eqiad [21:50:21] PROBLEM - HTTPS non-canonical-redirect-28 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to verify wikimedia.li against wikipedia.com, *.en-wp.com, *.en-wp.org, *.mediawiki.com, *.voyagewiki.com, *.voyagewiki.org, *.wiikipedia.com, *.wikibook.com, *.wikibooks.com, *.wikiepdia.com, *.wikiepdia.org, *.wikiipedia.org, *.wikijunior.com, *.wikijunior.net, *.wikijunior.org, *.wikipedia.com, en-wp.com, en-wp.org, mediawiki.com, voyagewiki.com [21:50:21] wiki.org, wiikipedia.com, wikibook.com, wikibooks.com, wikiepdia.com, wikiepdia.org, wikiipedia.org, wikijunior.com, wikijunior.net, wikijunior.org https://wikitech.wikimedia.org/wiki/Ncredir [21:51:17] (03CR) 10Andrea Denisse: [C:03+2] grafana: Disable dashboard sync for a version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1180972 (https://phabricator.wikimedia.org/T402544) (owner: 10Andrea Denisse) [21:51:21] RECOVERY - HTTPS non-canonical-redirect-28 on ncredir1001 is OK: SSL OK - Certificate wikimedia.li valid until 2025-11-19 20:49:28 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Ncredir [21:52:05] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.216 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:52:07] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 4.853 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:53:32] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on people1005.eqiad.wmnet with reason: host reimage [21:54:37] PROBLEM - HTTPS non-canonical-redirect-28 on ncredir1002 is CRITICAL: SSL CRITICAL - failed to verify wikimedia.li against wikipedia.com, *.en-wp.com, *.en-wp.org, *.mediawiki.com, *.voyagewiki.com, *.voyagewiki.org, *.wiikipedia.com, *.wikibook.com, *.wikibooks.com, *.wikiepdia.com, *.wikiepdia.org, *.wikiipedia.org, *.wikijunior.com, *.wikijunior.net, *.wikijunior.org, *.wikipedia.com, en-wp.com, en-wp.org, mediawiki.com, voyagewiki.com [21:54:37] wiki.org, wiikipedia.com, wikibook.com, wikibooks.com, wikiepdia.com, wikiepdia.org, wikiipedia.org, wikijunior.com, wikijunior.net, wikijunior.org https://wikitech.wikimedia.org/wiki/Ncredir [21:55:51] (03PS1) 10RLazarus: Rewrite checksum paths to filenames in get-envoy-release.sh [debs/envoyproxy] (v1.26) - 10https://gerrit.wikimedia.org/r/1180984 (https://phabricator.wikimedia.org/T402584) [21:56:05] PROBLEM - HTTPS non-canonical-redirect-28 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to verify wikimedia.li against wikipedia.com, *.en-wp.com, *.en-wp.org, *.mediawiki.com, *.voyagewiki.com, *.voyagewiki.org, *.wiikipedia.com, *.wikibook.com, *.wikibooks.com, *.wikiepdia.com, *.wikiepdia.org, *.wikiipedia.org, *.wikijunior.com, *.wikijunior.net, *.wikijunior.org, *.wikipedia.com, en-wp.com, en-wp.org, mediawiki.com, voyagewiki.com [21:56:05] wiki.org, wiikipedia.com, wikibook.com, wikibooks.com, wikiepdia.com, wikiepdia.org, wikiipedia.org, wikijunior.com, wikijunior.net, wikijunior.org https://wikitech.wikimedia.org/wiki/Ncredir [21:57:11] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-coord1002.eqiad.wmnet with OS bookworm [21:57:38] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180985 [21:57:53] ack the above [22:00:11] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [22:00:14] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [22:00:40] RECOVERY - HTTPS non-canonical-redirect-28 on ncredir7003 is OK: SSL OK - Certificate wikimedia.li valid until 2025-11-19 20:49:28 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Ncredir [22:01:07] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [22:01:10] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [22:01:38] RECOVERY - HTTPS non-canonical-redirect-28 on ncredir1002 is OK: SSL OK - Certificate wikimedia.li valid until 2025-11-19 20:49:28 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Ncredir [22:03:06] RECOVERY - HTTPS non-canonical-redirect-28 on ncredir4002 is OK: SSL OK - Certificate wikimedia.li valid until 2025-11-19 20:49:28 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Ncredir [22:04:50] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host people1005.eqiad.wmnet with OS trixie [22:04:51] !log dzahn@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host people1005.eqiad.wmnet [22:05:31] (03CR) 10Scott French: [C:03+1] Rewrite checksum paths to filenames in get-envoy-release.sh [debs/envoyproxy] (v1.26) - 10https://gerrit.wikimedia.org/r/1180984 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [22:07:25] (03CR) 10RLazarus: [C:03+2] "Thanks!" [debs/envoyproxy] (v1.26) - 10https://gerrit.wikimedia.org/r/1180984 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [22:09:55] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180985 (owner: 10Ncmonitor) [22:10:18] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bookworm [22:13:41] (03PS1) 10Dzahn: site: add peopleweb role to people1005 [puppet] - 10https://gerrit.wikimedia.org/r/1180990 (https://phabricator.wikimedia.org/T402596) [22:13:52] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [22:13:57] (03CR) 10CI reject: [V:04-1] site: add peopleweb role to people1005 [puppet] - 10https://gerrit.wikimedia.org/r/1180990 (https://phabricator.wikimedia.org/T402596) (owner: 10Dzahn) [22:15:04] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [22:15:22] (03PS2) 10Dzahn: site: add peopleweb role to people1005 [puppet] - 10https://gerrit.wikimedia.org/r/1180990 (https://phabricator.wikimedia.org/T402596) [22:15:54] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1027.eqiad.wmnet w/ force delete existing files, repooling both afterwards [22:15:59] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [22:18:38] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on cirrussearch2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:19:13] !log reprepro -C component/envoy-future include bullseye-wikimedia envoyproxy_1.26.8-1_source.changes # T402584 [22:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:18] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [22:19:18] !log Upgrading to Grafana 12.1.1 - T402544 [22:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:28] !log Upgrading to Grafana 12.1.1 in grafana - T402544 [22:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:09] (03PS1) 10Andrea Denisse: grafana: Enable dashboard sync after a version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1180992 [22:26:57] (03CR) 10Andrea Denisse: [C:03+2] grafana: Enable dashboard sync after a version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1180992 (owner: 10Andrea Denisse) [22:30:48] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11109033 (10Ladsgroup) Thanks! Replication has caught up. I'm repooling it now. [22:31:16] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool es2040* gradually with 4 steps - Work done [22:44:12] (03PS1) 10Dzahn: mariadb: replace legacy fact for memorysize [puppet] - 10https://gerrit.wikimedia.org/r/1180999 [22:57:55] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:04:33] (03CR) 10Andrea Denisse: [C:03+1] "Gerrit shows a merge conflict, the patch probably needs rebasing but it LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1179226 (owner: 10Cwhite) [23:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:09:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T399249)', diff saved to https://phabricator.wikimedia.org/P81679 and previous config saved to /var/cache/conftool/dbconfig/20250821-230916-fceratto.json [23:09:22] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [23:13:55] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11109152 (10phaultfinder) [23:16:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040* gradually with 4 steps - Work done [23:18:08] (03CR) 10Dzahn: [V:03+1] "debugging why this stopped working on trixie even though we are on the same puppet version, we are on a newer ruby version though." [puppet] - 10https://gerrit.wikimedia.org/r/1180999 (owner: 10Dzahn) [23:18:17] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:18:17] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:18:57] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11109180 (10phaultfinder) [23:19:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.399 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:19:17] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 9.559 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:23:19] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:23:19] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:24:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P81681 and previous config saved to /var/cache/conftool/dbconfig/20250821-232425-fceratto.json [23:29:13] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:29:13] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:35:54] jhathaway@cumin1002 reimage (PID 2331907) is awaiting input [23:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1181005 [23:38:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1181005 (owner: 10TrainBranchBot) [23:39:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P81682 and previous config saved to /var/cache/conftool/dbconfig/20250821-233932-fceratto.json [23:40:06] (03PS1) 10RLazarus: envoy-future: Update to 1.26.8 and bump envoy-future.list to bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1181006 (https://phabricator.wikimedia.org/T402584) [23:43:48] !log ryankemper@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal-scholarly,name=eqiad [23:46:02] (03CR) 10Papaul: [C:03+1] Add new Nokia switches to ibgp pod e/f in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1180958 (https://phabricator.wikimedia.org/T402590) (owner: 10Cathal Mooney) [23:49:27] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:49:27] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:50:05] (03CR) 10Scott French: [C:03+1] envoy-future: Update to 1.26.8 and bump envoy-future.list to bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1181006 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [23:50:18] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:50:19] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:51:31] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1181005 (owner: 10TrainBranchBot) [23:53:19] (03CR) 10RLazarus: [C:03+2] envoy-future: Update to 1.26.8 and bump envoy-future.list to bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1181006 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [23:53:34] (03CR) 10RLazarus: [V:03+2 C:03+2] envoy-future: Update to 1.26.8 and bump envoy-future.list to bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1181006 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [23:54:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T399249)', diff saved to https://phabricator.wikimedia.org/P81683 and previous config saved to /var/cache/conftool/dbconfig/20250821-235440-fceratto.json [23:54:45] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [23:54:56] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1242.eqiad.wmnet with reason: Maintenance [23:55:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T399249)', diff saved to https://phabricator.wikimedia.org/P81684 and previous config saved to /var/cache/conftool/dbconfig/20250821-235503-fceratto.json [23:59:29] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:59:29] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring