[00:08:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1180249 [00:08:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1180249 (owner: 10TrainBranchBot) [00:25:21] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: hw troubleshooting: disk (sdg) errors on ms-be1071 - https://phabricator.wikimedia.org/T402346#11100742 (10andrea.denisse) While investigating T402247 I left a smartctl test running for drive 3 (which is the one I suspect was failing due to the high number... [00:26:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{dse-k8s-worker10[15-19].eqiad.wmnet} and (A:dse-k8s-master or A:dse-k8s-worker) [00:29:10] 10SRE-swift-storage, 10Observability-Logging, 10SRE Observability (FY2025/2026-Q1): rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247#11100745 (10andrea.denisse) >>! In T402247#11096685, @andrea.denisse wrote: >>>! In T402247#11096653, @andrea.denisse wrote: >> I th... [00:31:31] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1180249 (owner: 10TrainBranchBot) [00:33:36] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180251 [00:46:05] PROBLEM - MariaDB Replica Lag: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 617.89 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:46:43] (03PS3) 10BCornwall: varnish: Fix tests that rely on User-Agent header [puppet] - 10https://gerrit.wikimedia.org/r/1180233 (https://phabricator.wikimedia.org/T400119) [00:46:44] (03CR) 10BCornwall: [V:03+2] "When used with https://gitlab.wikimedia.org/repos/sre/varnish/-/merge_requests/11" [puppet] - 10https://gerrit.wikimedia.org/r/1180233 (https://phabricator.wikimedia.org/T400119) (owner: 10BCornwall) [00:48:54] (03CR) 10Cwhite: "From testing, it appears the reservation is excluded from node_filesystem_avail_bytes. IOW, ext4 still has space in its reservation for r" [alerts] - 10https://gerrit.wikimedia.org/r/1178979 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [00:50:18] (03CR) 10BCornwall: [V:03+2 C:03+2] "NS records are proper and no dnssec enabled." [puppet] - 10https://gerrit.wikimedia.org/r/1180251 (owner: 10Ncmonitor) [01:15:51] (03CR) 10BCornwall: [V:03+2 C:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1180220 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [01:16:43] (03CR) 10BCornwall: [V:03+2 C:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1180166 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [01:40:05] RECOVERY - MariaDB Replica Lag: s2 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:44:33] (03PS1) 10BCornwall: ncredir: update vim modeline options for dat file [puppet] - 10https://gerrit.wikimedia.org/r/1180255 [01:48:06] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6659/console" [puppet] - 10https://gerrit.wikimedia.org/r/1180255 (owner: 10BCornwall) [02:01:19] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6662/console" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle) [02:16:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:21:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:21:49] :/ [02:29:39] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:30:21] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:31:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 195.200.68.151 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:31:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-magru (195.200.68.151) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr1-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:32:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [02:36:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:15:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:08:37] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:18:45] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:27:34] jouncebot: nowandnext [05:27:34] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [05:27:34] In 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T0600) [05:30:26] (03CR) 10Kosta Harlan: "We can move this forward now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148) (owner: 10Dreamy Jazz) [05:39:21] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:39:45] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:41:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 195.200.68.151 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:41:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:42:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:46:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11100913 (10ayounsi) a:03Papaul [05:49:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148) (owner: 10Dreamy Jazz) [05:50:13] (03Merged) 10jenkins-bot: Enable hCaptcha on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148) (owner: 10Dreamy Jazz) [05:50:57] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1168178|Enable hCaptcha on test2wiki (T382148)]] [05:51:01] T382148: Enable hCaptcha on test2wiki - https://phabricator.wikimedia.org/T382148 [05:53:14] !log kharlan@deploy1003 dreamyjazz, kharlan: Backport for [[gerrit:1168178|Enable hCaptcha on test2wiki (T382148)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [05:57:33] !log kharlan@deploy1003 dreamyjazz, kharlan: Continuing with sync [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T0600) [06:02:45] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1168178|Enable hCaptcha on test2wiki (T382148)]] (duration: 11m 48s) [06:02:49] T382148: Enable hCaptcha on test2wiki - https://phabricator.wikimedia.org/T382148 [06:07:11] (03PS4) 10Ayounsi: Add sandbox vlan to routed ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1180143 [06:08:45] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:13:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:17:27] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180143 (owner: 10Ayounsi) [06:18:37] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:24:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T399249)', diff saved to https://phabricator.wikimedia.org/P81578 and previous config saved to /var/cache/conftool/dbconfig/20250820-062457-fceratto.json [06:25:03] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [06:26:30] (03PS1) 10Giuseppe Lavagetto: Introduce policies [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1180266 [06:27:07] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Introduce policies [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1180266 (owner: 10Giuseppe Lavagetto) [06:27:26] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [06:27:41] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Introduce policies - oblivian@cumin1003" [06:27:43] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Introduce policies - oblivian@cumin1003 [06:27:53] (03CR) 10Muehlenhoff: [C:03+2] Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1180122 (owner: 10Muehlenhoff) [06:28:01] !log jmm@dns1004 START - running authdns-update [06:28:28] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Introduce policies - oblivian@cumin1003 [06:28:29] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Introduce policies - oblivian@cumin1003" [06:28:58] (03CR) 10Giuseppe Lavagetto: [C:03+2] hiddenparma: add policy file [puppet] - 10https://gerrit.wikimedia.org/r/1179971 (owner: 10Giuseppe Lavagetto) [06:29:06] !log jmm@dns1004 END - running authdns-update [06:31:57] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru sandbox to routed ganeti - ayounsi@cumin1003" [06:32:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru sandbox to routed ganeti - ayounsi@cumin1003" [06:32:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:33:08] (03PS1) 10Giuseppe Lavagetto: profile::conftool::hiddenparma: fix file path [puppet] - 10https://gerrit.wikimedia.org/r/1180268 [06:33:24] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] profile::conftool::hiddenparma: fix file path [puppet] - 10https://gerrit.wikimedia.org/r/1180268 (owner: 10Giuseppe Lavagetto) [06:40:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P81579 and previous config saved to /var/cache/conftool/dbconfig/20250820-064005-fceratto.json [06:43:31] (03CR) 10Muehlenhoff: [C:03+2] durum: Enable bird component in magru [puppet] - 10https://gerrit.wikimedia.org/r/1179972 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [06:48:25] (03PS2) 10Muehlenhoff: wmcs::novaproxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1072751 [06:55:07] (03CR) 10Muehlenhoff: [C:03+2] doh: Enable bird component in magru [puppet] - 10https://gerrit.wikimedia.org/r/1179981 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [06:55:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P81580 and previous config saved to /var/cache/conftool/dbconfig/20250820-065513-fceratto.json [07:00:04] Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:19] I'm here. [07:00:26] and will self deploy [07:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:05:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan) [07:06:01] (03Merged) 10jenkins-bot: MinT: Add stream configuration and registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan) [07:06:31] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1179120|MinT: Add stream configuration and registration (T397600 T397043)]] [07:06:37] T397600: MinT for Wiki Readers: pagevisit instrumentation for experiment - https://phabricator.wikimedia.org/T397600 [07:06:37] T397043: MinT for Readers: pre-experiment analytics setup - https://phabricator.wikimedia.org/T397043 [07:08:39] !log kartik@deploy1003 kartik, hueitan: Backport for [[gerrit:1179120|MinT: Add stream configuration and registration (T397600 T397043)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:10:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T399249)', diff saved to https://phabricator.wikimedia.org/P81581 and previous config saved to /var/cache/conftool/dbconfig/20250820-071020-fceratto.json [07:10:26] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:10:36] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1190.eqiad.wmnet with reason: Maintenance [07:10:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T399249)', diff saved to https://phabricator.wikimedia.org/P81582 and previous config saved to /var/cache/conftool/dbconfig/20250820-071043-fceratto.json [07:12:38] !log kartik@deploy1003 kartik, hueitan: Continuing with sync [07:15:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:17:52] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1179120|MinT: Add stream configuration and registration (T397600 T397043)]] (duration: 11m 21s) [07:17:58] T397600: MinT for Wiki Readers: pagevisit instrumentation for experiment - https://phabricator.wikimedia.org/T397600 [07:17:59] T397043: MinT for Readers: pre-experiment analytics setup - https://phabricator.wikimedia.org/T397043 [07:24:44] (03PS10) 10Arnaudb: nftables: throttle debugging [puppet] - 10https://gerrit.wikimedia.org/r/1178880 (https://phabricator.wikimedia.org/T400971) [07:24:58] (03CR) 10Arnaudb: [C:03+1] gerrit: blocking Huawei cloud Singapore subnet [puppet] - 10https://gerrit.wikimedia.org/r/1180193 (owner: 10Dzahn) [07:25:12] (03CR) 10Arnaudb: [C:03+1] gerrit: block another Huawei subnet for scraping [puppet] - 10https://gerrit.wikimedia.org/r/1180194 (owner: 10Dzahn) [07:48:51] (03CR) 10Filippo Giunchedi: [C:03+1] wmcs::novaproxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1072751 (owner: 10Muehlenhoff) [07:49:34] (03CR) 10Tiziano Fogli: [C:03+1] "ok, thank you" [alerts] - 10https://gerrit.wikimedia.org/r/1178979 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [07:50:56] (03CR) 10Tiziano Fogli: [C:03+1] "Acknowledged" [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [07:55:21] (03CR) 10Hashar: "Oups sorry, thank you for the verification and fix!" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [08:00:05] jnuche and jeena: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T0800) [08:00:11] morning, train will roll out in a short bit [08:03:15] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180498 (https://phabricator.wikimedia.org/T396376) [08:03:17] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180498 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot) [08:04:10] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180498 (https://phabricator.wikimedia.org/T396376) (owner: 10TrainBranchBot) [08:05:30] (03CR) 10Clément Goubert: [C:03+1] hcaptcha: Unset Referer header [puppet] - 10https://gerrit.wikimedia.org/r/1180204 (https://phabricator.wikimedia.org/T397841) (owner: 10Kosta Harlan) [08:09:09] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users and analytics-privatedata-users for dang - https://phabricator.wikimedia.org/T402191#11101054 (10dang) @Vgutierrez Yes you can re-use it, it's better that way, and that's my account anw [08:10:12] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: acquire cfssl certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [08:10:37] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users and analytics-privatedata-users for dang - https://phabricator.wikimedia.org/T402191#11101057 (10Vgutierrez) >>! In T402191#11101054, @dang wrote: > @Vgutierrez Yes you can re-use it, it's better that way, and that's my account anw great,... [08:11:08] 07sre-alert-triage, 06Data-Platform-SRE, 10Wikidata, 06Wikidata-Omega, 10Wikidata-Query-Service: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402292#11101058 (10SuzanneWood-WMDE) [08:11:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users and analytics-privatedata-users for dang - https://phabricator.wikimedia.org/T402191#11101059 (10dang) Please use the existing one :) Thanks a bunch! I don't remember that I created extra accounts, any chance I can delete them? [08:13:32] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users and analytics-privatedata-users for dang - https://phabricator.wikimedia.org/T402191#11101060 (10Vgutierrez) [08:15:50] 07sre-alert-triage, 06Data-Platform-SRE, 10Wikidata, 06Wikidata-Omega, 10Wikidata-Query-Service: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402292#11101062 (10SuzanneWood-WMDE) @tappof - could you please... [08:18:31] (03PS1) 10Vgutierrez: admin: Add dang to analytics-(wmde|privatedata)-users [puppet] - 10https://gerrit.wikimedia.org/r/1180499 (https://phabricator.wikimedia.org/T402191) [08:18:31] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.15 refs T396376 [08:18:36] T396376: 1.45.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T396376 [08:19:07] (03CR) 10Clément Goubert: [C:03+2] hcaptcha: Unset Referer header [puppet] - 10https://gerrit.wikimedia.org/r/1180204 (https://phabricator.wikimedia.org/T397841) (owner: 10Kosta Harlan) [08:19:34] (03PS1) 10Giuseppe Lavagetto: cache: move banning of requests with no UA to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1180500 [08:21:34] (03PS2) 10Tiziano Fogli: prometheus::alert::rule: use title to deduplicate resources [puppet] - 10https://gerrit.wikimedia.org/r/1178883 (https://phabricator.wikimedia.org/T381665) [08:21:35] (03PS13) 10Tiziano Fogli: nrpe wrapper: define Prometheus alerts via Puppet resources [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446) [08:21:35] (03PS1) 10Tiziano Fogli: nrpe wrapper: enable nrpe2nodexp for check_ferm_active (testing) [puppet] - 10https://gerrit.wikimedia.org/r/1180501 (https://phabricator.wikimedia.org/T395446) [08:21:38] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6664/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180500 (owner: 10Giuseppe Lavagetto) [08:21:48] (03PS1) 10David Caro: aptrepo: add k8s 1.30 packages and remove unused 1.28 [puppet] - 10https://gerrit.wikimedia.org/r/1180502 (https://phabricator.wikimedia.org/T362869) [08:22:10] (03PS2) 10David Caro: aptrepo: add k8s 1.30 packages and remove unused 1.28 [puppet] - 10https://gerrit.wikimedia.org/r/1180502 (https://phabricator.wikimedia.org/T362869) [08:22:46] (03CR) 10CI reject: [V:04-1] aptrepo: add k8s 1.30 packages and remove unused 1.28 [puppet] - 10https://gerrit.wikimedia.org/r/1180502 (https://phabricator.wikimedia.org/T362869) (owner: 10David Caro) [08:22:46] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: apt-staging: add headers to prevent CDN caching - https://phabricator.wikimedia.org/T402284#11101087 (10fnegri) @Dzahn fine with me, but if there's an easy way to keep e.g. a 5-minute cache it could be nice to have. I'll let @Joe have the final word. [08:22:50] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180501 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [08:26:48] (03CR) 10Tiziano Fogli: [C:03+2] prometheus::alert::rule: use title to deduplicate resources [puppet] - 10https://gerrit.wikimedia.org/r/1178883 (https://phabricator.wikimedia.org/T381665) (owner: 10Tiziano Fogli) [08:35:35] (03CR) 10Tiziano Fogli: [C:03+2] nrpe wrapper: enable nrpe2nodexp for check_ferm_active (testing) [puppet] - 10https://gerrit.wikimedia.org/r/1180501 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [08:36:05] (03CR) 10Tiziano Fogli: [C:03+2] nrpe wrapper: define Prometheus alerts via Puppet resources [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [08:37:15] 06SRE-OnFire, 10Cite, 10VisualEditor, 10WMDE-TechWish-Maintenance, and 4 others: Investigation: Write visual editor debug tool to produce Converter test cases - https://phabricator.wikimedia.org/T400311#11101141 (10thiemowmde) [08:38:21] (03PS3) 10David Caro: aptrepo: add k8s 1.30 packages and remove unused 1.28 [puppet] - 10https://gerrit.wikimedia.org/r/1180502 (https://phabricator.wikimedia.org/T362869) [08:43:58] (03CR) 10Vgutierrez: cache: move banning of requests with no UA to haproxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180500 (owner: 10Giuseppe Lavagetto) [08:47:03] (03PS1) 10DCausse: ml-services: stop using weighted_tags.rc0 stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180506 (https://phabricator.wikimedia.org/T375821) [08:47:36] (03CR) 10Vgutierrez: "varnish implementation blocked the U-A header value `-` and the haproxy implementation will allow that one" [puppet] - 10https://gerrit.wikimedia.org/r/1180500 (owner: 10Giuseppe Lavagetto) [08:50:59] 10SRE-SLO, 10Citoid, 10VisualEditor, 10Editing-team (Kanban Board), 13Patch-For-Review: Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11101165 (10Mvolz) It turns out mw.api actually already sends this h... [08:53:21] (03CR) 10FNegri: [C:03+1] aptrepo: add k8s 1.30 packages and remove unused 1.28 [puppet] - 10https://gerrit.wikimedia.org/r/1180502 (https://phabricator.wikimedia.org/T362869) (owner: 10David Caro) [08:54:39] (03CR) 10Arnaudb: [C:03+2] nftables: throttle debugging [puppet] - 10https://gerrit.wikimedia.org/r/1178880 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [08:59:20] (03CR) 10Majavah: [C:03+1] "see inline, but this can be merged as is too" [puppet] - 10https://gerrit.wikimedia.org/r/1072751 (owner: 10Muehlenhoff) [09:00:06] (03CR) 10Kevin Bazira: [C:03+1] "Thank you for updating this David. Once it is merged, I can help to deploy the change to LiftWing." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180506 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [09:01:08] (03CR) 10Majavah: [C:03+1] "This seems fine, but I also wonder whether this should be usable for non-pontoon users, either in `profile::ldap::client::labs` (which nee" [puppet] - 10https://gerrit.wikimedia.org/r/1180089 (https://phabricator.wikimedia.org/T402261) (owner: 10Filippo Giunchedi) [09:03:21] (03PS5) 10Ayounsi: Add sandbox vlan to routed ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1180143 (https://phabricator.wikimedia.org/T402372) [09:03:30] (03CR) 10Ayounsi: Add sandbox vlan to routed ganeti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180143 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [09:03:58] (03PS2) 10Giuseppe Lavagetto: cache: move banning of requests with no UA to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1180500 [09:03:59] (03PS1) 10Giuseppe Lavagetto: varnishtests: ux improvements [puppet] - 10https://gerrit.wikimedia.org/r/1180507 [09:06:49] (03CR) 10Muehlenhoff: wmcs::novaproxy: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072751 (owner: 10Muehlenhoff) [09:08:13] (03CR) 10Vgutierrez: [C:03+1] varnishtests: ux improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180507 (owner: 10Giuseppe Lavagetto) [09:08:19] (03PS2) 10Ayounsi: Add magru sandbox prefixes to routed ganeti ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1180150 (https://phabricator.wikimedia.org/T402372) [09:10:55] (03CR) 10Majavah: [C:03+1] wmcs::novaproxy: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072751 (owner: 10Muehlenhoff) [09:11:52] !log derick@deploy1003 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=foundationwiki --logwiki=metawiki Selfkilla666 Cowsheepcool # T402364 [09:11:56] T402364: Unblock stuck global rename of Cowsheepcool - https://phabricator.wikimedia.org/T402364 [09:13:09] (03CR) 10David Caro: [C:03+2] aptrepo: add k8s 1.30 packages and remove unused 1.28 [puppet] - 10https://gerrit.wikimedia.org/r/1180502 (https://phabricator.wikimedia.org/T362869) (owner: 10David Caro) [09:14:26] (03PS1) 10Brouberol: yarn: allow the analytics-ml user to send jobs in the production queue [puppet] - 10https://gerrit.wikimedia.org/r/1180509 (https://phabricator.wikimedia.org/T400902) [09:15:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: disk sdj failure for cloudcephosd1013.eqiad.wmnet - https://phabricator.wikimedia.org/T401319#11101301 (10fnegri) 05Resolved→03Open @Jclark-ctr the new drive does not show up in `lsblk`, I tried rebooting but that didn'... [09:15:52] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1180509 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [09:16:18] (03CR) 10Ozge: [C:03+1] yarn: allow the analytics-ml user to send jobs in the production queue [puppet] - 10https://gerrit.wikimedia.org/r/1180509 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [09:18:43] (03PS25) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [09:18:56] (03PS1) 10Filippo Giunchedi: openstack: disable cfssl certs in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1180510 (https://phabricator.wikimedia.org/T355145) [09:20:57] (03CR) 10CI reject: [V:04-1] C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:24:04] (03PS26) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [09:24:05] (03CR) 10Brouberol: [C:03+2] yarn: allow the analytics-ml user to send jobs in the production queue [puppet] - 10https://gerrit.wikimedia.org/r/1180509 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [09:24:26] (03CR) 10David Caro: [C:03+2] openstack: disable cfssl certs in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1180510 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [09:24:31] (03CR) 10David Caro: [C:03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180510 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [09:24:49] (03CR) 10David Caro: [C:03+1] openstack: disable cfssl certs in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1180510 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [09:25:01] (03PS1) 10Mvolz: Whitelist api-user-agent header for logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180511 (https://phabricator.wikimedia.org/T345627) [09:26:09] (03CR) 10CI reject: [V:04-1] C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:26:21] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: disable cfssl certs in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1180510 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [09:26:23] (03PS2) 10Mvolz: Whitelist api-user-agent header for logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180511 (https://phabricator.wikimedia.org/T345627) [09:27:29] (03PS1) 10Stevemunene: dse-k8s: Add dse-k8s-codfw ctrl and worker nodes to role [puppet] - 10https://gerrit.wikimedia.org/r/1180512 (https://phabricator.wikimedia.org/T397301) [09:28:44] (03PS27) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [09:35:37] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:36:22] (03CR) 10Alexandros Kosiaris: [C:03+2] profile::docker::firewall: Remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/1114661 (owner: 10Muehlenhoff) [09:36:27] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:44:23] !log Running `/usr/local/bin/foreachwikiindblist mediamoderation-continuous-scan.dblist extensions/MediaModeration/maintenance/importExistingFilesToScanTable.php --force --start-timestamp "20230701010101" --batch-size "5000"` [09:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:40] (03PS1) 10Tiziano Fogli: Revert "nrpe wrapper: define Prometheus alerts via Puppet resources" [puppet] - 10https://gerrit.wikimedia.org/r/1180520 [09:48:14] (03CR) 10Majavah: [C:03+1] Revert "nrpe wrapper: define Prometheus alerts via Puppet resources" [puppet] - 10https://gerrit.wikimedia.org/r/1180520 (owner: 10Tiziano Fogli) [09:50:34] (03CR) 10Tiziano Fogli: [C:03+2] Revert "nrpe wrapper: define Prometheus alerts via Puppet resources" [puppet] - 10https://gerrit.wikimedia.org/r/1180520 (owner: 10Tiziano Fogli) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T1000) [10:00:52] (03PS2) 10Stevemunene: dse-k8s: Add dse-k8s-codfw to service list [puppet] - 10https://gerrit.wikimedia.org/r/1180116 (https://phabricator.wikimedia.org/T397298) [10:14:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [10:19:23] (03CR) 10Stevemunene: [C:03+2] dse-k8s: add dse-k8s-codfw hosts to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [10:19:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [10:28:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Request for Kerberos identity for querying SSAC table via statmachines - https://phabricator.wikimedia.org/T401827#11101515 (10BTullis) [10:31:57] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker1009.eqiad.wmnet [10:32:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): decommission an-worker109[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T401678#11101604 (10BTullis) [10:32:49] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker1010.eqiad.wmnet [10:33:01] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker1011.eqiad.wmnet [10:33:36] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11101639 (10BTullis) [10:34:03] (03PS1) 10Seanleong-wmde: Set Alias entity usage modifier limit to 10. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180523 (https://phabricator.wikimedia.org/T401288) [10:35:51] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11101698 (10BTullis) [10:37:54] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker1012.eqiad.wmnet [10:38:00] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker1013.eqiad.wmnet [10:38:06] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker1015.eqiad.wmnet [10:38:11] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker1016.eqiad.wmnet [10:38:17] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker1017.eqiad.wmnet [10:38:26] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker1018.eqiad.wmnet [10:38:33] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker1019.eqiad.wmnet [10:48:28] (03PS2) 10Stevemunene: dse-k8s: Add dse-k8s-codfw ctrl and worker nodes to role [puppet] - 10https://gerrit.wikimedia.org/r/1180512 (https://phabricator.wikimedia.org/T397301) [10:50:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#11101783 (10Jclark-ctr) @cmooney Would you be able to assist with setting up eth1 link? servers are already imaged on eth0. I believe this ticket will be... [10:54:20] (03PS1) 10D3r1ck01: libs: Handle null domain in Cookie::canServeDomain [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180525 (https://phabricator.wikimedia.org/T402273) [10:57:16] (03PS1) 10Stevemunene: Replace an-druid100[1-2] [alerts] - 10https://gerrit.wikimedia.org/r/1180526 (https://phabricator.wikimedia.org/T401116) [11:00:05] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T1100). [11:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:09:04] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for Dima_Koushha_WMDE - https://phabricator.wikimedia.org/T402384 (10Dima_Koushha_WMDE) 03NEW [11:09:19] (03CR) 10Thiemo Kreuz (WMDE): libs: Handle null domain in Cookie::canServeDomain (032 comments) [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180525 (https://phabricator.wikimedia.org/T402273) (owner: 10D3r1ck01) [11:10:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#11101852 (10cmooney) @Jclark-ctr that is done now and all four host's second ports are connected and running at 25G now. @Andrew you can proceed to set th... [11:10:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:10:50] 10SRE-SLO, 10Citoid, 10VisualEditor, 10Editing-team (Kanban Board): Record api-user-agent in metrics; filter by MediaWikiJs - https://phabricator.wikimedia.org/T402385 (10Mvolz) 03NEW [11:10:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#11101863 (10Jclark-ctr) 05Open→03Resolved [11:11:18] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06MediaWiki-Platform-Team, and 5 others: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11101868 (10Bugreporter) >>! In T400881#11100318, @B... [11:11:24] jouncebot: now [11:11:24] For the next 0 hour(s) and 48 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T1100) [11:12:17] (03PS1) 10Brouberol: yarn: grant analytics-ml the acls required to send jobs in the production queue [puppet] - 10https://gerrit.wikimedia.org/r/1180528 (https://phabricator.wikimedia.org/T400902) [11:12:23] !log jnuche@deploy1003 Installing scap version "4.207.0" for 4 host(s) [11:12:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: disk sdj failure for cloudcephosd1013.eqiad.wmnet - https://phabricator.wikimedia.org/T401319#11101871 (10Jclark-ctr) 05Open→03Resolved [11:13:34] (03CR) 10Brouberol: [C:03+1] "I'd probably have submitted a stack of patches, adding a node at a time. If you want to control that with puppet agent --disable, I won't " [puppet] - 10https://gerrit.wikimedia.org/r/1180512 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [11:15:11] !log jnuche@deploy1003 Installation of scap version "4.207.0" completed for 4 hosts [11:15:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:15:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:16:27] 07sre-alert-triage, 10Wikidata, 06Wikidata-Omega, 10Wikidata-Query-Service, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402292#11101875 (10BTullis) p:05Triag... [11:20:34] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11101896 (10Jclark-ctr) ` Hi John. The system reboot could temporarily clear the alarms, but chances are that these alarms will reactivate in a couple of days or less. Kind regards Esteban Moral... [11:22:50] (03PS1) 10Stevemunene: Remove mention of an-druid100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1180529 (https://phabricator.wikimedia.org/T401116) [11:27:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: disk sdj failure for cloudcephosd1013.eqiad.wmnet - https://phabricator.wikimedia.org/T401319#11101921 (10Jclark-ctr) Removed new drive it was causing errors [11:30:48] (03PS2) 10Ladsgroup: tables-catalog: Catalog BounceHandler and LoginNotify tables [puppet] - 10https://gerrit.wikimedia.org/r/1180247 (https://phabricator.wikimedia.org/T399302) [11:30:55] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Catalog BounceHandler and LoginNotify tables [puppet] - 10https://gerrit.wikimedia.org/r/1180247 (https://phabricator.wikimedia.org/T399302) (owner: 10Ladsgroup) [11:31:50] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: hw troubleshooting: disk (sdg) errors on ms-be1071 - https://phabricator.wikimedia.org/T402346#11101939 (10Jclark-ctr) a:03Jclark-ctr This server is out of warranty we do have spare 8tb drives onhand at Eqiad @MatthewVernon Message when you return and... [11:33:06] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) (owner: 10Ladsgroup) [11:33:26] (03PS5) 10Ladsgroup: configmaster: Expose tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) [11:33:32] (03CR) 10Ladsgroup: [V:03+2 C:03+2] configmaster: Expose tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1180134 (https://phabricator.wikimedia.org/T398943) (owner: 10Ladsgroup) [11:35:30] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for Dima_Koushha_WMDE - https://phabricator.wikimedia.org/T402384#11101958 (10karapayneWMDE) As an EM for Wikidata at Wikimedia Deutschland, I approve this request [11:36:18] (03PS1) 10Tchanders: Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) [11:42:39] (03CR) 10Arnaudb: [C:03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1178619 (owner: 10Dzahn) [11:42:47] (03PS2) 10Tchanders: Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) [11:45:44] (03CR) 10Tchanders: "I generated this patch by:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [11:51:32] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6666/co" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [11:52:28] (03CR) 10Ayounsi: [C:03+2] Add sandbox vlan to routed ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1180143 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [11:52:59] (03CR) 10Ozge: [C:03+1] yarn: grant analytics-ml the acls required to send jobs in the production queue [puppet] - 10https://gerrit.wikimedia.org/r/1180528 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [11:53:20] (03CR) 10Brouberol: [C:03+2] yarn: grant analytics-ml the acls required to send jobs in the production queue [puppet] - 10https://gerrit.wikimedia.org/r/1180528 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [11:57:23] (03PS1) 10Filippo Giunchedi: openstack: install intermediate CA cert for libvirtd [puppet] - 10https://gerrit.wikimedia.org/r/1180534 (https://phabricator.wikimedia.org/T355145) [11:57:25] (03PS1) 10Filippo Giunchedi: openstack: clarify libvirtd debug levels [puppet] - 10https://gerrit.wikimedia.org/r/1180535 (https://phabricator.wikimedia.org/T355145) [11:58:53] (03PS1) 10Clément Goubert: ipoid: Raise memory limit for cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180536 (https://phabricator.wikimedia.org/T402388) [11:59:43] (03PS2) 10Filippo Giunchedi: openstack: install intermediate CA cert for libvirtd [puppet] - 10https://gerrit.wikimedia.org/r/1180534 (https://phabricator.wikimedia.org/T355145) [11:59:43] (03PS2) 10Filippo Giunchedi: openstack: clarify libvirtd debug levels [puppet] - 10https://gerrit.wikimedia.org/r/1180535 (https://phabricator.wikimedia.org/T355145) [12:00:23] (03CR) 10Ayounsi: [C:03+2] Add magru sandbox prefixes to routed ganeti ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1180150 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [12:00:23] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180535 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [12:00:31] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180534 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [12:01:04] (03Merged) 10jenkins-bot: Add magru sandbox prefixes to routed ganeti ranges [homer/public] - 10https://gerrit.wikimedia.org/r/1180150 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [12:01:04] (03CR) 10Slyngshede: [V:03+1] C:ip_reputation_vendors::datacenter_vendors: Known datacenters (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [12:02:13] (03CR) 10Slyngshede: [V:03+1] "I've avoided hooking up the timer for now. That will allow me to test, but without having an alert about a broken timer." [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [12:13:38] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180534 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [12:14:25] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: install intermediate CA cert for libvirtd [puppet] - 10https://gerrit.wikimedia.org/r/1180534 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [12:15:33] (03PS1) 10Jaime Nuche: Omit empty username in JCApiUtils::initApiRequestObj [extensions/JsonConfig] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180540 (https://phabricator.wikimedia.org/T402273) [12:15:43] (03CR) 10Clément Goubert: [C:03+2] ipoid: Raise memory limit for cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180536 (https://phabricator.wikimedia.org/T402388) (owner: 10Clément Goubert) [12:19:16] (03CR) 10Tchanders: "Since tempaccounts.dblist now contains 95% of wikis, it might be better to define a list of wikis that do not have temp accounts enabled, " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [12:23:39] (03Merged) 10jenkins-bot: ipoid: Raise memory limit for cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180536 (https://phabricator.wikimedia.org/T402388) (owner: 10Clément Goubert) [12:24:13] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [12:24:32] (03PS3) 10Stevemunene: dse-k8s: Add dse-k8s-codfw ctrl and worker nodes to role [puppet] - 10https://gerrit.wikimedia.org/r/1180512 (https://phabricator.wikimedia.org/T397301) [12:26:12] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:26:21] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:26:50] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:26:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy1003 using scap backport" [extensions/JsonConfig] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180540 (https://phabricator.wikimedia.org/T402273) (owner: 10Jaime Nuche) [12:27:58] (03Merged) 10jenkins-bot: Omit empty username in JCApiUtils::initApiRequestObj [extensions/JsonConfig] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180540 (https://phabricator.wikimedia.org/T402273) (owner: 10Jaime Nuche) [12:28:23] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:28:31] !log jnuche@deploy1003 Started scap sync-world: Backport for [[gerrit:1180540|Omit empty username in JCApiUtils::initApiRequestObj (T402273)]] [12:28:35] T402273: PHP Deprecated: str_starts_with(): Passing null to parameter #1 ($haystack) of type string is deprecated - https://phabricator.wikimedia.org/T402273 [12:30:41] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:30:44] !log jnuche@deploy1003 jnuche: Backport for [[gerrit:1180540|Omit empty username in JCApiUtils::initApiRequestObj (T402273)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:31:12] !log jnuche@deploy1003 jnuche: Continuing with sync [12:31:23] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 27.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:35:00] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/ipoid: apply [12:35:11] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [12:35:40] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/ipoid: apply [12:36:02] (03CR) 10Brouberol: [C:03+1] dse-k8s: Add dse-k8s-codfw ctrl and worker nodes to role [puppet] - 10https://gerrit.wikimedia.org/r/1180512 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [12:36:17] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [12:36:26] !log jnuche@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180540|Omit empty username in JCApiUtils::initApiRequestObj (T402273)]] (duration: 07m 55s) [12:36:31] T402273: PHP Deprecated: str_starts_with(): Passing null to parameter #1 ($haystack) of type string is deprecated - https://phabricator.wikimedia.org/T402273 [12:36:42] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/ipoid: apply [12:37:38] !log ayounsi@cumin1003 START - Cookbook sre.ganeti.makevm for new host atlas7001.wikimedia.org [12:37:40] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [12:38:01] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [12:41:36] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas7001.wikimedia.org - ayounsi@cumin1003" [12:41:40] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas7001.wikimedia.org - ayounsi@cumin1003" [12:41:40] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:41:40] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache atlas7001.wikimedia.org on all recursors [12:41:43] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas7001.wikimedia.org on all recursors [12:42:15] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas7001.wikimedia.org - ayounsi@cumin1003" [12:42:19] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas7001.wikimedia.org - ayounsi@cumin1003" [12:42:19] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host atlas7001.wikimedia.org [12:45:03] !log Restarting ipoid-daily-update job - T402388 [12:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:08] T402388: Increase RAM available to ipoid's updater container - https://phabricator.wikimedia.org/T402388 [12:45:35] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238#11102081 (10Jclark-ctr) |Device A|Device A Port|Device B|Device B Port|Type|cableID|Length required| |----------|-----------------|----------|----------|-------|-----|---------... [12:45:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:48:07] (03PS3) 10Tchanders: Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) [12:48:34] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238#11102083 (10Jclark-ctr) @cmooney @ayounsi I have pre-ran all the cables for the Spine in D1 and am waiting for the optics. The 1m fiber will have its cable ID filled in once it... [12:51:37] (03CR) 10Thiemo Kreuz (WMDE): libs: Handle null domain in Cookie::canServeDomain (032 comments) [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180525 (https://phabricator.wikimedia.org/T402273) (owner: 10D3r1ck01) [12:54:11] (03PS1) 10Cathal Mooney: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T371088) [12:55:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11102113 (10Jclark-ctr) a:05Marostegui→03VRiley-WMF [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:01:24] nothing to deploy :) [13:01:32] (which is good because I’m in a meeting ^^) [13:03:37] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:06:14] (03PS1) 10Cathal Mooney: wmf-plugin: New function to expose generic interface data to modules [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1180553 (https://phabricator.wikimedia.org/T371088) [13:07:35] (03PS2) 10Cathal Mooney: wmf-plugin: New function to expose generic interface data to modules [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1180553 (https://phabricator.wikimedia.org/T371088) [13:08:59] (03PS3) 10Cathal Mooney: wmf-plugin: New function to expose generic interface data to modules [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1180553 (https://phabricator.wikimedia.org/T371088) [13:10:19] (03CR) 10CI reject: [V:04-1] wmf-plugin: New function to expose generic interface data to modules [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1180553 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [13:11:46] (03PS4) 10Cathal Mooney: wmf-plugin: New function to expose generic interface data to modules [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1180553 (https://phabricator.wikimedia.org/T371088) [13:11:48] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [13:13:31] (03PS1) 10Filippo Giunchedi: openstack: enable cfssl certs for libvirt in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1180556 (https://phabricator.wikimedia.org/T355145) [13:14:05] (03CR) 10CI reject: [V:04-1] openstack: enable cfssl certs for libvirt in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1180556 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [13:14:53] (03CR) 10Filippo Giunchedi: [C:03+2] "Good point re: making the feature more generic. It is easy to move when/if the time comes!" [puppet] - 10https://gerrit.wikimedia.org/r/1180089 (https://phabricator.wikimedia.org/T402261) (owner: 10Filippo Giunchedi) [13:16:00] (03PS1) 10Cathal Mooney: Nokia JSON-RPC: Add secrets to support using JSON-RPC API [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 [13:16:46] (03PS2) 10Filippo Giunchedi: openstack: enable cfssl certs for libvirt in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1180556 (https://phabricator.wikimedia.org/T355145) [13:17:19] (03CR) 10CI reject: [V:04-1] openstack: enable cfssl certs for libvirt in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1180556 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [13:18:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [13:18:48] ah [13:18:50] here it comes [13:18:51] !incidents [13:18:51] 6621 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:18:51] 6619 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:18:55] !ack 6621 [13:18:55] 6621 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:19:06] * hnowlan here [13:19:52] * claime here [13:20:34] <_joe_> what is going on? graphs look horrible, but only CDN ones [13:21:01] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402043#11102214 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [13:21:09] * jhathaway here [13:23:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [13:24:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11102219 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [13:26:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402221#11102221 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [13:26:33] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11102223 (10Jhancock.wm) [13:26:35] (03PS1) 10Brouberol: stat: deploy an analytics-ml keytab on each host [puppet] - 10https://gerrit.wikimedia.org/r/1180559 (https://phabricator.wikimedia.org/T400902) [13:27:19] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402099#11102226 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [13:30:45] (03PS1) 10Cathal Mooney: Nokia: Add initial Python files for nokia switch system config [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T371088) [13:32:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:32:20] (03PS3) 10Slyngshede: cache: move banning of requests with no UA to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1180500 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [13:32:50] (03CR) 10Slyngshede: cache: move banning of requests with no UA to haproxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180500 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [13:34:42] (03PS1) 10Brouberol: Add dummy analytics-ml.keytab files to the stat hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1180563 [13:35:24] (03CR) 10Brouberol: [C:03+2] Add dummy analytics-ml.keytab files to the stat hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1180563 (owner: 10Brouberol) [13:35:26] (03CR) 10Brouberol: [V:03+2 C:03+2] Add dummy analytics-ml.keytab files to the stat hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1180563 (owner: 10Brouberol) [13:35:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:36:00] (03PS3) 10Filippo Giunchedi: openstack: enable cfssl certs for libvirt in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1180556 (https://phabricator.wikimedia.org/T355145) [13:37:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:37:24] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6668/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180500 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [13:37:31] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6669/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180559 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [13:38:43] (03CR) 10Filippo Giunchedi: "To be paired with I5f367d4b99" [puppet] - 10https://gerrit.wikimedia.org/r/1180535 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [13:40:27] (03PS2) 10Krinkle: varnish: Write docs for some mobile user agent regexen [puppet] - 10https://gerrit.wikimedia.org/r/1180220 (https://phabricator.wikimedia.org/T401595) [13:41:55] (03PS2) 10Brouberol: stat: deploy an analytics-ml keytab on each host [puppet] - 10https://gerrit.wikimedia.org/r/1180559 (https://phabricator.wikimedia.org/T400902) [13:45:28] (03CR) 10Dzahn: ncredir: update vim modeline options for dat file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180255 (owner: 10BCornwall) [13:52:29] (03CR) 10Stevemunene: [C:03+2] dse-k8s: Add dse-k8s-codfw ctrl and worker nodes to role [puppet] - 10https://gerrit.wikimedia.org/r/1180512 (https://phabricator.wikimedia.org/T397301) (owner: 10Stevemunene) [13:52:30] (03PS1) 10CDanis: haproxy: maxconn bump [puppet] - 10https://gerrit.wikimedia.org/r/1180568 (https://phabricator.wikimedia.org/T401695) [13:52:37] (03CR) 10Cathal Mooney: Nokia JSON-RPC: Add secrets to support using JSON-RPC API (031 comment) [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (owner: 10Cathal Mooney) [13:52:57] (03CR) 10Vgutierrez: [C:03+1] haproxy: maxconn bump [puppet] - 10https://gerrit.wikimedia.org/r/1180568 (https://phabricator.wikimedia.org/T401695) (owner: 10CDanis) [13:53:07] (03CR) 10CDanis: [C:03+2] haproxy: maxconn bump [puppet] - 10https://gerrit.wikimedia.org/r/1180568 (https://phabricator.wikimedia.org/T401695) (owner: 10CDanis) [13:56:06] cdanis: is it ok to merge your changes? [13:56:12] stevemunene: please [13:56:23] (03PS1) 10Muehlenhoff: transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) [13:56:27] Ack. [13:57:14] (03CR) 10CI reject: [V:04-1] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) (owner: 10Muehlenhoff) [13:57:31] (03PS1) 10Tiziano Fogli: cloud: add missing hiera values to enable nrpe wrapper on vps [puppet] - 10https://gerrit.wikimedia.org/r/1180565 [13:57:42] (03PS2) 10Muehlenhoff: transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) [13:58:30] (03CR) 10CI reject: [V:04-1] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) (owner: 10Muehlenhoff) [13:58:57] (03PS2) 10Arnaudb: gitlab: nftables monitoring new thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1180569 (https://phabricator.wikimedia.org/T400971) [13:58:57] (03CR) 10Arnaudb: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1180569/7268/gitlab1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1180569 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [13:59:03] (03CR) 10Arnaudb: [C:03+2] gitlab: nftables monitoring new thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1180569 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [13:59:25] (03PS2) 10Tiziano Fogli: cloud: add missing hiera values to enable nrpe wrapper on vps [puppet] - 10https://gerrit.wikimedia.org/r/1180565 (https://phabricator.wikimedia.org/T395446) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T1400) [14:00:32] (03PS4) 10Slyngshede: cache: move banning of requests with no UA to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1180500 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [14:00:46] (03CR) 10Klausman: "One small nit, otherwise LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174751 (https://phabricator.wikimedia.org/T400295) (owner: 10Bking) [14:03:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:05:54] (03PS3) 10Muehlenhoff: transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) [14:06:03] (03CR) 10Vgutierrez: cache: move banning of requests with no UA to haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180500 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [14:06:40] (03PS3) 10Krinkle: varnish: Document mobile user agent regexen and mobile_redirect logic [puppet] - 10https://gerrit.wikimedia.org/r/1180220 (https://phabricator.wikimedia.org/T401595) [14:06:43] (03CR) 10CI reject: [V:04-1] transferpy: Add support for nftables [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) (owner: 10Muehlenhoff) [14:07:13] (03CR) 10Btullis: [C:03+1] "This looks good to me, but I think that it might need to get reviewed by the security team because of the additional sudoers rights. Maybe" [puppet] - 10https://gerrit.wikimedia.org/r/1180559 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [14:07:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:08:09] (03PS5) 10Slyngshede: cache: move banning of requests with no UA to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1180500 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [14:08:34] (03CR) 10Slyngshede: cache: move banning of requests with no UA to haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180500 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [14:10:21] (03CR) 10Muehlenhoff: "Patch is ready for review, there is some entirely unrelated flake8 nitpicking somewhere in the tests (possibly because flake8 became stric" [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) (owner: 10Muehlenhoff) [14:12:01] (03PS6) 10Slyngshede: cache: move banning of requests with no UA to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1180500 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [14:12:11] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11102397 (10Andrew) p:05Triage→03Medium [14:12:58] (03CR) 10Vgutierrez: [C:03+1] cache: move banning of requests with no UA to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1180500 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [14:17:53] (03CR) 10Ozge: [C:03+1] stat: deploy an analytics-ml keytab on each host [puppet] - 10https://gerrit.wikimedia.org/r/1180559 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [14:19:14] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubemaster),name=dse-k8s-ctrl2001.codfw.wmnet [14:19:27] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubemaster),name=dse-k8s-ctrl2002.codfw.wmnet [14:25:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T1400) [14:30:06] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T1430) [14:30:41] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#11102470 (10Clement_Goubert) 05In progress→03Resolved [14:32:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:32:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T399249)', diff saved to https://phabricator.wikimedia.org/P81586 and previous config saved to /var/cache/conftool/dbconfig/20250820-143222-fceratto.json [14:32:27] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [14:36:48] (03PS1) 10Arnaudb: Revert "gitlab: nftables monitoring new thresholds" [puppet] - 10https://gerrit.wikimedia.org/r/1180576 [14:37:44] (03CR) 10Arnaudb: [C:03+2] Revert "gitlab: nftables monitoring new thresholds" [puppet] - 10https://gerrit.wikimedia.org/r/1180576 (owner: 10Arnaudb) [14:43:11] (03PS4) 10Krinkle: varnish: Document mobile user agent regexen and mobile_redirect logic [puppet] - 10https://gerrit.wikimedia.org/r/1180220 (https://phabricator.wikimedia.org/T401595) [14:43:11] (03PS1) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [14:44:14] (03CR) 10Muehlenhoff: "New sudo rules need to be approved in the weekly SRE IF meeting (which is Monday), if that needs to happen quicker, we can escalate to som" [puppet] - 10https://gerrit.wikimedia.org/r/1180559 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [14:44:55] (03CR) 10Muehlenhoff: "check" [software/transferpy] - 10https://gerrit.wikimedia.org/r/1143539 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [14:45:18] (03CR) 10Brouberol: "Monday is completely fine!" [puppet] - 10https://gerrit.wikimedia.org/r/1180559 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [14:45:24] (03PS1) 10Ayounsi: [WIP] Routed ganeti: improve firewalling [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) [14:45:49] (03PS3) 10Muehlenhoff: mariadb::ferm_wmcs: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1037766 [14:45:50] (03CR) 10CI reject: [V:04-1] [WIP] Routed ganeti: improve firewalling [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [14:46:51] (03PS2) 10Ayounsi: [WIP] Routed ganeti: improve firewalling [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) [14:47:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11102657 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:48:58] 06SRE, 06Traffic, 13Patch-For-Review: apt-staging: add headers to prevent CDN caching - https://phabricator.wikimedia.org/T402284#11102663 (10MoritzMuehlenhoff) [14:50:16] 06SRE, 06Infrastructure-Foundations: Move failoid to trixie - https://phabricator.wikimedia.org/T402406 (10MoritzMuehlenhoff) 03NEW [14:50:24] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [14:52:06] (03CR) 10Filippo Giunchedi: [C:03+1] cloud: add missing hiera values to enable nrpe wrapper on vps [puppet] - 10https://gerrit.wikimedia.org/r/1180565 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [14:52:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:08] ^ we restarted gitlab (which fixed CI/CD pipelines) [14:55:26] (03CR) 10Tiziano Fogli: [C:03+2] cloud: add missing hiera values to enable nrpe wrapper on vps [puppet] - 10https://gerrit.wikimedia.org/r/1180565 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [14:55:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037766 (owner: 10Muehlenhoff) [14:55:57] (03PS2) 10Tiziano Fogli: Reapply "nrpe wrapper: define Prometheus alerts via Puppet resources" [puppet] - 10https://gerrit.wikimedia.org/r/1180566 (https://phabricator.wikimedia.org/T395446) [14:55:57] (03CR) 10Tiziano Fogli: [C:03+2] "I’m self-merging the patch since there are no code changes from the previous one; the required modifications were made in a dedicated patc" [puppet] - 10https://gerrit.wikimedia.org/r/1180566 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [14:57:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:57:40] (03PS2) 10Scott French: deployment_server: switch mw-debug/next to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1177420 (https://phabricator.wikimedia.org/T401254) [15:01:02] (03CR) 10Cathal Mooney: [C:03+1] "Overall LGTM, one optional nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [15:01:33] (03PS1) 10Arnaudb: gitlab: nftables monitoring new thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1180580 (https://phabricator.wikimedia.org/T400971) [15:01:36] (03CR) 10Arnaudb: [C:03+2] gitlab: nftables monitoring new thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1180580 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [15:02:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T399249)', diff saved to https://phabricator.wikimedia.org/P81587 and previous config saved to /var/cache/conftool/dbconfig/20250820-150204-fceratto.json [15:02:09] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:02:44] jhathaway@cumin1002 provision (PID 4078785) is awaiting input [15:03:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:05:39] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker2001.codfw.wmnet [15:05:45] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker2002.codfw.wmnet [15:05:52] !log stevemunene@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: service=(kubesvc),name=dse-k8s-worker2003.codfw.wmnet [15:07:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:08:37] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:57] (03CR) 10Herron: [C:03+1] profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [15:13:54] (03PS1) 10David Caro: aptrepo: add k8s 1.30 to trixie-wikimedia repo [puppet] - 10https://gerrit.wikimedia.org/r/1180584 (https://phabricator.wikimedia.org/T362869) [15:17:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P81588 and previous config saved to /var/cache/conftool/dbconfig/20250820-151712-fceratto.json [15:18:09] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180499 (https://phabricator.wikimedia.org/T402191) (owner: 10Vgutierrez) [15:18:46] (03PS3) 10Stevemunene: dse-k8s: Add dse-k8s-codfw to service list [puppet] - 10https://gerrit.wikimedia.org/r/1180116 (https://phabricator.wikimedia.org/T397298) [15:19:04] (03CR) 10Stevemunene: "The previous changes were merged and the kubesvc and kubemaster services set as active." [puppet] - 10https://gerrit.wikimedia.org/r/1180116 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [15:20:42] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#11102786 (10RobH) 05Open→03Resolved [15:20:53] (03CR) 10Ayounsi: [WIP] Routed ganeti: improve firewalling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [15:23:37] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:33] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:50] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1002.eqiad.wmnet with reason: supermicro [15:31:00] (03PS2) 10David Caro: aptrepo: add k8s 1.30 and helm to trixie-wikimedia repo [puppet] - 10https://gerrit.wikimedia.org/r/1180584 (https://phabricator.wikimedia.org/T362869) [15:31:16] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:32:13] (03CR) 10Muehlenhoff: aptrepo: add k8s 1.30 and helm to trixie-wikimedia repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180584 (https://phabricator.wikimedia.org/T362869) (owner: 10David Caro) [15:32:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P81589 and previous config saved to /var/cache/conftool/dbconfig/20250820-153219-fceratto.json [15:35:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:36:29] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:38:44] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:39:28] 06SRE, 06Infrastructure-Foundations: Test HTTP Boot as an installation method with Trixie - https://phabricator.wikimedia.org/T402409 (10MoritzMuehlenhoff) 03NEW [15:39:36] 06SRE, 06Infrastructure-Foundations: Test HTTP Boot as an installation method with Trixie - https://phabricator.wikimedia.org/T402409#11102838 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:41:37] (03CR) 10David Caro: aptrepo: add k8s 1.30 and helm to trixie-wikimedia repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180584 (https://phabricator.wikimedia.org/T362869) (owner: 10David Caro) [15:43:34] (03CR) 10Muehlenhoff: aptrepo: add k8s 1.30 and helm to trixie-wikimedia repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180584 (https://phabricator.wikimedia.org/T362869) (owner: 10David Caro) [15:43:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:43:57] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:44:09] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:47:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T399249)', diff saved to https://phabricator.wikimedia.org/P81590 and previous config saved to /var/cache/conftool/dbconfig/20250820-154726-fceratto.json [15:47:32] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:47:42] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1199.eqiad.wmnet with reason: Maintenance [15:47:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T399249)', diff saved to https://phabricator.wikimedia.org/P81591 and previous config saved to /var/cache/conftool/dbconfig/20250820-154749-fceratto.json [15:51:55] (03CR) 10David Caro: aptrepo: add k8s 1.30 and helm to trixie-wikimedia repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180584 (https://phabricator.wikimedia.org/T362869) (owner: 10David Caro) [15:54:17] FIRING: ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T399249)', diff saved to https://phabricator.wikimedia.org/P81593 and previous config saved to /var/cache/conftool/dbconfig/20250820-155420-fceratto.json [15:54:25] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:56:52] (03PS1) 10Zabe: Stop writing to cl_to and cl_collation on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180594 (https://phabricator.wikimedia.org/T399579) [15:57:06] (03CR) 10Vgutierrez: [C:03+2] admin: Add dang to analytics-(wmde|privatedata)-users [puppet] - 10https://gerrit.wikimedia.org/r/1180499 (https://phabricator.wikimedia.org/T402191) (owner: 10Vgutierrez) [15:58:03] (03CR) 10Muehlenhoff: aptrepo: add k8s 1.30 and helm to trixie-wikimedia repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180584 (https://phabricator.wikimedia.org/T362869) (owner: 10David Caro) [15:59:17] RESOLVED: ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:59:21] (03CR) 10David Caro: aptrepo: add k8s 1.30 and helm to trixie-wikimedia repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180584 (https://phabricator.wikimedia.org/T362869) (owner: 10David Caro) [16:00:53] jouncebot: nowandnext [16:00:53] No deployments scheduled for the next 0 hour(s) and 59 minute(s) [16:00:54] In 0 hour(s) and 59 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T1700) [16:01:33] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-wmde-users and analytics-privatedata-users for dang - https://phabricator.wikimedia.org/T402191#11102930 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez The change granting access to the requested groups has been merge... [16:01:40] (03CR) 10Zabe: [C:03+2] Stop writing to cl_to and cl_collation on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180594 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [16:02:50] (03Merged) 10jenkins-bot: Stop writing to cl_to and cl_collation on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180594 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [16:03:39] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1180594|Stop writing to cl_to and cl_collation on more wikis (T399579)]] [16:03:43] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [16:03:47] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6671/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180220 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:06:10] !log zabe@deploy1003 zabe: Backport for [[gerrit:1180594|Stop writing to cl_to and cl_collation on more wikis (T399579)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:06:39] !log zabe@deploy1003 zabe: Continuing with sync [16:07:18] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11102953 (10BTullis) Hi @Jclark-ctr - It's a bit complicated - sorry :-) We are currently in the middle of re-assessing whether or not we are going to go ahead with a project to upg... [16:08:52] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for Dima_Koushha_WMDE - https://phabricator.wikimedia.org/T402384#11102962 (10Vgutierrez) @Dima_Koushha_WMDE could you create a gerrit change that contains your public SSH key (it can be immediately abandoned)? we can use that as a way of verifying t... [16:09:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P81594 and previous config saved to /var/cache/conftool/dbconfig/20250820-160927-fceratto.json [16:09:31] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180597 [16:11:55] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180594|Stop writing to cl_to and cl_collation on more wikis (T399579)]] (duration: 08m 16s) [16:12:00] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [16:15:16] (03PS1) 10Giuseppe Lavagetto: Action edit UX improvement [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1180598 [16:15:26] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Action edit UX improvement [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1180598 (owner: 10Giuseppe Lavagetto) [16:15:38] (03CR) 10Herron: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [16:15:48] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "UX improvements - oblivian@cumin1003" [16:15:49] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: UX improvements - oblivian@cumin1003 [16:16:37] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: UX improvements - oblivian@cumin1003 [16:16:39] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "UX improvements - oblivian@cumin1003" [16:20:26] (03CR) 10Vgutierrez: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [16:20:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:24:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P81595 and previous config saved to /var/cache/conftool/dbconfig/20250820-162435-fceratto.json [16:31:52] (03PS1) 10Jdlrobson: Temporarily use production for summary endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180599 (https://phabricator.wikimedia.org/T400694) [16:32:50] (03CR) 10Herron: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [16:39:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T399249)', diff saved to https://phabricator.wikimedia.org/P81596 and previous config saved to /var/cache/conftool/dbconfig/20250820-163942-fceratto.json [16:39:48] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [16:39:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:40:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T399249)', diff saved to https://phabricator.wikimedia.org/P81597 and previous config saved to /var/cache/conftool/dbconfig/20250820-164005-fceratto.json [16:41:00] (03PS4) 10Tchanders: Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) [16:46:34] (03CR) 10Tchanders: "I changed this patch to enable temp accounts by default and instead disable them for wikis on a (much shorter) exclusion list, as agreed w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [16:47:10] (03PS1) 10Giuseppe Lavagetto: bugfix [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1180601 [16:47:18] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] bugfix [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1180601 (owner: 10Giuseppe Lavagetto) [16:47:30] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfix - oblivian@cumin1003" [16:47:31] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfix - oblivian@cumin1003 [16:48:16] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfix - oblivian@cumin1003 [16:48:17] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfix - oblivian@cumin1003" [17:00:05] swfrench-wmf: gettimeofday() says it's time for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T1700) [17:00:12] o/ [17:00:18] I'll get started in a bit [17:05:50] !log swfrench@deploy1003 Started scap sync-world: Deployment to pick up build-report cleanup - T401721 [17:05:55] T401721: Provide MediaWiki app image PHP version in helm values - https://phabricator.wikimedia.org/T401721 [17:08:10] !log swfrench@deploy1003 Finished scap sync-world: Deployment to pick up build-report cleanup - T401721 (duration: 02m 41s) [17:09:49] (03CR) 10Scott French: "Thank you both for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1177420 (https://phabricator.wikimedia.org/T401254) (owner: 10Scott French) [17:10:08] (03CR) 10Scott French: [C:03+2] deployment_server: switch mw-debug/next to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1177420 (https://phabricator.wikimedia.org/T401254) (owner: 10Scott French) [17:18:46] !log swfrench@deploy1003 Started scap sync-world: No-sync deployment to verify mw-debug/next helmfile diffs - T401254 [17:18:51] T401254: Upgrade mw-debug/next to PHP 8.3 - https://phabricator.wikimedia.org/T401254 [17:19:11] !log swfrench@deploy1003 Stopping before sync operations [17:22:15] (03CR) 10BCornwall: [V:03+2 C:03+2] "Records are right, dnssec is disabled" [puppet] - 10https://gerrit.wikimedia.org/r/1180597 (owner: 10Ncmonitor) [17:23:01] (03CR) 10Dreamy Jazz: "Non-blocking: Could we consider naming this `tempaccounts_disabled.dblist`?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [17:23:43] !log swfrench@deploy1003 Started scap sync-world: No-build deployment to apply mw-debug/next helmfile diffs - T401254 [17:24:25] (03CR) 10Bking: [C:03+2] stat hosts: Alert on memory stalls (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1177410 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking) [17:25:10] gs [17:28:32] !log swfrench@deploy1003 Finished scap sync-world: No-build deployment to apply mw-debug/next helmfile diffs - T401254 (duration: 05m 57s) [17:28:36] T401254: Upgrade mw-debug/next to PHP 8.3 - https://phabricator.wikimedia.org/T401254 [17:32:22] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180608 [17:33:57] (03PS1) 10Zabe: Stop writing to cl_to and cl_collation on large s7 and s8 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180609 (https://phabricator.wikimedia.org/T399579) [17:34:55] (03CR) 10BCornwall: "Records are sound." [puppet] - 10https://gerrit.wikimedia.org/r/1180608 (owner: 10Ncmonitor) [17:34:59] (03CR) 10BCornwall: [V:03+2 C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180608 (owner: 10Ncmonitor) [17:35:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:40:20] (03CR) 10Eevans: [C:03+2] sessionstore: upgrade production to Kask v1.0.18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180227 (owner: 10Eevans) [17:42:17] (03Merged) 10jenkins-bot: sessionstore: upgrade production to Kask v1.0.18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180227 (owner: 10Eevans) [17:42:25] (03PS1) 10Ebernhardson: cirrus: Enable phrase suggester variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180610 (https://phabricator.wikimedia.org/T397083) [17:43:18] (03PS2) 10Ebernhardson: cirrus: Enable phrase suggester variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180610 (https://phabricator.wikimedia.org/T397083) [17:44:42] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/sessionstore: apply [17:45:11] alright, I believe I'm finished with the infra window [17:46:47] (03PS1) 10Dzahn: zuul::main: include profile::pki::client [puppet] - 10https://gerrit.wikimedia.org/r/1180613 (https://phabricator.wikimedia.org/T401614) [17:48:15] (03CR) 10Umherirrender: "(Getting review comments on backport is a bit curious)" [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180525 (https://phabricator.wikimedia.org/T402273) (owner: 10D3r1ck01) [17:50:17] (03CR) 10Bking: "I've built the image successfully using the Dockerfile at https://phabricator.wikimedia.org/P81598 (same as here, but with hard-coded valu" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174751 (https://phabricator.wikimedia.org/T400295) (owner: 10Bking) [17:51:48] (03PS4) 10Bking: golang: add trixie-based golang-1.24 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174751 (https://phabricator.wikimedia.org/T400295) [17:52:27] (03CR) 10Bking: golang: add trixie-based golang-1.24 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174751 (https://phabricator.wikimedia.org/T400295) (owner: 10Bking) [17:54:51] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [17:56:53] (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1180613/6672/" [puppet] - 10https://gerrit.wikimedia.org/r/1180613 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [18:00:05] jnuche and jeena: MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T1800). Please do the needful. [18:07:17] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:08:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T399249)', diff saved to https://phabricator.wikimedia.org/P81599 and previous config saved to /var/cache/conftool/dbconfig/20250820-180759-fceratto.json [18:08:05] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:12:17] RESOLVED: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:18:24] (03PS1) 10Dzahn: pki: create a new intermediate CA for zuul [puppet] - 10https://gerrit.wikimedia.org/r/1180616 (https://phabricator.wikimedia.org/T395938) [18:19:44] (03CR) 10Dzahn: "context of all of this is project https://phabricator.wikimedia.org/project/view/7592/" [puppet] - 10https://gerrit.wikimedia.org/r/1180616 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:20:33] (03CR) 10Dzahn: [V:03+1 C:03+2] various: fix puppet-lint legacy_fact warnings for collab services [puppet] - 10https://gerrit.wikimedia.org/r/1178619 (owner: 10Dzahn) [18:20:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:23:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P81600 and previous config saved to /var/cache/conftool/dbconfig/20250820-182307-fceratto.json [18:24:35] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1002.eqiad.wmnet with reason: supermicro [18:25:15] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11103449 (10VRiley-WMF) Many thanks to @ayounsi set this secondary cable cloudcephosd1042 C8 U12 CableID 5204 Port 29 CableID 20220266 Port 28 [18:27:24] (03PS35) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [18:28:18] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6673/console" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [18:29:45] (03CR) 10CI reject: [V:04-1] dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [18:32:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:34:42] (03PS2) 10Cathal Mooney: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T371088) [18:36:40] (03PS3) 10Cathal Mooney: Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T371088) [18:38:07] jhathaway@cumin1002 provision (PID 226056) is awaiting input [18:38:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P81603 and previous config saved to /var/cache/conftool/dbconfig/20250820-183814-fceratto.json [18:38:32] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6674/console" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [18:39:20] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:39:33] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6675/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [18:42:19] (03PS1) 10Dzahn: add dummy private key for zuul intermediate CA [labs/private] - 10https://gerrit.wikimedia.org/r/1180621 (https://phabricator.wikimedia.org/T401614) [18:43:05] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:43:05] (03PS2) 10Cathal Mooney: Nokia JSON-RPC: Add secrets to support using JSON-RPC API [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 [18:43:19] (03CR) 10Dzahn: [V:03+2 C:03+2] "to go with I89486e5bafee3d4" [labs/private] - 10https://gerrit.wikimedia.org/r/1180621 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [18:43:52] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:43:56] (03CR) 10Cathal Mooney: Nokia JSON-RPC: Add secrets to support using JSON-RPC API (032 comments) [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (owner: 10Cathal Mooney) [18:47:08] (03CR) 10CDanis: [C:03+1] add dummy private key for zuul intermediate CA [labs/private] - 10https://gerrit.wikimedia.org/r/1180621 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [18:47:13] (03CR) 10CDanis: [C:03+1] pki: create a new intermediate CA for zuul [puppet] - 10https://gerrit.wikimedia.org/r/1180616 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:47:56] (03PS36) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [18:50:34] (03PS1) 10Eevans: sessionstore: (explicitly) disable TLS host verification [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180622 (https://phabricator.wikimedia.org/T352647) [18:50:48] (03CR) 10CI reject: [V:04-1] Nokia: Add support for Python config generation and JSON-RPC API [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [18:51:18] (03CR) 10CDanis: [C:03+1] sessionstore: (explicitly) disable TLS host verification [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180622 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [18:51:37] (03PS5) 10Bking: golang: add trixie-based golang-1.24 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174751 (https://phabricator.wikimedia.org/T400295) [18:52:29] (03CR) 10Bking: [V:03+2 C:03+2] "I corrected the nit mentioned above. Self-merging in the interest of time..." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174751 (https://phabricator.wikimedia.org/T400295) (owner: 10Bking) [18:52:49] (03CR) 10Eevans: [C:03+2] sessionstore: (explicitly) disable TLS host verification [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180622 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [18:53:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T399249)', diff saved to https://phabricator.wikimedia.org/P81604 and previous config saved to /var/cache/conftool/dbconfig/20250820-185321-fceratto.json [18:53:27] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:53:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [18:53:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T399249)', diff saved to https://phabricator.wikimedia.org/P81605 and previous config saved to /var/cache/conftool/dbconfig/20250820-185344-fceratto.json [18:54:05] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [18:54:28] (03Merged) 10jenkins-bot: sessionstore: (explicitly) disable TLS host verification [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180622 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [18:54:57] (03CR) 10Dzahn: [C:03+2] pki: create a new intermediate CA for zuul [puppet] - 10https://gerrit.wikimedia.org/r/1180616 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:55:37] (03PS37) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [18:55:55] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/sessionstore: apply [18:56:18] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [18:56:48] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6677/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [19:01:27] (03CR) 10Dzahn: [C:03+2] "Info: Applying configuration version '(d7f7ab1b7f) Dzahn - pki: create a new intermediate CA for zuul'" [puppet] - 10https://gerrit.wikimedia.org/r/1180616 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:09:13] (03PS1) 10Dzahn: pki: add cert for new zuul intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1180626 (https://phabricator.wikimedia.org/T401614) [19:10:34] (03PS1) 10JHathaway: WIP: always use EFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1180627 [19:11:28] !log eevans@deploy1003 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [19:12:17] !log eevans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [19:13:31] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:13:44] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-test-coord1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [19:14:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling needed between cages to eqiad 2025/6 switch refresh - https://phabricator.wikimedia.org/T402432 (10cmooney) 03NEW p:05Triage→03Medium [19:20:59] (03CR) 10Dzahn: [C:03+2] pki: add cert for new zuul intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1180626 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [19:29:04] (03PS1) 10Dzahn: pki::multirootca: add entry for zuul intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1180630 (https://phabricator.wikimedia.org/T401614) [19:32:59] (03CR) 10CDanis: pki::multirootca: add entry for zuul intermediate CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180630 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [19:33:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling needed between cages to eqiad 2025/6 switch refresh - https://phabricator.wikimedia.org/T402432#11103889 (10cmooney) [19:33:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11103890 (10cmooney) [19:35:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:38:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad: new structured cabling needed between cages to eqiad 2025/6 switch refresh - https://phabricator.wikimedia.org/T402432#11103905 (10cmooney) Ok so I spoke to traffic and while they are close to ditching the need for L2 adjacency... [19:38:52] (03CR) 10Dzahn: pki::multirootca: add entry for zuul intermediate CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180630 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [19:38:57] (03PS2) 10Dzahn: pki::multirootca: add entry for zuul intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/1180630 (https://phabricator.wikimedia.org/T401614) [19:44:37] (03PS2) 10JHathaway: provision: always set NIC to EFI in UEFI mode [cookbooks] - 10https://gerrit.wikimedia.org/r/1180627 (https://phabricator.wikimedia.org/T387577) [19:46:01] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11103963 (10Jgreen) LGTM! [19:49:07] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#11103984 (10Jgreen) 05Open→03Resolved a:03Jgreen trial is in progress [19:55:37] (03CR) 10CDanis: [C:03+1] "You might need additional default_usages, but we can try and see" [puppet] - 10https://gerrit.wikimedia.org/r/1180630 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [19:55:57] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1180636 [19:56:00] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180637 [19:56:02] (03Abandoned) 10Scott French: mw-debug: switch php.version to 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177421 (https://phabricator.wikimedia.org/T401254) (owner: 10Scott French) [19:56:04] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180638 [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:01:20] (03CR) 10BCornwall: [V:03+2 C:03+2] "NS records are properly set in MM" [dns] - 10https://gerrit.wikimedia.org/r/1180636 (owner: 10Ncmonitor) [20:01:35] !log brett@dns1004 START - running authdns-update [20:02:42] !log brett@dns1004 END - running authdns-update [20:09:09] (03CR) 10Dzahn: [C:03+2] "thank you for the help" [puppet] - 10https://gerrit.wikimedia.org/r/1180630 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [20:09:43] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180637 (owner: 10Ncmonitor) [20:16:04] (03CR) 10BCornwall: [V:03+2 C:03+2] "Records are right" [puppet] - 10https://gerrit.wikimedia.org/r/1180638 (owner: 10Ncmonitor) [20:19:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T399249)', diff saved to https://phabricator.wikimedia.org/P81606 and previous config saved to /var/cache/conftool/dbconfig/20250820-201914-fceratto.json [20:19:20] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [20:20:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:23:12] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [20:23:19] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180637 (owner: 10Ncmonitor) [20:25:17] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180637 (owner: 10Ncmonitor) [20:25:18] (03PS3) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1180637 (owner: 10Ncmonitor) [20:34:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P81607 and previous config saved to /var/cache/conftool/dbconfig/20250820-203422-fceratto.json [20:34:34] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180637 (owner: 10Ncmonitor) [20:41:48] (03CR) 10Cwhite: [C:03+2] DiskSpace: add DiskSpace critical alert [alerts] - 10https://gerrit.wikimedia.org/r/1178979 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [20:43:25] (03Merged) 10jenkins-bot: DiskSpace: add DiskSpace critical alert [alerts] - 10https://gerrit.wikimedia.org/r/1178979 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [20:44:29] (03PS1) 10Cwhite: monitoring: ensure disk space check is absent [puppet] - 10https://gerrit.wikimedia.org/r/1180642 (https://phabricator.wikimedia.org/T332764) [20:46:48] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180637 (owner: 10Ncmonitor) [20:48:44] 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11104228 (10RLazarus) How's this looking? If you're happy with it, do you want to try adding the second of the firs... [20:49:17] (03PS1) 10Zabe: Initial configuration for bewwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180644 (https://phabricator.wikimedia.org/T402130) [20:49:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P81608 and previous config saved to /var/cache/conftool/dbconfig/20250820-204929-fceratto.json [20:49:58] (03PS1) 10Zabe: Activate bewwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180645 (https://phabricator.wikimedia.org/T402130) [20:52:22] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1180637 (owner: 10Ncmonitor) [20:52:28] (03PS1) 10Dzahn: zuul::main: add TLS cert/key for nodepool->zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/1180646 (https://phabricator.wikimedia.org/T401614) [20:55:49] (03PS1) 10Zabe: tlwikisource: Enable parsoid mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180649 (https://phabricator.wikimedia.org/T388654) [20:56:43] (03Abandoned) 10Zabe: tlwikisource: Enable parsoid mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180649 (https://phabricator.wikimedia.org/T388654) (owner: 10Zabe) [20:58:42] (03PS2) 10Brennen Bearnes: phabricator: bump APCu shared memory size to 4096M [puppet] - 10https://gerrit.wikimedia.org/r/1180643 (https://phabricator.wikimedia.org/T401157) [20:59:28] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-08-05-075031 to 2025-08-20-203801 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180651 (https://phabricator.wikimedia.org/T400652) [20:59:37] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-08-13-113934 to 2025-08-20-182519 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180652 (https://phabricator.wikimedia.org/T400652) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T2100) [21:00:37] (03CR) 10CI reject: [V:04-1] phabricator: bump APCu shared memory size to 4096M [puppet] - 10https://gerrit.wikimedia.org/r/1180643 (https://phabricator.wikimedia.org/T401157) (owner: 10Brennen Bearnes) [21:02:14] (03PS2) 10Cwhite: resources: remove most filters [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) [21:02:41] (03CR) 10Cwhite: "I agree, it looks like most of the filters are of date except tmpfs. I'll rewrite this patch." [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [21:03:42] (03PS3) 10Brennen Bearnes: phabricator: bump APCu shared memory size to 4096M [puppet] - 10https://gerrit.wikimedia.org/r/1180643 (https://phabricator.wikimedia.org/T401157) [21:04:00] (03PS4) 10Brennen Bearnes: phabricator: bump APCu shared memory size to 4096M [puppet] - 10https://gerrit.wikimedia.org/r/1180643 (https://phabricator.wikimedia.org/T401157) [21:04:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T399249)', diff saved to https://phabricator.wikimedia.org/P81609 and previous config saved to /var/cache/conftool/dbconfig/20250820-210437-fceratto.json [21:04:42] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [21:04:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1195.eqiad.wmnet with reason: Maintenance [21:05:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T399249)', diff saved to https://phabricator.wikimedia.org/P81610 and previous config saved to /var/cache/conftool/dbconfig/20250820-210500-fceratto.json [21:06:49] (03PS3) 10Cwhite: k8s-ops: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) [21:06:50] (03PS3) 10Cwhite: cirrussearch: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179178 (https://phabricator.wikimedia.org/T332764) [21:08:37] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade evaluators from 2025-08-05-075031 to 2025-08-20-203801 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180651 (https://phabricator.wikimedia.org/T400652) (owner: 10Jforrester) [21:10:27] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-08-05-075031 to 2025-08-20-203801 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180651 (https://phabricator.wikimedia.org/T400652) (owner: 10Jforrester) [21:12:55] !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:13:17] FIRING: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:13:42] !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:14:41] !log ecarg@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:15:28] !log ecarg@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:15:49] !log ecarg@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:16:30] !log ecarg@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:16:34] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [21:18:17] RESOLVED: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:20:08] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcephosd1045 - vriley@cumin1003" [21:20:12] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcephosd1045 - vriley@cumin1003" [21:20:12] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:20:29] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-08-13-113934 to 2025-08-20-210742 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180652 (https://phabricator.wikimedia.org/T400652) [21:20:33] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1045 [21:20:48] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1045 [21:20:53] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-08-13-113934 to 2025-08-20-210742 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180652 (https://phabricator.wikimedia.org/T400652) (owner: 10Jforrester) [21:21:28] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:22:36] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-08-13-113934 to 2025-08-20-210742 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180652 (https://phabricator.wikimedia.org/T400652) (owner: 10Jforrester) [21:23:03] 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#11104345 (10Andrew) 05Stalled→03In progress [21:23:15] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1025.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:23:17] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1025.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:23:20] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [21:23:31] !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:23:55] !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:24:24] !log ecarg@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:24:56] !log ecarg@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:25:05] !log ecarg@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:25:36] !log ecarg@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:26:40] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1180646/6678/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1180646 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [21:26:57] (03CR) 10Dzahn: [V:03+1 C:03+2] zuul::main: add TLS cert/key for nodepool->zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/1180646 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [21:27:24] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye [21:29:54] (03PS3) 10Scott French: php: remove deprecated ${} string interpolation [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1180653 (https://phabricator.wikimedia.org/T402424) [21:30:09] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1025.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:30:14] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [21:32:12] (03CR) 10Scott French: [V:03+2] "* Built locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1180653 (https://phabricator.wikimedia.org/T402424) (owner: 10Scott French) [21:33:25] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [21:35:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:36:39] (03CR) 10Cwhite: [C:03+2] logstash: remove udp in error alerts [alerts] - 10https://gerrit.wikimedia.org/r/1179221 (owner: 10Cwhite) [21:37:04] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:38:07] (03Merged) 10jenkins-bot: logstash: remove udp in error alerts [alerts] - 10https://gerrit.wikimedia.org/r/1179221 (owner: 10Cwhite) [21:39:04] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1012.eqiad.wmnet -> wdqs1017.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:39:08] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [21:41:59] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1013.eqiad.wmnet -> wdqs1021.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:42:55] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [21:44:18] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [21:46:03] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye [21:51:39] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2019.codfw.wmnet w/ force delete existing files, repooling both afterwards [21:51:43] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [21:52:15] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2008.codfw.wmnet -> wdqs2020.codfw.wmnet w/ force delete existing files, repooling both afterwards [21:53:13] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:54:13] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:56:18] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [21:57:47] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1023.eqiad.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards [21:57:48] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [22:00:02] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T2200) [22:02:01] FIRING: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:03:27] o/ gonna use deploy window for a few things [22:04:43] !log [WDQS] `ryankemper@wdqs1016:~$ sudo systemctl restart wdqs-blazegraph` [22:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:01] RESOLVED: [2x] ProbeDown: Service wdqs1016:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:07:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180599 (https://phabricator.wikimedia.org/T400694) (owner: 10Jdlrobson) [22:08:38] (03Merged) 10jenkins-bot: Temporarily use production for summary endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180599 (https://phabricator.wikimedia.org/T400694) (owner: 10Jdlrobson) [22:09:15] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1180637 (owner: 10Ncmonitor) [22:10:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/ReadingLists] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180244 (https://phabricator.wikimedia.org/T402050) (owner: 10Jdlrobson) [22:10:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/ReadingLists] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1180205 (https://phabricator.wikimedia.org/T402050) (owner: 10Jdlrobson) [22:10:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:11:51] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1042.eqiad.wmnet with reason: host reimage [22:13:23] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage [22:13:40] (03Merged) 10jenkins-bot: Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists [extensions/ReadingLists] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180244 (https://phabricator.wikimedia.org/T402050) (owner: 10Jdlrobson) [22:13:41] (03Merged) 10jenkins-bot: Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists [extensions/ReadingLists] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1180205 (https://phabricator.wikimedia.org/T402050) (owner: 10Jdlrobson) [22:14:08] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1180244|Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists (T402050)]], [[gerrit:1180205|Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists (T402050)]] [22:14:12] T402050: Shareable Reading list items should not be visible on the Special:ReadingLists page - https://phabricator.wikimedia.org/T402050 [22:14:51] (03PS1) 10Jdlrobson: Hide content for wgReadingListsAnonymizedPreviews = true [extensions/ReadingLists] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180670 (https://phabricator.wikimedia.org/T402050) [22:15:13] (03PS1) 10Jdlrobson: Hide content for wgReadingListsAnonymizedPreviews = true [extensions/ReadingLists] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1180671 (https://phabricator.wikimedia.org/T402050) [22:17:08] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye [22:18:33] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1180244|Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists (T402050)]], [[gerrit:1180205|Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists (T402050)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:19:05] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1042.eqiad.wmnet with reason: host reimage [22:20:01] !log jdlrobson@deploy1003 jdlrobson: Continuing with sync [22:21:54] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1011.eqiad.wmnet -> wdqs1025.eqiad.wmnet w/ force delete existing files, repooling both afterwards [22:21:58] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [22:22:31] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage [22:23:44] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye [22:25:12] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye [22:25:15] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180244|Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists (T402050)]], [[gerrit:1180205|Restore $wgReadingListsAnonymizedPreviews feature flag for shared lists (T402050)]] (duration: 11m 07s) [22:25:19] T402050: Shareable Reading list items should not be visible on the Special:ReadingLists page - https://phabricator.wikimedia.org/T402050 [22:25:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T399249)', diff saved to https://phabricator.wikimedia.org/P81611 and previous config saved to /var/cache/conftool/dbconfig/20250820-222544-fceratto.json [22:25:49] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [22:26:07] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11104623 (10Andrew) [22:26:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:28:39] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1044.eqiad.wmnet with OS bullseye [22:31:55] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1012.eqiad.wmnet -> wdqs1017.eqiad.wmnet w/ force delete existing files, repooling both afterwards [22:31:59] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [22:32:00] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1013.eqiad.wmnet -> wdqs1021.eqiad.wmnet w/ force delete existing files, repooling both afterwards [22:32:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/ReadingLists] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180670 (https://phabricator.wikimedia.org/T402050) (owner: 10Jdlrobson) [22:32:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [extensions/ReadingLists] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1180671 (https://phabricator.wikimedia.org/T402050) (owner: 10Jdlrobson) [22:33:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:33:56] (03Merged) 10jenkins-bot: Hide content for wgReadingListsAnonymizedPreviews = true [extensions/ReadingLists] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1180670 (https://phabricator.wikimedia.org/T402050) (owner: 10Jdlrobson) [22:33:57] (03Merged) 10jenkins-bot: Hide content for wgReadingListsAnonymizedPreviews = true [extensions/ReadingLists] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1180671 (https://phabricator.wikimedia.org/T402050) (owner: 10Jdlrobson) [22:34:27] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1180670|Hide content for wgReadingListsAnonymizedPreviews = true (T402050)]], [[gerrit:1180671|Hide content for wgReadingListsAnonymizedPreviews = true (T402050)]] [22:34:31] T402050: Shareable Reading list items should not be visible on the Special:ReadingLists page - https://phabricator.wikimedia.org/T402050 [22:37:50] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [22:38:44] !log resize prometheus/k8s-dse +25G on prometheus100[78] [22:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:51] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1180670|Hide content for wgReadingListsAnonymizedPreviews = true (T402050)]], [[gerrit:1180671|Hide content for wgReadingListsAnonymizedPreviews = true (T402050)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:39:19] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye [22:40:13] !log jdlrobson@deploy1003 jdlrobson: Continuing with sync [22:40:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P81612 and previous config saved to /var/cache/conftool/dbconfig/20250820-224052-fceratto.json [22:40:54] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1044.eqiad.wmnet with OS bullseye [22:41:18] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1044.eqiad.wmnet with OS bullseye [22:42:14] RECOVERY - Disk space on prometheus1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1008&var-datasource=eqiad+prometheus/ops [22:45:21] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180670|Hide content for wgReadingListsAnonymizedPreviews = true (T402050)]], [[gerrit:1180671|Hide content for wgReadingListsAnonymizedPreviews = true (T402050)]] (duration: 10m 54s) [22:45:26] T402050: Shareable Reading list items should not be visible on the Special:ReadingLists page - https://phabricator.wikimedia.org/T402050 [22:45:41] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2008.codfw.wmnet -> wdqs2020.codfw.wmnet w/ force delete existing files, repooling both afterwards [22:45:46] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [22:47:43] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2019.codfw.wmnet w/ force delete existing files, repooling both afterwards [22:48:00] RECOVERY - Disk space on prometheus1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1007&var-datasource=eqiad+prometheus/ops [22:48:14] (03CR) 10Dzahn: "I do not feel strongly, its easy to convince me guys. Just go ahead regardless." [puppet] - 10https://gerrit.wikimedia.org/r/1180637 (owner: 10Ncmonitor) [22:48:28] okay done with spiderpig! [22:49:05] (03CR) 10Scott French: [C:03+1] k8s-ops: add disk space check overrides (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [22:50:08] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1023.eqiad.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards [22:51:09] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage [22:52:39] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage [22:54:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage [22:56:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P81613 and previous config saved to /var/cache/conftool/dbconfig/20250820-225600-fceratto.json [22:58:19] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage [22:59:20] jouncebot: nowandnext [22:59:21] For the next 0 hour(s) and 0 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250820T2200) [22:59:21] In 7 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T0600) [22:59:21] In 7 hour(s) and 0 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250821T0600) [23:00:19] (03CR) 10Zabe: [C:03+2] Initial configuration for bewwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180644 (https://phabricator.wikimedia.org/T402130) (owner: 10Zabe) [23:01:24] (03Merged) 10jenkins-bot: Initial configuration for bewwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180644 (https://phabricator.wikimedia.org/T402130) (owner: 10Zabe) [23:01:53] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1180644|Initial configuration for bewwiktionary (T402130)]] [23:01:58] T402130: Create Wiktionary Betawi - https://phabricator.wikimedia.org/T402130 [23:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:06:22] !log zabe@deploy1003 zabe: Backport for [[gerrit:1180644|Initial configuration for bewwiktionary (T402130)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:08:27] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1044.eqiad.wmnet with reason: host reimage [23:08:43] !log zabe@deploy1003 zabe: Continuing with sync [23:11:05] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye [23:11:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T399249)', diff saved to https://phabricator.wikimedia.org/P81614 and previous config saved to /var/cache/conftool/dbconfig/20250820-231107-fceratto.json [23:11:15] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [23:11:23] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [23:11:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:11:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T399249)', diff saved to https://phabricator.wikimedia.org/P81615 and previous config saved to /var/cache/conftool/dbconfig/20250820-231148-fceratto.json [23:11:58] 07Puppet, 06SRE, 03Readers Essential Work 2025: Certain mobile devices are (possibly) not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#11104804 (10Jdlrobson-WMF) I'll keep an eye on this during the mobile domain switchover. [23:12:35] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1044.eqiad.wmnet with reason: host reimage [23:13:53] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180644|Initial configuration for bewwiktionary (T402130)]] (duration: 11m 59s) [23:13:57] T402130: Create Wiktionary Betawi - https://phabricator.wikimedia.org/T402130 [23:15:15] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye [23:16:41] !log create Wiktionary Betawi # T402130 [23:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:46] (03CR) 10Zabe: [C:03+2] Activate bewwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180645 (https://phabricator.wikimedia.org/T402130) (owner: 10Zabe) [23:16:54] (03CR) 10CI reject: [V:04-1] Activate bewwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180645 (https://phabricator.wikimedia.org/T402130) (owner: 10Zabe) [23:17:14] (03PS2) 10Zabe: Activate bewwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180645 (https://phabricator.wikimedia.org/T402130) [23:17:24] (03CR) 10Zabe: [C:03+2] Activate bewwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180645 (https://phabricator.wikimedia.org/T402130) (owner: 10Zabe) [23:18:22] (03Merged) 10jenkins-bot: Activate bewwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180645 (https://phabricator.wikimedia.org/T402130) (owner: 10Zabe) [23:18:50] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1180645|Activate bewwiktionary (T402130)]] [23:23:14] !log zabe@deploy1003 zabe: Backport for [[gerrit:1180645|Activate bewwiktionary (T402130)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:23:19] T402130: Create Wiktionary Betawi - https://phabricator.wikimedia.org/T402130 [23:24:46] !log zabe@deploy1003 zabe: Continuing with sync [23:26:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:30:03] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180645|Activate bewwiktionary (T402130)]] (duration: 11m 12s) [23:30:07] T402130: Create Wiktionary Betawi - https://phabricator.wikimedia.org/T402130 [23:30:53] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1044.eqiad.wmnet with OS bullseye [23:31:40] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180680 (https://phabricator.wikimedia.org/T402130) [23:31:42] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180680 (https://phabricator.wikimedia.org/T402130) (owner: 10Zabe) [23:32:32] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180680 (https://phabricator.wikimedia.org/T402130) (owner: 10Zabe) [23:32:58] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1180680|Update interwiki cache (T402130)]] [23:37:28] !log zabe@deploy1003 zabe: Backport for [[gerrit:1180680|Update interwiki cache (T402130)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:37:33] T402130: Create Wiktionary Betawi - https://phabricator.wikimedia.org/T402130 [23:37:49] !log zabe@deploy1003 zabe: Continuing with sync [23:38:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T399249)', diff saved to https://phabricator.wikimedia.org/P81616 and previous config saved to /var/cache/conftool/dbconfig/20250820-233802-fceratto.json [23:38:07] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [23:38:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1180683 [23:38:42] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1180683 (owner: 10TrainBranchBot) [23:43:01] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180680|Update interwiki cache (T402130)]] (duration: 10m 03s) [23:43:06] T402130: Create Wiktionary Betawi - https://phabricator.wikimedia.org/T402130 [23:46:17] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis bewwiktionary in section s5 [23:52:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1180683 (owner: 10TrainBranchBot) [23:53:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P81617 and previous config saved to /var/cache/conftool/dbconfig/20250820-235310-fceratto.json [23:55:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis bewwiktionary in section s5