[00:02:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T402925)', diff saved to https://phabricator.wikimedia.org/P81815 and previous config saved to /var/cache/conftool/dbconfig/20250827-000203-ladsgroup.json [00:02:09] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [00:02:20] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [00:02:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2171 (T402925)', diff saved to https://phabricator.wikimedia.org/P81816 and previous config saved to /var/cache/conftool/dbconfig/20250827-000227-ladsgroup.json [00:04:36] (03PS1) 10Zabe: BacklinkCache: Use LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182244 [00:04:46] (03PS1) 10Zabe: BacklinkCache: Use LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182245 [00:05:11] (03CR) 10Zabe: [C:03+2] BacklinkCache: Use LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182244 (owner: 10Zabe) [00:05:13] (03CR) 10Zabe: [C:03+2] BacklinkCache: Use LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182245 (owner: 10Zabe) [00:08:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1182247 [00:08:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1182247 (owner: 10TrainBranchBot) [00:10:58] (03CR) 10Papaul: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [00:16:20] (03PS8) 10Papaul: Add BGP on mr1-ulsfo and temporary remove replace ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) [00:17:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T402925)', diff saved to https://phabricator.wikimedia.org/P81817 and previous config saved to /var/cache/conftool/dbconfig/20250827-001701-ladsgroup.json [00:17:06] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [00:22:47] (03Merged) 10jenkins-bot: BacklinkCache: Use LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182244 (owner: 10Zabe) [00:22:51] (03Merged) 10jenkins-bot: BacklinkCache: Use LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182245 (owner: 10Zabe) [00:23:40] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1182244|BacklinkCache: Use LinksMigration for categorylinks]], [[gerrit:1182245|BacklinkCache: Use LinksMigration for categorylinks]] [00:29:41] !log zabe@deploy1003 zabe: Backport for [[gerrit:1182244|BacklinkCache: Use LinksMigration for categorylinks]], [[gerrit:1182245|BacklinkCache: Use LinksMigration for categorylinks]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:30:10] !log zabe@deploy1003 zabe: Continuing with sync [00:32:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P81818 and previous config saved to /var/cache/conftool/dbconfig/20250827-003208-ladsgroup.json [00:35:26] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1182247 (owner: 10TrainBranchBot) [00:35:29] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182244|BacklinkCache: Use LinksMigration for categorylinks]], [[gerrit:1182245|BacklinkCache: Use LinksMigration for categorylinks]] (duration: 11m 49s) [00:47:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P81819 and previous config saved to /var/cache/conftool/dbconfig/20250827-004716-ladsgroup.json [00:50:23] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2039 [00:50:34] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2039 [00:51:33] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11121998 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [01:00:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:01:06] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:02:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T402925)', diff saved to https://phabricator.wikimedia.org/P81821 and previous config saved to /var/cache/conftool/dbconfig/20250827-010223-ladsgroup.json [01:02:29] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [01:02:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [01:02:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2178 (T402925)', diff saved to https://phabricator.wikimedia.org/P81822 and previous config saved to /var/cache/conftool/dbconfig/20250827-010246-ladsgroup.json [01:12:32] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 25s) [01:15:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T402925)', diff saved to https://phabricator.wikimedia.org/P81823 and previous config saved to /var/cache/conftool/dbconfig/20250827-011501-ladsgroup.json [01:15:07] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [01:19:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:24:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:26:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:30:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P81824 and previous config saved to /var/cache/conftool/dbconfig/20250827-013008-ladsgroup.json [01:36:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:43:53] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11122052 (10phaultfinder) [01:45:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P81826 and previous config saved to /var/cache/conftool/dbconfig/20250827-014516-ladsgroup.json [01:48:51] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11122065 (10phaultfinder) [02:00:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T402925)', diff saved to https://phabricator.wikimedia.org/P81827 and previous config saved to /var/cache/conftool/dbconfig/20250827-020023-ladsgroup.json [02:00:29] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:00:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance [02:00:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2192 (T402925)', diff saved to https://phabricator.wikimedia.org/P81828 and previous config saved to /var/cache/conftool/dbconfig/20250827-020046-ladsgroup.json [02:10:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T402925)', diff saved to https://phabricator.wikimedia.org/P81829 and previous config saved to /var/cache/conftool/dbconfig/20250827-021006-ladsgroup.json [02:10:12] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:25:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P81833 and previous config saved to /var/cache/conftool/dbconfig/20250827-022513-ladsgroup.json [02:40:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P81834 and previous config saved to /var/cache/conftool/dbconfig/20250827-024021-ladsgroup.json [02:53:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:55:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T402925)', diff saved to https://phabricator.wikimedia.org/P81835 and previous config saved to /var/cache/conftool/dbconfig/20250827-025529-ladsgroup.json [02:55:35] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:55:44] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2201.codfw.wmnet with reason: Maintenance [03:03:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:07:06] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2211.codfw.wmnet with reason: Maintenance [03:07:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2211 (T402925)', diff saved to https://phabricator.wikimedia.org/P81836 and previous config saved to /var/cache/conftool/dbconfig/20250827-030713-ladsgroup.json [03:07:18] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [03:10:25] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11122138 (10Zache) > However, I remain concerned that a determined attacker or a widely used non-compliant script could create the same load again. This risk hi... [03:20:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T402925)', diff saved to https://phabricator.wikimedia.org/P81837 and previous config saved to /var/cache/conftool/dbconfig/20250827-032019-ladsgroup.json [03:20:28] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [03:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:35:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P81838 and previous config saved to /var/cache/conftool/dbconfig/20250827-033527-ladsgroup.json [03:50:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P81839 and previous config saved to /var/cache/conftool/dbconfig/20250827-035035-ladsgroup.json [03:59:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:03:01] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:03:05] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:03:10] FIRING: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:04:01] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:04:05] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:04:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:05:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T402925)', diff saved to https://phabricator.wikimedia.org/P81840 and previous config saved to /var/cache/conftool/dbconfig/20250827-040542-ladsgroup.json [04:05:48] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [04:05:58] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2223.codfw.wmnet with reason: Maintenance [04:06:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2223 (T402925)', diff saved to https://phabricator.wikimedia.org/P81841 and previous config saved to /var/cache/conftool/dbconfig/20250827-040605-ladsgroup.json [04:08:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:09:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:19:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T402925)', diff saved to https://phabricator.wikimedia.org/P81842 and previous config saved to /var/cache/conftool/dbconfig/20250827-041900-ladsgroup.json [04:19:06] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [04:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:27:11] PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 1 (install6003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:34:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P81843 and previous config saved to /var/cache/conftool/dbconfig/20250827-043407-ladsgroup.json [04:40:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:46:40] 06SRE, 06Traffic, 10Wikidata, 10Wikidata-Query-Service: Find a solution for SPARQL federation that is blocked by stricter user agent policy enforcement - https://phabricator.wikimedia.org/T402959#11122213 (10Abbe98) Affected SPARQL backends appear to at least include Fuseki and Virtuoso. [04:49:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P81844 and previous config saved to /var/cache/conftool/dbconfig/20250827-044915-ladsgroup.json [05:02:22] (03PS1) 10Arnaudb: Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182258 [05:04:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T402925)', diff saved to https://phabricator.wikimedia.org/P81845 and previous config saved to /var/cache/conftool/dbconfig/20250827-050423-ladsgroup.json [05:04:28] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [05:04:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2228.codfw.wmnet with reason: Maintenance [05:04:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2228 (T402925)', diff saved to https://phabricator.wikimedia.org/P81846 and previous config saved to /var/cache/conftool/dbconfig/20250827-050446-ladsgroup.json [05:04:57] (03CR) 10CI reject: [V:04-1] Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182258 (owner: 10Arnaudb) [05:08:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T402925)', diff saved to https://phabricator.wikimedia.org/P81848 and previous config saved to /var/cache/conftool/dbconfig/20250827-051540-ladsgroup.json [05:15:46] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [05:17:24] (03CR) 10Ayounsi: [C:03+1] "nice, lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [05:20:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:29:35] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:30:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P81849 and previous config saved to /var/cache/conftool/dbconfig/20250827-053047-ladsgroup.json [05:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:38:32] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11122268 (10ayounsi) 05Resolved→03Open Good job all!! @Ladsgroup from {T378715} do you need to upgrade any listed db* hosts to 10G? [05:40:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:45:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P81850 and previous config saved to /var/cache/conftool/dbconfig/20250827-054555-ladsgroup.json [05:48:56] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11122281 (10phaultfinder) [05:53:53] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11122283 (10phaultfinder) [05:56:53] (03PS8) 10Ayounsi: Nokia: Add initial Python files for nokia switch system config [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [05:56:53] (03PS11) 10Ayounsi: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [05:56:53] (03PS4) 10Ayounsi: Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [05:56:53] (03PS3) 10Ayounsi: Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [05:58:24] (03CR) 10CI reject: [V:04-1] Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [05:58:25] (03CR) 10CI reject: [V:04-1] Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T0600) [06:01:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T402925)', diff saved to https://phabricator.wikimedia.org/P81851 and previous config saved to /var/cache/conftool/dbconfig/20250827-060103-ladsgroup.json [06:01:09] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [06:20:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:25:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:30:48] (03PS4) 10Arnaudb: Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182258 (https://phabricator.wikimedia.org/T402611) [06:36:34] (03PS1) 10Arnaudb: Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182440 [06:39:08] (03CR) 10Arnaudb: [C:03+2] Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182440 (owner: 10Arnaudb) [06:52:06] (03CR) 10Muehlenhoff: [C:03+2] Assign installserver role to install2005 [puppet] - 10https://gerrit.wikimedia.org/r/1182167 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [06:56:16] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11122371 (10Josve05a) (I meant to edit my comment but deleted it… ugh) [06:56:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:04] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:16] (03CR) 10Brouberol: [C:03+1] opensearch-k8s: allow setting vm.max_map_count [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) (owner: 10Ryan Kemper) [07:01:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:02:28] (03CR) 10Muehlenhoff: [C:03+2] Update DHCP server in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1182168 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:02:44] (03CR) 10Muehlenhoff: [C:03+2] homer: Update DHCP server in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1182165 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:11:18] (03CR) 10Filippo Giunchedi: [C:03+1] Update the proxies used by cloudcumin to install2005 [puppet] - 10https://gerrit.wikimedia.org/r/1182170 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:14:36] (03CR) 10MVernon: [C:03+2] swift: re-add 3 codfw hosts, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1182174 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [07:14:54] (03CR) 10MVernon: [C:03+2] thanos - put thanos-be2005 back into rings [puppet] - 10https://gerrit.wikimedia.org/r/1182182 (https://phabricator.wikimedia.org/T400876) (owner: 10MVernon) [07:22:30] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11122388 (10MatthewVernon) [07:23:09] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11122389 (10MatthewVernon) [07:28:56] (03PS7) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [07:29:20] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:37:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:57:16] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1182189 (owner: 10Andrew Bogott) [07:57:35] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1182190 (owner: 10Andrew Bogott) [07:59:47] (03CR) 10David Caro: "LGTM, does tofu wait/expect the domain to be active?" [puppet] - 10https://gerrit.wikimedia.org/r/1182188 (https://phabricator.wikimedia.org/T398712) (owner: 10Andrew Bogott) [08:00:05] andre and jnuche: gettimeofday() says it's time for MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T0800) [08:00:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:00:09] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182492 (https://phabricator.wikimedia.org/T396377) [08:00:11] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182492 (https://phabricator.wikimedia.org/T396377) (owner: 10TrainBranchBot) [08:01:09] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182492 (https://phabricator.wikimedia.org/T396377) (owner: 10TrainBranchBot) [08:02:33] (03CR) 10Muehlenhoff: [C:03+2] Point webproxy in codfw to install2005 [dns] - 10https://gerrit.wikimedia.org/r/1182166 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:02:39] !log jmm@dns1004 START - running authdns-update [08:03:50] !log jmm@dns1004 END - running authdns-update [08:07:45] (03CR) 10Fabfur: [C:03+1] varnish: Remove unused header X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/1181134 (owner: 10Vgutierrez) [08:09:29] (03CR) 10Filippo Giunchedi: "Good questions; I have not looked into how exactly prometheus-openstack-exporter gather metrics, though I'm assuming a nova API call indee" [alerts] - 10https://gerrit.wikimedia.org/r/1182034 (https://phabricator.wikimedia.org/T402778) (owner: 10Filippo Giunchedi) [08:13:35] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.16 refs T396377 [08:13:40] T396377: 1.45.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T396377 [08:20:07] (03CR) 10Muehlenhoff: [C:03+2] Update the proxies used by cloudcumin to install2005 [puppet] - 10https://gerrit.wikimedia.org/r/1182170 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:26:36] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11122458 (10Aklapper) @Josve05a: I could re-post your comment from my bugmail copy, if you want me to? [08:30:45] PROBLEM - Squid on install2004 is CRITICAL: connect to address 208.80.153.105 and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/HTTP_proxy [08:31:09] PROBLEM - HTTP on install2004 is CRITICAL: connect to address 208.80.153.105 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Install_servers [08:31:29] PROBLEM - TFTP service on install2004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [08:33:26] is install2004 a new host being setup? [08:33:39] FIRING: [2x] ProbeDown: Service install2004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:34:11] no, it's an old host being taken down, install2005 is the new one, I'll silence it some more [08:34:19] no worries [08:34:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:35:03] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on install2004.wikimedia.org with reason: being replaced by install2005 [08:35:43] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 (owner: 10JHathaway) [08:36:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2005.codfw.wmnet with OS trixie [08:56:55] (03PS1) 10Arnaudb: gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1182450 (https://phabricator.wikimedia.org/T402611) [08:56:55] (03CR) 10Arnaudb: [C:03+2] "the previous iteration was installing the wrong version of mod-qos on bookworm" [puppet] - 10https://gerrit.wikimedia.org/r/1182450 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [08:57:40] (03PS1) 10Arnaudb: Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182496 [09:02:03] (03CR) 10Arnaudb: [C:03+2] Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182496 (owner: 10Arnaudb) [09:05:05] (03PS8) 10Kosta Harlan: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [09:08:41] (03PS2) 10Tiziano Fogli: mirrormaker: add alerts directly in Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) [09:09:07] (03CR) 10CI reject: [V:04-1] mirrormaker: add alerts directly in Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [09:09:58] (03PS8) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [09:10:37] (03PS3) 10Tiziano Fogli: mirrormaker: add alerts directly in Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) [09:15:37] (03CR) 10Tiziano Fogli: "I decided to split the checks into two different profiles because I wasn’t happy about grooming the parameters with regexps, as it didn’t " [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [09:17:15] (03PS5) 10Ayounsi: Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:17:15] (03PS4) 10Ayounsi: Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [09:17:29] (03CR) 10Ayounsi: Nokia: module for network-instance configuration (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:19:18] (03PS1) 10Muehlenhoff: Remove my old Neo-based key [puppet] - 10https://gerrit.wikimedia.org/r/1182498 [09:24:47] (03PS9) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [09:32:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2157.codfw.wmnet with reason: Maintenance [09:32:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T401906)', diff saved to https://phabricator.wikimedia.org/P81854 and previous config saved to /var/cache/conftool/dbconfig/20250827-093239-fceratto.json [09:32:44] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [09:33:02] (03CR) 10Tiziano Fogli: [C:03+2] nrpewrapper: correlate Prometheus "for:" duration with Icinga timing [puppet] - 10https://gerrit.wikimedia.org/r/1182148 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [09:34:22] (03PS1) 10Cathal Mooney: Fix error where codfw kube-dse ASN listed under eqiad customers [homer/public] - 10https://gerrit.wikimedia.org/r/1182500 [09:34:55] (03CR) 10Ayounsi: [C:03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/1182500 (owner: 10Cathal Mooney) [09:35:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T401906)', diff saved to https://phabricator.wikimedia.org/P81855 and previous config saved to /var/cache/conftool/dbconfig/20250827-093507-fceratto.json [09:35:29] (03PS2) 10Cathal Mooney: Fix error where codfw kube-dse ASN listed under eqiad customers [homer/public] - 10https://gerrit.wikimedia.org/r/1182500 [09:37:01] (03CR) 10Cathal Mooney: [C:03+2] Fix error where codfw kube-dse ASN listed under eqiad customers [homer/public] - 10https://gerrit.wikimedia.org/r/1182500 (owner: 10Cathal Mooney) [09:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:20] (03CR) 10Vgutierrez: P:puppetserver::volatile generate datacenter database (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:38:23] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11122675 (10Bugreporter) >>! In T402749#11122138, @Zache wrote: >> However, I remain concerned that a determined attacker or a widely used non-compliant script... [09:39:39] 06SRE, 06Traffic, 10Wikidata, 10Wikidata-Query-Service: Find a solution for SPARQL federation that is blocked by stricter user agent policy enforcement - https://phabricator.wikimedia.org/T402959#11122678 (10Lucas_Werkmeister_WMDE) [09:39:54] (03Merged) 10jenkins-bot: Fix error where codfw kube-dse ASN listed under eqiad customers [homer/public] - 10https://gerrit.wikimedia.org/r/1182500 (owner: 10Cathal Mooney) [09:42:47] (03PS1) 10Sergio Gimeno: Revert "changeprop beta: Decrease reenqueue_delay for Getting Started notif job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182501 [09:43:04] (03PS1) 10Sergio Gimeno: Revert "changeprop: Decrease reenqueue_delay for Getting Started notif job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182502 [09:43:48] (03CR) 10Vgutierrez: P:puppetserver::volatile generate datacenter database (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:43:53] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host testvm2005.codfw.wmnet with OS trixie [09:49:39] jmm@cumin2002 reimage (PID 474023) is awaiting input [09:49:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS trixie [09:50:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P81856 and previous config saved to /var/cache/conftool/dbconfig/20250827-095014-fceratto.json [09:53:05] (03PS1) 10Peter Fischer: SUP: upgrade to flink 1.20.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182503 (https://phabricator.wikimedia.org/T398159) [09:53:25] (03PS5) 10Vgutierrez: varnish: Remove unused header X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/1181134 [09:53:51] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11122817 (10phaultfinder) [09:54:14] (03CR) 10Vgutierrez: [C:03+2] varnish: Remove unused header X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/1181134 (owner: 10Vgutierrez) [09:56:10] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2009.codfw.wmnet with OS trixie [09:56:44] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [09:58:51] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11122825 (10phaultfinder) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1000) [10:00:45] (03CR) 10Vgutierrez: "we moved this to leverage `X-Provenance` signaled from HAProxy to Varnish: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppe" [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [10:01:57] !log installing libxslt security updates [10:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:05:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P81857 and previous config saved to /var/cache/conftool/dbconfig/20250827-100521-fceratto.json [10:07:17] (03PS1) 10Kevin Bazira: ml-services: update revscoring staging image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182506 (https://phabricator.wikimedia.org/T400350) [10:10:06] (03PS1) 10Federico Ceratto: Prepare new es2* nodes to replace old ones [puppet] - 10https://gerrit.wikimedia.org/r/1182507 (https://phabricator.wikimedia.org/T402859) [10:10:06] (03CR) 10Federico Ceratto: "Deploying new es2* nodes (as discussed on IRC)" [puppet] - 10https://gerrit.wikimedia.org/r/1182507 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:13:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:17:00] (03CR) 10Majavah: [C:03+2] openstack: puppet: Set user-agent for ENC client script [puppet] - 10https://gerrit.wikimedia.org/r/1179121 (owner: 10Majavah) [10:19:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:20:20] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [10:20:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T401906)', diff saved to https://phabricator.wikimedia.org/P81859 and previous config saved to /var/cache/conftool/dbconfig/20250827-102029-fceratto.json [10:20:34] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:20:35] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [10:20:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T401906)', diff saved to https://phabricator.wikimedia.org/P81860 and previous config saved to /var/cache/conftool/dbconfig/20250827-102041-fceratto.json [10:21:15] (03PS1) 10Filippo Giunchedi: wmcs: add JobUnavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/1182508 (https://phabricator.wikimedia.org/T402778) [10:23:40] (03PS9) 10Kosta Harlan: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [10:24:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T401906)', diff saved to https://phabricator.wikimedia.org/P81861 and previous config saved to /var/cache/conftool/dbconfig/20250827-102414-fceratto.json [10:29:55] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm [10:31:39] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [10:33:00] (03CR) 10FNegri: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1182508 (https://phabricator.wikimedia.org/T402778) (owner: 10Filippo Giunchedi) [10:38:27] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [10:38:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T402925)', diff saved to https://phabricator.wikimedia.org/P81862 and previous config saved to /var/cache/conftool/dbconfig/20250827-103834-ladsgroup.json [10:38:40] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:39:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P81863 and previous config saved to /var/cache/conftool/dbconfig/20250827-103921-fceratto.json [10:41:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS trixie [10:41:31] (03CR) 10Urbanecm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182502 (owner: 10Sergio Gimeno) [10:41:43] (03CR) 10Urbanecm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182501 (owner: 10Sergio Gimeno) [10:42:16] (03CR) 10Máté Szabó: "pcc fail seems to be from Ic32d387689d6faabd233c2f357d7a34c7c083949" [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [10:42:29] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11122997 (10Ladsgroup) I looked at them and they seems to be random replicas in random sections. I think they probably need rebalanacing to reduce their load... [10:44:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2009.codfw.wmnet with OS trixie [10:47:33] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm [10:49:32] !log idm2001.wikimedia.org - Update EnvoyProxy to version 1.26.8 - https://phabricator.wikimedia.org/T402584 [10:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:13] (03PS6) 10STran: Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [10:51:01] (03CR) 10CI reject: [V:04-1] Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [10:51:19] (03PS4) 10FNegri: maintain-views: Stop providing rc_new and rc_type to replicas [puppet] - 10https://gerrit.wikimedia.org/r/1178899 (https://phabricator.wikimedia.org/T36320) (owner: 10Zabe) [10:51:21] (03CR) 10Ladsgroup: [C:03+2] maintain-views: Stop providing rc_new and rc_type to replicas [puppet] - 10https://gerrit.wikimedia.org/r/1178899 (https://phabricator.wikimedia.org/T36320) (owner: 10Zabe) [10:51:23] (03CR) 10Ladsgroup: [V:03+2 C:03+2] maintain-views: Stop providing rc_new and rc_type to replicas [puppet] - 10https://gerrit.wikimedia.org/r/1178899 (https://phabricator.wikimedia.org/T36320) (owner: 10Zabe) [10:51:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T402925)', diff saved to https://phabricator.wikimedia.org/P81865 and previous config saved to /var/cache/conftool/dbconfig/20250827-105151-ladsgroup.json [10:51:57] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:54:03] !log idm1001.wikimedia.org - Update EnvoyProxy to version 1.26.8 - https://phabricator.wikimedia.org/T402584 [10:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:15] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [10:54:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P81866 and previous config saved to /var/cache/conftool/dbconfig/20250827-105428-fceratto.json [10:54:35] (03CR) 10Vgutierrez: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [10:55:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS bookworm [10:55:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031 (10cmooney) 03NEW p:05Triage→03Medium [10:56:01] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181123 (owner: 10PipelineBot) [10:56:56] (03PS7) 10STran: Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [10:57:36] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [10:57:42] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181123 (owner: 10PipelineBot) [10:58:18] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2009.codfw.wmnet with OS bookworm [10:59:16] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [11:00:05] mvolz: Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1100). Please do the needful. [11:00:24] !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [11:00:53] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [11:01:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm [11:03:03] (03PS1) 10Tiziano Fogli: nrpewrapper: fix max parameters [puppet] - 10https://gerrit.wikimedia.org/r/1182511 (https://phabricator.wikimedia.org/T395446) [11:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:04:40] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:05:09] (03PS10) 10Máté Szabó: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) [11:05:23] (03CR) 10Kosta Harlan: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [11:05:52] (03CR) 10Vgutierrez: [C:03+1] hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [11:06:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P81867 and previous config saved to /var/cache/conftool/dbconfig/20250827-110659-ladsgroup.json [11:07:54] !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [11:08:47] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:08:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035 (10cmooney) 03NEW p:05Triage→03Medium [11:09:07] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:09:20] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [11:09:36] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6771/console" [puppet] - 10https://gerrit.wikimedia.org/r/1182511 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [11:09:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T401906)', diff saved to https://phabricator.wikimedia.org/P81868 and previous config saved to /var/cache/conftool/dbconfig/20250827-110936-fceratto.json [11:09:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11123210 (10cmooney) [11:09:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:09:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2178.codfw.wmnet with reason: Maintenance [11:09:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11123211 (10cmooney) [11:09:42] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:09:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T401906)', diff saved to https://phabricator.wikimedia.org/P81869 and previous config saved to /var/cache/conftool/dbconfig/20250827-110948-fceratto.json [11:10:54] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6772/console" [puppet] - 10https://gerrit.wikimedia.org/r/1182511 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [11:11:04] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [11:11:43] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [11:12:20] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:12:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:52] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:12:56] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [11:13:03] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [11:13:03] (03Abandoned) 10Tiziano Fogli: nrpewrapper: fix max parameters [puppet] - 10https://gerrit.wikimedia.org/r/1182511 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [11:13:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T401906)', diff saved to https://phabricator.wikimedia.org/P81870 and previous config saved to /var/cache/conftool/dbconfig/20250827-111320-fceratto.json [11:13:23] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [11:13:43] (03PS1) 10Tiziano Fogli: Revert "nrpewrapper: correlate Prometheus "for:" duration with Icinga timing" [puppet] - 10https://gerrit.wikimedia.org/r/1182512 [11:13:52] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:14:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:15:35] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:16:17] (03CR) 10Tiziano Fogli: [C:03+2] Revert "nrpewrapper: correlate Prometheus "for:" duration with Icinga timing" [puppet] - 10https://gerrit.wikimedia.org/r/1182512 (owner: 10Tiziano Fogli) [11:17:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:18:44] (03PS11) 10Máté Szabó: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) [11:19:04] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [11:19:40] RESOLVED: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:20:01] jmm@cumin2002 reimage (PID 519131) is awaiting input [11:22:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P81871 and previous config saved to /var/cache/conftool/dbconfig/20250827-112206-ladsgroup.json [11:28:15] (03CR) 10Dreamy Jazz: [C:03+1] Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [11:28:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P81872 and previous config saved to /var/cache/conftool/dbconfig/20250827-112827-fceratto.json [11:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:29:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2003.codfw.wmnet with OS bookworm [11:33:37] jmm@cumin2002 reimage (PID 533938) is awaiting input [11:34:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2005.codfw.wmnet with OS bookworm