[00:08:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1178109 [00:08:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1178109 (owner: 10TrainBranchBot) [00:09:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P81171 and previous config saved to /var/cache/conftool/dbconfig/20250813-000859-ladsgroup.json [00:24:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T400854)', diff saved to https://phabricator.wikimedia.org/P81172 and previous config saved to /var/cache/conftool/dbconfig/20250813-002407-ladsgroup.json [00:24:11] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [00:24:23] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1206.eqiad.wmnet with reason: Maintenance [00:24:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T400854)', diff saved to https://phabricator.wikimedia.org/P81173 and previous config saved to /var/cache/conftool/dbconfig/20250813-002430-ladsgroup.json [00:27:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T400854)', diff saved to https://phabricator.wikimedia.org/P81174 and previous config saved to /var/cache/conftool/dbconfig/20250813-002722-ladsgroup.json [00:29:19] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1178109 (owner: 10TrainBranchBot) [00:37:12] !log Deployed security mitigation for T401266 [00:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P81175 and previous config saved to /var/cache/conftool/dbconfig/20250813-004230-ladsgroup.json [00:56:40] !log Deployed updated security mitigation for T401266 [00:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P81176 and previous config saved to /var/cache/conftool/dbconfig/20250813-005737-ladsgroup.json [01:00:42] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:12:27] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 44s) [01:12:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T400854)', diff saved to https://phabricator.wikimedia.org/P81177 and previous config saved to /var/cache/conftool/dbconfig/20250813-011245-ladsgroup.json [01:12:50] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [01:13:01] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1218.eqiad.wmnet with reason: Maintenance [01:13:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T400854)', diff saved to https://phabricator.wikimedia.org/P81178 and previous config saved to /var/cache/conftool/dbconfig/20250813-011308-ladsgroup.json [01:28:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T400854)', diff saved to https://phabricator.wikimedia.org/P81179 and previous config saved to /var/cache/conftool/dbconfig/20250813-012849-ladsgroup.json [01:28:55] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [01:35:29] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:35:31] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:40:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:40:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:43:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P81180 and previous config saved to /var/cache/conftool/dbconfig/20250813-014357-ladsgroup.json [01:51:56] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11080875 (10Midleading) @TheDJ Thanks. Hope this will help more people who are confused like me. [01:59:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P81181 and previous config saved to /var/cache/conftool/dbconfig/20250813-015904-ladsgroup.json [02:06:52] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [02:14:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T400854)', diff saved to https://phabricator.wikimedia.org/P81182 and previous config saved to /var/cache/conftool/dbconfig/20250813-021411-ladsgroup.json [02:14:17] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [02:14:27] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1219.eqiad.wmnet with reason: Maintenance [02:14:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T400854)', diff saved to https://phabricator.wikimedia.org/P81183 and previous config saved to /var/cache/conftool/dbconfig/20250813-021434-ladsgroup.json [02:17:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T400854)', diff saved to https://phabricator.wikimedia.org/P81184 and previous config saved to /var/cache/conftool/dbconfig/20250813-021708-ladsgroup.json [02:29:17] andrew@cumin2002 reimage (PID 2829625) is awaiting input [02:32:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P81185 and previous config saved to /var/cache/conftool/dbconfig/20250813-023215-ladsgroup.json [02:36:38] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [02:47:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P81186 and previous config saved to /var/cache/conftool/dbconfig/20250813-024723-ladsgroup.json [03:01:21] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:02:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T400854)', diff saved to https://phabricator.wikimedia.org/P81187 and previous config saved to /var/cache/conftool/dbconfig/20250813-030231-ladsgroup.json [03:02:37] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [03:02:47] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1232.eqiad.wmnet with reason: Maintenance [03:02:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T400854)', diff saved to https://phabricator.wikimedia.org/P81188 and previous config saved to /var/cache/conftool/dbconfig/20250813-030254-ladsgroup.json [03:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:07:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T400854)', diff saved to https://phabricator.wikimedia.org/P81189 and previous config saved to /var/cache/conftool/dbconfig/20250813-030729-ladsgroup.json [03:22:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P81190 and previous config saved to /var/cache/conftool/dbconfig/20250813-032237-ladsgroup.json [03:33:09] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is CRITICAL: 1.003e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [03:37:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P81191 and previous config saved to /var/cache/conftool/dbconfig/20250813-033745-ladsgroup.json [03:48:44] (03PS1) 10Anzx: Revert^2 "minwikibooks , zghwiktionary : add project talk namespace aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178146 [03:48:53] (03CR) 10CI reject: [V:04-1] Revert^2 "minwikibooks , zghwiktionary : add project talk namespace aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178146 (owner: 10Anzx) [03:49:36] (03PS1) 10Anzx: Revert^2 "madwikisource: set metanamespace, sitename and timezone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178148 [03:52:26] (03PS2) 10Anzx: Revert^2 "minwikibooks , zghwiktionary : add project talk namespace aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178146 [03:52:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T400854)', diff saved to https://phabricator.wikimedia.org/P81192 and previous config saved to /var/cache/conftool/dbconfig/20250813-035252-ladsgroup.json [03:53:08] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1234.eqiad.wmnet with reason: Maintenance [03:53:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T400854)', diff saved to https://phabricator.wikimedia.org/P81193 and previous config saved to /var/cache/conftool/dbconfig/20250813-035315-ladsgroup.json [03:55:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T400854)', diff saved to https://phabricator.wikimedia.org/P81194 and previous config saved to /var/cache/conftool/dbconfig/20250813-035552-ladsgroup.json [04:11:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P81195 and previous config saved to /var/cache/conftool/dbconfig/20250813-041100-ladsgroup.json [04:17:04] 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11080991 (10ecarg) HI @RLazarus ~ 1. I believe we will use the metrics we have been using to start. Based off of @el... [04:24:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T399249)', diff saved to https://phabricator.wikimedia.org/P81196 and previous config saved to /var/cache/conftool/dbconfig/20250813-042458-fceratto.json [04:25:03] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [04:26:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P81197 and previous config saved to /var/cache/conftool/dbconfig/20250813-042607-ladsgroup.json [04:40:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P81198 and previous config saved to /var/cache/conftool/dbconfig/20250813-044006-fceratto.json [04:41:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T400854)', diff saved to https://phabricator.wikimedia.org/P81199 and previous config saved to /var/cache/conftool/dbconfig/20250813-044115-ladsgroup.json [04:41:19] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [04:41:31] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1235.eqiad.wmnet with reason: Maintenance [04:41:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T400854)', diff saved to https://phabricator.wikimedia.org/P81200 and previous config saved to /var/cache/conftool/dbconfig/20250813-044138-ladsgroup.json [04:44:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T400854)', diff saved to https://phabricator.wikimedia.org/P81201 and previous config saved to /var/cache/conftool/dbconfig/20250813-044408-ladsgroup.json [04:55:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P81202 and previous config saved to /var/cache/conftool/dbconfig/20250813-045514-fceratto.json [04:59:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P81203 and previous config saved to /var/cache/conftool/dbconfig/20250813-045915-ladsgroup.json [05:08:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:10:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T399249)', diff saved to https://phabricator.wikimedia.org/P81204 and previous config saved to /var/cache/conftool/dbconfig/20250813-051022-fceratto.json [05:10:26] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [05:10:38] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2210.codfw.wmnet with reason: Maintenance [05:10:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T399249)', diff saved to https://phabricator.wikimedia.org/P81205 and previous config saved to /var/cache/conftool/dbconfig/20250813-051045-fceratto.json [05:14:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P81206 and previous config saved to /var/cache/conftool/dbconfig/20250813-051422-ladsgroup.json [05:18:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:29:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T400854)', diff saved to https://phabricator.wikimedia.org/P81207 and previous config saved to /var/cache/conftool/dbconfig/20250813-052930-ladsgroup.json [05:29:35] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [05:29:46] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [05:30:45] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1251.eqiad.wmnet with reason: Maintenance [05:30:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1251 (T400854)', diff saved to https://phabricator.wikimedia.org/P81208 and previous config saved to /var/cache/conftool/dbconfig/20250813-053052-ladsgroup.json [05:33:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T400854)', diff saved to https://phabricator.wikimedia.org/P81209 and previous config saved to /var/cache/conftool/dbconfig/20250813-053330-ladsgroup.json [05:46:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178036 (https://phabricator.wikimedia.org/T392490) (owner: 10KartikMistry) [05:48:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P81210 and previous config saved to /var/cache/conftool/dbconfig/20250813-054839-ladsgroup.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T0600) [06:03:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P81211 and previous config saved to /var/cache/conftool/dbconfig/20250813-060347-ladsgroup.json [06:13:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:14:32] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:14:42] (03PS2) 10Arnaudb: gerrit: add spare fqdn to apache vhost [puppet] - 10https://gerrit.wikimedia.org/r/1178172 (https://phabricator.wikimedia.org/T387833) [06:16:48] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178181 [06:18:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T400854)', diff saved to https://phabricator.wikimedia.org/P81212 and previous config saved to /var/cache/conftool/dbconfig/20250813-061854-ladsgroup.json [06:18:59] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [06:19:10] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [06:19:20] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove dummy keytab for sretest1001 (decommed) [labs/private] - 10https://gerrit.wikimedia.org/r/1169040 (owner: 10Muehlenhoff) [06:20:10] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2145.codfw.wmnet with reason: Maintenance [06:20:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T400854)', diff saved to https://phabricator.wikimedia.org/P81213 and previous config saved to /var/cache/conftool/dbconfig/20250813-062018-ladsgroup.json [06:20:28] (03PS15) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1172625 (https://phabricator.wikimedia.org/T338470) [06:20:56] (03CR) 10Arnaudb: "done! thanks for the highlight 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1174672 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [06:22:38] (03PS1) 10Muehlenhoff: Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1178183 [06:23:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T400854)', diff saved to https://phabricator.wikimedia.org/P81214 and previous config saved to /var/cache/conftool/dbconfig/20250813-062303-ladsgroup.json [06:24:04] (03CR) 10Arnaudb: [C:03+1] lists: delete unused apache.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/1178029 (owner: 10Dzahn) [06:27:41] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one final nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [06:36:38] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [06:38:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P81215 and previous config saved to /var/cache/conftool/dbconfig/20250813-063811-ladsgroup.json [06:53:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P81216 and previous config saved to /var/cache/conftool/dbconfig/20250813-065318-ladsgroup.json [06:57:18] (03PS2) 10KartikMistry: Section Translation: Add Arakan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178036 (https://phabricator.wikimedia.org/T392490) [07:00:04] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:25] here. I'll go ahead with this simple deployment.. [07:01:21] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:02:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178036 (https://phabricator.wikimedia.org/T392490) (owner: 10KartikMistry) [07:03:19] (03Merged) 10jenkins-bot: Section Translation: Add Arakan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178036 (https://phabricator.wikimedia.org/T392490) (owner: 10KartikMistry) [07:04:09] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1178036|Section Translation: Add Arakan Wikipedia (T392490)]] [07:04:12] T392490: Create Wikipedia Arakan - https://phabricator.wikimedia.org/T392490 [07:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:05:42] !incidents [07:05:43] 6591 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet eqsin) [07:05:43] 6584 (RESOLVED) db2161 (paged)/MariaDB Replica Lag: s8 (paged) [07:05:44] 6583 (RESOLVED) db2154 (paged)/MariaDB Replica Lag: s8 (paged) [07:05:44] 6589 (RESOLVED) db2163 (paged)/MariaDB Replica Lag: s8 (paged) [07:05:44] 6588 (RESOLVED) db2152 (paged)/MariaDB Replica Lag: s8 (paged) [07:05:44] 6587 (RESOLVED) db2166 (paged)/MariaDB Replica Lag: s8 (paged) [07:05:45] 6585 (RESOLVED) db2164 (paged)/MariaDB Replica Lag: s8 (paged) [07:05:45] 6590 (RESOLVED) db2181 (paged)/MariaDB Replica Lag: s8 (paged) [07:05:45] 6586 (RESOLVED) db2167 (paged)/MariaDB Replica Lag: s8 (paged) [07:05:46] 6580 (RESOLVED) db2164 (paged)/MariaDB Replica Lag: s8 (paged) [07:05:46] 6574 (RESOLVED) db2167 (paged)/MariaDB Replica Lag: s8 (paged) [07:05:47] 6579 (RESOLVED) db2154 (paged)/MariaDB Replica Lag: s8 (paged) [07:05:47] 6578 (RESOLVED) db2161 (paged)/MariaDB Replica Lag: s8 (paged) [07:05:48] 6576 (RESOLVED) db2163 (paged)/MariaDB Replica Lag: s8 (paged) [07:06:00] ok... long night :) [07:06:30] !log kartik@deploy1003 kartik: Backport for [[gerrit:1178036|Section Translation: Add Arakan Wikipedia (T392490)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:08:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T400854)', diff saved to https://phabricator.wikimedia.org/P81218 and previous config saved to /var/cache/conftool/dbconfig/20250813-070826-ladsgroup.json [07:08:31] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [07:08:42] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2146.codfw.wmnet with reason: Maintenance [07:08:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T400854)', diff saved to https://phabricator.wikimedia.org/P81219 and previous config saved to /var/cache/conftool/dbconfig/20250813-070849-ladsgroup.json [07:10:00] !log kartik@deploy1003 kartik: Continuing with sync [07:11:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T400854)', diff saved to https://phabricator.wikimedia.org/P81220 and previous config saved to /var/cache/conftool/dbconfig/20250813-071135-ladsgroup.json [07:15:15] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178036|Section Translation: Add Arakan Wikipedia (T392490)]] (duration: 11m 06s) [07:15:19] T392490: Create Wikipedia Arakan - https://phabricator.wikimedia.org/T392490 [07:15:59] (03CR) 10Vgutierrez: [C:03+1] profile,prometheus,haproxykafka: support for rdkafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/1178001 (https://phabricator.wikimedia.org/T400978) (owner: 10Fabfur) [07:26:32] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11081290 (10VRiley-WMF) Hey @MatthewVernon Is there a specific time or order you'd like to schedual these upgrades? [07:26:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P81221 and previous config saved to /var/cache/conftool/dbconfig/20250813-072643-ladsgroup.json [07:33:31] (03CR) 10Fabfur: [C:03+2] profile,prometheus,haproxykafka: support for rdkafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/1178001 (https://phabricator.wikimedia.org/T400978) (owner: 10Fabfur) [07:41:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P81222 and previous config saved to /var/cache/conftool/dbconfig/20250813-074150-ladsgroup.json [07:42:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve1012.eqiad.wmnet [07:43:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1003.eqiad.wmnet [07:46:06] (03PS3) 10Arnaudb: nftables: throttle debugging [puppet] - 10https://gerrit.wikimedia.org/r/1177956 (https://phabricator.wikimedia.org/T400971) [07:46:06] (03CR) 10Arnaudb: "we could try that config on the remaining instances with the policy `accept` so we can ensure there is no unintended behavior" [puppet] - 10https://gerrit.wikimedia.org/r/1177956 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [07:48:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1003.eqiad.wmnet [07:49:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1012.eqiad.wmnet [07:52:49] !log manually upgrading haproxykafka on cp1111 to test new metrics (T400978) [07:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:53] T400978: HaproxyKafka: expose librdkafka metrics - https://phabricator.wikimedia.org/T400978 [07:56:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T400854)', diff saved to https://phabricator.wikimedia.org/P81223 and previous config saved to /var/cache/conftool/dbconfig/20250813-075658-ladsgroup.json [07:57:03] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [07:57:14] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance [07:57:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T400854)', diff saved to https://phabricator.wikimedia.org/P81224 and previous config saved to /var/cache/conftool/dbconfig/20250813-075721-ladsgroup.json [08:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T0800) [08:00:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T400854)', diff saved to https://phabricator.wikimedia.org/P81225 and previous config saved to /var/cache/conftool/dbconfig/20250813-080008-ladsgroup.json [08:14:28] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11081407 (10Joe) >>! In T400119#11076853, @TheDJ wrote: > @Midleading you are always supposed to have a user-agent. Api-user-agent is just for situations where you are unable to... [08:16:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P81226 and previous config saved to /var/cache/conftool/dbconfig/20250813-081516-ladsgroup.json [08:17:36] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11081409 (10Joe) >>! In T400119#11076470, @Midleading wrote: > Please be more clear about the UA policy enforced here. I am always setting the `Api-User-Agent` header in my code... [08:18:11] (03CR) 10Fabfur: [C:03+1] varnish: also convert abuse networks to use x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [08:21:04] (03PS2) 10Ilias Sarantopoulos: ores-extension: add threshold for revertrisk in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177446 (https://phabricator.wikimedia.org/T400590) [08:30:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P81227 and previous config saved to /var/cache/conftool/dbconfig/20250813-083023-ladsgroup.json [08:34:07] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1065.eqiad.wmnet [08:34:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts an-worker1065.eqiad.wmnet [08:35:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1053.eqiad.wmnet [08:37:32] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1065.eqiad.wmnet [08:37:47] !log btullis@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts an-worker1065.eqiad.wmnet [08:42:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1053.eqiad.wmnet [08:43:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1054.eqiad.wmnet [08:45:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T400854)', diff saved to https://phabricator.wikimedia.org/P81228 and previous config saved to /var/cache/conftool/dbconfig/20250813-084530-ladsgroup.json [08:45:47] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [08:45:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T400854)', diff saved to https://phabricator.wikimedia.org/P81229 and previous config saved to /var/cache/conftool/dbconfig/20250813-084554-ladsgroup.json [08:48:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T400854)', diff saved to https://phabricator.wikimedia.org/P81230 and previous config saved to /var/cache/conftool/dbconfig/20250813-084838-ladsgroup.json [08:50:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1054.eqiad.wmnet [08:52:18] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [08:55:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1053.eqiad.wmnet to cluster eqiad and group A [08:56:21] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for nokia switches codfw - cmooney@cumin1003" [08:56:40] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for nokia switches codfw - cmooney@cumin1003" [08:56:40] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:00:30] jmm@cumin2002 addnode (PID 3027142) is awaiting input [09:01:45] FIRING: CirrusStreamingUpdaterFlinkNoRegisteredTask: ... [09:01:45] cirrus-streaming-updater job in eqiad (k8s) is running without any taskmanagers - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic-backfill - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkNoRegisteredTask [09:02:43] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11081707 (10MatthewVernon) Hi @VRiley-WMF, this is a good question, with a slightly tedious answer: - Before a system can have its controller swapped, it... [09:03:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P81231 and previous config saved to /var/cache/conftool/dbconfig/20250813-090346-ladsgroup.json [09:06:24] (03PS2) 10Stevemunene: zookeeper: Remove an-druid1002 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178006 (https://phabricator.wikimedia.org/T400330) [09:06:30] (03PS1) 10DCausse: search: CirrusStreamingUpdaterFlinkNoRegisteredTask do not check backfill jobs [alerts] - 10https://gerrit.wikimedia.org/r/1178483 [09:06:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1053.eqiad.wmnet to cluster eqiad and group A [09:06:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1054.eqiad.wmnet to cluster eqiad and group A [09:07:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:08:06] (03CR) 10Stevemunene: [C:03+2] zookeeper: Remove an-druid1002 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178006 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene) [09:08:14] (03CR) 10Peter Fischer: [C:03+2] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1178483 (owner: 10DCausse) [09:08:29] (03CR) 10Guilherme Gonçalves: [C:03+1] "Thanks for getting to this so quickly!" [alerts] - 10https://gerrit.wikimedia.org/r/1177410 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking) [09:09:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:10:04] (03Merged) 10jenkins-bot: search: CirrusStreamingUpdaterFlinkNoRegisteredTask do not check backfill jobs [alerts] - 10https://gerrit.wikimedia.org/r/1178483 (owner: 10DCausse) [09:10:06] !log Set newprojects mailman list to moderate posts from nonmembers (previous: discard) to debug an issue with new projects announcements (T393444) [09:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:10] T393444: Wiki creations not being reported to newprojects list - https://phabricator.wikimedia.org/T393444 [09:10:23] jmm@cumin2002 addnode (PID 3034190) is awaiting input [09:11:54] (03PS1) 10MVernon: thanos: remove thanos-be1005 from rings [puppet] - 10https://gerrit.wikimedia.org/r/1178484 (https://phabricator.wikimedia.org/T400877) [09:11:55] !log restarting ATS on cp5017 [09:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:09] jmm@cumin2002 addnode (PID 3034190) is awaiting input [09:17:10] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1065.eqiad.wmnet [09:17:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts an-worker1065.eqiad.wmnet [09:18:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P81233 and previous config saved to /var/cache/conftool/dbconfig/20250813-091853-ladsgroup.json [09:19:42] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:22:15] (03PS4) 10Arnaudb: nftables: throttle debugging [puppet] - 10https://gerrit.wikimedia.org/r/1177956 (https://phabricator.wikimedia.org/T400971) [09:25:43] (03CR) 10Federico Ceratto: [C:03+1] "I reviewed the hostname and it matches the description." [puppet] - 10https://gerrit.wikimedia.org/r/1178484 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [09:25:45] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1177956 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [09:29:06] !log restarting varnish on cp5017 [09:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:01] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti1053 / ganeti1054 to the production cluster - https://phabricator.wikimedia.org/T401691#11081789 (10MoritzMuehlenhoff) [09:31:28] (03CR) 10MVernon: [C:03+2] thanos: remove thanos-be1005 from rings [puppet] - 10https://gerrit.wikimedia.org/r/1178484 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [09:32:13] (03PS1) 10Muehlenhoff: Remove obsolete mw1-raid1-lvm Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1178488 (https://phabricator.wikimedia.org/T156955) [09:32:35] (03CR) 10Jelto: nftables: throttle debugging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177956 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [09:32:51] RESOLVED: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:33:58] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-worker1065.eqiad.wmnet [09:34:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T400854)', diff saved to https://phabricator.wikimedia.org/P81234 and previous config saved to /var/cache/conftool/dbconfig/20250813-093401-ladsgroup.json [09:34:05] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [09:34:16] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance [09:34:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T400854)', diff saved to https://phabricator.wikimedia.org/P81235 and previous config saved to /var/cache/conftool/dbconfig/20250813-093423-ladsgroup.json [09:34:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:37:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T400854)', diff saved to https://phabricator.wikimedia.org/P81236 and previous config saved to /var/cache/conftool/dbconfig/20250813-093710-ladsgroup.json [09:39:31] (03PS1) 10Majavah: P:toolforge::proxy: Collect network error reports [puppet] - 10https://gerrit.wikimedia.org/r/1178489 (https://phabricator.wikimedia.org/T303725) [09:40:05] (03PS2) 10Majavah: P:toolforge::proxy: Collect network error reports [puppet] - 10https://gerrit.wikimedia.org/r/1178489 (https://phabricator.wikimedia.org/T400994) [09:41:06] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [09:44:04] (03CR) 10Filippo Giunchedi: [C:03+1] "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/1178488 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:44:31] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1065.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [09:46:37] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1065.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [09:46:37] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:46:38] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-worker1065.eqiad.wmnet [09:52:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P81237 and previous config saved to /var/cache/conftool/dbconfig/20250813-095217-ladsgroup.json [09:56:09] (03CR) 10Clément Goubert: [C:03+1] P:docker: Add trixie as a known base image [puppet] - 10https://gerrit.wikimedia.org/r/1177995 (owner: 10Majavah) [09:56:09] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1193.eqiad.wmnet with reason: Maintenance [09:56:31] (03CR) 10Majavah: [C:03+2] P:docker: Add trixie as a known base image [puppet] - 10https://gerrit.wikimedia.org/r/1177995 (owner: 10Majavah) [09:57:16] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2161.codfw.wmnet with reason: Maintenance [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T1000) [10:00:54] jmm@cumin2002 addnode (PID 3034190) is awaiting input [10:05:28] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [10:07:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P81238 and previous config saved to /var/cache/conftool/dbconfig/20250813-100724-ladsgroup.json [10:11:17] btullis@cumin1003 netbox (PID 1832158) is awaiting input [10:12:03] (03PS1) 10Fabfur: profile,prometheus,haproxykafka: added producer metrics [puppet] - 10https://gerrit.wikimedia.org/r/1178500 (https://phabricator.wikimedia.org/T400978) [10:12:54] !log installing openssl updates on Bookworm [10:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:01] !log btullis@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:13:06] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [10:15:12] (03PS2) 10Fabfur: profile,prometheus,haproxykafka: added producer metrics [puppet] - 10https://gerrit.wikimedia.org/r/1178500 (https://phabricator.wikimedia.org/T400978) [10:16:46] !log btullis@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:17:34] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [10:18:28] (03CR) 10Vgutierrez: [C:03+1] "[nitpick] update commit message since you're dropping consumer metrics for haproxykafka" [puppet] - 10https://gerrit.wikimedia.org/r/1178500 (https://phabricator.wikimedia.org/T400978) (owner: 10Fabfur) [10:22:13] (03CR) 10Fabfur: [C:03+2] profile,prometheus,haproxykafka: added producer metrics [puppet] - 10https://gerrit.wikimedia.org/r/1178500 (https://phabricator.wikimedia.org/T400978) (owner: 10Fabfur) [10:22:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T400854)', diff saved to https://phabricator.wikimedia.org/P81239 and previous config saved to /var/cache/conftool/dbconfig/20250813-102232-ladsgroup.json [10:22:36] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [10:22:37] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance [10:22:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T400854)', diff saved to https://phabricator.wikimedia.org/P81240 and previous config saved to /var/cache/conftool/dbconfig/20250813-102243-ladsgroup.json [10:23:25] btullis@cumin1003 netbox (PID 1832546) is awaiting input [10:24:36] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9618 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [10:25:10] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete mw1-raid1-lvm Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1178488 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:25:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T400854)', diff saved to https://phabricator.wikimedia.org/P81241 and previous config saved to /var/cache/conftool/dbconfig/20250813-102527-ladsgroup.json [10:25:32] fabfur: I'll merge your Puppet patch along, ok? [10:26:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti1053 / ganeti1054 to the production cluster - https://phabricator.wikimedia.org/T401691#11082068 (10MoritzMuehlenhoff) [10:26:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti1053 / ganeti1054 to the production cluster - https://phabricator.wikimedia.org/T401691#11082069 (10MoritzMuehlenhoff) 05Open→03Resolved p:05Triage→03Medium All done [10:30:22] moritzm: thanks sorry [10:30:39] fabfur: ack, now merged [10:31:10] !log upgrading haproxykafka to v 0.3.14+deb11u2 on A:cp [10:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:16] !log installing openssl updates on Bookworm [10:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:53] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming and reprovisioning an-worker1065 as an-backup-datanode1001 - btullis@cumin1003" [10:33:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming and reprovisioning an-worker1065 as an-backup-datanode1001 - btullis@cumin1003" [10:33:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:36:42] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [10:39:33] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1001 [10:40:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P81242 and previous config saved to /var/cache/conftool/dbconfig/20250813-104034-ladsgroup.json [10:40:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1001 [10:45:00] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11082131 (10Midleading) I shut down the bot and upgraded it. [10:52:30] (03PS1) 10Btullis: Add the new an-backup-datanode servers to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1178507 (https://phabricator.wikimedia.org/T397166) [10:55:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P81243 and previous config saved to /var/cache/conftool/dbconfig/20250813-105542-ladsgroup.json [10:57:36] (03CR) 10Jforrester: "This'll need a bump to the chart version number, as we're changing the base chart rather than the run-time config." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178076 (https://phabricator.wikimedia.org/T400515) (owner: 10Cory Massaro) [11:00:04] mvolz: #bothumor My software never has bugs. It just develops random features. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T1100). [11:01:42] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:02:34] PROBLEM - ganeti-wconfd running on ganeti-test2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:03:12] (03CR) 10Stevemunene: [C:03+1] "looka good!" [puppet] - 10https://gerrit.wikimedia.org/r/1178507 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis) [11:04:06] (03PS3) 10Jforrester: wikifunctions: Bump orchestrator memory to the max allowed 3GiB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178076 (https://phabricator.wikimedia.org/T400515) (owner: 10Cory Massaro) [11:04:33] jouncebot: nowandnext [11:04:34] For the next 0 hour(s) and 55 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T1100) [11:04:34] In 1 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T1300) [11:04:42] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:04:43] (03CR) 10Jforrester: [C:03+2] wikifunctions: Bump orchestrator memory to the max allowed 3GiB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178076 (https://phabricator.wikimedia.org/T400515) (owner: 10Cory Massaro) [11:04:58] (03CR) 10Fabfur: [C:03+1] "lgtm, considering also vgutierrez comments" [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [11:06:32] (03Merged) 10jenkins-bot: wikifunctions: Bump orchestrator memory to the max allowed 3GiB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178076 (https://phabricator.wikimedia.org/T400515) (owner: 10Cory Massaro) [11:06:55] (03PS1) 10Majavah: P:wmcs::terraform: Refresh registry name [puppet] - 10https://gerrit.wikimedia.org/r/1178508 (https://phabricator.wikimedia.org/T401814) [11:07:13] (03PS1) 10Giuseppe Lavagetto: hiddenparma: Add datacenters to the config [puppet] - 10https://gerrit.wikimedia.org/r/1178509 [11:07:21] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [11:07:29] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [11:08:33] (03CR) 10Clément Goubert: [C:03+1] hiddenparma: Add datacenters to the config [puppet] - 10https://gerrit.wikimedia.org/r/1178509 (owner: 10Giuseppe Lavagetto) [11:09:05] (03CR) 10Vgutierrez: [C:03+1] hiddenparma: Add datacenters to the config [puppet] - 10https://gerrit.wikimedia.org/r/1178509 (owner: 10Giuseppe Lavagetto) [11:09:57] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] hiddenparma: Add datacenters to the config [puppet] - 10https://gerrit.wikimedia.org/r/1178509 (owner: 10Giuseppe Lavagetto) [11:09:57] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [11:10:20] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [11:10:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T400854)', diff saved to https://phabricator.wikimedia.org/P81244 and previous config saved to /var/cache/conftool/dbconfig/20250813-111049-ladsgroup.json [11:10:54] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [11:11:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1054.eqiad.wmnet to cluster eqiad and group A [11:11:05] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance [11:11:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T400854)', diff saved to https://phabricator.wikimedia.org/P81245 and previous config saved to /var/cache/conftool/dbconfig/20250813-111112-ladsgroup.json [11:11:26] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [11:11:59] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [11:12:19] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [11:13:24] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [11:13:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T400854)', diff saved to https://phabricator.wikimedia.org/P81246 and previous config saved to /var/cache/conftool/dbconfig/20250813-111351-ladsgroup.json [11:13:54] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:18:25] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1178183 (owner: 10Muehlenhoff) [11:18:30] (03CR) 10Cathal Mooney: [C:03+1] "Wow impressive work! Been a while since I looked at expect or rancid config (damn it is ugly). LGTM overall, I will be honest and say I'" [puppet] - 10https://gerrit.wikimedia.org/r/1176216 (owner: 10Ayounsi) [11:19:40] (03CR) 10Muehlenhoff: [C:03+2] cloudweb: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff) [11:22:43] (03CR) 10Cathal Mooney: [C:03+1] Rancid: add SR-Linux support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176216 (owner: 10Ayounsi) [11:22:50] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#11082227 (10MoritzMuehlenhoff) [11:27:08] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11082254 (10MatthewVernon) [11:27:55] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11082255 (10MatthewVernon) @VRiley-WMF thanos-be1005 is now ready to have its controller swapped, so please go ahead - it's out of production service now. [11:28:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P81247 and previous config saved to /var/cache/conftool/dbconfig/20250813-112858-ladsgroup.json [11:29:12] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers cloudweb1004.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:29:12] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers cloudweb1004.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:29:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [11:31:25] (03Abandoned) 10Slyngshede: data.yaml: add users as ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1175839 (https://phabricator.wikimedia.org/T400374) (owner: 10Slyngshede) [11:33:12] (03PS1) 10Gkyziridis: ml-services: Deploy new edit-check model version. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178518 (https://phabricator.wikimedia.org/T401696) [11:35:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [11:35:53] (03CR) 10Gkyziridis: "Lets deploy it on staging first and then on prod." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178518 (https://phabricator.wikimedia.org/T401696) (owner: 10Gkyziridis) [11:36:10] (03PS1) 10Stevemunene: dse-k8s: bootstrap dse-k8s-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178526 (https://phabricator.wikimedia.org/T397293) [11:39:49] (03PS1) 10Muehlenhoff: Remove obsolete appserver cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1178528 (https://phabricator.wikimedia.org/T360636) [11:44:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P81248 and previous config saved to /var/cache/conftool/dbconfig/20250813-114406-ladsgroup.json [11:45:00] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-08-01-154925 to 2025-08-13-113934 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178529 (https://phabricator.wikimedia.org/T399424) [11:47:31] (03CR) 10Majavah: [V:03+2 C:03+2] Add python-trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177998 (owner: 10Majavah) [11:56:19] (03PS5) 10Arnaudb: nftables: throttle debugging [puppet] - 10https://gerrit.wikimedia.org/r/1177956 (https://phabricator.wikimedia.org/T400971) [11:58:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T399249)', diff saved to https://phabricator.wikimedia.org/P81249 and previous config saved to /var/cache/conftool/dbconfig/20250813-115803-fceratto.json [11:58:08] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:59:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T400854)', diff saved to https://phabricator.wikimedia.org/P81250 and previous config saved to /var/cache/conftool/dbconfig/20250813-115913-ladsgroup.json [11:59:18] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [11:59:30] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2188.codfw.wmnet with reason: Maintenance [11:59:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T400854)', diff saved to https://phabricator.wikimedia.org/P81251 and previous config saved to /var/cache/conftool/dbconfig/20250813-115937-ladsgroup.json [12:02:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T400854)', diff saved to https://phabricator.wikimedia.org/P81252 and previous config saved to /var/cache/conftool/dbconfig/20250813-120212-ladsgroup.json [12:03:32] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178518 (https://phabricator.wikimedia.org/T401696) (owner: 10Gkyziridis) [12:13:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P81253 and previous config saved to /var/cache/conftool/dbconfig/20250813-121311-fceratto.json [12:15:30] (03CR) 10Btullis: [C:03+1] dse-k8s: bootstrap dse-k8s-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178526 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [12:17:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P81254 and previous config saved to /var/cache/conftool/dbconfig/20250813-121719-ladsgroup.json [12:17:45] (03CR) 10Stevemunene: [C:03+2] dse-k8s: bootstrap dse-k8s-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178526 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [12:23:46] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1175992 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [12:26:19] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1001.eqiad.wmnet with OS bookworm [12:28:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P81255 and previous config saved to /var/cache/conftool/dbconfig/20250813-122818-fceratto.json [12:30:16] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica [12:32:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P81256 and previous config saved to /var/cache/conftool/dbconfig/20250813-123226-ladsgroup.json [12:39:29] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica [12:40:04] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab Replica [12:41:43] (03PS1) 10Stevemunene: dse-k8s: dibable dse-k8s-codfw bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1178534 (https://phabricator.wikimedia.org/T397293) [12:43:15] (03PS2) 10Stevemunene: dse-k8s: disable dse-k8s-codfw bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1178534 (https://phabricator.wikimedia.org/T397293) [12:43:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T399249)', diff saved to https://phabricator.wikimedia.org/P81257 and previous config saved to /var/cache/conftool/dbconfig/20250813-124326-fceratto.json [12:43:31] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:43:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2219.codfw.wmnet with reason: Maintenance [12:43:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T399249)', diff saved to https://phabricator.wikimedia.org/P81258 and previous config saved to /var/cache/conftool/dbconfig/20250813-124348-fceratto.json [12:47:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T400854)', diff saved to https://phabricator.wikimedia.org/P81259 and previous config saved to /var/cache/conftool/dbconfig/20250813-124734-ladsgroup.json [12:47:38] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [12:47:50] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2202.codfw.wmnet with reason: Maintenance [12:48:43] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2203.codfw.wmnet with reason: Maintenance [12:48:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T400854)', diff saved to https://phabricator.wikimedia.org/P81260 and previous config saved to /var/cache/conftool/dbconfig/20250813-124849-ladsgroup.json [12:49:03] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab Replica [12:50:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T400854)', diff saved to https://phabricator.wikimedia.org/P81261 and previous config saved to /var/cache/conftool/dbconfig/20250813-125036-ladsgroup.json [12:51:24] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401651#11082425 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm moved one server to a different breaker. has not alerted again since 2025-08-12 15:48:27 [12:52:54] (03CR) 10Clément Goubert: [C:03+1] Remove obsolete appserver cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1178528 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [12:53:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178146 (owner: 10Anzx) [12:54:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178148 (owner: 10Anzx) [13:00:06] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T1300). [13:00:06] anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] o/ [13:00:26] o/ [13:01:09] I guess I can deploy ^^ though I need a moment to look at the changes first [13:01:48] PROBLEM - SSH on build2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:02:00] FIRING: CirrusStreamingUpdaterFlinkNoRegisteredTask: ... [13:02:00] cirrus-streaming-updater job in eqiad (k8s) is running without any taskmanagers - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic-backfill - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkNoRegisteredTask [13:02:36] hm.. I thought we adjusted this alert ^, looking [13:03:38] RECOVERY - SSH on build2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:04:41] ok, based on the log yesterday I think it’s reasonable to assume that the deployments were indeed not related to the 5xx errors https://wm-bot.wmcloud.org/browser/index.php?start=08%2F12%2F2025&end=08%2F13%2F2025&display=%23wikimedia-operations [13:04:54] so unless someone shouts I’ll go ahead with those deploys to restore the config changes [13:05:14] (cc cjming but it’s probably too early in their timezone yet ^^) [13:06:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178148 (owner: 10Anzx) [13:06:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178146 (owner: 10Anzx) [13:07:13] (03Merged) 10jenkins-bot: Revert^2 "madwikisource: set metanamespace, sitename and timezone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178148 (owner: 10Anzx) [13:07:14] (03Merged) 10jenkins-bot: Revert^2 "minwikibooks , zghwiktionary : add project talk namespace aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178146 (owner: 10Anzx) [13:07:42] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1178148|Revert^2 "madwikisource: set metanamespace, sitename and timezone"]], [[gerrit:1178146|Revert^2 "minwikibooks , zghwiktionary : add project talk namespace aliases"]] [13:08:02] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [13:08:25] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:30] (03PS1) 10DCausse: search: fix CirrusStreamingUpdaterFlinkNoRegisteredTask [alerts] - 10https://gerrit.wikimedia.org/r/1178540 [13:09:16] 👀 /var/lib/spiderpig/scap-image-build-and-push-log**.next** [13:09:21] (03CR) 10DCausse: [C:03+2] search: fix CirrusStreamingUpdaterFlinkNoRegisteredTask [alerts] - 10https://gerrit.wikimedia.org/r/1178540 (owner: 10DCausse) [13:09:33] (are we building php 8.3 images already?) [13:09:53] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Backport for [[gerrit:1178148|Revert^2 "madwikisource: set metanamespace, sitename and timezone"]], [[gerrit:1178146|Revert^2 "minwikibooks , zghwiktionary : add project talk namespace aliases"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:09:58] Lucas_WMDE: checking [13:10:56] weird, I don’t see any difference in the namespaces on minwikibooks [13:10:58] (03Merged) 10jenkins-bot: search: fix CirrusStreamingUpdaterFlinkNoRegisteredTask [alerts] - 10https://gerrit.wikimedia.org/r/1178540 (owner: 10DCausse) [13:11:47] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2165.codfw.wmnet [13:12:56] AFAICT the minwikibooks project talk namespace was already Rundiang Wikibuku before 🤷 [13:12:58] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2165 - Upgrading db2165.codfw.wmnet [13:13:06] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2165 - Upgrading db2165.codfw.wmnet [13:14:10] (03CR) 10Muehlenhoff: "The role is also used on phab1004/2002, which are still on Bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/1080823 (https://phabricator.wikimedia.org/T377374) (owner: 10Dzahn) [13:14:14] (03CR) 10BBlack: [C:03+1] varnish: stop loading netmaps [puppet] - 10https://gerrit.wikimedia.org/r/1175992 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [13:14:15] (re php 8.3, looks like there’s been movement at T401254 and related tasks indeed, niiiiice) [13:14:15] T401254: Upgrade mw-debug/next to PHP 8.3 - https://phabricator.wikimedia.org/T401254 [13:16:07] (03CR) 10Ssingh: dnsrecursor: add recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [13:16:09] Lucas_WMDE: i see change from yesterday, looks good to syc [13:16:13] ok, thanks! [13:16:17] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Continuing with sync [13:17:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177446 (https://phabricator.wikimedia.org/T400590) (owner: 10Ilias Sarantopoulos) [13:17:04] 06SRE, 10SRE-Access-Requests, 06MW-Interfaces-Team: Requesting access to analytics-privatedata-users, SSH and Kerberos for HCoplin-WMF - https://phabricator.wikimedia.org/T400897#11082555 (10tappof) 05In progress→03Resolved Closing this task for now. Don’t hesitate to reach out if anything unexpected... [13:17:59] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11082559 (10tappof) [13:19:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177446 (https://phabricator.wikimedia.org/T400590) (owner: 10Ilias Sarantopoulos) [13:20:47] (03PS1) 10Tiziano Fogli: sre-access-requests: add egardner to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1178542 [13:20:55] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2165 gradually with 4 steps - Upgrade of db2165.codfw.wmnet completed [13:21:16] (03PS2) 10Tiziano Fogli: sre-access-requests: add egardner to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1178542 (https://phabricator.wikimedia.org/T401622) [13:21:23] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178148|Revert^2 "madwikisource: set metanamespace, sitename and timezone"]], [[gerrit:1178146|Revert^2 "minwikibooks , zghwiktionary : add project talk namespace aliases"]] (duration: 13m 40s) [13:21:27] Lucas_WMDE: thanks for deploying, please run namespacedupes for zghwiktionary and minwikibooks [13:22:46] ack [13:23:25] RESOLVED: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:34] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: namespaceDupes minwikibooks --fix # T395499 [13:23:38] T395499: Post-creation work for minwikibooks - https://phabricator.wikimedia.org/T395499 [13:24:26] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: namespaceDupes zghwiktionary --fix # T399785 [13:24:29] T399785: Post-creation work for zghwiktionary - https://phabricator.wikimedia.org/T399785 [13:25:22] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: namespaceDupes madwikisource --fix # T391767 [13:25:25] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:26] T391767: Post-creation work for madwikisource - https://phabricator.wikimedia.org/T391767 [13:27:45] ok, I think we’re done then! [13:27:49] !log UTC afternoon backport+config window done [13:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:41] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1178542 (https://phabricator.wikimedia.org/T401622) (owner: 10Tiziano Fogli) [13:30:15] (03CR) 10Muehlenhoff: [C:03+2] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:31:05] (03CR) 10Tiziano Fogli: [C:03+2] sre-access-requests: add egardner to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1178542 (https://phabricator.wikimedia.org/T401622) (owner: 10Tiziano Fogli) [13:33:00] PROBLEM - Check whether ferm is active by checking the default input chain on db2165 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:34:31] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab, 13Patch-For-Review: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11082600 (10tappof) 05Open→03Resolved a:03tappof The patch has been merged and access granted. Please feel free to reac... [13:35:35] btullis@cumin1003 reimage (PID 1846369) is awaiting input [13:36:10] (03PS11) 10Tiziano Fogli: nrpe wrapper: define Prometheus alerts via Puppet resources [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446) [13:36:12] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2216.codfw.wmnet with reason: Maintenance [13:36:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T400854)', diff saved to https://phabricator.wikimedia.org/P81264 and previous config saved to /var/cache/conftool/dbconfig/20250813-133619-ladsgroup.json [13:36:23] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [13:36:45] RESOLVED: CirrusStreamingUpdaterFlinkNoRegisteredTask: ... [13:36:45] cirrus-streaming-updater job in eqiad (k8s) is running without any taskmanagers - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic-backfill - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkNoRegisteredTask [13:39:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T400854)', diff saved to https://phabricator.wikimedia.org/P81265 and previous config saved to /var/cache/conftool/dbconfig/20250813-133859-ladsgroup.json [13:40:12] (03PS1) 10Muehlenhoff: ssh/trixie: Also pass ssh_ca_key_available to the EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1178547 (https://phabricator.wikimedia.org/T393762) [13:44:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178547 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:49:41] (03CR) 10Bking: [C:03+2] stat hosts: Alert on memory stalls [alerts] - 10https://gerrit.wikimedia.org/r/1177410 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking) [13:50:23] (03PS1) 10TChin: [eventstreams] Bump version 0.18.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178550 (https://phabricator.wikimedia.org/T390140) [13:51:10] (03Merged) 10jenkins-bot: stat hosts: Alert on memory stalls [alerts] - 10https://gerrit.wikimedia.org/r/1177410 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking) [13:52:13] (03CR) 10Bking: "I think we should, particularly the I/O alerts. But let's leave that for a follow-up, if that's OK." [alerts] - 10https://gerrit.wikimedia.org/r/1177410 (https://phabricator.wikimedia.org/T401589) (owner: 10Bking) [13:54:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P81266 and previous config saved to /var/cache/conftool/dbconfig/20250813-135407-ladsgroup.json [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T1400) [14:03:00] RECOVERY - Check whether ferm is active by checking the default input chain on db2165 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:03:32] PROBLEM - Disk space on an-druid1003 is CRITICAL: DISK CRITICAL - free space: /srv 100574 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1003&var-datasource=eqiad+prometheus/ops [14:04:20] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-08-01-154925 to 2025-08-13-113934 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178529 (https://phabricator.wikimedia.org/T399424) (owner: 10Jforrester) [14:06:20] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-08-01-154925 to 2025-08-13-113934 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178529 (https://phabricator.wikimedia.org/T399424) (owner: 10Jforrester) [14:06:22] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2165 gradually with 4 steps - Upgrade of db2165.codfw.wmnet completed [14:06:23] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2165.codfw.wmnet [14:07:40] PROBLEM - Disk space on an-druid1005 is CRITICAL: DISK CRITICAL - free space: /srv 84146 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1005&var-datasource=eqiad+prometheus/ops [14:09:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P81267 and previous config saved to /var/cache/conftool/dbconfig/20250813-140914-ladsgroup.json [14:10:00] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:10:14] (03PS2) 10Muehlenhoff: ssh/trixie: Also pass ssh_ca_key_available to the EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1178547 (https://phabricator.wikimedia.org/T393762) [14:11:06] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy new edit-check model version. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178518 (https://phabricator.wikimedia.org/T401696) (owner: 10Gkyziridis) [14:13:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178547 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [14:15:15] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:08] PROBLEM - Disk space on an-druid1006 is CRITICAL: DISK CRITICAL - free space: /srv 83385 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1006&var-datasource=eqiad+prometheus/ops [14:16:48] (03Merged) 10jenkins-bot: ml-services: Deploy new edit-check model version. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178518 (https://phabricator.wikimedia.org/T401696) (owner: 10Gkyziridis) [14:17:30] (03PS1) 10Muehlenhoff: Failover IDP to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1178556 [14:20:14] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:20:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:10] (03CR) 10Slyngshede: [C:03+2] Failover IDP to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1178556 (owner: 10Muehlenhoff) [14:21:19] (03CR) 10Slyngshede: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1178556 (owner: 10Muehlenhoff) [14:21:31] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:22:03] (03CR) 10Ssingh: dnsrecursor: add recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [14:22:20] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:22:46] (03CR) 10Muehlenhoff: [C:03+2] Failover IDP to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1178556 (owner: 10Muehlenhoff) [14:23:48] (03PS1) 10MVernon: thanos: add thanos-be1005 (JBOD), drain thanos-be2005 [puppet] - 10https://gerrit.wikimedia.org/r/1178557 (https://phabricator.wikimedia.org/T400877) [14:24:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T400854)', diff saved to https://phabricator.wikimedia.org/P81268 and previous config saved to /var/cache/conftool/dbconfig/20250813-142421-ladsgroup.json [14:24:26] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [14:25:05] !log jmm@dns1004 START - running authdns-update [14:25:59] !log jmm@dns1004 END - running authdns-update [14:26:19] !log jmm@dns1004 START - running authdns-update [14:27:16] !log jmm@dns1004 END - running authdns-update [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T1400) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T1430) [14:30:24] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:36:42] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [14:37:29] (03PS1) 10DLynch: Edit check: selectionmanager/gutter merge follow-ups [extensions/VisualEditor] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1178558 (https://phabricator.wikimedia.org/T400905) [14:40:27] !log installing PHP 8.2 security updates [14:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:26] (03PS1) 10Gkyziridis: ml-services: Deploy new edit-check model version on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178563 (https://phabricator.wikimedia.org/T401696) [14:55:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/VisualEditor] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1178558 (https://phabricator.wikimedia.org/T400905) (owner: 10DLynch) [14:55:25] RESOLVED: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:00:30] (03PS1) 10Alexandros Kosiaris: staging: Bump wikifunctions quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178566 (https://phabricator.wikimedia.org/T401833) [15:01:42] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:04:42] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:07:29] (03CR) 10Jelto: [C:03+1] "lgtm, two nits in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1177956 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [15:07:58] (03CR) 10Ssingh: "I think this looks good but question on why not simply using /bin/sh -c?" [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [15:09:06] (03CR) 10Clément Goubert: [C:03+1] staging: Bump wikifunctions quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178566 (https://phabricator.wikimedia.org/T401833) (owner: 10Alexandros Kosiaris) [15:10:17] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178566 (https://phabricator.wikimedia.org/T401833) (owner: 10Alexandros Kosiaris) [15:14:42] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:17:06] (03CR) 10Vgutierrez: acme-chief: Move clean-stale-certs to file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [15:17:45] (03Merged) 10jenkins-bot: staging: Bump wikifunctions quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178566 (https://phabricator.wikimedia.org/T401833) (owner: 10Alexandros Kosiaris) [15:18:55] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: Deploy new edit-check model version on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178563 (https://phabricator.wikimedia.org/T401696) (owner: 10Gkyziridis) [15:20:05] (03PS1) 10Jforrester: wikifunctions: Pull down the memory limits for staging instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178571 (https://phabricator.wikimedia.org/T401833) [15:24:10] !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:25:19] !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:27:05] (03CR) 10Vgutierrez: [C:04-1] acme-chief: Move clean-stale-certs to file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [15:27:05] !log akosiaris@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:28:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:29:15] !log akosiaris@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:36:02] 06SRE, 06Traffic-Icebox, 07User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891#11083072 (10CDanis) →14Duplicate dup:03T400119 [15:36:07] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11083074 (10CDanis) [15:45:51] FIRING: [9x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:53:20] 06SRE-OnFire, 06Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 07Sustainability: Review alerting around Wikidata Query Service update pipeline - https://phabricator.wikimedia.org/T336574#11083131 (10BTullis) [15:54:33] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2147.codfw.wmnet with reason: Maintenance [15:54:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T399249)', diff saved to https://phabricator.wikimedia.org/P81274 and previous config saved to /var/cache/conftool/dbconfig/20250813-155440-fceratto.json [15:54:44] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:56:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T399249)', diff saved to https://phabricator.wikimedia.org/P81275 and previous config saved to /var/cache/conftool/dbconfig/20250813-155650-fceratto.json [16:01:02] PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [16:01:48] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [16:02:32] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1047.eqiad.wmnet with OS bookworm [16:02:50] 10SRE-Access-Requests, 06Data-Engineering, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Request for Kerberos identity for querying SSAC table via statmachines - https://phabricator.wikimedia.org/T401827#11083202 (10Ottomata) [16:02:51] 10SRE-Access-Requests, 06Data-Engineering, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Request for Kerberos identity for querying SSAC table via statmachines - https://phabricator.wikimedia.org/T401827#11083204 (10Ottomata) `cmelo` is in `analytics-privatedata-users`: https://gerrit.wikimedia.org/r/plugin... [16:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11083206 (10VRiley-WMF) Since this unit is out of warrenty, Will locate another disk to use as a replacement. [16:04:48] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11083207 (10VRiley-WMF) a:03VRiley-WMF [16:06:11] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2147 quickly with 2 steps - Repooling [16:07:50] 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Request for Kerberos identity for querying SSAC table via statmachines - https://phabricator.wikimedia.org/T401827#11083211 (10Ottomata) [16:08:53] !log urbanecm@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [16:09:02] RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [16:09:24] !log urbanecm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [16:10:06] (03PS9) 10Arnaudb: nftables: throttle debugging [puppet] - 10https://gerrit.wikimedia.org/r/1177956 (https://phabricator.wikimedia.org/T400971) [16:10:07] (03CR) 10Arnaudb: "nits handled!" [puppet] - 10https://gerrit.wikimedia.org/r/1177956 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [16:14:00] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11083230 (10VRiley-WMF) 05Open→03In progress Commencing with thanos-be1005 controller swap out now. [16:16:08] RECOVERY - Disk space on an-druid1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1006&var-datasource=eqiad+prometheus/ops [16:17:34] PROBLEM - Host thanos-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:20:02] (03PS2) 10Aaron Schulz: Add restbase spec JSON files to which /rest_v1/?spec can be routed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175942 (https://phabricator.wikimedia.org/T397203) [16:20:30] (03CR) 10JHathaway: [C:03+1] ssh/trixie: Also pass ssh_ca_key_available to the EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1178547 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [16:21:28] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2147 quickly with 2 steps - Repooling [16:23:04] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:32:45] FIRING: [2x] Traffic bill over quota: Alert for device cr2-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [16:33:09] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1047.eqiad.wmnet with reason: host reimage [16:33:12] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:35:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [16:37:46] FIRING: [3x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [16:39:04] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1047.eqiad.wmnet with reason: host reimage [16:40:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:41:27] (03CR) 10Majavah: [C:03+2] P:wmcs::terraform: Refresh registry name [puppet] - 10https://gerrit.wikimedia.org/r/1178508 (https://phabricator.wikimedia.org/T401814) (owner: 10Majavah) [16:43:26] (03PS2) 10Aaron Schulz: [DNM] Route old /api/rest_v1/?specs endpoints to static JSON files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177514 (https://phabricator.wikimedia.org/T397203) [16:45:01] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:46:47] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2219 gradually with 4 steps - Repooling [16:47:59] (03PS1) 10Majavah: ntp: Enable IPv6 on Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1178585 (https://phabricator.wikimedia.org/T401848) [16:48:34] (03CR) 10BCornwall: [C:03+1] "Dunno about the name but +1 for the change" [dns] - 10https://gerrit.wikimedia.org/r/1178082 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [16:49:20] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6574/co" [puppet] - 10https://gerrit.wikimedia.org/r/1178585 (https://phabricator.wikimedia.org/T401848) (owner: 10Majavah) [16:51:13] (03CR) 10Jforrester: [C:03+2] wikifunctions: Pull down the memory limits for staging instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178571 (https://phabricator.wikimedia.org/T401833) (owner: 10Jforrester) [16:52:14] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#11083346 (10RobH) All affected servers (except an-mariadb1001) have been taken care of via their various sub-tasks. Followed up on the specific linked task for that host directly. [16:52:45] FIRING: [3x] Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [16:53:09] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:54:16] (03Merged) 10jenkins-bot: wikifunctions: Pull down the memory limits for staging instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178571 (https://phabricator.wikimedia.org/T401833) (owner: 10Jforrester) [16:55:27] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:55:38] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:56:31] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:57:01] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11083360 (10VRiley-WMF) 05In progress→03Open thanos-be1005 controller has been swapped out. Please test it out and let me know.... [16:57:32] (03CR) 10Ssingh: [C:03+1] "Looks good from a prod DNS perspective." [puppet] - 10https://gerrit.wikimedia.org/r/1178585 (https://phabricator.wikimedia.org/T401848) (owner: 10Majavah) [16:57:45] RESOLVED: Traffic bill over quota: Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [16:58:06] RECOVERY - Host thanos-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [16:58:15] (03CR) 10BCornwall: [V:03+2 C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1177469 (owner: 10Ncmonitor) [16:58:17] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1047.eqiad.wmnet with OS bookworm [16:59:00] !log brett@dns1004 START - running authdns-update [17:00:01] !log brett@dns1004 END - running authdns-update [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T1700) [17:06:38] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:14:36] jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade. [17:17:41] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:23:05] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [17:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:29:19] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:32:17] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2219 gradually with 4 steps - Repooling [17:35:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:45:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:50:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:59:57] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bookworm [18:00:04] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T1800) [18:00:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:04:35] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178592 (https://phabricator.wikimedia.org/T396375) [18:04:37] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178592 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot) [18:05:44] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178592 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot) [18:13:27] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.14 refs T396375 [18:13:31] T396375: 1.45.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T396375 [18:18:57] 06SRE, 10DNS, 06Traffic: Set mediawiki.gr, wikipedia.pt, and wiktionary.org.uk NS records to WMF - https://phabricator.wikimedia.org/T401438#11083604 (10Mike_Peel) >>! In T401438#11069994, @BCornwall wrote: > Emails sent to @Mike_Peel and @Geraki and brazenly subbed them here too :) Thanks for the messages.... [18:24:45] andrew@cumin2002 reimage (PID 3298348) is awaiting input [18:26:27] (03CR) 10Dzahn: [C:03+2] "thanks! yes, we agree on the name -> https://phabricator.wikimedia.org/T395938#11080325 https://phabricator.wikimedia.org/T395938#1108138" [dns] - 10https://gerrit.wikimedia.org/r/1178082 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:26:56] !log dzahn@dns1004 START - running authdns-update [18:28:00] !log dzahn@dns1004 END - running authdns-update [18:30:41] (03CR) 10Dzahn: admin: stop using groups parsoid-roots and parsoid-admin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [18:30:59] (03PS5) 10Dzahn: admin: stop using groups parsoid-roots and parsoid-admin [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) [18:31:23] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bookworm [18:31:38] (03CR) 10Dzahn: "oh, good point, but we don't actually use that anymore since we split aphlict into seperate VMs.. hmm.. we can either remove that code or " [puppet] - 10https://gerrit.wikimedia.org/r/1080823 (https://phabricator.wikimedia.org/T377374) (owner: 10Dzahn) [18:31:42] 07Puppet, 06Infrastructure-Foundations: alert1002.wikimedia.org: Puppet warning of too many entries in /etc/acmecerts/icinga - https://phabricator.wikimedia.org/T401858#11083678 (10jhathaway) p:05Triage→03Low a:03jhathaway [18:32:21] (03CR) 10Dzahn: [V:03+1 C:03+2] lists: delete unused apache.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/1178029 (owner: 10Dzahn) [18:36:37] (03PS1) 10JHathaway: acme_chief: purge old certs [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) [18:36:37] (03PS1) 10Andrew Bogott: ceph codfw1dev: revert back to pacific [puppet] - 10https://gerrit.wikimedia.org/r/1178598 (https://phabricator.wikimedia.org/T399858) [18:36:42] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [18:36:54] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [18:37:19] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [18:38:09] (03CR) 10Muehlenhoff: [C:04-1] "Then maybe just drop the profile for aphlict from the main Phabricator role." [puppet] - 10https://gerrit.wikimedia.org/r/1080823 (https://phabricator.wikimedia.org/T377374) (owner: 10Dzahn) [18:38:44] (03CR) 10JHathaway: "@vgutierrez@wikimedia.org, I wasn't sure of the best way to test this. Let me know if you have any ideas." [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [18:40:02] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for maps2009.mgmt:22 - https://phabricator.wikimedia.org/T390659#11083704 (10Jhancock.wm) maps2009 idrac continues to fail and requires a main board replacement to fix. The server is due to be refreshed later this quarter (Q1 25-26). Gonna leave the ticke... [18:40:28] (03CR) 10Andrew Bogott: [C:03+2] ceph codfw1dev: revert back to pacific [puppet] - 10https://gerrit.wikimedia.org/r/1178598 (https://phabricator.wikimedia.org/T399858) (owner: 10Andrew Bogott) [18:40:50] (03CR) 10Dzahn: [V:03+1 C:03+2] "confirmed this was a noop on both lists servers" [puppet] - 10https://gerrit.wikimedia.org/r/1178029 (owner: 10Dzahn) [18:41:00] (03CR) 10Dzahn: [C:03+2] admin: stop using groups parsoid-roots and parsoid-admin [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [18:41:02] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [18:41:36] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bookworm [18:47:09] (03CR) 10Dzahn: [C:03+2] "confirmed on testreduce1002 and parsoidtest1001 - no change except Roan's access was removed: Notice: /Stage[main]/Ssh::Server/File[/etc/" [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [18:49:04] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11083736 (10Dzahn) parsoid-roots, parsoid-admins and parsoid-test-admins groups have now been removed to simplify this. In the future you... [18:50:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [18:54:17] (03CR) 10Dzahn: [C:03+1] gerrit: add spare fqdn to apache vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178172 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [18:55:01] (03PS1) 10Dzahn: wikistats: add support for trixie/PHP8.4 [puppet] - 10https://gerrit.wikimedia.org/r/1178603 (https://phabricator.wikimedia.org/T401859) [18:55:46] (03CR) 10Dzahn: [C:03+2] wikistats: add support for trixie/PHP8.4 [puppet] - 10https://gerrit.wikimedia.org/r/1178603 (https://phabricator.wikimedia.org/T401859) (owner: 10Dzahn) [19:00:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:01:42] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:02:48] (03CR) 10Hashar: [C:03+1] gerrit: add daemons ssh host key to known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar) [19:04:43] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:05:49] 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11083766 (10RLazarus) 1. Yeah, the benefit of using Istio metrics is Istio exports them for you, so you don't have t... [19:05:57] (03PS1) 10Dzahn: wmflib: add 8.4 as a valid PHP version string, for trixie support [puppet] - 10https://gerrit.wikimedia.org/r/1178604 (https://phabricator.wikimedia.org/T401859) [19:06:00] andrew@cumin2002 reimage (PID 3317378) is awaiting input [19:09:54] (03PS2) 10Dzahn: zuul::main: allow caching layer to connect to http backend [puppet] - 10https://gerrit.wikimedia.org/r/1178093 (https://phabricator.wikimedia.org/T395938) [19:10:20] (03CR) 10CI reject: [V:04-1] zuul::main: allow caching layer to connect to http backend [puppet] - 10https://gerrit.wikimedia.org/r/1178093 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:11:27] (03PS3) 10Dzahn: zuul::main: allow caching layer to connect to http backend [puppet] - 10https://gerrit.wikimedia.org/r/1178093 (https://phabricator.wikimedia.org/T395938) [19:11:53] (03CR) 10CI reject: [V:04-1] zuul::main: allow caching layer to connect to http backend [puppet] - 10https://gerrit.wikimedia.org/r/1178093 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:12:38] (03PS4) 10Dzahn: zuul::main: allow caching layer to connect to http backend [puppet] - 10https://gerrit.wikimedia.org/r/1178093 (https://phabricator.wikimedia.org/T395938) [19:13:24] (03CR) 10Dzahn: [V:03+1 C:03+2] lists: add NEL headers to apache.conf.epp template [puppet] - 10https://gerrit.wikimedia.org/r/1178032 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn) [19:14:37] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11083815 (10Dzahn) [19:15:36] (03CR) 10Scott French: [C:03+1] "Ah, interesting! I guess 8.3 will need added to this at some point when non-containerized mediawiki workloads migrate. Not your problem, t" [puppet] - 10https://gerrit.wikimedia.org/r/1178604 (https://phabricator.wikimedia.org/T401859) (owner: 10Dzahn) [19:15:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:16:29] (03CR) 10RhinosF1: [C:03+1] wmflib: add 8.4 as a valid PHP version string, for trixie support [puppet] - 10https://gerrit.wikimedia.org/r/1178604 (https://phabricator.wikimedia.org/T401859) (owner: 10Dzahn) [19:16:45] (03CR) 10Dzahn: [C:03+2] "thank you!:)" [puppet] - 10https://gerrit.wikimedia.org/r/1178604 (https://phabricator.wikimedia.org/T401859) (owner: 10Dzahn) [19:18:39] !log aqu@deploy1003 Started deploy [analytics/refinery@f09c763] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@f09c7633] [19:19:40] !log lists.wikimedia.org - restarted apache2, added NEL headers [19:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:58] 10ops-codfw, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863 (10phaultfinder) 03NEW [19:21:45] !log aqu@deploy1003 Finished deploy [analytics/refinery@f09c763] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@f09c7633] (duration: 03m 05s) [19:26:59] (03CR) 10Dzahn: [C:04-1] "allow 80 for httpbb and 443 for envoy" [puppet] - 10https://gerrit.wikimedia.org/r/1178093 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:28:09] !log aqu@deploy1003 Started deploy [analytics/refinery@f09c763]: Regular analytics weekly train [analytics/refinery@f09c7633] [19:28:35] (03PS9) 10Dzahn: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) [19:29:02] (03CR) 10CI reject: [V:04-1] phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [19:30:11] (03PS1) 10Scott French: Reduce log level to 'info' on ImageSuggestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178601 (https://phabricator.wikimedia.org/T368096) [19:30:22] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11083861 (10Jhancock.wm) [19:31:18] !log aqu@deploy1003 Finished deploy [analytics/refinery@f09c763]: Regular analytics weekly train [analytics/refinery@f09c7633] (duration: 03m 09s) [19:32:23] !log aqu@deploy1003 Started deploy [analytics/refinery@f09c763] (thin): Regular analytics weekly train THIN [analytics/refinery@f09c7633] [19:32:33] (03CR) 10Dzahn: "does anyone see how I am supposed to do this correctly? the key of the key/value pair for this PHP extension config contains a "." but tha" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [19:32:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:33:57] !log aqu@deploy1003 Finished deploy [analytics/refinery@f09c763] (thin): Regular analytics weekly train THIN [analytics/refinery@f09c7633] (duration: 01m 33s) [19:35:12] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11083911 (10Jhancock.wm) [19:35:51] (03PS2) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177471 (owner: 10Ncmonitor) [19:36:07] (03CR) 10Eevans: [C:03+1] Reduce log level to 'info' on ImageSuggestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178601 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [19:39:27] (03PS1) 10Dzahn: zuul::main: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/1178609 (https://phabricator.wikimedia.org/T395938) [19:40:17] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bookworm [19:40:29] (03PS2) 10Dzahn: zuul::main: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/1178609 (https://phabricator.wikimedia.org/T395938) [19:43:21] (03PS1) 10Andrew Bogott: Revert "ceph codfw1dev: revert back to pacific" [puppet] - 10https://gerrit.wikimedia.org/r/1178610 [19:44:16] (03CR) 10Dzahn: [C:04-1] "missing profile::tlsproxy::envoy::global_cert_name" [puppet] - 10https://gerrit.wikimedia.org/r/1178609 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:44:57] 10ops-codfw, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867 (10phaultfinder) 03NEW [19:45:30] (03PS1) 10Eevans: data-gateway: enable debug logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178611 (https://phabricator.wikimedia.org/T368096) [19:45:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:45:54] (03CR) 10BCornwall: "Removed the cricket domains as we're going to dead-park those (ignored in ncmonitor via Iaa057e10d0aa9888751365e9cecc47cb004c47a" [puppet] - 10https://gerrit.wikimedia.org/r/1177471 (owner: 10Ncmonitor) [19:46:32] (03CR) 10Andrew Bogott: [C:03+2] Revert "ceph codfw1dev: revert back to pacific" [puppet] - 10https://gerrit.wikimedia.org/r/1178610 (owner: 10Andrew Bogott) [19:47:46] (03CR) 10Eevans: [C:03+1] data-gateway: enable debug logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178611 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [19:48:10] (03CR) 10Eevans: data-gateway: enable debug logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178611 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [19:48:26] (03PS3) 10Dzahn: zuul::main: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/1178609 (https://phabricator.wikimedia.org/T395938) [19:49:51] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11083949 (10phaultfinder) [19:51:02] (03PS1) 10Dzahn: zuul::main: set a role description [puppet] - 10https://gerrit.wikimedia.org/r/1178612 (https://phabricator.wikimedia.org/T395938) [19:51:33] (03CR) 10Dzahn: "invalid secret" [puppet] - 10https://gerrit.wikimedia.org/r/1178609 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:51:50] (03PS2) 10Dzahn: zuul::main: set a role description [puppet] - 10https://gerrit.wikimedia.org/r/1178612 (https://phabricator.wikimedia.org/T395938) [19:52:07] (03PS3) 10BCornwall: acme-chief: Move clean-stale-certs to file [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) [19:52:16] (03CR) 10BCornwall: "General cleanliness and consistency like 'clean-stale-puppet-certs'. IMO once there's a desire to chain commands in a systemd unit file it" [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [19:53:54] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6577/co" [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [19:56:43] (03CR) 10Dzahn: [C:03+2] zuul::main: set a role description [puppet] - 10https://gerrit.wikimedia.org/r/1178612 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:57:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:58:00] (03CR) 10Brennen Bearnes: "Hrm, digging... I see this handled in modules/profile/manifests/mediawiki/php.pp, but that seems to be a special case?" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [19:58:23] (03PS1) 10Bking: WIP: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) [19:58:51] (03CR) 10CI reject: [V:04-1] WIP: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [19:58:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T2000). Please do the needful. [20:00:06] kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:25] o/ [20:00:39] Looks like it's just me, so I can run it myself. [20:01:01] (03CR) 10Dzahn: "there is a style rule thing against creating shell scripts straight from erb templates. The reason is that CI can't validate shell scripts" [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [20:01:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1178558 (https://phabricator.wikimedia.org/T400905) (owner: 10DLynch) [20:03:16] (03Merged) 10jenkins-bot: Edit check: selectionmanager/gutter merge follow-ups [extensions/VisualEditor] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1178558 (https://phabricator.wikimedia.org/T400905) (owner: 10DLynch) [20:03:40] (03CR) 10Dzahn: acme-chief: Move clean-stale-certs to file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [20:03:46] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1178558|Edit check: selectionmanager/gutter merge follow-ups (T400905)]] [20:03:50] T400905: Refactor EditCheck gutter markers into SelectionManager - https://phabricator.wikimedia.org/T400905 [20:04:44] 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11083979 (10Dzahn) a:05CDobbins→03None [20:04:50] 10ops-codfw, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11083981 (10phaultfinder) [20:05:51] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1178558|Edit check: selectionmanager/gutter merge follow-ups (T400905)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:06:09] 06SRE, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T401118#11083984 (10Dzahn) a:05karapayneWMDE→03None [20:07:46] (03PS4) 10Dzahn: zuul::main: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/1178609 (https://phabricator.wikimedia.org/T395938) [20:07:48] !log kemayo@deploy1003 kemayo: Continuing with sync [20:09:33] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1178609/6578/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1178609 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:10:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:10:59] (03CR) 10Scott French: [C:03+1] data-gateway: enable debug logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178611 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [20:11:07] (03PS2) 10Hashar: admin: allow systemctl status for MediaWiki train [puppet] - 10https://gerrit.wikimedia.org/r/1177958 [20:11:16] (03CR) 10Dzahn: [C:03+2] zuul::main: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/1178609 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:11:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:12:35] (03CR) 10CDanis: [C:03+2] admin: allow systemctl status for MediaWiki train [puppet] - 10https://gerrit.wikimedia.org/r/1177958 (owner: 10Hashar) [20:13:03] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178558|Edit check: selectionmanager/gutter merge follow-ups (T400905)]] (duration: 09m 17s) [20:13:07] T400905: Refactor EditCheck gutter markers into SelectionManager - https://phabricator.wikimedia.org/T400905 [20:14:19] Kemayo: if you're done with your backport(s), I might sneak one in as well. did you have anything else planned? [20:14:30] swfrench-wmf: That was everything for me. [20:14:55] Kemayo: great, I'll get started shortly. thanks! [20:15:23] (03CR) 10Scott French: "Thanks for the review, Eric!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178601 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [20:15:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:16:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178601 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [20:17:36] (03Merged) 10jenkins-bot: Reduce log level to 'info' on ImageSuggestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178601 (https://phabricator.wikimedia.org/T368096) (owner: 10Scott French) [20:17:57] !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1178601|Reduce log level to 'info' on ImageSuggestions (T368096)]] [20:18:01] (03CR) 10Eevans: [C:03+2] data-gateway: enable debug logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178611 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [20:18:01] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [20:19:49] (03Merged) 10jenkins-bot: data-gateway: enable debug logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178611 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [20:20:13] !log swfrench@deploy1003 swfrench: Backport for [[gerrit:1178601|Reduce log level to 'info' on ImageSuggestions (T368096)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:21:15] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/data-gateway: apply [20:21:33] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [20:21:53] !log swfrench@deploy1003 swfrench: Continuing with sync [20:21:54] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11084019 (10VRiley-WMF) [20:24:52] !log eevans@deploy1003 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [20:26:09] !log eevans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [20:27:19] !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178601|Reduce log level to 'info' on ImageSuggestions (T368096)]] (duration: 09m 22s) [20:27:23] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [20:29:47] (03CR) 10BCornwall: [V:03+1] "Would that not just end up being a `.` directive, just shifting the issue to a separate file?" [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [20:31:38] (03PS1) 10Zabe: UpdateSearchIndexConfig get the writable clusters not all of them [extensions/CirrusSearch] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1178616 (https://phabricator.wikimedia.org/T401633) [20:31:47] (03PS1) 10Zabe: UpdateSearchIndexConfig get the writable clusters not all of them [extensions/CirrusSearch] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1178617 (https://phabricator.wikimedia.org/T401633) [20:31:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:33:22] (03CR) 10Dzahn: "CI would be able to check actual check code. Since code and config data would be separated just the data would remain unchecked. (I know i" [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [20:33:40] (03CR) 10Dzahn: "meant to say "check the shell code"" [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [20:35:38] jouncebot: nowandnext [20:35:39] For the next 0 hour(s) and 24 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T2000) [20:35:39] In 0 hour(s) and 24 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T2100) [20:35:58] (03CR) 10Zabe: [C:03+2] UpdateSearchIndexConfig get the writable clusters not all of them [extensions/CirrusSearch] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1178616 (https://phabricator.wikimedia.org/T401633) (owner: 10Zabe) [20:36:00] (03CR) 10Zabe: [C:03+2] UpdateSearchIndexConfig get the writable clusters not all of them [extensions/CirrusSearch] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1178617 (https://phabricator.wikimedia.org/T401633) (owner: 10Zabe) [20:36:13] !log swfrench@deploy1003 mwscript-k8s job started: extensions/ImageSuggestions/maintenance/SendNotificationsForUnillustratedWatchedTitles.php --wiki=cawiki --min-edit-count=500 --min-confidence=80 --max-notifications-per-user=2 --exclude-instance-of=Q5 --queue --quiet --dry-run [20:36:17] !log start manual equivalent of imagesuggestions-notifyunillustratedwatched-ca cronjob in --dry-run mode - T368096 [20:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:21] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [20:43:09] (03CR) 10Dzahn: "hmm, that's a good pointer but I think it gets away with it because it is using "config_by_sapi" with class php, rather than passing the k" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [20:44:09] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11084060 (10GhostInTheMachine) Just repeating these for clarity: - https://wikitech.wikimedia.org/wiki/Robot_policy - https://foundation.wikimedia.org/wiki/Policy:Wikimedia... [20:44:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084061 (10phaultfinder) [20:47:12] (03PS10) 10Dzahn: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) [20:47:39] (03CR) 10CI reject: [V:04-1] phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [20:49:17] (03Merged) 10jenkins-bot: UpdateSearchIndexConfig get the writable clusters not all of them [extensions/CirrusSearch] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1178616 (https://phabricator.wikimedia.org/T401633) (owner: 10Zabe) [20:49:58] (03PS11) 10Dzahn: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) [20:50:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:52:35] (03Merged) 10jenkins-bot: UpdateSearchIndexConfig get the writable clusters not all of them [extensions/CirrusSearch] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1178617 (https://phabricator.wikimedia.org/T401633) (owner: 10Zabe) [20:52:57] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [20:53:12] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1178617|UpdateSearchIndexConfig get the writable clusters not all of them (T401633)]], [[gerrit:1178616|UpdateSearchIndexConfig get the writable clusters not all of them (T401633)]] [20:53:16] T401633: UpdateSearchIndexConfig.php fails with "Named cluster (dnsdisc) is not configured for maintenance operations" - https://phabricator.wikimedia.org/T401633 [20:54:04] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2212.codfw.wmnet with reason: Maintenance [20:55:14] !log zabe@deploy1003 zabe: Backport for [[gerrit:1178617|UpdateSearchIndexConfig get the writable clusters not all of them (T401633)]], [[gerrit:1178616|UpdateSearchIndexConfig get the writable clusters not all of them (T401633)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:55:28] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [20:55:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [20:56:15] !log zabe@deploy1003 zabe: Continuing with sync [20:59:17] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [20:59:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T400854)', diff saved to https://phabricator.wikimedia.org/P81283 and previous config saved to /var/cache/conftool/dbconfig/20250813-205923-ladsgroup.json [20:59:28] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [21:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T2100) [21:01:54] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178617|UpdateSearchIndexConfig get the writable clusters not all of them (T401633)]], [[gerrit:1178616|UpdateSearchIndexConfig get the writable clusters not all of them (T401633)]] (duration: 08m 41s) [21:01:58] T401633: UpdateSearchIndexConfig.php fails with "Named cluster (dnsdisc) is not configured for maintenance operations" - https://phabricator.wikimedia.org/T401633 [21:02:02] (03PS1) 10Dzahn: various: fix puppet-lint legacy_fact warnings for collab services [puppet] - 10https://gerrit.wikimedia.org/r/1178619 [21:04:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T400854)', diff saved to https://phabricator.wikimedia.org/P81284 and previous config saved to /var/cache/conftool/dbconfig/20250813-210418-ladsgroup.json [21:05:14] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084119 (10phaultfinder) [21:05:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:06:11] (03CR) 10Dzahn: [C:04-1] "Php::Extension[apc]: has no parameter named 'shm_size'" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [21:10:53] !log zabe@deploy1003 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki rkiw --cluster=all # T392490 [21:10:57] T392490: Create Wikipedia Arakan - https://phabricator.wikimedia.org/T392490 [21:11:08] !log zabe@deploy1003 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki rkiwiki --cluster=all # T392490 [21:12:13] !log zabe@deploy1003 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki zghwiktionary --cluster=all # T399684 [21:12:17] T399684: Create Wiktionary Standard Moroccan Tamazight - https://phabricator.wikimedia.org/T399684 [21:13:24] !log zabe@deploy1003 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki tlwikisource --cluster=all # T388639 [21:13:28] T388639: Create Wikisource Tagalog - https://phabricator.wikimedia.org/T388639 [21:14:13] !log zabe@deploy1003 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki madwikisource --cluster=all # T391747 [21:14:16] T391747: Create Wikisource Madurese - https://phabricator.wikimedia.org/T391747 [21:14:56] !log zabe@deploy1003 mwscript-k8s job started: extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki minwikibooks --cluster=all # T395452 [21:14:59] T395452: Create Wikibooks Minangkabau - https://phabricator.wikimedia.org/T395452 [21:19:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P81285 and previous config saved to /var/cache/conftool/dbconfig/20250813-211925-ladsgroup.json [21:20:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:21:18] jouncebot nowandnext [21:21:18] For the next 0 hour(s) and 38 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T2100) [21:21:18] In 0 hour(s) and 38 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T2200) [21:22:11] all clear for a scap deploy? [21:23:31] (seeing nothing current in spiderpig or on deploy box, going ahead.) [21:24:04] !log brennen@deploy1003 Installing scap version "4.201.0" for 169 host(s) [21:24:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084145 (10phaultfinder) [21:25:51] FIRING: [9x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:29:14] !log brennen@deploy1003 Installation of scap version "4.201.0" completed for 169 hosts [21:34:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P81286 and previous config saved to /var/cache/conftool/dbconfig/20250813-213433-ladsgroup.json [21:34:43] (03PS15) 10Bking: WIP: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) [21:34:47] (03PS16) 10Ryan Kemper: WIP: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [21:35:51] FIRING: [9x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:40:36] (03PS17) 10Ryan Kemper: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [21:40:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:41:00] (03CR) 10Ryan Kemper: [C:03+1] "+1 pending review from datahubsearch and logstash owners" [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [21:45:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:48:26] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11084184 (10VRiley-WMF) @elukey I was able to locate a spare 480 gig SSD for this unit. Would you be able to let me know a good time to replace this? [21:49:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T400854)', diff saved to https://phabricator.wikimedia.org/P81287 and previous config saved to /var/cache/conftool/dbconfig/20250813-214940-ladsgroup.json [21:49:45] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [21:49:56] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1190.eqiad.wmnet with reason: Maintenance [21:50:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T400854)', diff saved to https://phabricator.wikimedia.org/P81288 and previous config saved to /var/cache/conftool/dbconfig/20250813-215003-ladsgroup.json [21:54:12] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [21:54:46] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device ssw1-d8-eqiad [21:54:46] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-d8-eqiad [21:54:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T400854)', diff saved to https://phabricator.wikimedia.org/P81289 and previous config saved to /var/cache/conftool/dbconfig/20250813-215458-ladsgroup.json [21:55:03] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250813T2200) [22:00:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:04:03] (03CR) 10Cwhite: [C:04-1] "Some questions inline. -1 for visibility." [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [22:10:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P81290 and previous config saved to /var/cache/conftool/dbconfig/20250813-221006-ladsgroup.json [22:15:56] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage [22:18:41] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2004-dev.codfw.wmnet with reason: host reimage [22:25:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P81291 and previous config saved to /var/cache/conftool/dbconfig/20250813-222513-ladsgroup.json [22:25:51] FIRING: [9x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:35:51] FIRING: [9x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:36:42] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [22:39:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084278 (10phaultfinder) [22:40:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T400854)', diff saved to https://phabricator.wikimedia.org/P81292 and previous config saved to /var/cache/conftool/dbconfig/20250813-224021-ladsgroup.json [22:40:26] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [22:40:37] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1199.eqiad.wmnet with reason: Maintenance [22:40:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T400854)', diff saved to https://phabricator.wikimedia.org/P81293 and previous config saved to /var/cache/conftool/dbconfig/20250813-224044-ladsgroup.json [22:40:51] FIRING: [9x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:43:20] (03PS1) 10RLazarus: pyrra: Add Wikifunctions backend API combined latency-availability [puppet] - 10https://gerrit.wikimedia.org/r/1178627 (https://phabricator.wikimedia.org/T394057) [22:45:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T400854)', diff saved to https://phabricator.wikimedia.org/P81294 and previous config saved to /var/cache/conftool/dbconfig/20250813-224508-ladsgroup.json [22:45:51] FIRING: [10x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [22:48:40] (03PS1) 10Andrew Bogott: Move cloudcephosd2004-dev to ceph version 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1178629 [22:48:51] (03CR) 10RLazarus: "Adding Grace for the semantics, Valentín for the config. Thanks both!" [puppet] - 10https://gerrit.wikimedia.org/r/1178627 (https://phabricator.wikimedia.org/T394057) (owner: 10RLazarus) [22:49:12] (03CR) 10Andrew Bogott: [C:03+2] Move cloudcephosd2004-dev to ceph version 'reef' [puppet] - 10https://gerrit.wikimedia.org/r/1178629 (owner: 10Andrew Bogott) [22:50:20] (03PS2) 10Mstyles: WebAuthn: Limit passkeys to roaming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174048 (https://phabricator.wikimedia.org/T399665) [22:50:35] (03CR) 10Mstyles: WebAuthn: Limit passkeys to roaming (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174048 (https://phabricator.wikimedia.org/T399665) (owner: 10Mstyles) [22:52:04] (03CR) 10Catrope: [C:03+1] WebAuthn: Limit passkeys to roaming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174048 (https://phabricator.wikimedia.org/T399665) (owner: 10Mstyles) [22:55:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174048 (https://phabricator.wikimedia.org/T399665) (owner: 10Mstyles) [22:56:51] (03PS1) 10Krinkle: alertmanager: Remove unused wikimedia-perf-bots [puppet] - 10https://gerrit.wikimedia.org/r/1178631 [22:57:37] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2004-dev.codfw.wmnet with OS bookworm [23:00:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P81295 and previous config saved to /var/cache/conftool/dbconfig/20250813-230015-ladsgroup.json [23:00:51] FIRING: [9x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:01:42] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:04:43] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:05:07] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084326 (10phaultfinder) [23:05:51] FIRING: [9x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:10:51] FIRING: [9x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:12:02] (03PS1) 10Andrew Bogott: cloudcephosd2004-dev: update nic names for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1178636 [23:12:38] (03CR) 10Andrew Bogott: [C:03+2] cloudcephosd2004-dev: update nic names for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1178636 (owner: 10Andrew Bogott) [23:15:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P81296 and previous config saved to /var/cache/conftool/dbconfig/20250813-231523-ladsgroup.json [23:15:51] FIRING: [9x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:20:51] FIRING: [9x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:25:51] FIRING: [8x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-codfw:et-0/0/1 (Core: lsw1-e2-codfw:ethernet-1/56 {#130117100030}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:29:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084332 (10phaultfinder) [23:30:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T400854)', diff saved to https://phabricator.wikimedia.org/P81297 and previous config saved to /var/cache/conftool/dbconfig/20250813-233031-ladsgroup.json [23:30:35] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [23:30:47] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1221.eqiad.wmnet with reason: Maintenance [23:30:55] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:31:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T400854)', diff saved to https://phabricator.wikimedia.org/P81298 and previous config saved to /var/cache/conftool/dbconfig/20250813-233102-ladsgroup.json [23:36:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T400854)', diff saved to https://phabricator.wikimedia.org/P81299 and previous config saved to /var/cache/conftool/dbconfig/20250813-233614-ladsgroup.json [23:36:18] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [23:38:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1178640 [23:38:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1178640 (owner: 10TrainBranchBot) [23:51:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P81300 and previous config saved to /var/cache/conftool/dbconfig/20250813-235121-ladsgroup.json [23:54:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1178640 (owner: 10TrainBranchBot)