[00:06:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P81303 and previous config saved to /var/cache/conftool/dbconfig/20250814-000629-ladsgroup.json [00:07:28] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 145593760 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:08:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1178648 [00:08:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1178648 (owner: 10TrainBranchBot) [00:08:28] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 6013024 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:09:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084357 (10phaultfinder) [00:09:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11084358 (10phaultfinder) [00:21:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T400854)', diff saved to https://phabricator.wikimedia.org/P81304 and previous config saved to /var/cache/conftool/dbconfig/20250814-002136-ladsgroup.json [00:21:41] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [00:21:52] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1238.eqiad.wmnet with reason: Maintenance [00:22:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T400854)', diff saved to https://phabricator.wikimedia.org/P81305 and previous config saved to /var/cache/conftool/dbconfig/20250814-002159-ladsgroup.json [00:26:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T400854)', diff saved to https://phabricator.wikimedia.org/P81306 and previous config saved to /var/cache/conftool/dbconfig/20250814-002629-ladsgroup.json [00:30:58] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1178648 (owner: 10TrainBranchBot) [00:34:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11084388 (10phaultfinder) [00:41:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P81307 and previous config saved to /var/cache/conftool/dbconfig/20250814-004137-ladsgroup.json [00:56:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P81308 and previous config saved to /var/cache/conftool/dbconfig/20250814-005644-ladsgroup.json [00:59:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11084395 (10phaultfinder) [01:00:45] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:00:51] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11084396 (10Andrew) [01:01:59] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11084399 (10Andrew) I'm still unable to reimage cloudcephosd1042; it still PXE boots every time, never landing back in the OS. I tried with 1047 and that worked fine, so I suspect something i... [01:09:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084403 (10phaultfinder) [01:10:24] (03CR) 10BCornwall: "Does PCC work against it?" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [01:11:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T400854)', diff saved to https://phabricator.wikimedia.org/P81309 and previous config saved to /var/cache/conftool/dbconfig/20250814-011152-ladsgroup.json [01:11:56] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [01:12:08] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1241.eqiad.wmnet with reason: Maintenance [01:12:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T400854)', diff saved to https://phabricator.wikimedia.org/P81310 and previous config saved to /var/cache/conftool/dbconfig/20250814-011215-ladsgroup.json [01:12:34] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 48s) [01:16:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T400854)', diff saved to https://phabricator.wikimedia.org/P81311 and previous config saved to /var/cache/conftool/dbconfig/20250814-011642-ladsgroup.json [01:24:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084429 (10phaultfinder) [01:29:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11084434 (10phaultfinder) [01:31:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P81312 and previous config saved to /var/cache/conftool/dbconfig/20250814-013149-ladsgroup.json [01:32:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:37:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:39:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084436 (10phaultfinder) [01:46:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P81313 and previous config saved to /var/cache/conftool/dbconfig/20250814-014657-ladsgroup.json [01:49:26] (03PS1) 10RLazarus: mediawiki: Add support for mounting a custom dblist [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178666 (https://phabricator.wikimedia.org/T401737) [01:49:40] (03PS1) 10RLazarus: deployment_server: Add --local-dblist to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1178667 (https://phabricator.wikimedia.org/T401737) [01:55:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084458 (10phaultfinder) [01:57:51] (03PS2) 10RLazarus: deployment_server: Add --local_dblist to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1178667 (https://phabricator.wikimedia.org/T401737) [02:01:10] (03CR) 10RLazarus: "Tried all of these and they worked as expected:" [puppet] - 10https://gerrit.wikimedia.org/r/1178667 (https://phabricator.wikimedia.org/T401737) (owner: 10RLazarus) [02:02:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T400854)', diff saved to https://phabricator.wikimedia.org/P81314 and previous config saved to /var/cache/conftool/dbconfig/20250814-020205-ladsgroup.json [02:02:09] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [02:02:20] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1242.eqiad.wmnet with reason: Maintenance [02:02:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T400854)', diff saved to https://phabricator.wikimedia.org/P81315 and previous config saved to /var/cache/conftool/dbconfig/20250814-020228-ladsgroup.json [02:07:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T400854)', diff saved to https://phabricator.wikimedia.org/P81316 and previous config saved to /var/cache/conftool/dbconfig/20250814-020737-ladsgroup.json [02:07:41] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [02:22:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P81317 and previous config saved to /var/cache/conftool/dbconfig/20250814-022245-ladsgroup.json [02:29:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11084498 (10phaultfinder) [02:34:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084501 (10phaultfinder) [02:36:42] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [02:37:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P81318 and previous config saved to /var/cache/conftool/dbconfig/20250814-023753-ladsgroup.json [02:38:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:44:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11084504 (10phaultfinder) [02:53:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T400854)', diff saved to https://phabricator.wikimedia.org/P81319 and previous config saved to /var/cache/conftool/dbconfig/20250814-025300-ladsgroup.json [02:53:05] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [02:53:16] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1243.eqiad.wmnet with reason: Maintenance [02:53:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T400854)', diff saved to https://phabricator.wikimedia.org/P81320 and previous config saved to /var/cache/conftool/dbconfig/20250814-025323-ladsgroup.json [02:53:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:58:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T400854)', diff saved to https://phabricator.wikimedia.org/P81321 and previous config saved to /var/cache/conftool/dbconfig/20250814-025808-ladsgroup.json [02:58:13] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [03:01:42] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:04:42] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:13:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P81322 and previous config saved to /var/cache/conftool/dbconfig/20250814-031316-ladsgroup.json [03:20:12] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084520 (10phaultfinder) [03:26:06] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:28:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P81323 and previous config saved to /var/cache/conftool/dbconfig/20250814-032824-ladsgroup.json [03:36:28] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11084530 (10Samwilson) Will GitLab CI be excluded from this policy? While working on T395398 I'm getting "429 Please set a proper user-agent…" in CI for URLs like https://wikis... [03:41:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:41:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:43:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T400854)', diff saved to https://phabricator.wikimedia.org/P81324 and previous config saved to /var/cache/conftool/dbconfig/20250814-034332-ladsgroup.json [03:43:37] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [03:43:48] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1245.eqiad.wmnet with reason: Maintenance [03:46:06] PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [03:46:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:46:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:47:27] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1247.eqiad.wmnet with reason: Maintenance [03:47:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T400854)', diff saved to https://phabricator.wikimedia.org/P81325 and previous config saved to /var/cache/conftool/dbconfig/20250814-034734-ladsgroup.json [03:51:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T400854)', diff saved to https://phabricator.wikimedia.org/P81326 and previous config saved to /var/cache/conftool/dbconfig/20250814-035155-ladsgroup.json [03:52:00] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [03:53:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:03:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:07:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P81327 and previous config saved to /var/cache/conftool/dbconfig/20250814-040703-ladsgroup.json [04:09:06] RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [04:20:09] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11084551 (10phaultfinder) [04:22:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P81328 and previous config saved to /var/cache/conftool/dbconfig/20250814-042211-ladsgroup.json [04:24:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084552 (10phaultfinder) [04:37:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T400854)', diff saved to https://phabricator.wikimedia.org/P81329 and previous config saved to /var/cache/conftool/dbconfig/20250814-043719-ladsgroup.json [04:37:24] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [04:37:24] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1248.eqiad.wmnet with reason: Maintenance [04:37:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T400854)', diff saved to https://phabricator.wikimedia.org/P81330 and previous config saved to /var/cache/conftool/dbconfig/20250814-043732-ladsgroup.json [04:42:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T400854)', diff saved to https://phabricator.wikimedia.org/P81331 and previous config saved to /var/cache/conftool/dbconfig/20250814-044246-ladsgroup.json [04:42:50] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [04:55:38] (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [04:57:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P81332 and previous config saved to /var/cache/conftool/dbconfig/20250814-045753-ladsgroup.json [05:03:43] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:05:07] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084563 (10phaultfinder) [05:13:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P81333 and previous config saved to /var/cache/conftool/dbconfig/20250814-051301-ladsgroup.json [05:15:04] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11084564 (10phaultfinder) [05:22:54] (03CR) 10Arnaudb: gerrit: add spare fqdn to apache vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178172 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [05:23:08] (03CR) 10Arnaudb: [C:03+2] gerrit: add spare fqdn to apache vhost [puppet] - 10https://gerrit.wikimedia.org/r/1178172 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [05:28:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T400854)', diff saved to https://phabricator.wikimedia.org/P81334 and previous config saved to /var/cache/conftool/dbconfig/20250814-052809-ladsgroup.json [05:28:14] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [05:28:24] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1249.eqiad.wmnet with reason: Maintenance [05:28:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T400854)', diff saved to https://phabricator.wikimedia.org/P81335 and previous config saved to /var/cache/conftool/dbconfig/20250814-052831-ladsgroup.json [05:32:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T400854)', diff saved to https://phabricator.wikimedia.org/P81336 and previous config saved to /var/cache/conftool/dbconfig/20250814-053252-ladsgroup.json [05:37:46] (03CR) 10Ayounsi: [C:03+2] Replace SONIC grpc port with Nokia's in MR ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1175872 (owner: 10Ayounsi) [05:38:27] (03Merged) 10jenkins-bot: Replace SONIC grpc port with Nokia's in MR ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/1175872 (owner: 10Ayounsi) [05:44:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084576 (10phaultfinder) [05:45:01] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11084577 (10phaultfinder) [05:48:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P81337 and previous config saved to /var/cache/conftool/dbconfig/20250814-054800-ladsgroup.json [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T0600) [06:00:05] marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T0600) [06:03:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P81338 and previous config saved to /var/cache/conftool/dbconfig/20250814-060308-ladsgroup.json [06:14:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084588 (10phaultfinder) [06:17:01] (03CR) 10Ayounsi: [C:03+2] gNMI: initial Nokia support [puppet] - 10https://gerrit.wikimedia.org/r/1175887 (owner: 10Ayounsi) [06:18:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T400854)', diff saved to https://phabricator.wikimedia.org/P81339 and previous config saved to /var/cache/conftool/dbconfig/20250814-061816-ladsgroup.json [06:18:20] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [06:18:31] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1252.eqiad.wmnet with reason: Maintenance [06:18:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1252 (T400854)', diff saved to https://phabricator.wikimedia.org/P81340 and previous config saved to /var/cache/conftool/dbconfig/20250814-061838-ladsgroup.json [06:22:45] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 262725 [06:23:29] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262725 [06:24:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T400854)', diff saved to https://phabricator.wikimedia.org/P81341 and previous config saved to /var/cache/conftool/dbconfig/20250814-062409-ladsgroup.json [06:24:13] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [06:24:47] FIRING: Emergency syslog message: Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [06:29:47] RESOLVED: Emergency syslog message: Device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [06:32:01] !log lsw1-d2-codfw> restart analytics-agent gracefully [06:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:42] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [06:39:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P81342 and previous config saved to /var/cache/conftool/dbconfig/20250814-063916-ladsgroup.json [06:39:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084629 (10phaultfinder) [06:51:30] (03CR) 10Muehlenhoff: [C:03+2] ssh/trixie: Also pass ssh_ca_key_available to the EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1178547 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [06:54:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P81343 and previous config saved to /var/cache/conftool/dbconfig/20250814-065424-ladsgroup.json [06:55:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084638 (10phaultfinder) [06:56:36] Hey folks, I am gonna start the backport deployment in a while [06:56:39] is anybody around ? [06:57:59] ready to fire it [07:00:05] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T0700). [07:00:05] georgekyz: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:39] alright folks I am firing the deployment [07:01:14] moritzm: vgutierrez: are you around ?should I start it ? [07:01:42] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:02:46] folks I am ready, should I start it ? [07:04:43] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:04:54] starting the deployment [07:05:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177446 (https://phabricator.wikimedia.org/T400590) (owner: 10Ilias Sarantopoulos) [07:06:34] (03Merged) 10jenkins-bot: ores-extension: add threshold for revertrisk in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177446 (https://phabricator.wikimedia.org/T400590) (owner: 10Ilias Sarantopoulos) [07:07:24] !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1177446|ores-extension: add threshold for revertrisk in enwiki (T400590)]] [07:07:28] T400590: Investigate revertrisk threshold generation for enwiki - https://phabricator.wikimedia.org/T400590 [07:09:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T400854)', diff saved to https://phabricator.wikimedia.org/P81344 and previous config saved to /var/cache/conftool/dbconfig/20250814-070932-ladsgroup.json [07:09:36] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [07:09:46] !log gkyziridis@deploy1003 gkyziridis, isaranto: Backport for [[gerrit:1177446|ores-extension: add threshold for revertrisk in enwiki (T400590)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:09:47] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:14:31] !log gkyziridis@deploy1003 gkyziridis, isaranto: Continuing with sync [07:19:31] !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177446|ores-extension: add threshold for revertrisk in enwiki (T400590)]] (duration: 12m 07s) [07:19:36] T400590: Investigate revertrisk threshold generation for enwiki - https://phabricator.wikimedia.org/T400590 [07:19:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084681 (10phaultfinder) [07:26:06] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:34:09] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy new edit-check model version on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178563 (https://phabricator.wikimedia.org/T401696) (owner: 10Gkyziridis) [07:35:48] (03Merged) 10jenkins-bot: ml-services: Deploy new edit-check model version on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178563 (https://phabricator.wikimedia.org/T401696) (owner: 10Gkyziridis) [07:47:34] 10ops-codfw, 06SRE, 06DC-Ops: Add scs-e3-codfw to monitoring - https://phabricator.wikimedia.org/T401310#11084792 (10ayounsi) 05Resolved→03Open Rancid complains with: ` The following routers have not been successfully contacted for more than 24 hours. -rw-r--r-- 1 rancid rancid 0 Aug 11 16:44 scs-e3-codf... [07:48:43] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:49:52] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [07:50:01] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [07:51:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:54:58] (03CR) 10Vgutierrez: pyrra: Add Wikifunctions backend API combined latency-availability (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178627 (https://phabricator.wikimedia.org/T394057) (owner: 10RLazarus) [07:58:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:00:08] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T0800) [08:06:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:07:19] (03CR) 10Vgutierrez: acme-chief: Move clean-stale-certs to file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [08:09:54] (03CR) 10Ecarg: pyrra: Add Wikifunctions backend API combined latency-availability (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178627 (https://phabricator.wikimedia.org/T394057) (owner: 10RLazarus) [08:14:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11084833 (10phaultfinder) [08:23:18] (03CR) 10Jelto: [C:03+1] "lgtm now, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1177956 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [08:24:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11084847 (10phaultfinder) [08:26:23] !log installing Java 8 security updates on kafka-test* [08:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:36] !log jmm@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [08:27:12] 10ops-eqiad, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886 (10ayounsi) 03NEW p:05Triage→03High [08:27:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [08:29:06] (03CR) 10Arnaudb: [C:03+2] nftables: throttle debugging [puppet] - 10https://gerrit.wikimedia.org/r/1177956 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [08:30:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [08:31:25] !log lsw1-d2-codfw> restart jsd gracefully [08:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [08:33:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:34:44] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [08:35:43] (03PS1) 10Arnaudb: Revert "nftables: throttle debugging" [puppet] - 10https://gerrit.wikimedia.org/r/1178812 [08:36:23] RESOLVED: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:38:18] (03CR) 10Arnaudb: [C:03+2] Revert "nftables: throttle debugging" [puppet] - 10https://gerrit.wikimedia.org/r/1178812 (owner: 10Arnaudb) [08:44:01] (03CR) 10Btullis: [C:03+1] dse-k8s: disable dse-k8s-codfw bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1178534 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [08:44:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [08:45:40] (03CR) 10Stevemunene: [C:03+2] dse-k8s: disable dse-k8s-codfw bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1178534 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [08:57:07] (03PS1) 10Clément Goubert: wikifunctions: Bump staging quota to 20G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178817 (https://phabricator.wikimedia.org/T401833) [08:59:06] !log uploaded openjdk-8 8u462-ga-1 to bullseye-wikimedia (backport of latest Java 8 security fixes) [08:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:24] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1177956 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [09:01:57] (03CR) 10Joal: team-data-engineering: fixed alert HaproxykafkaNoMessages (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1174421 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [09:02:52] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:02:52] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:03:10] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:03:42] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54680 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:03:42] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:06:58] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11085004 (10ayounsi) That's a bit annoying. To not waste time I've done the steps myself. But we should look at removing that blocker. [09:07:59] !log installing Java 8 security updates on Bullseye [09:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:46] (03PS3) 10Fabfur: team-data-engineering: fixed alert HaproxykafkaNoMessages [alerts] - 10https://gerrit.wikimedia.org/r/1174421 (https://phabricator.wikimedia.org/T400039) [09:15:51] (03CR) 10Fabfur: team-data-engineering: fixed alert HaproxykafkaNoMessages (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1174421 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [09:16:12] (03CR) 10Joal: [C:03+2] "LGTM! Thanks Fabrizio" [alerts] - 10https://gerrit.wikimedia.org/r/1174421 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [09:17:00] !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host thanos-be1005.eqiad.wmnet with OS bullseye [09:17:12] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11085046 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host thanos-be1005.eqiad.w... [09:17:37] (03CR) 10Joal: [C:03+2] team-data-engineering: fixed alert HaproxykafkaNoMessages (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1174421 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [09:17:47] (03Merged) 10jenkins-bot: team-data-engineering: fixed alert HaproxykafkaNoMessages [alerts] - 10https://gerrit.wikimedia.org/r/1174421 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [09:22:29] (03PS3) 10Ayounsi: Rancid: add SR-Linux support [puppet] - 10https://gerrit.wikimedia.org/r/1176216 [09:23:12] (03CR) 10Ayounsi: Rancid: add SR-Linux support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176216 (owner: 10Ayounsi) [09:23:42] (03CR) 10Ayounsi: [C:03+2] Rancid: add SR-Linux support [puppet] - 10https://gerrit.wikimedia.org/r/1176216 (owner: 10Ayounsi) [09:24:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11085083 (10phaultfinder) [09:30:04] (03CR) 10Clément Goubert: [C:03+2] wikifunctions: Bump staging quota to 20G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178817 (https://phabricator.wikimedia.org/T401833) (owner: 10Clément Goubert) [09:35:04] (03PS1) 10Alexandros Kosiaris: admin: Brown paper bag fix for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178821 (https://phabricator.wikimedia.org/T401833) [09:35:30] (03Abandoned) 10Clément Goubert: admin: Brown paper bag fix for wikifunctions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178821 (https://phabricator.wikimedia.org/T401833) (owner: 10Alexandros Kosiaris) [09:37:40] (03Merged) 10jenkins-bot: wikifunctions: Bump staging quota to 20G [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178817 (https://phabricator.wikimedia.org/T401833) (owner: 10Clément Goubert) [09:38:55] !log cgoubert@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:39:09] !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1005.eqiad.wmnet with reason: host reimage [09:39:46] !log cgoubert@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:39:56] !log cgoubert@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:41:31] !log cgoubert@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:43:01] !log installing Java 17 security updates [09:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:48] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1005.eqiad.wmnet with reason: host reimage [09:46:14] (03PS1) 10Federico Ceratto: aptrepo: relax wmfmariadbpy regexp [puppet] - 10https://gerrit.wikimedia.org/r/1178824 (https://phabricator.wikimedia.org/T397305) [09:46:14] (03CR) 10Federico Ceratto: "Relax the regexp around wmfmariadbpy: it seems that we need to match the binary package names rather than the source package. It should be" [puppet] - 10https://gerrit.wikimedia.org/r/1178824 (https://phabricator.wikimedia.org/T397305) (owner: 10Federico Ceratto) [09:48:19] (03CR) 10Ayounsi: Nokia ZTP: small fixes and better python script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175141 (https://phabricator.wikimedia.org/T401013) (owner: 10Ayounsi) [09:48:42] (03PS3) 10Ayounsi: Nokia ZTP: small fixes and better python script [puppet] - 10https://gerrit.wikimedia.org/r/1175141 (https://phabricator.wikimedia.org/T401013) [09:49:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11085225 (10phaultfinder) [09:51:58] (03CR) 10Ayounsi: [C:03+2] "Deploying that one, then will pause ZTP work. To be resumed once we receive more devices." [puppet] - 10https://gerrit.wikimedia.org/r/1175141 (https://phabricator.wikimedia.org/T401013) (owner: 10Ayounsi) [09:55:04] (03CR) 10Cathal Mooney: [C:03+1] sre.network.provision: add Nokia support [cookbooks] - 10https://gerrit.wikimedia.org/r/1175471 (https://phabricator.wikimedia.org/T401013) (owner: 10Ayounsi) [09:55:06] (03CR) 10Jelto: [C:03+1] "the regex looks reasonable for the four different packages and could explain why the packages are not updates properly. But let's see what" [puppet] - 10https://gerrit.wikimedia.org/r/1178824 (https://phabricator.wikimedia.org/T397305) (owner: 10Federico Ceratto) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1000) [10:00:33] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1005.eqiad.wmnet with OS bullseye [10:00:47] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11085282 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host thanos-be1005.eqiad.wmnet... [10:02:12] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11085315 (10MatthewVernon) [10:02:40] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11085321 (10MatthewVernon) >>! In T400877#11083360, @VRiley-WMF wrote: > thanos-be1005 controller has been swapped out. Please test... [10:03:13] (03PS1) 10Stevemunene: dse-k8s: setup the dse-k8s-codfw helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178827 (https://phabricator.wikimedia.org/T397297) [10:04:44] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-backup-datanode1001.eqiad.wmnet with OS bookworm [10:04:46] (03PS1) 10Brouberol: airflow: make it possible to inject datahub ingestion config files in secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T306903) [10:04:48] (03PS1) 10Brouberol: airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T306903) [10:06:55] (03CR) 10CI reject: [V:04-1] airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T306903) (owner: 10Brouberol) [10:09:45] (03PS2) 10Brouberol: airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T306903) [10:12:07] (03CR) 10CI reject: [V:04-1] airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T306903) (owner: 10Brouberol) [10:12:25] (03CR) 10Btullis: airflow: make it possible to inject datahub ingestion config files in secrets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T306903) (owner: 10Brouberol) [10:13:21] (03PS3) 10Brouberol: airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T306903) [10:17:10] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1178824 (https://phabricator.wikimedia.org/T397305) (owner: 10Federico Ceratto) [10:19:45] (03PS1) 10Btullis: Typo in site.pp matching an-backup-datanode hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178832 (https://phabricator.wikimedia.org/T397166) [10:20:02] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11085477 (10phaultfinder) [10:20:04] 06SRE, 06Traffic: Possible SSL certificate expiration - https://phabricator.wikimedia.org/T401902#11085478 (10Josve05a) [10:20:06] btullis@cumin1003 rename (PID 1976485) is awaiting input [10:22:47] (03CR) 10Btullis: [C:03+2] Typo in site.pp matching an-backup-datanode hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178832 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis) [10:23:24] (03CR) 10Btullis: [C:03+2] Add the new an-backup-datanode servers to site.pp and preseed.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1178507 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis) [10:28:15] (03PS2) 10Brouberol: airflow: make it possible to inject datahub ingestion config files in secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T306903) [10:28:15] (03PS4) 10Brouberol: airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T306903) [10:28:49] (03CR) 10Brouberol: airflow: make it possible to inject datahub ingestion config files in secrets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T306903) (owner: 10Brouberol) [10:29:27] 06SRE, 06Traffic: Possible SSL certificate expiration - https://phabricator.wikimedia.org/T401902#11085524 (10Aklapper) Which website is this about? [10:32:41] (03CR) 10Btullis: airflow: make it possible to inject datahub ingestion config files in secrets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T306903) (owner: 10Brouberol) [10:33:55] (03PS3) 10Brouberol: airflow: make it possible to inject datahub ingestion config files in secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T309622) [10:33:58] (03PS5) 10Brouberol: airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) [10:34:01] (03PS1) 10Tiziano Fogli: krb: add principal for user cmelo [puppet] - 10https://gerrit.wikimedia.org/r/1178833 (https://phabricator.wikimedia.org/T401827) [10:34:22] (03PS1) 10Stevemunene: dse-k8s: add dse-k8s-codfw hosts to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) [10:37:29] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1001.eqiad.wmnet with OS bookworm [10:37:42] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Request for Kerberos identity for querying SSAC table via statmachines - https://phabricator.wikimedia.org/T401827#11085580 (10tappof) Created KRB principal and notified user by email. Submitted patch to set KRB fla... [10:41:05] (03CR) 10Btullis: airflow-main: define a custom superset datahub ingestion configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [10:41:21] (03PS1) 10Hokwelum: ResourceLoader: Temporily track cache usage of preloaded NS_USER title info [core] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1178835 (https://phabricator.wikimedia.org/T393835) [10:43:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [core] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1178835 (https://phabricator.wikimedia.org/T393835) (owner: 10Hokwelum) [10:45:09] (03CR) 10Btullis: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [10:45:12] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11085641 (10phaultfinder) [10:49:19] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-backup-datanode1001.eqiad.wmnet with OS bookworm [10:50:39] (03CR) 10Ayounsi: [C:03+2] sre.network.provision: add Nokia support [cookbooks] - 10https://gerrit.wikimedia.org/r/1175471 (https://phabricator.wikimedia.org/T401013) (owner: 10Ayounsi) [10:54:12] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1001.eqiad.wmnet with OS bookworm [10:55:07] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2219.codfw.wmnet with reason: Maintenance [10:55:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T399249)', diff saved to https://phabricator.wikimedia.org/P81350 and previous config saved to /var/cache/conftool/dbconfig/20250814-105514-fceratto.json [10:55:18] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:55:26] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178827 (https://phabricator.wikimedia.org/T397297) (owner: 10Stevemunene) [10:58:05] (03Merged) 10jenkins-bot: sre.network.provision: add Nokia support [cookbooks] - 10https://gerrit.wikimedia.org/r/1175471 (https://phabricator.wikimedia.org/T401013) (owner: 10Ayounsi) [10:59:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1178833 (https://phabricator.wikimedia.org/T401827) (owner: 10Tiziano Fogli) [10:59:26] (03CR) 10Tiziano Fogli: [C:03+2] krb: add principal for user cmelo [puppet] - 10https://gerrit.wikimedia.org/r/1178833 (https://phabricator.wikimedia.org/T401827) (owner: 10Tiziano Fogli) [11:00:30] (03PS4) 10Muehlenhoff: Stop applying maps-admins to maps Bookworm roles [puppet] - 10https://gerrit.wikimedia.org/r/1170124 (https://phabricator.wikimedia.org/T381565) [11:01:42] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:02:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170124 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:08:09] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [11:08:19] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:09:35] jouncebot: nowandnext [11:09:36] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [11:09:36] In 0 hour(s) and 50 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1200) [11:11:50] (03CR) 10Muehlenhoff: [C:03+2] Stop applying maps-admins to maps Bookworm roles [puppet] - 10https://gerrit.wikimedia.org/r/1170124 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:13:33] !log installing openssl security updates [11:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:49] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1001.eqiad.wmnet with reason: host reimage [11:23:24] !log copy thanos package to trixie-wikimedia T401813 [11:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:28] T401813: Migrate metricsinfra project off of Bullseye - https://phabricator.wikimedia.org/T401813 [11:23:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1001.eqiad.wmnet with reason: host reimage [11:24:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11085897 (10phaultfinder) [11:26:06] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:27:43] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2142.codfw.wmnet [11:28:03] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2142 - Upgrading db2142.codfw.wmnet [11:28:27] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) db2142 - Upgrading db2142.codfw.wmnet [11:28:36] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for db2142.codfw.wmnet [11:29:48] (03PS1) 10Muehlenhoff: Remove Ganeti role from ganeti7004 [puppet] - 10https://gerrit.wikimedia.org/r/1178839 (https://phabricator.wikimedia.org/T394263) [11:29:49] (03PS1) 10Muehlenhoff: netbox: Remove ganeti02/magru cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178840 (https://phabricator.wikimedia.org/T394263) [11:33:00] (03PS1) 10Urbanecm: mariadb: Document GrowthExperiments tables [puppet] - 10https://gerrit.wikimedia.org/r/1178841 (https://phabricator.wikimedia.org/T399302) [11:33:11] !log mszabo Deployed security patch for T280413 [11:35:55] (03CR) 10CI reject: [V:04-1] mariadb: Document GrowthExperiments tables [puppet] - 10https://gerrit.wikimedia.org/r/1178841 (https://phabricator.wikimedia.org/T399302) (owner: 10Urbanecm) [11:37:59] (03PS2) 10Urbanecm: mariadb: Document GrowthExperiments tables [puppet] - 10https://gerrit.wikimedia.org/r/1178841 (https://phabricator.wikimedia.org/T399302) [11:40:44] (03CR) 10Ayounsi: [C:03+1] netbox: Remove ganeti02/magru cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178840 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:41:26] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [11:42:37] (03CR) 10Ladsgroup: [C:03+2] mariadb: Document GrowthExperiments tables [puppet] - 10https://gerrit.wikimedia.org/r/1178841 (https://phabricator.wikimedia.org/T399302) (owner: 10Urbanecm) [11:44:07] (03CR) 10Ayounsi: [C:03+1] "Don't forget to delete the VIP once done: https://netbox.wikimedia.org/ipam/ip-addresses/16701/" [puppet] - 10https://gerrit.wikimedia.org/r/1178839 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:44:30] btullis@cumin1003 reimage (PID 1981627) is awaiting input [11:45:11] (03CR) 10Brouberol: airflow-main: define a custom superset datahub ingestion configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [11:45:15] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [11:45:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1001.eqiad.wmnet with OS bookworm [11:45:40] (03PS1) 10Urbanecm: mariadb: Document echo_unread_wikis [puppet] - 10https://gerrit.wikimedia.org/r/1178843 (https://phabricator.wikimedia.org/T399302) [11:46:08] (03PS6) 10Brouberol: airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) [11:46:41] (03PS7) 10Brouberol: airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) [11:46:47] (03CR) 10Brouberol: airflow-main: define a custom superset datahub ingestion configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [11:47:23] (03PS4) 10Brouberol: airflow: make it possible to inject custom files in secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T309622) [11:47:24] (03PS8) 10Brouberol: airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) [11:49:30] (03PS1) 10Urbanecm: mariadb: Document UrlShortener tables [puppet] - 10https://gerrit.wikimedia.org/r/1178845 (https://phabricator.wikimedia.org/T399302) [11:50:08] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11085992 (10phaultfinder) [11:50:17] (03PS1) 10Ayounsi: SR-Linux Rancid, adapt secret masking for info flat format [puppet] - 10https://gerrit.wikimedia.org/r/1178847 [11:51:55] (03CR) 10Ladsgroup: [C:04-1] mariadb: Document UrlShortener tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178845 (https://phabricator.wikimedia.org/T399302) (owner: 10Urbanecm) [11:52:05] (03CR) 10CI reject: [V:04-1] mariadb: Document UrlShortener tables [puppet] - 10https://gerrit.wikimedia.org/r/1178845 (https://phabricator.wikimedia.org/T399302) (owner: 10Urbanecm) [11:53:08] (03CR) 10Ladsgroup: mariadb: Document echo_unread_wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178843 (https://phabricator.wikimedia.org/T399302) (owner: 10Urbanecm) [11:54:03] (03CR) 10Ladsgroup: [C:04-1] mariadb: Document UrlShortener tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178845 (https://phabricator.wikimedia.org/T399302) (owner: 10Urbanecm) [11:55:13] Amir1: fwiw, some of the growth_* tables are also marked as private (i vaguely recall new x1 tables were added to private_tables automatically, as they can't be _made_ public) [11:55:17] should i fix that too? [11:55:34] (03PS2) 10Urbanecm: mariadb: Document UrlShortener tables [puppet] - 10https://gerrit.wikimedia.org/r/1178845 (https://phabricator.wikimedia.org/T399302) [11:55:42] (03CR) 10Urbanecm: mariadb: Document UrlShortener tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178845 (https://phabricator.wikimedia.org/T399302) (owner: 10Urbanecm) [11:55:59] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2142.codfw.wmnet [11:56:15] urbanecm: we are planning to set up x1 wikireplicas (hopefully soooon) so I think it shouldn't be marked as private for the sake of private [11:56:17] (03PS2) 10Urbanecm: mariadb: Document echo_unread_wikis [puppet] - 10https://gerrit.wikimedia.org/r/1178843 (https://phabricator.wikimedia.org/T399302) [11:56:32] which contradicts the discussion on urlshortener [11:56:41] contradicts? [11:56:50] (03CR) 10Urbanecm: mariadb: Document echo_unread_wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178843 (https://phabricator.wikimedia.org/T399302) (owner: 10Urbanecm) [11:56:55] (03PS3) 10Urbanecm: mariadb: Document UrlShortener tables [puppet] - 10https://gerrit.wikimedia.org/r/1178845 (https://phabricator.wikimedia.org/T399302) [11:57:00] (I asked url shortener to be private, I mean I'm contradicting myself :D) [11:57:12] kind of :-D [11:57:25] I think url shortener should be marked as partially public [11:57:34] that'll help setting up x1 to wikireplicas much easier [11:58:00] (03PS4) 10Urbanecm: mariadb: Document UrlShortener tables [puppet] - 10https://gerrit.wikimedia.org/r/1178845 (https://phabricator.wikimedia.org/T399302) [11:58:03] and remove growthexperiments_* from hiera. makes sense. [11:58:23] yeah [11:58:42] PROBLEM - MariaDB Replica IO: ms1 on db1152 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2142.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2142.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:43] well. all catalogued tables. [11:58:55] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, didn't double-check the regex in huge depth but seems ok" [puppet] - 10https://gerrit.wikimedia.org/r/1178847 (owner: 10Ayounsi) [11:59:45] (03CR) 10Ladsgroup: [C:03+2] mariadb: Document echo_unread_wikis [puppet] - 10https://gerrit.wikimedia.org/r/1178843 (https://phabricator.wikimedia.org/T399302) (owner: 10Urbanecm) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1200) [12:01:23] (03PS1) 10Urbanecm: mariadb: Remove catalogued tables from private_tables in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1178849 (https://phabricator.wikimedia.org/T399302) [12:01:28] Amir1: this sounds good? ^^ [12:02:04] (03PS2) 10Ladsgroup: mariadb: Remove catalogued tables from private_tables in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1178849 (https://phabricator.wikimedia.org/T399302) (owner: 10Urbanecm) [12:02:12] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178849 (https://phabricator.wikimedia.org/T399302) (owner: 10Urbanecm) [12:02:19] let me run PCC [12:03:16] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2142.codfw.wmnet [12:04:42] well, it wasn't useful. Let me double check which ones are marked as private in the catalog [12:04:42] RECOVERY - MariaDB Replica IO: ms1 on db1152 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:06:34] LGTM [12:06:55] (03CR) 10Ladsgroup: [C:03+2] mariadb: Document UrlShortener tables [puppet] - 10https://gerrit.wikimedia.org/r/1178845 (https://phabricator.wikimedia.org/T399302) (owner: 10Urbanecm) [12:07:19] (03CR) 10Ladsgroup: [C:03+2] mariadb: Remove catalogued tables from private_tables in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1178849 (https://phabricator.wikimedia.org/T399302) (owner: 10Urbanecm) [12:08:17] (03CR) 10Ayounsi: [C:03+2] SR-Linux Rancid, adapt secret masking for info flat format [puppet] - 10https://gerrit.wikimedia.org/r/1178847 (owner: 10Ayounsi) [12:09:06] urbanecm: https://going-merry.toolforge.org/?table=echo_unread_wikis [12:10:11] for some reason i read that as "going-marry"... [12:10:18] anyway, not sure what should i notice there? [12:11:42] the tool can now load the information on this table [12:11:45] since it's cataloged [12:11:57] https://onepiece.fandom.com/wiki/Going_Merry [12:11:59] ah, perfect! :) [12:12:55] !log installing PHP 7.4 security updates [12:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:38] (03PS2) 10Cathal Mooney: User management: create new RO login class and allow to view logs [homer/public] - 10https://gerrit.wikimedia.org/r/1176443 (https://phabricator.wikimedia.org/T401378) [12:20:13] (03CR) 10Cathal Mooney: [C:03+1] Add hostname to a couple errors [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1175867 (owner: 10Ayounsi) [12:22:56] (03PS2) 10Anzx: rkiwiki: set sitename, project namespace and time zone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178552 (https://phabricator.wikimedia.org/T392499) [12:23:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178552 (https://phabricator.wikimedia.org/T392499) (owner: 10Anzx) [12:23:36] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178858 [12:23:40] (03PS1) 10Anzx: minwikibooks: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178551 (https://phabricator.wikimedia.org/T395499) [12:23:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178551 (https://phabricator.wikimedia.org/T395499) (owner: 10Anzx) [12:23:58] (03PS1) 10Anzx: rkiwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178555 (https://phabricator.wikimedia.org/T392499) [12:24:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178555 (https://phabricator.wikimedia.org/T392499) (owner: 10Anzx) [12:26:34] (03CR) 10Muehlenhoff: [C:03+2] Remove check_user script [puppet] - 10https://gerrit.wikimedia.org/r/1170096 (https://phabricator.wikimedia.org/T394072) (owner: 10Muehlenhoff) [12:27:16] XioNoX: I'll merge your SR-Linux patch along, ok? [12:27:26] moritzm: go for it [12:27:51] (03CR) 10Ayounsi: [C:03+2] Add hostname to a couple errors [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1175867 (owner: 10Ayounsi) [12:29:42] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1176443 (https://phabricator.wikimedia.org/T401378) (owner: 10Cathal Mooney) [12:29:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11086067 (10phaultfinder) [12:31:25] XioNoX: merged [12:31:29] thx [12:35:59] (03PS2) 10Majavah: ntp: Enable IPv6 on Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1178585 (https://phabricator.wikimedia.org/T401848) [12:36:16] (03CR) 10Majavah: ntp: Enable IPv6 on Cloud VPS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178585 (https://phabricator.wikimedia.org/T401848) (owner: 10Majavah) [12:39:34] (03PS1) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenter [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [12:39:44] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db1152.eqiad.wmnet [12:39:59] (03CR) 10CI reject: [V:04-1] C:ip_reputation_vendors::datacenter_vendors: Known datacenter [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [12:40:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.857s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:42:24] PROBLEM - MariaDB Replica IO: ms1 on db2142 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1152.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1152.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:42:59] (03PS2) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [12:43:11] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:43:30] (03CR) 10CI reject: [V:04-1] C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [12:43:42] (03CR) 10Brouberol: airflow: make it possible to inject custom files in secrets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T309622) (owner: 10Brouberol) [12:43:42] (03PS5) 10Brouberol: airflow: make it possible to inject custom files in secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178828 (https://phabricator.wikimedia.org/T309622) [12:43:42] (03PS9) 10Brouberol: airflow-main: define a custom superset datahub ingestion configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178829 (https://phabricator.wikimedia.org/T309622) [12:45:07] (03CR) 10Filippo Giunchedi: [C:03+1] ntp: Enable IPv6 on Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1178585 (https://phabricator.wikimedia.org/T401848) (owner: 10Majavah) [12:45:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.857s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:47:06] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1152.eqiad.wmnet [12:47:24] RECOVERY - MariaDB Replica IO: ms1 on db2142 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:48:44] (03CR) 10Ssingh: [C:03+1] "Feel free to merge and I will take care of the NTP restarts on prod DNS." [puppet] - 10https://gerrit.wikimedia.org/r/1178585 (https://phabricator.wikimedia.org/T401848) (owner: 10Majavah) [12:49:18] (03CR) 10Majavah: [C:03+2] ntp: Enable IPv6 on Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1178585 (https://phabricator.wikimedia.org/T401848) (owner: 10Majavah) [12:50:52] !log sudo cumin "A:dnsbox" "run-puppet-agent": T401848 [12:50:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:56] T401848: Refresh Cloud VPS NTP servers to run on Trixie and enable IPv6 - https://phabricator.wikimedia.org/T401848 [12:54:05] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-d6-eqiad [12:54:46] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d6-eqiad [12:55:29] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox [12:56:29] FIRING: SystemdUnitFailed: wmf_auto_restart_lldpd.service on install1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:57:11] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [12:57:12] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [12:57:18] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [12:57:31] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [13:00:48] (03CR) 10Federico Ceratto: [C:03+2] aptrepo: relax wmfmariadbpy regexp [puppet] - 10https://gerrit.wikimedia.org/r/1178824 (https://phabricator.wikimedia.org/T397305) (owner: 10Federico Ceratto) [13:00:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:02:00] jouncebot: nowandnext [13:02:00] No deployments scheduled for the next 1 hour(s) and 27 minute(s) [13:02:00] In 1 hour(s) and 27 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1430) [13:03:29] (03PS1) 10Majavah: hieradata: Update Cloud VPS NTP servers [puppet] - 10https://gerrit.wikimedia.org/r/1178871 (https://phabricator.wikimedia.org/T401848) [13:03:38] !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts doh7002.wikimedia.org [13:04:40] !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts durum7002.magru.wmnet [13:05:31] o_O [13:05:41] what happened to the deployment window [13:06:58] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:07:06] something's turned it into an UTC late window? [13:07:15] looks like https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=2332515#deploycal-item-20250814T1200 broke it [13:07:25] (oops, ignore the anchor in that URL) [13:08:15] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [13:08:17] georgekyz: what happened there? do you remember what you did in VE which produced that? [13:08:23] Lucas_WMDE: o/ [13:09:19] jouncebot: refresh [13:09:20] I refreshed my knowledge about deployments. [13:09:22] jouncebot: nowandnext [13:09:22] For the next 0 hour(s) and 50 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1300) [13:09:22] In 1 hour(s) and 20 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1430) [13:09:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11086179 (10phaultfinder) [13:10:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11086180 (10phaultfinder) [13:10:21] anyway, I guess I can deploy [13:10:35] (03PS1) 10Majavah: openstack: vendordata: Use puppet-agent as Puppet package [puppet] - 10https://gerrit.wikimedia.org/r/1178873 (https://phabricator.wikimedia.org/T401913) [13:10:42] maryum, JustHannah: are you around? [13:10:50] yes please [13:11:58] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1178835 (https://phabricator.wikimedia.org/T393835) (owner: 10Hokwelum) [13:12:08] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003" [13:12:26] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003" [13:12:26] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:12:28] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7002.wikimedia.org [13:12:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#11086196 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for hosts: `doh7002.wikimedia.org` - doh7002.wikimedia.org (**PASS**)... [13:13:06] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [13:13:22] I think we can deploy the first change by anzx while that gate-and-submit runs [13:13:31] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] rkiwiki: set sitename, project namespace and time zone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178552 (https://phabricator.wikimedia.org/T392499) (owner: 10Anzx) [13:13:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178552 (https://phabricator.wikimedia.org/T392499) (owner: 10Anzx) [13:14:50] (03Merged) 10jenkins-bot: rkiwiki: set sitename, project namespace and time zone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178552 (https://phabricator.wikimedia.org/T392499) (owner: 10Anzx) [13:15:14] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1178552|rkiwiki: set sitename, project namespace and time zone (T392499)]] [13:15:17] T392499: Post-creation work for rkiwiki - https://phabricator.wikimedia.org/T392499 [13:15:45] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:15:46] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7002.magru.wmnet [13:15:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#11086200 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for hosts: `durum7002.magru.wmnet` - durum7002.magru.wmnet (**PASS**)... [13:16:08] (03Merged) 10jenkins-bot: ResourceLoader: Temporily track cache usage of preloaded NS_USER title info [core] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1178835 (https://phabricator.wikimedia.org/T393835) (owner: 10Hokwelum) [13:16:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#11086204 (10ssingh) [13:17:05] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] minwikibooks: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178551 (https://phabricator.wikimedia.org/T395499) (owner: 10Anzx) [13:17:33] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Backport for [[gerrit:1178552|rkiwiki: set sitename, project namespace and time zone (T392499)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:17:39] Lucas_WMDE: checking [13:17:50] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] rkiwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178555 (https://phabricator.wikimedia.org/T392499) (owner: 10Anzx) [13:19:14] (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] WebAuthn: Limit passkeys to roaming (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174048 (https://phabricator.wikimedia.org/T399665) (owner: 10Mstyles) [13:19:38] Lucas_WMDE: looks good, ok to sync [13:19:43] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Request for Kerberos identity for querying SSAC table via statmachines - https://phabricator.wikimedia.org/T401827#11086211 (10tappof) 05Open→03Resolved a:03tappof [13:19:58] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Continuing with sync [13:20:00] thanks! [13:20:48] (I guess we don’t need an alias for the old “"Wikipedia ဆွီးနွီးချက်" namespace name, if this is such a new wiki) [13:22:14] I wonder if maryum meant to schedule that change for the actual UTC late backport window [13:22:29] Slack claims it’s quite early in the morning for them at the moment ^^ [13:22:44] (though the CR-1 I left, see above, would still apply in that case anyway) [13:23:25] so I think my plan for the rest of this window is JustHannah’s backport and then anzx’ other two config changes (at once) [13:24:04] Lucas_WMDE: yeah old alias not required for new wiki, would probably be redundant, as wikipedia_talk redirect to project talk page [13:24:25] yeah, it’s only the combination of untranslated “Wikipedia” and translated “talk” that gets “lost” [13:24:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11086216 (10phaultfinder) [13:25:05] (03CR) 10Muehlenhoff: [C:03+2] Remove Ganeti role from ganeti7004 [puppet] - 10https://gerrit.wikimedia.org/r/1178839 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:25:14] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178552|rkiwiki: set sitename, project namespace and time zone (T392499)]] (duration: 10m 00s) [13:25:18] T392499: Post-creation work for rkiwiki - https://phabricator.wikimedia.org/T392499 [13:26:40] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1178835|ResourceLoader: Temporily track cache usage of preloaded NS_USER title info (T393835)]] [13:26:44] T393835: Explore removing WikModuleTitleInfo in ResourceLoader, in favour of standard LinkCache - https://phabricator.wikimedia.org/T393835 [13:28:52] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, hokwelum: Backport for [[gerrit:1178835|ResourceLoader: Temporily track cache usage of preloaded NS_USER title info (T393835)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:29:23] JustHannah: can you test the backported change? [13:29:26] (03PS1) 10AOkoth: vrts: Create test role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1178874 (https://phabricator.wikimedia.org/T378028) [13:29:56] Lucas_WMDE: yes, I can! [13:30:04] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11086228 (10phaultfinder) [13:30:17] ok! [13:30:34] (03CR) 10Muehlenhoff: [C:03+1] "I don't have 100% of the contact what Nova uses this for the, but the underlying change LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1178873 (https://phabricator.wikimedia.org/T401913) (owner: 10Majavah) [13:33:22] Lucas_WMDE: Hi, I accidentally scheduled the backport deployment in the afternoon window. [13:37:04] okay… I’m still confused how that resulted in this edit [13:37:30] (03CR) 10Muehlenhoff: [C:03+2] netbox: Remove ganeti02/magru cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178840 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:38:39] Lucas_WMDE: Thank you! Everything looks good! [13:38:45] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, hokwelum: Continuing with sync [13:38:47] alright, thanks! [13:39:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11086268 (10phaultfinder) [13:43:46] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178835|ResourceLoader: Temporily track cache usage of preloaded NS_USER title info (T393835)]] (duration: 17m 06s) [13:43:50] T393835: Explore removing WikModuleTitleInfo in ResourceLoader, in favour of standard LinkCache - https://phabricator.wikimedia.org/T393835 [13:44:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178551 (https://phabricator.wikimedia.org/T395499) (owner: 10Anzx) [13:44:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178555 (https://phabricator.wikimedia.org/T392499) (owner: 10Anzx) [13:45:43] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:45:48] (03Merged) 10jenkins-bot: minwikibooks: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178551 (https://phabricator.wikimedia.org/T395499) (owner: 10Anzx) [13:45:56] (03Merged) 10jenkins-bot: rkiwiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178555 (https://phabricator.wikimedia.org/T392499) (owner: 10Anzx) [13:46:21] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1178551|minwikibooks: add logos (T395499)]], [[gerrit:1178555|rkiwiki: add logos (T392499)]] [13:46:26] T395499: Post-creation work for minwikibooks - https://phabricator.wikimedia.org/T395499 [13:46:26] T392499: Post-creation work for rkiwiki - https://phabricator.wikimedia.org/T392499 [13:48:42] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Backport for [[gerrit:1178551|minwikibooks: add logos (T395499)]], [[gerrit:1178555|rkiwiki: add logos (T392499)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:49:30] (03CR) 10Vgutierrez: Introduce v1 xLab / MPIC SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [13:49:38] Lucas_WMDE: new logo appears, ok to sync [13:49:44] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Continuing with sync [13:49:46] great, thanks! [13:51:04] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: removed VIP for magru02 - jmm@cumin2002" [13:51:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: removed VIP for magru02 - jmm@cumin2002" [13:51:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:51:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_lldpd.service on install1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:45] (03PS12) 10Tiziano Fogli: nrpe wrapper: define Prometheus alerts via Puppet resources [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446) [13:51:45] (03PS1) 10Tiziano Fogli: prometheus::alert::rule: use title to deduplicate resources [puppet] - 10https://gerrit.wikimedia.org/r/1178883 (https://phabricator.wikimedia.org/T381665) [13:55:15] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178551|minwikibooks: add logos (T395499)]], [[gerrit:1178555|rkiwiki: add logos (T392499)]] (duration: 08m 54s) [13:55:20] T395499: Post-creation work for minwikibooks - https://phabricator.wikimedia.org/T395499 [13:55:21] T392499: Post-creation work for rkiwiki - https://phabricator.wikimedia.org/T392499 [13:55:23] (03PS1) 10Muehlenhoff: Revert "cloudweb: Restrict access to Envoy port" [puppet] - 10https://gerrit.wikimedia.org/r/1178884 [13:55:57] !log UTC afternoon backport+config window done [13:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:08] I’ll message maryum in slack about the confusing window [13:56:25] (03CR) 10Majavah: [C:03+1] Revert "cloudweb: Restrict access to Envoy port" [puppet] - 10https://gerrit.wikimedia.org/r/1178884 (owner: 10Muehlenhoff) [13:56:25] (03CR) 10Ssingh: [C:03+1] Revert "cloudweb: Restrict access to Envoy port" [puppet] - 10https://gerrit.wikimedia.org/r/1178884 (owner: 10Muehlenhoff) [13:56:29] Lucas_WMDE: thanks for deploying, logo appears without debug enabled [13:56:51] (03CR) 10Tiziano Fogli: "Tested deduplication feature on Pontoon." [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [13:57:46] np :) [13:59:53] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: namespaceDupes rkiwiki --fix # T392499 [14:00:22] (03CR) 10Muehlenhoff: [C:03+2] Revert "cloudweb: Restrict access to Envoy port" [puppet] - 10https://gerrit.wikimedia.org/r/1178884 (owner: 10Muehlenhoff) [14:03:05] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [14:03:15] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [14:03:58] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2143.codfw.wmnet [14:04:04] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:04:24] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:07:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti7004.magru.wmnet with OS bookworm [14:07:08] PROBLEM - MariaDB Replica IO: ms3 on db1153 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2143.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2143.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:07:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#11086390 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti7004.magru.wmnet with OS bookworm [14:09:15] (03CR) 10Gergő Tisza: WebAuthn: Limit passkeys to roaming (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174048 (https://phabricator.wikimedia.org/T399665) (owner: 10Mstyles) [14:10:50] (03CR) 10Lucas Werkmeister (WMDE): WebAuthn: Limit passkeys to roaming (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174048 (https://phabricator.wikimedia.org/T399665) (owner: 10Mstyles) [14:10:58] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2143.codfw.wmnet [14:11:07] (03CR) 10Gergő Tisza: WebAuthn: Limit passkeys to roaming (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174048 (https://phabricator.wikimedia.org/T399665) (owner: 10Mstyles) [14:12:08] RECOVERY - MariaDB Replica IO: ms3 on db1153 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:16:13] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db1153.eqiad.wmnet [14:18:47] PROBLEM - MariaDB Replica IO: ms3 on db2143 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1153.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1153.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:20:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11086434 (10phaultfinder) [14:22:11] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178666 (https://phabricator.wikimedia.org/T401737) (owner: 10RLazarus) [14:23:22] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1153.eqiad.wmnet [14:23:47] RECOVERY - MariaDB Replica IO: ms3 on db2143 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:24:55] (03PS1) 10Muehlenhoff: Add ganeti7004 to the routed Ganeti cluster in magru [puppet] - 10https://gerrit.wikimedia.org/r/1178887 (https://phabricator.wikimedia.org/T394263) [14:25:21] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#11086449 (10MoritzMuehlenhoff) [14:27:16] (03CR) 10Ayounsi: [C:03+1] Add ganeti7004 to the routed Ganeti cluster in magru [puppet] - 10https://gerrit.wikimedia.org/r/1178887 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [14:28:31] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [14:28:31] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [14:29:42] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [14:29:43] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [14:29:50] 10ops-codfw, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401924 (10phaultfinder) 03NEW [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1430) [14:30:37] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7004.magru.wmnet with reason: host reimage [14:31:47] !log installing libxml2 security updates [14:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:01] (03CR) 10Federico Ceratto: [C:03+1] "Hostnames are matching, object list matches `df -hl` as discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1178557 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [14:32:30] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [14:32:43] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [14:33:24] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [14:34:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7004.magru.wmnet with reason: host reimage [14:35:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [14:41:07] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all [14:43:23] (03CR) 10CDobbins: [V:03+1] dnsrecursor: add recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [14:44:51] 10ops-codfw, 06DC-Ops: Alert for device ps1-c7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401926 (10phaultfinder) 03NEW [14:51:24] (03CR) 10MVernon: [C:03+2] thanos: add thanos-be1005 (JBOD), drain thanos-be2005 [puppet] - 10https://gerrit.wikimedia.org/r/1178557 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [14:53:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11086556 (10MatthewVernon) [14:54:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all [14:55:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7004.magru.wmnet with OS bookworm [14:55:41] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#11086568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti7004.magru.wmnet with OS bookworm completed: - ganeti7... [14:56:42] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11086574 (10VRiley-WMF) @ayounsi This seems to be powered on when I checked this. Is it still showing down? [14:57:12] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11086575 (10VRiley-WMF) a:03VRiley-WMF [14:57:15] (03PS33) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [14:58:06] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [15:00:05] jeena and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1500) [15:01:02] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11086595 (10VRiley-WMF) @LSobanski it seems that Luca is out, but you were listed on the orginal install ticket for this unit. Would you be able to assist is in schedualing time for this mantainence? [15:01:12] PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns4004 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:01:34] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for nokia switches codfw - cmooney@cumin1003" [15:01:39] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for nokia switches codfw - cmooney@cumin1003" [15:01:39] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:01:42] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:01:53] (03PS1) 10Peter Fischer: Add AirFlow connection configuration for kafka_test_eqiad_external [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178889 (https://phabricator.wikimedia.org/T372912) [15:01:53] ^ the NTP one should be resolved soon once the cookbook reaches dns4004. it's a NOOP change, so nothing to worry [15:03:16] (03CR) 10Brouberol: [C:03+1] Add AirFlow connection configuration for kafka_test_eqiad_external [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178889 (https://phabricator.wikimedia.org/T372912) (owner: 10Peter Fischer) [15:05:07] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11086611 (10phaultfinder) [15:05:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11086612 (10phaultfinder) [15:05:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T401611#11086613 (10VRiley-WMF) a:03VRiley-WMF [15:05:25] (03CR) 10Brouberol: [C:03+2] Add AirFlow connection configuration for kafka_test_eqiad_external [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178889 (https://phabricator.wikimedia.org/T372912) (owner: 10Peter Fischer) [15:08:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T401611#11086619 (10VRiley-WMF) [15:10:45] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [15:10:46] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [15:11:18] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [15:11:28] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [15:11:53] (03PS1) 10Vgutierrez: thanos: Add recording rules for xlab SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) [15:13:27] (03PS2) 10Vgutierrez: thanos: Add recording rules for xlab SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) [15:14:31] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db1151.eqiad.wmnet [15:14:50] PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns7002 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:14:50] PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns7001 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:14:58] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns4004 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:15:12] (03CR) 10Vgutierrez: Introduce v1 xLab / MPIC SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [15:15:13] ^ exepcted, resolving, nothing to worry (NOOP) [15:15:15] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2144.codfw.wmnet [15:17:21] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1116 to an-backup-datanode1046 [15:17:41] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [15:19:36] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11086657 (10Jhancock.wm) @MatthewVernon @Papaul is gonna be taking over for me on this next week. I'll be OoO. [15:21:34] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1151.eqiad.wmnet [15:22:10] PROBLEM - MariaDB Replica IO: ms2 on db1151 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error connecting to master repl2024@db2144.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2144.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:22:21] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2144.codfw.wmnet [15:22:36] PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns6001 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:23:10] RECOVERY - MariaDB Replica IO: ms2 on db1151 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:23:10] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:23:24] btullis@cumin1003 rename (PID 2014818) is awaiting input [15:24:08] 10ops-codfw, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401924#11086671 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm moved power of a server that was just racked to a new breaker. alert cleared. [15:26:06] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:26:54] PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns5004 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:27:00] PROBLEM - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns6002 is CRITICAL: CRITICAL: Service ntpsec.service has not been restarted after /etc/ntpsec/ntp.conf was changed (gt 2h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:27:11] ^ resolving soon :) [15:27:26] (03CR) 10Andrea Denisse: [C:03+2] centrallog: Add sampling rules for debug logging [puppet] - 10https://gerrit.wikimedia.org/r/1173442 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [15:29:36] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1116 to an-backup-datanode1046 - btullis@cumin1003" [15:29:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11086709 (10phaultfinder) [15:29:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1116 to an-backup-datanode1046 - btullis@cumin1003" [15:29:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:29:57] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-datanode1046 on all recursors [15:30:01] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-datanode1046 on all recursors [15:30:01] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1046 [15:30:25] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1178667 (https://phabricator.wikimedia.org/T401737) (owner: 10RLazarus) [15:31:13] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1046 [15:31:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1116 to an-backup-datanode1046 [15:32:46] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1115 to an-backup-datanode1045 [15:32:54] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [15:35:05] 10ops-codfw, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401930 (10phaultfinder) 03NEW [15:36:08] RECOVERY - Juniper alarms on asw2-a-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:37:04] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1115 to an-backup-datanode1045 - btullis@cumin1003" [15:40:09] btullis@cumin1003 rename (PID 2017588) is awaiting input [15:40:51] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11086758 (10VRiley-WMF) Looking at the unit, reseated power on the unit, and reseated the power supply on the switch as well. Swapped out the power supply with a backup. Checked it, and it seems like... [15:41:18] (03PS1) 10Andrea Denisse: Revert "centrallog: Add sampling rules for debug logging" [puppet] - 10https://gerrit.wikimedia.org/r/1178897 [15:41:28] (03CR) 10Andrea Denisse: [C:03+2] Revert "centrallog: Add sampling rules for debug logging" [puppet] - 10https://gerrit.wikimedia.org/r/1178897 (owner: 10Andrea Denisse) [15:41:30] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] Revert "centrallog: Add sampling rules for debug logging" [puppet] - 10https://gerrit.wikimedia.org/r/1178897 (owner: 10Andrea Denisse) [15:41:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1115 to an-backup-datanode1045 - btullis@cumin1003" [15:41:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:45] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-datanode1045 on all recursors [15:41:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-datanode1045 on all recursors [15:41:48] !log jhancock@cumin1002 START - Cookbook sre.dns.netbox [15:41:49] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1045 [15:44:54] btullis@cumin1003 rename (PID 2017588) is awaiting input [15:45:11] !log jhancock@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2050 to codfw - jhancock@cumin1002" [15:45:16] !log jhancock@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2050 to codfw - jhancock@cumin1002" [15:45:16] !log jhancock@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:45:50] !log jhancock@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host es2050 [15:45:56] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns5004 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [15:45:59] !log jhancock@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2050 [15:46:20] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host es2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:46:22] (03PS1) 10Zabe: maintain-views: Stop providing rc_new and rc_type to replicas [puppet] - 10https://gerrit.wikimedia.org/r/1178899 (https://phabricator.wikimedia.org/T36320) [15:48:15] (03PS2) 10Zabe: maintain-views: Stop providing rc_new and rc_type to replicas [puppet] - 10https://gerrit.wikimedia.org/r/1178899 (https://phabricator.wikimedia.org/T36320) [15:48:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [15:49:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [15:50:11] 10ops-codfw, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401931 (10phaultfinder) 03NEW [15:50:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [15:51:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [15:51:40] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1045 [15:52:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1115 to an-backup-datanode1045 [15:54:01] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1046.eqiad.wmnet with OS bookworm [15:54:23] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11086806 (10VRiley-WMF) p:05High→03Medium Changing the status to medium for now. Will need to obtain a replacment PEM. However, the unit is completely healthy now. [15:54:28] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1114 to an-backup-datanode1044 [15:54:46] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [15:54:53] 10ops-codfw, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401930#11086809 (10phaultfinder) [15:55:27] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [15:55:41] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [15:58:37] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1114 to an-backup-datanode1044 - btullis@cumin1003" [15:58:38] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:59:00] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1114 to an-backup-datanode1044 - btullis@cumin1003" [15:59:00] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:59:00] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-datanode1044 on all recursors [15:59:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-datanode1044 on all recursors [15:59:04] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1044 [16:00:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11086837 (10phaultfinder) [16:00:05] jhathaway and moritzm: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1600). nyaa~ [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1044 [16:00:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1114 to an-backup-datanode1044 [16:01:26] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns6001 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [16:01:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:02:39] (03PS1) 10Brouberol: airflow-dev: don't report dag runs to datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178905 (https://phabricator.wikimedia.org/T401932) [16:03:13] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Upgrading to Java 11.0.28 - eevans@cumin1002 [16:05:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174048 (https://phabricator.wikimedia.org/T399665) (owner: 10Mstyles) [16:05:38] (03CR) 10Btullis: [C:03+1] airflow-dev: don't report dag runs to datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178905 (https://phabricator.wikimedia.org/T401932) (owner: 10Brouberol) [16:06:22] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1113 to an-backup-datanode1043 [16:06:42] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [16:06:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:10:17] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1113 to an-backup-datanode1043 - btullis@cumin1003" [16:10:42] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host es2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:12:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1113 to an-backup-datanode1043 - btullis@cumin1003" [16:12:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:12:11] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-datanode1043 on all recursors [16:12:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-datanode1043 on all recursors [16:12:14] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1043 [16:13:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1043 [16:13:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1113 to an-backup-datanode1043 [16:14:44] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:15:21] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1112 to an-backup-datanode1042 [16:15:41] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [16:17:04] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns6002 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [16:17:31] !log jhancock@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2050'] [16:17:41] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2050'] [16:18:19] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host es2050.codfw.wmnet with OS bookworm [16:18:33] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11086907 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host es2050.codfw.wmnet with OS bookworm [16:21:23] btullis@cumin1003 rename (PID 2021920) is awaiting input [16:24:18] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401863#11086933 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm balanced power across breakers. alert cleared. [16:26:13] 10ops-codfw, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401930#11086938 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:27:18] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1112 to an-backup-datanode1042 - btullis@cumin1003" [16:27:47] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1112 to an-backup-datanode1042 - btullis@cumin1003" [16:27:47] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:27:47] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-datanode1042 on all recursors [16:27:51] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-datanode1042 on all recursors [16:27:51] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1042 [16:28:00] 10ops-codfw, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401931#11086944 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm removed limits. need to work with networks to fix the underlying issue. Will add limiters back afterwards. [16:28:31] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c7-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401926#11086951 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm need to move server that was just racked to a new rack. Rack is out of power. [16:30:12] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11086957 (10Jhancock.wm) check [16:30:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401867#11086958 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:30:57] btullis@cumin1003 rename (PID 2021920) is awaiting input [16:32:18] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937 (10Papaul) 03NEW [16:32:34] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns7001 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [16:32:41] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11086974 (10Papaul) p:05Triage→03Medium [16:33:15] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11086976 (10Jhancock.wm) i may remove reporting on some of these that are making issues so as not to ticket spam. will check back for new tickets tomorrow morning and make adjustments. [16:33:33] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11086977 (10bd808) >>! In T400119#11084530, @Samwilson wrote: > ~~Will GitLab CI be excluded from this policy?~~ I know you added the ignore edit to this, but as this thread is... [16:33:54] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11086979 (10Jhancock.wm) [16:41:15] (03CR) 10Andrew Bogott: [C:03+2] "I'm not 100% sure that this won't break building of new base images but it's a lot easier to merge and test then to test in place." [puppet] - 10https://gerrit.wikimedia.org/r/1178873 (https://phabricator.wikimedia.org/T401913) (owner: 10Majavah) [16:42:09] (03PS4) 10BCornwall: acme-chief: Move clean-stale-certs to file [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) [16:42:10] (03CR) 10BCornwall: acme-chief: Move clean-stale-certs to file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [16:43:22] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6581/co" [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [16:43:41] (03Abandoned) 10Ottomata: Add support for airflow filesystem backend variables [puppet] - 10https://gerrit.wikimedia.org/r/811986 (https://phabricator.wikimedia.org/T309622) (owner: 10Ottomata) [16:43:45] (03PS1) 10Zabe: Stop writing to cl_to and cl_collation on medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178914 (https://phabricator.wikimedia.org/T399579) [16:44:14] (03PS1) 10BryanDavis: developer-portal: Bump to 2025-08-14-122553-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178915 [16:44:43] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:52] jouncebot: nowandnext [16:44:53] For the next 0 hour(s) and 15 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1600) [16:44:53] In 0 hour(s) and 15 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1700) [16:44:53] In 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1700) [16:45:11] (03CR) 10Zabe: [C:03+2] Stop writing to cl_to and cl_collation on medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178914 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [16:45:12] 10ops-codfw, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938 (10phaultfinder) 03NEW [16:46:10] (03Merged) 10jenkins-bot: Stop writing to cl_to and cl_collation on medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178914 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [16:46:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdb1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T401611#11087018 (10VRiley-WMF) 05Open→03Resolved [16:46:41] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1178914|Stop writing to cl_to and cl_collation on medium wikis (T399579)]] [16:46:45] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [16:48:02] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns7002 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [16:48:12] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox [16:48:55] !log zabe@deploy1003 zabe: Backport for [[gerrit:1178914|Stop writing to cl_to and cl_collation on medium wikis (T399579)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:49:50] !log zabe@deploy1003 zabe: Continuing with sync [16:49:58] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2025-08-14-122553-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178915 (owner: 10BryanDavis) [16:51:32] (03Merged) 10jenkins-bot: developer-portal: Bump to 2025-08-14-122553-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178915 (owner: 10BryanDavis) [16:55:04] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178914|Stop writing to cl_to and cl_collation on medium wikis (T399579)]] (duration: 08m 23s) [16:55:08] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [16:56:44] (03CR) 10BryanDavis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [16:57:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11087058 (10VRiley-WMF) @Marostegui This will be for the install of all 9 of these servers? 1049 - 1057? The ticket only lists 1049 - 1053. I didn't know if the rest were going... [17:00:05] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1700) [17:00:17] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:00:39] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:00:51] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:01:26] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:01:28] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-backup-datanode1046.eqiad.wmnet with OS bookworm [17:01:55] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:02:19] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:05:34] (03CR) 10Cathal Mooney: [C:03+2] User management: create new RO login class and allow to view logs [homer/public] - 10https://gerrit.wikimedia.org/r/1176443 (https://phabricator.wikimedia.org/T401378) (owner: 10Cathal Mooney) [17:06:12] (03Merged) 10jenkins-bot: User management: create new RO login class and allow to view logs [homer/public] - 10https://gerrit.wikimedia.org/r/1176443 (https://phabricator.wikimedia.org/T401378) (owner: 10Cathal Mooney) [17:10:14] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6582/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [17:11:19] (03PS1) 10Dzahn: zuul::main: add a httpd with proxy modules loaded [puppet] - 10https://gerrit.wikimedia.org/r/1178918 (https://phabricator.wikimedia.org/T395938) [17:12:18] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 1 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:21:03] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1178918/6583/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1178918 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:23:05] (03CR) 10CDobbins: [V:03+1] dnsrecursor: add recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [17:23:12] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11087133 (10cmooney) Hey @VRiley-WMF just a reminder to update me about port 42 on cloudsw1-d5-eqiad. Currently has config on it for cloudcephosd1046 but in Netbox there is no cable attached... [17:24:54] 10ops-codfw, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11087136 (10phaultfinder) [17:25:28] (03PS1) 10Dzahn: zuul::main: configure envoy ports [puppet] - 10https://gerrit.wikimedia.org/r/1178921 [17:27:14] (03CR) 10Dzahn: "this is more FYI for the reviewers. I will compile it on all to show nothing changes." [puppet] - 10https://gerrit.wikimedia.org/r/1178619 (owner: 10Dzahn) [17:28:09] (03PS2) 10Dzahn: zuul::main: configure envoy ports [puppet] - 10https://gerrit.wikimedia.org/r/1178921 [17:32:50] (03CR) 10Dzahn: [C:03+2] zuul::main: configure envoy ports [puppet] - 10https://gerrit.wikimedia.org/r/1178921 (owner: 10Dzahn) [17:37:20] !log jhancock@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2050.codfw.wmnet with OS bookworm [17:37:27] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11087174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host es2050.codfw.wmnet with OS bookworm executed with errors: - es2050 (*... [17:39:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11087178 (10phaultfinder) [17:40:00] 10ops-eqiad, 06SRE, 06DC-Ops: Investigate whether we can add RAM to dumpsdata100[4-7] from any decommissioned hosts - https://phabricator.wikimedia.org/T401299#11087179 (10VRiley-WMF) Thanks for this information! Yes, this will be compatible with it. The only impact that it would have would be the clockspeed... [17:47:15] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1042 [17:47:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1112 to an-backup-datanode1042 [17:51:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_lldpd.service on install1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:06] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T1800) [18:01:48] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178924 (https://phabricator.wikimedia.org/T396375) [18:01:49] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178924 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot) [18:02:42] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178924 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot) [18:07:49] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Upgrading to Java 11.0.28 - eevans@cumin1002 [18:08:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T399249)', diff saved to https://phabricator.wikimedia.org/P81357 and previous config saved to /var/cache/conftool/dbconfig/20250814-180825-fceratto.json [18:08:30] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:08:31] (03CR) 10RLazarus: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [18:10:26] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.14 refs T396375 [18:10:30] T396375: 1.45.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T396375 [18:10:56] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:10:56] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:11:41] (03CR) 10RLazarus: Introduce v1 xLab / MPIC SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [18:13:06] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:14:00] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:14:46] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54680 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:14:48] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:15:47] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11087349 (10cmooney) >>! In T400783#11085004, @ayounsi wrote: > That's a bit annoying. To not waste time I've done the steps myself. But we should look at removing that blocker. Should be sorted no... [18:20:55] (03PS1) 10Robertsky: update wikimaniawiki extendedconfirmed promotion config: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178927 [18:22:50] (03PS2) 10Robertsky: update wikimaniawiki extendedconfirmed promotion config: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178927 (https://phabricator.wikimedia.org/T401537) [18:23:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P81358 and previous config saved to /var/cache/conftool/dbconfig/20250814-182333-fceratto.json [18:28:45] (03PS5) 10Dzahn: zuul::main: allow caching layer to connect to http backend [puppet] - 10https://gerrit.wikimedia.org/r/1178093 (https://phabricator.wikimedia.org/T395938) [18:28:52] (03PS6) 10Dzahn: zuul::main: allow caching layer to connect to https backend [puppet] - 10https://gerrit.wikimedia.org/r/1178093 (https://phabricator.wikimedia.org/T395938) [18:29:23] (03PS7) 10Dzahn: zuul::main: allow caching layer to connect to http backend [puppet] - 10https://gerrit.wikimedia.org/r/1178093 (https://phabricator.wikimedia.org/T395938) [18:29:26] (03PS1) 10Robertsky: wikimaniawiki: remove 2026-2028 namespace protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178929 (https://phabricator.wikimedia.org/T401948) [18:29:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:31:01] (03CR) 10Dzahn: [C:03+2] zuul::main: allow caching layer to connect to http backend [puppet] - 10https://gerrit.wikimedia.org/r/1178093 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:34:20] (03PS3) 10Robertsky: wikimaniawiki: update extendedconfirmed promotion config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178927 (https://phabricator.wikimedia.org/T401537) [18:35:32] (03PS4) 10Robertsky: wikimaniawiki: update extendedconfirmed promotion config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178927 (https://phabricator.wikimedia.org/T401537) [18:36:38] (03CR) 10Chlod Alejandro: [C:03+1] wikimaniawiki: remove 2026-2028 namespace protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178929 (https://phabricator.wikimedia.org/T401948) (owner: 10Robertsky) [18:38:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P81359 and previous config saved to /var/cache/conftool/dbconfig/20250814-183840-fceratto.json [18:38:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178929 (https://phabricator.wikimedia.org/T401948) (owner: 10Robertsky) [18:39:04] (03PS1) 10Andrea Denisse: centrallog: Add sampling rules for debug logging [puppet] - 10https://gerrit.wikimedia.org/r/1178932 (https://phabricator.wikimedia.org/T383309) [18:39:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11087440 (10phaultfinder) [18:40:00] (03CR) 10Chlod Alejandro: [C:03+1] wikimaniawiki: update extendedconfirmed promotion config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178927 (https://phabricator.wikimedia.org/T401537) (owner: 10Robertsky) [18:40:22] (03CR) 10Andrea Denisse: "Hi folks, I had to revert this change as I didn't had SSH access after merging due to an issue on my side." [puppet] - 10https://gerrit.wikimedia.org/r/1178932 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [18:40:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178927 (https://phabricator.wikimedia.org/T401537) (owner: 10Robertsky) [18:42:02] (03CR) 10Andrea Denisse: [C:03+2] centrallog: Add sampling rules for debug logging [puppet] - 10https://gerrit.wikimedia.org/r/1178932 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [18:48:13] 06SRE, 06Traffic, 13Patch-For-Review: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) - https://phabricator.wikimedia.org/T381608#11087466 (10ssingh) [18:49:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:53:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T399249)', diff saved to https://phabricator.wikimedia.org/P81360 and previous config saved to /var/cache/conftool/dbconfig/20250814-185348-fceratto.json [18:53:52] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:54:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2236.codfw.wmnet with reason: Maintenance [18:54:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2236 (T399249)', diff saved to https://phabricator.wikimedia.org/P81361 and previous config saved to /var/cache/conftool/dbconfig/20250814-185410-fceratto.json [18:54:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:01:42] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:02:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host es2049.codfw.wmnet with OS bookworm [19:02:14] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11087510 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host es2049.codfw.wmnet with OS bookworm [19:08:37] (03PS1) 10Dzahn: zuul::main: allow deployment hosts to speak http to it for testing [puppet] - 10https://gerrit.wikimedia.org/r/1178939 (https://phabricator.wikimedia.org/T395938) [19:17:08] (03PS18) 10Bking: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) [19:17:25] (03CR) 10Bking: cirrussearch: Fix logstash/log4j config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [19:17:34] (03CR) 10CI reject: [V:04-1] cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [19:19:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11087553 (10phaultfinder) [19:21:03] (03PS19) 10Bking: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) [19:21:29] (03CR) 10CI reject: [V:04-1] cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [19:23:40] (03PS20) 10Bking: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) [19:24:07] (03CR) 10CI reject: [V:04-1] cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [19:25:44] (03PS1) 10Pppery: MediaWiki.org: Restrict creation of empty categories using Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178941 (https://phabricator.wikimedia.org/T401878) [19:26:06] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:26:16] (03PS2) 10Pppery: MediaWiki.org: Restrict creation of empty categories using Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178941 (https://phabricator.wikimedia.org/T401878) [19:28:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178941 (https://phabricator.wikimedia.org/T401878) (owner: 10Pppery) [19:28:44] (03PS21) 10Bking: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) [19:31:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [19:33:09] (03PS22) 10Bking: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) [19:34:12] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [19:38:22] (03PS23) 10Bking: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) [19:41:07] (03CR) 10Dzahn: "the regex in site.pp seems wrong. [] should be () ?" [puppet] - 10https://gerrit.wikimedia.org/r/1172192 (https://phabricator.wikimedia.org/T400195) (owner: 10Marostegui) [19:42:15] (03PS1) 10Dzahn: site: fix a broken regex [puppet] - 10https://gerrit.wikimedia.org/r/1178947 [19:42:44] it seems entire site.pp is broken [19:47:43] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Upgrading to Java 11.0.28 - eevans@cumin1002 [19:51:06] (03CR) 10Cathal Mooney: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1178947 (owner: 10Dzahn) [19:55:06] (03CR) 10Dzahn: [C:03+2] site: fix a broken regex [puppet] - 10https://gerrit.wikimedia.org/r/1178947 (owner: 10Dzahn) [19:55:08] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1128 - https://phabricator.wikimedia.org/T401504#11087634 (10wiki_willy) Adding @BTullis and @Stevemunene for feedback on an appropriate window for an-worker1128 >>! In T401504#11086594, @VRiley-WMF wrote: > @LSobanski it seems that Luca is out, but yo... [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T2000). [20:00:05] maryum, robertsky, and Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] here [20:00:13] here [20:00:19] here [20:00:59] Should I use spiderpig to deploy myself? [20:01:22] (03CR) 10Dzahn: "deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178947" [puppet] - 10https://gerrit.wikimedia.org/r/1172192 (https://phabricator.wikimedia.org/T400195) (owner: 10Marostegui) [20:05:11] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2049.codfw.wmnet with OS bookworm [20:05:20] maryum: you can if you want to! [20:05:23] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11087640 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host es2049.codfw.wmnet with OS bookworm executed with errors: - es2049 (**F... [20:06:19] 06SRE, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T401118#11087641 (10KFrancis) The NDA has been signed. Thanks! [20:06:53] Pppery: robertsky maryum you can ping me if you need any help deploying [20:07:05] jeena will do thanks [20:07:14] (03PS24) 10Bking: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) [20:07:35] I'm going to go ahead and +2 my patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1174048) and then use spiderpig [20:07:56] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [20:09:08] jeena, yeah. will need help with deploying my patches. i have no deploy rights. [20:09:28] (03CR) 10Mstyles: [C:03+2] WebAuthn: Limit passkeys to roaming (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174048 (https://phabricator.wikimedia.org/T399665) (owner: 10Mstyles) [20:09:52] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11087642 (10phaultfinder) [20:10:17] (03Merged) 10jenkins-bot: WebAuthn: Limit passkeys to roaming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174048 (https://phabricator.wikimedia.org/T399665) (owner: 10Mstyles) [20:10:46] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Sat 30 Aug 2025 08:10:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [20:11:09] !log mstyles@deploy1003 Started scap sync-world: Backport for [[gerrit:1174048|WebAuthn: Limit passkeys to roaming (T399665)]] [20:11:13] T399665: Restrict WebAuthn to hardware security keys only - https://phabricator.wikimedia.org/T399665 [20:11:17] just learned that you don't need to merge before using spiderpig [20:11:21] robertsky okay I can do yours [20:11:23] (03PS25) 10Bking: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) [20:11:26] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [20:11:26] Oh yeah it will do it for you [20:11:37] And I don't have deploy rights either, so someone will need to handle mine [20:12:02] Pppery: 👍 [20:12:16] thanks! [20:13:09] !log mstyles@deploy1003 mstyles: Backport for [[gerrit:1174048|WebAuthn: Limit passkeys to roaming (T399665)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:16:24] !log mstyles@deploy1003 mstyles: Continuing with sync [20:17:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [20:20:23] (03PS26) 10Bking: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) [20:20:36] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [20:21:32] !log mstyles@deploy1003 Finished scap sync-world: Backport for [[gerrit:1174048|WebAuthn: Limit passkeys to roaming (T399665)]] (duration: 10m 23s) [20:21:36] T399665: Restrict WebAuthn to hardware security keys only - https://phabricator.wikimedia.org/T399665 [20:22:27] robertsky I'll deploy both your config changes together, is that fine with you? [20:22:36] yes [20:24:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178929 (https://phabricator.wikimedia.org/T401948) (owner: 10Robertsky) [20:24:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178927 (https://phabricator.wikimedia.org/T401537) (owner: 10Robertsky) [20:25:52] (03Merged) 10jenkins-bot: wikimaniawiki: remove 2026-2028 namespace protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178929 (https://phabricator.wikimedia.org/T401948) (owner: 10Robertsky) [20:25:54] (03Merged) 10jenkins-bot: wikimaniawiki: update extendedconfirmed promotion config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178927 (https://phabricator.wikimedia.org/T401537) (owner: 10Robertsky) [20:26:10] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1178929|wikimaniawiki: remove 2026-2028 namespace protection (T401948)]], [[gerrit:1178927|wikimaniawiki: update extendedconfirmed promotion config (T401537)]] [20:26:16] T401948: wikimaniawiki: decrease future namespaces protection - https://phabricator.wikimedia.org/T401948 [20:26:16] T401537: wikimaniawiki: adjust autopromotion for extendedconfirmed group - https://phabricator.wikimedia.org/T401537 [20:27:23] (03CR) 10Bking: cirrussearch: Fix logstash/log4j config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [20:28:11] !log jhuneidi@deploy1003 robertsky, jhuneidi: Backport for [[gerrit:1178929|wikimaniawiki: remove 2026-2028 namespace protection (T401948)]], [[gerrit:1178927|wikimaniawiki: update extendedconfirmed promotion config (T401537)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:29:24] (03CR) 10Dzahn: [C:03+2] zuul::main: allow deployment hosts to speak http to it for testing [puppet] - 10https://gerrit.wikimedia.org/r/1178939 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:29:31] (03PS2) 10Dzahn: zuul::main: allow deployment hosts to speak http to it for testing [puppet] - 10https://gerrit.wikimedia.org/r/1178939 (https://phabricator.wikimedia.org/T395938) [20:30:35] jeena: looks ok for 1178929, 1178927 will require next user edits to check for autopromotion, let's assume that the changes are fine for now. [20:30:57] okay, I'll continue the backport [20:31:10] !log jhuneidi@deploy1003 robertsky, jhuneidi: Continuing with sync [20:32:19] (03CR) 10Dzahn: [C:03+2] zuul::main: allow deployment hosts to speak http to it for testing [puppet] - 10https://gerrit.wikimedia.org/r/1178939 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:33:27] (03PS27) 10Bking: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) [20:35:31] (03PS1) 10Dzahn: create zuul.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1178957 (https://phabricator.wikimedia.org/T395938) [20:36:04] (03PS2) 10Dzahn: create zuul.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1178957 (https://phabricator.wikimedia.org/T395938) [20:36:23] !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178929|wikimaniawiki: remove 2026-2028 namespace protection (T401948)]], [[gerrit:1178927|wikimaniawiki: update extendedconfirmed promotion config (T401537)]] (duration: 10m 12s) [20:36:28] T401948: wikimaniawiki: decrease future namespaces protection - https://phabricator.wikimedia.org/T401948 [20:36:28] T401537: wikimaniawiki: adjust autopromotion for extendedconfirmed group - https://phabricator.wikimedia.org/T401537 [20:39:47] Pppery: ready for the backport? [20:39:50] Yep [20:43:09] jeena: You there? [20:43:20] yeah sorry i am just about to click the button [20:43:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178941 (https://phabricator.wikimedia.org/T401878) (owner: 10Pppery) [20:44:27] (03Merged) 10jenkins-bot: MediaWiki.org: Restrict creation of empty categories using Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178941 (https://phabricator.wikimedia.org/T401878) (owner: 10Pppery) [20:44:41] !log jhuneidi@deploy1003 Started scap sync-world: Backport for [[gerrit:1178941|MediaWiki.org: Restrict creation of empty categories using Translate (T401878)]] [20:44:43] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:44:45] T401878: Restrict `translate-empty-category` rights on MediaWIki.org to sysop/translationadmin - https://phabricator.wikimedia.org/T401878 [20:46:37] !log jhuneidi@deploy1003 jhuneidi, pppery: Backport for [[gerrit:1178941|MediaWiki.org: Restrict creation of empty categories using Translate (T401878)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:46:43] looking [20:47:12] (03CR) 10Dzahn: [C:03+2] create zuul.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1178957 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:47:26] !log dzahn@dns1004 START - running authdns-update [20:48:35] !log dzahn@dns1004 END - running authdns-update [20:48:59] (03CR) 10Cwhite: cirrussearch: Fix logstash/log4j config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [20:50:39] Looks good, proceed [20:52:51] 👍 [20:52:56] !log jhuneidi@deploy1003 jhuneidi, pppery: Continuing with sync [20:58:17] !log jhuneidi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178941|MediaWiki.org: Restrict creation of empty categories using Translate (T401878)]] (duration: 13m 36s) [20:58:21] T401878: Restrict `translate-empty-category` rights on MediaWIki.org to sysop/translationadmin - https://phabricator.wikimedia.org/T401878 [20:59:38] (03PS1) 10Jdlrobson: Restore access for Jon [puppet] - 10https://gerrit.wikimedia.org/r/1178960 [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250814T2100) [21:00:29] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11087775 (10RobH) [21:00:37] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11087776 (10RobH) p:05Triage→03Medium a:05VRiley-WMF→03None [21:03:11] FIRING: [2x] SystemdUnitFailed: opensearch.service on relforge1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:03:21] Pppery: I'm seeing some new errors and I'm not sure if they have any relation to your change: `v/w/p/s/Wt2H/T/TemplateHandler:803 PHP Warning: Attempt to read property "key" on null` [21:03:27] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1043.eqiad.wmnet with OS bullseye [21:03:55] (03CR) 10Bking: cirrussearch: Fix logstash/log4j config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [21:05:25] there was only one spike and I don't see any more so far [21:05:29] That seems highly unlikely to be related [21:05:35] That's something from Parsoid [21:05:43] And all my patch did is change user rights in the Translate extension [21:05:51] okay thanks for checking [21:07:21] 10ops-eqiad, 06SRE, 06DC-Ops: PXE provision script needed for - https://phabricator.wikimedia.org/T401964 (10RobH) 03NEW [21:08:21] 10ops-eqiad, 06SRE, 06DC-Ops: PXE provision script needed for - https://phabricator.wikimedia.org/T401964#11087854 (10RobH) a:03klausman @klausman, Can you advise when would be a good time to roll through and correct the hosts listed in the task description? I'd available to handle the cookbook run and c... [21:08:54] 10ops-eqiad, 06SRE, 06DC-Ops: PXE provision script needed for ml-lab and ml-serve hosts - https://phabricator.wikimedia.org/T401964#11087857 (10RobH) p:05Triage→03Medium [21:11:08] Hey all - would like to get a security patch out right now, unless there are any objections. [21:12:11] PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:12:27] sbassett: backports are finished [21:19:28] 10ops-eqiad, 06SRE, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966 (10RobH) 03NEW [21:22:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11087928 (10RobH) [21:23:41] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: PXE provision script needed for data-persistence hosts - https://phabricator.wikimedia.org/T401966#11087932 (10RobH) a:03Marostegui @Marostegui: Would you be able to advise on behalf of #data-persistence a schedule for updating the hosts in the task descr... [21:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11087937 (10RobH) [21:24:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11087941 (10phaultfinder) [21:24:59] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11087942 (10RobH) [21:27:07] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11087944 (10Andrew) [21:29:21] (03CR) 10Cwhite: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [21:30:46] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1043.eqiad.wmnet with reason: host reimage [21:31:54] jeena: ok, thanks. Turns out the sec patch isn’t quite ready to go today :) [21:33:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1043.eqiad.wmnet with reason: host reimage [21:39:11] RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:41:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host es2049.codfw.wmnet with OS bookworm [21:41:34] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11087961 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host es2049.codfw.wmnet with OS bookworm [21:43:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2049.codfw.wmnet with reason: host reimage [21:48:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2049.codfw.wmnet with reason: host reimage [21:51:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_lldpd.service on install1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:51:58] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1043.eqiad.wmnet with OS bullseye [21:52:41] (03CR) 10Cwhite: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1178631 (owner: 10Krinkle) [21:53:00] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Upgrading to Java 11.0.28 - eevans@cumin1002 [21:53:20] (03PS1) 10Andrea Denisse: centrallog: Remove debug sampling [puppet] - 10https://gerrit.wikimedia.org/r/1178969 (https://phabricator.wikimedia.org/T383309) [21:59:39] 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on an-worker1190:9290 - https://phabricator.wikimedia.org/T401969 (10phaultfinder) 03NEW [22:04:22] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06MediaWiki-Platform-Team, and 2 others: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11088017 (10Tgr) @Joe which format do you think woul... [22:07:44] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:09:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:09:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2049.codfw.wmnet with OS bookworm [22:09:15] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11088031 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host es2049.codfw.wmnet with OS bookworm completed: - es2049 (**PASS**) -... [22:14:03] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Upgrading to Java 11.0.28 - eevans@cumin1002 [22:15:13] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11088035 (10phaultfinder) [22:17:34] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11088050 (10Papaul) We were getting the error below while re-image es2049 my thinking was that the entry in site.pp for the new es host was not right to be sure i ping @Dzahn to... [22:19:39] FIRING: TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (2001:2035:0:699::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=Arelion - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:19:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 (Transit: Arelion (IC-381309) {#30386}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:24:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (2001:2035:0:699::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:24:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/8 (Transit: Arelion (IC-381309) {#30386}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:28:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:37:26] (03PS1) 10Cwhite: DiskSpace: add DiskSpace critical alert [alerts] - 10https://gerrit.wikimedia.org/r/1178979 (https://phabricator.wikimedia.org/T332764) [22:48:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:03:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:09:35] (03PS1) 10Cwhite: resources: Exclude docker|containerd|kubelet mounts from alerts [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) [23:11:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (2001:2035:0:699::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:20:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b8-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401938#11088142 (10phaultfinder) [23:25:18] (03PS1) 10Dzahn: zuul::main: add website config with proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/1178985 (https://phabricator.wikimedia.org/T395938) [23:27:42] (03CR) 10CI reject: [V:04-1] zuul::main: add website config with proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/1178985 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:28:36] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:28:36] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:29:32] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:29:35] (03PS2) 10Dzahn: zuul::main: add website config with proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/1178985 (https://phabricator.wikimedia.org/T395938) [23:31:44] (03PS3) 10RLazarus: deployment_server: Add --local_dblist to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1178667 (https://phabricator.wikimedia.org/T401737) [23:32:37] jouncebot: nowandnext [23:32:37] No deployments scheduled for the next 6 hour(s) and 27 minute(s) [23:32:38] In 6 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250815T0600) [23:32:45] (03CR) 10Dzahn: [C:03+2] zuul::main: add website config with proxy settings [puppet] - 10https://gerrit.wikimedia.org/r/1178985 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:33:36] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:33:52] (03CR) 10RLazarus: [C:03+2] "Yeah, I'm going to just roll them out coordinatedly, so the gap should be just a couple of moments. But even better, we don't actually use" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178666 (https://phabricator.wikimedia.org/T401737) (owner: 10RLazarus) [23:34:44] (putting out a MW helm patch that only affects mwscript-k8s, I'll do a scap deploy just to clean up the chart version diff) [23:36:40] (03Merged) 10jenkins-bot: mediawiki: Add support for mounting a custom dblist [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178666 (https://phabricator.wikimedia.org/T401737) (owner: 10RLazarus) [23:37:17] (03CR) 10RLazarus: [C:03+2] "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1178667 (https://phabricator.wikimedia.org/T401737) (owner: 10RLazarus) [23:38:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1178986 [23:38:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1178986 (owner: 10TrainBranchBot) [23:39:43] (03PS1) 10Dzahn: zuul::main: fix location of httpd config file [puppet] - 10https://gerrit.wikimedia.org/r/1178987 [23:40:36] (03CR) 10Dzahn: [C:03+2] zuul::main: fix location of httpd config file [puppet] - 10https://gerrit.wikimedia.org/r/1178987 (owner: 10Dzahn) [23:41:48] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Upgrading to Java 11.0.28 - eevans@cumin1002 [23:46:39] !log rzl@deploy1003 Started scap sync-world: https://gerrit.wikimedia.org/r/1178666 [23:46:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Arelion (2001:2035:0:699::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:48:59] !log rzl@deploy1003 Finished scap sync-world: https://gerrit.wikimedia.org/r/1178666 (duration: 03m 28s) [23:49:40] done [23:49:48] (03PS1) 10Dzahn: httpbb: add test file for zuul.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1178989 (https://phabricator.wikimedia.org/T395938) [23:52:15] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1178986 (owner: 10TrainBranchBot) [23:54:03] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1178989/6586/deploy1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1178989 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:55:37] (03PS1) 10Zabe: Reduce default recentchanges query time on large wikis to 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178990 (https://phabricator.wikimedia.org/T399455)