[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175225 (owner: 10TrainBranchBot) [00:03:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P80620 and previous config saved to /var/cache/conftool/dbconfig/20250804-000337-ladsgroup.json [00:08:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175271 [00:08:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175271 (owner: 10TrainBranchBot) [00:18:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T400854)', diff saved to https://phabricator.wikimedia.org/P80621 and previous config saved to /var/cache/conftool/dbconfig/20250804-001845-ladsgroup.json [00:18:48] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [00:19:00] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2238.codfw.wmnet with reason: Maintenance [00:19:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2238 (T400854)', diff saved to https://phabricator.wikimedia.org/P80622 and previous config saved to /var/cache/conftool/dbconfig/20250804-001908-ladsgroup.json [00:21:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T400854)', diff saved to https://phabricator.wikimedia.org/P80623 and previous config saved to /var/cache/conftool/dbconfig/20250804-002159-ladsgroup.json [00:29:17] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175271 (owner: 10TrainBranchBot) [00:37:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P80624 and previous config saved to /var/cache/conftool/dbconfig/20250804-003706-ladsgroup.json [00:52:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P80625 and previous config saved to /var/cache/conftool/dbconfig/20250804-005214-ladsgroup.json [01:00:39] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:07:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T400854)', diff saved to https://phabricator.wikimedia.org/P80626 and previous config saved to /var/cache/conftool/dbconfig/20250804-010722-ladsgroup.json [01:07:25] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [01:07:29] (03PS1) 10Andrew Bogott: Toolforge docker registry: add some comments to help my future self [puppet] - 10https://gerrit.wikimedia.org/r/1175272 [01:11:42] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 03s) [03:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:38:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [03:38:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [04:35:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175121 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle) [04:47:54] (03Merged) 10jenkins-bot: In sitemap responses set CC: public [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175121 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle) [04:48:23] !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1175121|In sitemap responses set CC: public (T400023)]] [04:48:26] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [05:09:13] !log tstarling@deploy1003 krinkle, tstarling: Backport for [[gerrit:1175121|In sitemap responses set CC: public (T400023)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [05:09:16] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [05:09:44] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:10:53] (03PS1) 10Giuseppe Lavagetto: New UX release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1175274 [05:11:42] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] New UX release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1175274 (owner: 10Giuseppe Lavagetto) [05:13:02] !log tstarling@deploy1003 krinkle, tstarling: Continuing with sync [05:13:04] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "UX improvements - oblivian@cumin1003" [05:13:05] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: UX improvements - oblivian@cumin1003 [05:13:56] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: UX improvements - oblivian@cumin1003 [05:13:57] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "UX improvements - oblivian@cumin1003" [05:25:26] !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175121|In sitemap responses set CC: public (T400023)]] (duration: 37m 03s) [05:25:29] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [05:27:58] PROBLEM - Etcd cluster health on wikikube-ctrl1004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [05:28:14] PROBLEM - Etcd cluster health on wikikube-ctrl1002 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [05:28:20] PROBLEM - Etcd cluster health on wikikube-ctrl1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [05:28:20] PROBLEM - Etcd cluster health on wikikube-ctrl1001 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [05:49:02] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:55:07] (03PS2) 10Ayounsi: Nokia ZTP: small fixes and better python script [puppet] - 10https://gerrit.wikimedia.org/r/1175141 (https://phabricator.wikimedia.org/T401013) [06:09:44] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:10:24] (03CR) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [06:10:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-ntmkv - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [06:18:44] FIRING: [5x] KubernetesDeploymentUnavailableReplicas: Deployment mw-api-ext.eqiad.main in mw-api-ext at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [06:25:58] RECOVERY - Etcd cluster health on wikikube-ctrl1004 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [06:26:14] RECOVERY - Etcd cluster health on wikikube-ctrl1002 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [06:26:20] RECOVERY - Etcd cluster health on wikikube-ctrl1003 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [06:26:20] RECOVERY - Etcd cluster health on wikikube-ctrl1001 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [06:26:29] <_joe_> !log defragmented etcd k8s cluster in eqiad [06:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:44] FIRING: [6x] KubernetesDeploymentUnavailableReplicas: Deployment helm-state-metrics in kube-system at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [06:33:28] <_joe_> all the k8s alerts should clear soon [06:33:44] FIRING: [6x] KubernetesDeploymentUnavailableReplicas: Deployment helm-state-metrics in kube-system at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [06:34:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:46:50] <_joe_> I don't get why the unavailable replicas alert for thumbor is still firing [06:49:04] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:49:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:50:05] <_joe_> !log deleting unhealthy thumbor pods [06:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:31] (03PS1) 10Ayounsi: dhcp: add support for Nokia switches [software/spicerack] - 10https://gerrit.wikimedia.org/r/1175395 (https://phabricator.wikimedia.org/T401013) [06:52:39] (03PS2) 10Ayounsi: dhcp: add support for Nokia switches [software/spicerack] - 10https://gerrit.wikimedia.org/r/1175395 (https://phabricator.wikimedia.org/T401013) [06:58:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [06:58:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [07:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T0700). nyaa~ [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:38] (03CR) 10CI reject: [V:04-1] dhcp: add support for Nokia switches [software/spicerack] - 10https://gerrit.wikimedia.org/r/1175395 (https://phabricator.wikimedia.org/T401013) (owner: 10Ayounsi) [07:04:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:04:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:39:28] (03PS3) 10Ayounsi: dhcp: add support for Nokia switches [software/spicerack] - 10https://gerrit.wikimedia.org/r/1175395 (https://phabricator.wikimedia.org/T401013) [07:43:08] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [07:43:27] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:43:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T400854)', diff saved to https://phabricator.wikimedia.org/P80627 and previous config saved to /var/cache/conftool/dbconfig/20250804-074333-ladsgroup.json [07:43:37] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [07:51:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T400854)', diff saved to https://phabricator.wikimedia.org/P80628 and previous config saved to /var/cache/conftool/dbconfig/20250804-075101-ladsgroup.json [07:51:05] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [07:52:08] (03PS4) 10Ayounsi: dhcp: add support for Nokia switches [software/spicerack] - 10https://gerrit.wikimedia.org/r/1175395 (https://phabricator.wikimedia.org/T401013) [07:54:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [07:54:48] (03PS1) 10Ayounsi: sre.network.provision: add Nokia support [cookbooks] - 10https://gerrit.wikimedia.org/r/1175471 (https://phabricator.wikimedia.org/T401013) [07:55:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CheckUser] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175002 (https://phabricator.wikimedia.org/T400627) (owner: 10Kosta Harlan) [07:55:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175042 (https://phabricator.wikimedia.org/T398681) (owner: 10Kosta Harlan) [08:04:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:06:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P80629 and previous config saved to /var/cache/conftool/dbconfig/20250804-080608-ladsgroup.json [08:08:08] (03PS1) 10Clément Goubert: etcd::v3: Allow setting quota-backend-bytes [puppet] - 10https://gerrit.wikimedia.org/r/1175472 [08:08:12] (03PS1) 10Clément Goubert: O:kubernetes::master_stacked: Set etcd quota to 8G [puppet] - 10https://gerrit.wikimedia.org/r/1175473 [08:21:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P80630 and previous config saved to /var/cache/conftool/dbconfig/20250804-082116-ladsgroup.json [08:22:30] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1159.eqiad.wmnet with reason: Maintenance [08:22:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1159 (T399728)', diff saved to https://phabricator.wikimedia.org/P80631 and previous config saved to /var/cache/conftool/dbconfig/20250804-082237-fceratto.json [08:22:40] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [08:25:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T399728)', diff saved to https://phabricator.wikimedia.org/P80632 and previous config saved to /var/cache/conftool/dbconfig/20250804-082524-fceratto.json [08:28:55] (03CR) 10Vgutierrez: text-frontend: enforcement of UA policy (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [08:30:12] 06SRE, 10SRE-SLO, 06Traffic: Page on ATS backend errors relative to traffic - https://phabricator.wikimedia.org/T400675#11056694 (10fgiunchedi) [08:30:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-ntmkv - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [08:34:49] (03CR) 10Harroyo-wmf: UserInfoCard: Add config var for making UIC available (031 comment) [extensions/CheckUser] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175002 (https://phabricator.wikimedia.org/T400627) (owner: 10Kosta Harlan) [08:36:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T400854)', diff saved to https://phabricator.wikimedia.org/P80633 and previous config saved to /var/cache/conftool/dbconfig/20250804-083623-ladsgroup.json [08:36:27] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [08:36:39] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:36:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T400854)', diff saved to https://phabricator.wikimedia.org/P80634 and previous config saved to /var/cache/conftool/dbconfig/20250804-083646-ladsgroup.json [08:37:18] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:39:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T400854)', diff saved to https://phabricator.wikimedia.org/P80635 and previous config saved to /var/cache/conftool/dbconfig/20250804-083921-ladsgroup.json [08:40:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P80636 and previous config saved to /var/cache/conftool/dbconfig/20250804-084032-fceratto.json [08:45:10] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:45:18] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:46:00] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:46:08] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54368 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:48:33] (03CR) 10Jelto: "one question in line, thanks for working on the rename-project issue!" [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar) [08:52:08] (03PS1) 10Hashar: build: upgrade QUnit [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1175475 [08:52:50] (03CR) 10CI reject: [V:04-1] build: upgrade QUnit [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1175475 (owner: 10Hashar) [08:54:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P80637 and previous config saved to /var/cache/conftool/dbconfig/20250804-085430-ladsgroup.json [08:55:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P80638 and previous config saved to /var/cache/conftool/dbconfig/20250804-085540-fceratto.json [09:02:26] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:04:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:05:53] 06SRE, 10SRE-SLO, 06Traffic: Page on ATS backend errors relative to traffic - https://phabricator.wikimedia.org/T400675#11056765 (10Vgutierrez) This alert was created by @Joe back in the day, adding him to the discussion :) [09:06:15] (03CR) 10Jelto: [C:03+2] Revert "gitlab: pause restore on gitlab2002" [puppet] - 10https://gerrit.wikimedia.org/r/1174976 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [09:09:11] (03PS2) 10Hashar: build: upgrade QUnit [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1175475 [09:09:29] (03PS1) 10Gkyziridis: ml-services: Deploy latest image for langid on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175477 (https://phabricator.wikimedia.org/T400347) [09:09:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P80639 and previous config saved to /var/cache/conftool/dbconfig/20250804-090938-ladsgroup.json [09:10:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T399728)', diff saved to https://phabricator.wikimedia.org/P80640 and previous config saved to /var/cache/conftool/dbconfig/20250804-091048-fceratto.json [09:10:52] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:11:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1161.eqiad.wmnet with reason: Maintenance [09:11:21] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:11:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T399728)', diff saved to https://phabricator.wikimedia.org/P80641 and previous config saved to /var/cache/conftool/dbconfig/20250804-091128-fceratto.json [09:14:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T399728)', diff saved to https://phabricator.wikimedia.org/P80642 and previous config saved to /var/cache/conftool/dbconfig/20250804-091413-fceratto.json [09:19:22] (03PS1) 10Gkyziridis: ml-services: Deploy latest images for articletopic-outlink-model on production. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175478 (https://phabricator.wikimedia.org/T400349) [09:22:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [09:22:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [09:24:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T400854)', diff saved to https://phabricator.wikimedia.org/P80643 and previous config saved to /var/cache/conftool/dbconfig/20250804-092445-ladsgroup.json [09:24:49] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [09:25:02] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [09:25:59] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [09:26:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T400854)', diff saved to https://phabricator.wikimedia.org/P80644 and previous config saved to /var/cache/conftool/dbconfig/20250804-092606-ladsgroup.json [09:26:42] (03PS1) 10Majavah: P:toolforge::legacy_redirector: Add NEL headers [puppet] - 10https://gerrit.wikimedia.org/r/1175480 (https://phabricator.wikimedia.org/T400994) [09:27:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [09:27:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [09:28:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T400854)', diff saved to https://phabricator.wikimedia.org/P80645 and previous config saved to /var/cache/conftool/dbconfig/20250804-092836-ladsgroup.json [09:28:56] (03CR) 10STran: "I got the list of wikis from the task and ran the following command:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [09:29:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P80646 and previous config saved to /var/cache/conftool/dbconfig/20250804-092920-fceratto.json [09:31:34] (03CR) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [09:31:39] (03PS2) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) [09:38:48] (03PS2) 10Btullis: Allow the Airflow webserver to support long requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174005 (https://phabricator.wikimedia.org/T400493) [09:39:04] (03CR) 10Btullis: Allow the Airflow webserver to support long requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174005 (https://phabricator.wikimedia.org/T400493) (owner: 10Btullis) [09:43:32] (03PS1) 10Majavah: P:toolforge::static: Remove port from redirect header [puppet] - 10https://gerrit.wikimedia.org/r/1175482 (https://phabricator.wikimedia.org/T401024) [09:43:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P80647 and previous config saved to /var/cache/conftool/dbconfig/20250804-094343-ladsgroup.json [09:44:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P80648 and previous config saved to /var/cache/conftool/dbconfig/20250804-094428-fceratto.json [09:44:29] (03PS3) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) [09:46:37] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:49:38] (03CR) 10FNegri: [C:03+1] P:toolforge::static: Remove port from redirect header [puppet] - 10https://gerrit.wikimedia.org/r/1175482 (https://phabricator.wikimedia.org/T401024) (owner: 10Majavah) [09:53:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-vkrsk - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [09:58:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P80649 and previous config saved to /var/cache/conftool/dbconfig/20250804-095851-ladsgroup.json [09:59:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T399728)', diff saved to https://phabricator.wikimedia.org/P80650 and previous config saved to /var/cache/conftool/dbconfig/20250804-095935-fceratto.json [09:59:39] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:59:51] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance [09:59:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T399728)', diff saved to https://phabricator.wikimedia.org/P80651 and previous config saved to /var/cache/conftool/dbconfig/20250804-095958-fceratto.json [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1000) [10:02:26] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:02:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T399728)', diff saved to https://phabricator.wikimedia.org/P80652 and previous config saved to /var/cache/conftool/dbconfig/20250804-100237-fceratto.json [10:03:30] (03CR) 10Majavah: [C:03+2] P:toolforge::static: Remove port from redirect header [puppet] - 10https://gerrit.wikimedia.org/r/1175482 (https://phabricator.wikimedia.org/T401024) (owner: 10Majavah) [10:04:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:07] (03CR) 10Vgutierrez: [C:04-1] text-frontend: enforcement of UA policy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [10:07:33] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:09:09] (03PS1) 10Majavah: P:toolforge::static: Produce relative redirects instead [puppet] - 10https://gerrit.wikimedia.org/r/1175483 (https://phabricator.wikimedia.org/T401024) [10:13:17] (03CR) 10Majavah: [C:03+2] P:toolforge::static: Produce relative redirects instead [puppet] - 10https://gerrit.wikimedia.org/r/1175483 (https://phabricator.wikimedia.org/T401024) (owner: 10Majavah) [10:13:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T400854)', diff saved to https://phabricator.wikimedia.org/P80653 and previous config saved to /var/cache/conftool/dbconfig/20250804-101358-ladsgroup.json [10:14:02] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [10:14:14] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:14:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T400854)', diff saved to https://phabricator.wikimedia.org/P80654 and previous config saved to /var/cache/conftool/dbconfig/20250804-101421-ladsgroup.json [10:17:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P80655 and previous config saved to /var/cache/conftool/dbconfig/20250804-101745-fceratto.json [10:18:24] (03CR) 10Federico Ceratto: [C:03+2] sanitize-wiki: Support sections other than s5 [cookbooks] - 10https://gerrit.wikimedia.org/r/1167895 (https://phabricator.wikimedia.org/T399178) (owner: 10Federico Ceratto) [10:19:14] (03PS1) 10Effie Mouzeli: thumbor: add SUBPROCESS_TIMEOUT_KILL_AFTER to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175484 (https://phabricator.wikimedia.org/T374350) [10:20:14] (03PS2) 10Federico Ceratto: Add wmfmariadbpy package generation [puppet] - 10https://gerrit.wikimedia.org/r/1172025 (https://phabricator.wikimedia.org/T397305) [10:20:49] (03CR) 10Federico Ceratto: [C:03+2] Add wmfmariadbpy package generation [puppet] - 10https://gerrit.wikimedia.org/r/1172025 (https://phabricator.wikimedia.org/T397305) (owner: 10Federico Ceratto) [10:21:27] (03CR) 10Federico Ceratto: [C:03+2] "Discussed with @jelto on IRC: OK to merge" [puppet] - 10https://gerrit.wikimedia.org/r/1172025 (https://phabricator.wikimedia.org/T397305) (owner: 10Federico Ceratto) [10:22:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T400854)', diff saved to https://phabricator.wikimedia.org/P80656 and previous config saved to /var/cache/conftool/dbconfig/20250804-102248-ladsgroup.json [10:22:52] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [10:23:42] (03CR) 10Effie Mouzeli: [C:03+1] etcd::v3: Allow setting quota-backend-bytes [puppet] - 10https://gerrit.wikimedia.org/r/1175472 (owner: 10Clément Goubert) [10:29:13] (03CR) 10Effie Mouzeli: "I would suggest we first assess why we ran out of space recently, before increasing the quota, in case we have other underlying issues suc" [puppet] - 10https://gerrit.wikimedia.org/r/1175473 (owner: 10Clément Goubert) [10:32:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P80657 and previous config saved to /var/cache/conftool/dbconfig/20250804-103252-fceratto.json [10:37:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P80658 and previous config saved to /var/cache/conftool/dbconfig/20250804-103756-ladsgroup.json [10:44:52] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11057086 (10MatthewVernon) @Jhancock.wm in case of any confusion, I would like you to swap the controller in ms-be2088 ASAP if possible please :) [10:45:28] (03PS2) 10Effie Mouzeli: O:kubernetes::master_stacked: Set etcd quota to 8G [puppet] - 10https://gerrit.wikimedia.org/r/1175473 (https://phabricator.wikimedia.org/T401107) (owner: 10Clément Goubert) [10:46:31] (03PS3) 10Marco Fossati: image-suggestion: reconfigure for data-gateway listener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171702 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [10:46:31] (03CR) 10Marco Fossati: "Naïve question: is it right to say that this doesn't change the API endpoint URL?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171702 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [10:48:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T399728)', diff saved to https://phabricator.wikimedia.org/P80659 and previous config saved to /var/cache/conftool/dbconfig/20250804-104800-fceratto.json [10:48:04] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:48:16] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1200.eqiad.wmnet with reason: Maintenance [10:48:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T399728)', diff saved to https://phabricator.wikimedia.org/P80660 and previous config saved to /var/cache/conftool/dbconfig/20250804-104823-fceratto.json [10:48:36] (03CR) 10Btullis: [C:03+2] Allow the Airflow webserver to support long requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174005 (https://phabricator.wikimedia.org/T400493) (owner: 10Btullis) [10:50:04] (03PS2) 10Effie Mouzeli: thumbor: add SUBPROCESS_TIMEOUT_KILL_AFTER to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175484 (https://phabricator.wikimedia.org/T374350) [10:51:01] (03Merged) 10jenkins-bot: Allow the Airflow webserver to support long requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174005 (https://phabricator.wikimedia.org/T400493) (owner: 10Btullis) [10:51:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T399728)', diff saved to https://phabricator.wikimedia.org/P80661 and previous config saved to /var/cache/conftool/dbconfig/20250804-105101-fceratto.json [10:53:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P80662 and previous config saved to /var/cache/conftool/dbconfig/20250804-105303-ladsgroup.json [10:57:37] (03CR) 10Clément Goubert: [C:03+1] thumbor: add SUBPROCESS_TIMEOUT_KILL_AFTER to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175484 (https://phabricator.wikimedia.org/T374350) (owner: 10Effie Mouzeli) [10:57:51] (03CR) 10Clément Goubert: [C:03+2] etcd::v3: Allow setting quota-backend-bytes [puppet] - 10https://gerrit.wikimedia.org/r/1175472 (owner: 10Clément Goubert) [10:59:51] (03PS4) 10Federico Ceratto: Add MariaDB test-s8 section VMs [puppet] - 10https://gerrit.wikimedia.org/r/1171597 (https://phabricator.wikimedia.org/T390087) [11:02:13] (03CR) 10Effie Mouzeli: [C:03+2] thumbor: add SUBPROCESS_TIMEOUT_KILL_AFTER to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175484 (https://phabricator.wikimedia.org/T374350) (owner: 10Effie Mouzeli) [11:03:26] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:04:18] (03Merged) 10jenkins-bot: thumbor: add SUBPROCESS_TIMEOUT_KILL_AFTER to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175484 (https://phabricator.wikimedia.org/T374350) (owner: 10Effie Mouzeli) [11:04:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:06:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P80663 and previous config saved to /var/cache/conftool/dbconfig/20250804-110609-fceratto.json [11:06:11] !incidents [11:06:11] No incidents occurred in the past 24 hours for team SRE [11:07:24] (03PS1) 10Btullis: Add another 10 TB to the dumps v1 PVC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175486 (https://phabricator.wikimedia.org/T401098) [11:08:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T400854)', diff saved to https://phabricator.wikimedia.org/P80664 and previous config saved to /var/cache/conftool/dbconfig/20250804-110811-ladsgroup.json [11:08:17] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [11:08:27] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [11:08:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T400854)', diff saved to https://phabricator.wikimedia.org/P80665 and previous config saved to /var/cache/conftool/dbconfig/20250804-110834-ladsgroup.json [11:09:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:10:41] (03CR) 10Brouberol: [C:03+1] Add another 10 TB to the dumps v1 PVC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175486 (https://phabricator.wikimedia.org/T401098) (owner: 10Btullis) [11:11:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T400854)', diff saved to https://phabricator.wikimedia.org/P80666 and previous config saved to /var/cache/conftool/dbconfig/20250804-111103-ladsgroup.json [11:11:09] (03CR) 10Btullis: [C:03+2] Add another 10 TB to the dumps v1 PVC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175486 (https://phabricator.wikimedia.org/T401098) (owner: 10Btullis) [11:11:10] (03CR) 10Stevemunene: [C:03+2] Add keytabs for new an-druid100[67] hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1171214 (https://phabricator.wikimedia.org/T397440) (owner: 10Stevemunene) [11:11:24] (03CR) 10Stevemunene: [V:03+2 C:03+2] Add keytabs for new an-druid100[67] hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1171214 (https://phabricator.wikimedia.org/T397440) (owner: 10Stevemunene) [11:11:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:12:41] (03Merged) 10jenkins-bot: Add another 10 TB to the dumps v1 PVC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175486 (https://phabricator.wikimedia.org/T401098) (owner: 10Btullis) [11:13:26] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:15:59] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [11:16:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [11:21:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P80667 and previous config saved to /var/cache/conftool/dbconfig/20250804-112118-fceratto.json [11:21:56] (03CR) 10Slyngshede: [C:03+1] "We already have 443 open, so not really concerned about also allowing 6443." [puppet] - 10https://gerrit.wikimedia.org/r/1174842 (https://phabricator.wikimedia.org/T394838) (owner: 10BryanDavis) [11:25:00] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:26:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P80668 and previous config saved to /var/cache/conftool/dbconfig/20250804-112612-ladsgroup.json [11:28:51] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:34:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:36:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T399728)', diff saved to https://phabricator.wikimedia.org/P80670 and previous config saved to /var/cache/conftool/dbconfig/20250804-113625-fceratto.json [11:36:29] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:36:42] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1207.eqiad.wmnet with reason: Maintenance [11:36:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T399728)', diff saved to https://phabricator.wikimedia.org/P80671 and previous config saved to /var/cache/conftool/dbconfig/20250804-113649-fceratto.json [11:38:10] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:38:18] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:38:56] (03CR) 10Federico Ceratto: [C:03+2] thanos: drain thanos-be1005 for disk controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1175120 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [11:39:00] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:39:08] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54369 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:39:14] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:39:19] (03CR) 10Federico Ceratto: [C:03+1] "The hostname matched the task description." [puppet] - 10https://gerrit.wikimedia.org/r/1175120 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [11:39:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T399728)', diff saved to https://phabricator.wikimedia.org/P80672 and previous config saved to /var/cache/conftool/dbconfig/20250804-113931-fceratto.json [11:41:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P80673 and previous config saved to /var/cache/conftool/dbconfig/20250804-114119-ladsgroup.json [11:42:12] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:43:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-vkrsk - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [11:49:11] (03CR) 10Stevemunene: [C:03+2] turnilo: replace turnilo druid hosts [puppet] - 10https://gerrit.wikimedia.org/r/1171209 (https://phabricator.wikimedia.org/T397440) (owner: 10Stevemunene) [11:49:53] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:54:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P80674 and previous config saved to /var/cache/conftool/dbconfig/20250804-115438-fceratto.json [11:56:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T400854)', diff saved to https://phabricator.wikimedia.org/P80675 and previous config saved to /var/cache/conftool/dbconfig/20250804-115626-ladsgroup.json [11:56:35] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [11:56:42] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [11:56:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T400854)', diff saved to https://phabricator.wikimedia.org/P80676 and previous config saved to /var/cache/conftool/dbconfig/20250804-115649-ladsgroup.json [11:59:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T400854)', diff saved to https://phabricator.wikimedia.org/P80677 and previous config saved to /var/cache/conftool/dbconfig/20250804-115917-ladsgroup.json [11:59:53] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:03:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.183s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:04:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:09:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P80678 and previous config saved to /var/cache/conftool/dbconfig/20250804-120946-fceratto.json [12:10:38] !log depooling & restarting blazegraph on wdqs1016 (stuck for 7days) [12:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:05] (03CR) 10Elukey: [C:03+1] dhcp: add support for Nokia switches [software/spicerack] - 10https://gerrit.wikimedia.org/r/1175395 (https://phabricator.wikimedia.org/T401013) (owner: 10Ayounsi) [12:14:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P80679 and previous config saved to /var/cache/conftool/dbconfig/20250804-121424-ladsgroup.json [12:15:47] jouncebot: nowandnext [12:15:47] No deployments scheduled for the next 0 hour(s) and 44 minute(s) [12:15:48] In 0 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1300) [12:17:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:22:18] FIRING: [4x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:22:25] !log depooling & restarting blazegraph on wdqs1011 (stuck for 3hours) [12:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.434s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:24:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T399728)', diff saved to https://phabricator.wikimedia.org/P80680 and previous config saved to /var/cache/conftool/dbconfig/20250804-122454-fceratto.json [12:24:58] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:25:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [12:26:07] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1230.eqiad.wmnet with reason: Maintenance [12:26:14] (03CR) 10MVernon: [C:03+2] thanos: drain thanos-be1005 for disk controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1175120 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [12:26:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T399728)', diff saved to https://phabricator.wikimedia.org/P80681 and previous config saved to /var/cache/conftool/dbconfig/20250804-122614-fceratto.json [12:27:18] RESOLVED: [4x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:28:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T399728)', diff saved to https://phabricator.wikimedia.org/P80682 and previous config saved to /var/cache/conftool/dbconfig/20250804-122855-fceratto.json [12:29:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P80683 and previous config saved to /var/cache/conftool/dbconfig/20250804-122931-ladsgroup.json [12:30:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.309s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:30:39] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11057352 (10Jclark-ctr) Alarms are still present, but the fans are currently spinning at normal speeds. I created a Juniper account per the registration page. My account request will be reviewed wit... [12:31:31] !log repooling wdqs1011 [12:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:58] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:34:39] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:35:02] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [12:35:38] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [12:35:38] (03PS1) 10Daimona Eaytoy: Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175507 (https://phabricator.wikimedia.org/T397476) [12:36:56] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [12:37:40] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [12:38:42] PROBLEM - Host archiva1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:40:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.33s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:40:40] RECOVERY - Host archiva1002 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [12:40:45] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [12:41:23] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [12:44:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P80684 and previous config saved to /var/cache/conftool/dbconfig/20250804-124402-fceratto.json [12:44:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T400854)', diff saved to https://phabricator.wikimedia.org/P80685 and previous config saved to /var/cache/conftool/dbconfig/20250804-124438-ladsgroup.json [12:44:42] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [12:44:48] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11057371 (10elukey) >>! In T391852#11046485, @Mvolz wrote: > If citoid calls zotero, sees a 404, and it reports a 404 back, yes, we want to count... [12:44:54] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [12:45:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T400854)', diff saved to https://phabricator.wikimedia.org/P80686 and previous config saved to /var/cache/conftool/dbconfig/20250804-124500-ladsgroup.json [12:45:20] (03CR) 10Brouberol: [C:03+2] site: assign the insetup::data_platform_ferm role to an-airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173905 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [12:45:22] (03CR) 10Brouberol: [C:03+2] preseed: remove an-airflow preseed mapping [puppet] - 10https://gerrit.wikimedia.org/r/1173906 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [12:45:37] (03CR) 10Brouberol: [C:03+2] an-airflow: remove roles [puppet] - 10https://gerrit.wikimedia.org/r/1173907 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [12:45:41] (03CR) 10Brouberol: [C:03+2] an-airflow: remove any role-speciific hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1173908 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [12:45:45] (03CR) 10Brouberol: [C:03+2] Remove an-airflow host-specific hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1173911 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [12:46:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175507 (https://phabricator.wikimedia.org/T397476) (owner: 10Daimona Eaytoy) [12:47:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T400854)', diff saved to https://phabricator.wikimedia.org/P80687 and previous config saved to /var/cache/conftool/dbconfig/20250804-124729-ladsgroup.json [12:48:25] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [12:49:00] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [12:50:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [12:50:43] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [12:56:34] (03PS1) 10Bartosz Dziewoński: Clear edit count when unattaching local users for rename [extensions/CentralAuth] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175511 (https://phabricator.wikimedia.org/T313900) [12:56:42] (03PS1) 10Bartosz Dziewoński: fixStuckGlobalRename: Fix using actor_id from the wrong wiki [extensions/CentralAuth] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175512 (https://phabricator.wikimedia.org/T398177) [12:56:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CentralAuth] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175511 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [12:57:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CentralAuth] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175512 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [12:57:10] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11057407 (10Jclark-ctr) From what i have found online on juniper docs alarms are not cleared automatically, even if the issue resolves. The system only removes them when it re-evaluates the hardwar... [12:57:50] jouncebot: next [12:57:50] In 0 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1300) [12:58:17] i scheduled a couple of things, but i see the window is already kind of packed. i don't mind rescheduling them for the evening deploy if we run out of time [12:59:09] we’ll see how far we get [12:59:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P80688 and previous config saved to /var/cache/conftool/dbconfig/20250804-125909-fceratto.json [12:59:16] (03PS1) 10Mhorsey: Add ce_event_topics to mariadb tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1175514 (https://phabricator.wikimedia.org/T399302) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1300). [13:00:04] Tran, Daimona, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] 👋 [13:00:11] o/ [13:00:18] o/ [13:00:22] I can deploy [13:01:21] let’s start with that dblist [13:01:26] 🙇 [13:01:57] claime / effie / others: just checking, is it okay to deploy despite T401107? [13:01:59] T401107: etcdserver: mvcc: database space exceeded - https://phabricator.wikimedia.org/T401107 [13:02:25] (or _joe_ who reportedly mitigated the issue) [13:02:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P80689 and previous config saved to /var/cache/conftool/dbconfig/20250804-130236-ladsgroup.json [13:02:54] <_joe_> Lucas_WMDE: yes things are ok since this morning [13:02:58] ok thanks! [13:03:07] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM, diffConfig confirms nothing changes :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [13:03:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [13:03:27] let’s see how the first spiderpig of the week goes [13:05:10] (03Merged) 10jenkins-bot: Use tempaccounts.dblist to manage rollout wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [13:05:28] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1175062|Use tempaccounts.dblist to manage rollout wikis (T400672)]] [13:05:33] T400672: Deploy temporary accounts to special/non-standard/private wikis - https://phabricator.wikimedia.org/T400672 [13:06:14] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CheckUser] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175002 (https://phabricator.wikimedia.org/T400627) (owner: 10Kosta Harlan) [13:07:33] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, stran: Backport for [[gerrit:1175062|Use tempaccounts.dblist to manage rollout wikis (T400672)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:46] Tran: please test :) [13:09:08] (03CR) 10Vgutierrez: "Btw this can be flagged on Gerrit adding `Depends-On: I765c6b00b15010822b200491209eb474f2034c40` to the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/1175164 (owner: 10BCornwall) [13:09:23] Testing, please hold [13:09:27] ack [13:11:14] (03CR) 10Vgutierrez: [C:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1175146 (owner: 10BCornwall) [13:12:51] (03PS1) 10Btullis: Update the dashboard for the dumps cephfs volume [alerts] - 10https://gerrit.wikimedia.org/r/1175515 (https://phabricator.wikimedia.org/T401098) [13:13:37] Looks good, thank you! [13:14:15] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, stran: Continuing with sync [13:14:17] ok, thanks! [13:14:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T399728)', diff saved to https://phabricator.wikimedia.org/P80690 and previous config saved to /var/cache/conftool/dbconfig/20250804-131417-fceratto.json [13:14:21] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:14:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [13:14:41] Tran: and should the next two changes (backport + config) be deployed together? or separately? [13:14:51] if I understand correctly, the backport does nothing on its own [13:15:09] so maybe together makes more sense [13:15:19] Yeah they can go together as I can't test the backport until the config is enabled [13:15:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [13:15:24] ok, thanks! [13:17:21] (03Merged) 10jenkins-bot: UserInfoCard: Add config var for making UIC available [extensions/CheckUser] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175002 (https://phabricator.wikimedia.org/T400627) (owner: 10Kosta Harlan) [13:17:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P80691 and previous config saved to /var/cache/conftool/dbconfig/20250804-131744-ladsgroup.json [13:21:10] Lucas_WMDE: sorry, you caught us during our lunch break :) [13:21:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:21:59] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175062|Use tempaccounts.dblist to manage rollout wikis (T400672)]] (duration: 16m 31s) [13:22:05] T400672: Deploy temporary accounts to special/non-standard/private wikis - https://phabricator.wikimedia.org/T400672 [13:22:59] (03CR) 10Btullis: [C:03+2] Upgrade the flink-operator CRDs to match the upstream resease v1.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173403 (https://phabricator.wikimedia.org/T398162) (owner: 10Btullis) [13:23:03] effie: np :P [13:23:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175042 (https://phabricator.wikimedia.org/T398681) (owner: 10Kosta Harlan) [13:24:21] (03Merged) 10jenkins-bot: CheckUser: Make user info card feature discoverable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175042 (https://phabricator.wikimedia.org/T398681) (owner: 10Kosta Harlan) [13:24:36] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1175002|UserInfoCard: Add config var for making UIC available (T400627)]], [[gerrit:1175042|CheckUser: Make user info card feature discoverable (T398681)]] [13:24:43] T400627: UserInfoCard: Provide configuration to allow the preference to be visible on wikis where it's not enabled by default - https://phabricator.wikimedia.org/T400627 [13:24:43] T398681: UserInfo: Deployment Tracker - https://phabricator.wikimedia.org/T398681 [13:26:33] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, kharlan: Backport for [[gerrit:1175002|UserInfoCard: Add config var for making UIC available (T400627)]], [[gerrit:1175042|CheckUser: Make user info card feature discoverable (T398681)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:26:48] Tran: please test :) [13:26:59] (is it just me or are changes taking less time to reach mwdebug than they used to?) [13:27:08] Testing, please hold [13:28:50] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [alerts] - 10https://gerrit.wikimedia.org/r/1175515 (https://phabricator.wikimedia.org/T401098) (owner: 10Btullis) [13:29:15] (I’m aware of T378740 and I feel like it got even faster than that would account for today) [13:29:15] T378740: scap: announce testserver sync complete before running checks - https://phabricator.wikimedia.org/T378740 [13:30:13] (03Merged) 10jenkins-bot: Upgrade the flink-operator CRDs to match the upstream resease v1.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173403 (https://phabricator.wikimedia.org/T398162) (owner: 10Btullis) [13:30:47] Looks good, thank you! [13:30:51] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, kharlan: Continuing with sync [13:30:52] thanks! [13:31:10] (03PS1) 10MVernon: thanos: fix typo in hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1175517 [13:31:42] (03CR) 10CI reject: [V:04-1] thanos: fix typo in hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1175517 (owner: 10MVernon) [13:32:48] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06MediaWiki-Platform-Team, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11057528 (10Tgr) #mediawiki-platform-team will pick u... [13:32:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T400854)', diff saved to https://phabricator.wikimedia.org/P80692 and previous config saved to /var/cache/conftool/dbconfig/20250804-133251-ladsgroup.json [13:32:56] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [13:33:07] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance [13:33:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T400854)', diff saved to https://phabricator.wikimedia.org/P80693 and previous config saved to /var/cache/conftool/dbconfig/20250804-133314-ladsgroup.json [13:33:14] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:33:18] (03CR) 10Brouberol: [C:03+2] Remove airflow-search role [puppet] - 10https://gerrit.wikimedia.org/r/1173912 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:33:27] (03CR) 10Brouberol: [C:03+2] data: remove any privilege related to airlfow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1173909 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:33:47] (03Abandoned) 10Brouberol: Remove references to deprecated airflow roles [puppet] - 10https://gerrit.wikimedia.org/r/1173913 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:33:56] (03CR) 10Brouberol: [V:03+1 C:03+2] common/search/airflow: drop hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1173910 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [13:33:58] (03PS1) 10Btullis: Stop deploying mediawiki to snapshot hosts with scap [puppet] - 10https://gerrit.wikimedia.org/r/1175518 (https://phabricator.wikimedia.org/T398438) [13:34:19] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:34:32] (03PS2) 10MVernon: thanos: fix typo in hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1175517 [13:35:00] (03CR) 10CI reject: [V:04-1] thanos: fix typo in hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1175517 (owner: 10MVernon) [13:35:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T400854)', diff saved to https://phabricator.wikimedia.org/P80694 and previous config saved to /var/cache/conftool/dbconfig/20250804-133547-ladsgroup.json [13:36:25] (03PS3) 10MVernon: thanos: fix typo in hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1175517 [13:37:58] PROBLEM - Etcd cluster health on wikikube-ctrl1004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [13:38:14] PROBLEM - Etcd cluster health on wikikube-ctrl1002 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [13:38:20] PROBLEM - Etcd cluster health on wikikube-ctrl1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [13:38:20] PROBLEM - Etcd cluster health on wikikube-ctrl1001 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [13:38:31] (03PS1) 10Btullis: Set all snapshot and dumpsdata servers into the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1175519 (https://phabricator.wikimedia.org/T398438) [13:39:05] Well here we go again [13:39:26] same thing happened, sharp bump in db size [13:39:29] oh no [13:39:44] (03PS2) 10Btullis: Set all snapshot and dumpsdata servers into the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1175519 (https://phabricator.wikimedia.org/T398438) [13:39:45] sync k8s production is currently at 95% [13:40:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [13:40:42] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [13:41:36] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [13:41:43] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@b89eed0] (releasing): check fix for releases2003 [13:41:51] (I assume I can’t do much about it right now, otherwise let me know) [13:42:09] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@b89eed0] (releasing): check fix for releases2003 (duration: 00m 26s) [13:42:20] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [13:42:59] (03PS3) 10Brouberol: common/search/airflow: drop hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1173910 (https://phabricator.wikimedia.org/T390941) [13:43:05] Lucas_WMDE: nope [13:43:10] need to compact and defrag [13:43:21] spiderpig failed now (with… enough output to overwhelm the console, seemingly) [13:43:24] ack [13:44:00] _joe_: how did you determine what revision you compacted to? [13:44:09] ok, failed is the wrong word, it just vomited output to the console but seemingly is still running [13:44:13] claiming 24% atm [13:44:25] <_joe_> claime: the last one available [13:44:39] _joe_: so the one in the endpoint status? [13:45:01] (03PS1) 10Ottomata: Disable EventgateProduceRateAnomaly for eventgate-main [alerts] - 10https://gerrit.wikimedia.org/r/1175520 (https://phabricator.wikimedia.org/T398437) [13:45:52] <_joe_> claime: yes, need me to do something? [13:45:56] _joe_: no I'm on it [13:45:58] do we need a incident doc or is T401107 enough for tracking the etcd issues? [13:45:59] T401107: etcdserver: mvcc: database space exceeded - https://phabricator.wikimedia.org/T401107 [13:46:06] please track in the existing task [13:46:09] Running compaction now [13:46:09] ack [13:46:28] (03CR) 10CI reject: [V:04-1] Disable EventgateProduceRateAnomaly for eventgate-main [alerts] - 10https://gerrit.wikimedia.org/r/1175520 (https://phabricator.wikimedia.org/T398437) (owner: 10Ottomata) [13:47:08] Running defrag sequentially on etcd nodes [13:47:33] ok, *now* spiderpig failed. waiting before doing anything else [13:48:05] Tran: maybe you can check if it ended up getting deployed or not? if it’s not a bother [13:48:45] (03PS2) 10Ottomata: Disable EventgateProduceRateAnomaly for eventgate-main [alerts] - 10https://gerrit.wikimedia.org/r/1175520 (https://phabricator.wikimedia.org/T398437) [13:49:14] RECOVERY - Etcd cluster health on wikikube-ctrl1002 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [13:49:20] RECOVERY - Etcd cluster health on wikikube-ctrl1003 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [13:49:20] RECOVERY - Etcd cluster health on wikikube-ctrl1001 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [13:49:27] ok, etcd recovered [13:49:54] let me merge something before running another deploy [13:49:58] RECOVERY - Etcd cluster health on wikikube-ctrl1004 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [13:50:07] effie: I'll merge an increase of the db quota to 4GB [13:50:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P80695 and previous config saved to /var/cache/conftool/dbconfig/20250804-135054-ladsgroup.json [13:50:59] (03PS3) 10Clément Goubert: O:kubernetes::master_stacked: Set etcd quota to 4G [puppet] - 10https://gerrit.wikimedia.org/r/1175473 [13:51:13] (03CR) 10Andrew Bogott: [C:03+2] Toolforge docker registry: add some comments to help my future self [puppet] - 10https://gerrit.wikimedia.org/r/1175272 (owner: 10Andrew Bogott) [13:51:22] Lucas_WMDE: I'll need to merge, then run puppet on all the etcd nodes before you can move forward [13:51:28] I'll tell you when it's done [13:51:55] (03CR) 10Ottomata: [C:03+2] Disable EventgateProduceRateAnomaly for eventgate-main [alerts] - 10https://gerrit.wikimedia.org/r/1175520 (https://phabricator.wikimedia.org/T398437) (owner: 10Ottomata) [13:52:21] ack [13:53:21] (03Merged) 10jenkins-bot: Disable EventgateProduceRateAnomaly for eventgate-main [alerts] - 10https://gerrit.wikimedia.org/r/1175520 (https://phabricator.wikimedia.org/T398437) (owner: 10Ottomata) [13:53:29] claime: go for it [13:53:45] (03CR) 10Clément Goubert: [C:03+2] O:kubernetes::master_stacked: Set etcd quota to 4G [puppet] - 10https://gerrit.wikimedia.org/r/1175473 (owner: 10Clément Goubert) [13:55:29] Sorry I stepped away let me recheck [13:58:15] (03CR) 10Brouberol: [C:03+1] Set all snapshot and dumpsdata servers into the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1175519 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [13:58:25] (03CR) 10Brouberol: [C:03+1] Stop deploying mediawiki to snapshot hosts with scap [puppet] - 10https://gerrit.wikimedia.org/r/1175518 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [13:58:39] Lucas_WMDE From what I can tell, it seems okay. I see my preference on enwiki off of the canary. [13:58:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:59:06] High latency is probably due to me restarting the etcd servers [13:59:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161757 (https://phabricator.wikimedia.org/T400586) (owner: 10Krinkle) [14:00:18] !log T317599 start full-cluster reindex for eqiad/codfw/cloudelastic opensearch clusters [14:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:21] T317599: Allow ^ and $ in intitle regex search - https://phabricator.wikimedia.org/T317599 [14:01:05] (03PS3) 10STran: Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) [14:02:19] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11057608 (10Jhancock.wm) @Marostegui do you have time to knock out es2040 this week? [14:02:43] Tran: ok thanks [14:02:52] jouncebot: now [14:02:52] No deployments scheduled for the next 0 hour(s) and 27 minute(s) [14:03:08] (03CR) 10Brouberol: [C:03+2] common/search/airflow: drop hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1173910 (https://phabricator.wikimedia.org/T390941) (owner: 10Brouberol) [14:03:09] I think once we’re clear I’d try to deploy Daimona’s change, just to make sure everything is on the latest version [14:03:14] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11057612 (10Jhancock.wm) i tried to reach out to you on Friday [14:03:22] yeah, done restarting etcds in eqiad, doing codfw now [14:03:29] but we definitely won’t have time for MatmaRex, sorry :S [14:03:44] no problem [14:03:45] (03CR) 10Btullis: [C:03+2] Stop deploying mediawiki to snapshot hosts with scap [puppet] - 10https://gerrit.wikimedia.org/r/1175518 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [14:03:51] i have a meeting right now anyway [14:05:04] (03CR) 10Brouberol: cumin: remove any mention of an-airflow hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1173903 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [14:06:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P80696 and previous config saved to /var/cache/conftool/dbconfig/20250804-140602-ladsgroup.json [14:06:42] (03CR) 10Federico Ceratto: "I see the change switching type from a mapping to a list." [puppet] - 10https://gerrit.wikimedia.org/r/1175517 (owner: 10MVernon) [14:08:58] PROBLEM - Host ms-be2088 is DOWN: PING CRITICAL - Packet loss = 100% [14:10:21] Lucas_WMDE: we should be g2g [14:10:43] (03CR) 10Btullis: [C:03+2] Set all snapshot and dumpsdata servers into the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1175519 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [14:11:09] ok [14:11:14] thanks [14:11:16] 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11057650 (10Ottomata) @CDobbins Data-Platform-Engineering doesn't manage logstash access. Perhaps #observability team? [14:11:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175507 (https://phabricator.wikimedia.org/T397476) (owner: 10Daimona Eaytoy) [14:12:22] (03Merged) 10jenkins-bot: Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175507 (https://phabricator.wikimedia.org/T397476) (owner: 10Daimona Eaytoy) [14:12:35] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1175507|Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_WRITE_BOTH (T397476)]] [14:12:38] T397476: Country of event data migration (free text -> code; optional -> required; remove country from address) - https://phabricator.wikimedia.org/T397476 [14:13:40] 06SRE, 10LDAP-Access-Requests, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T401118 (10Sadiya.Mohammed_WMDE) 03NEW [14:15:28] (03CR) 10Hashar: [C:03+1] gerrit: add daemons ssh host key to known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar) [14:16:41] !log lucaswerkmeister-wmde@deploy1003 daimona, lucaswerkmeister-wmde: Backport for [[gerrit:1175507|Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_WRITE_BOTH (T397476)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:16:48] 06SRE, 06Data-Engineering: WE 5.4 FY 25/26: Improve automata detection at the edge and pass it to the refinery pipeline - https://phabricator.wikimedia.org/T396562#11057686 (10Ottomata) [14:16:56] Daimona: please test [14:17:07] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11057689 (10MatthewVernon) >>! In T400876#11057612, @Jhancock.wm wrote: > i tried to reach out to you on Friday Ah, sorry, I left around 16:00 UTC on Fri... [14:17:17] (03CR) 10Brouberol: [C:03+2] deployment_server: remove all config related to airflow artifact deployment [puppet] - 10https://gerrit.wikimedia.org/r/1173902 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [14:17:28] (probably would’ve been smart for me to confirm that you’re still around before starting the late deployment…) [14:18:56] Looks good to me AFAICT, thank you [14:19:02] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11057699 (10Jhancock.wm) @MatthewVernon card has been replaced [14:19:05] I'm always around :P [14:19:11] !log lucaswerkmeister-wmde@deploy1003 daimona, lucaswerkmeister-wmde: Continuing with sync [14:19:14] ok, thanks :D [14:20:42] (03PS4) 10MVernon: thanos: fix typo in hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1175517 [14:20:44] (03PS5) 10Brouberol: cumin: remove any mention of an-airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173903 (https://phabricator.wikimedia.org/T395296) [14:21:08] (03CR) 10Federico Ceratto: [C:03+1] thanos: fix typo in hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1175517 (owner: 10MVernon) [14:21:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T400854)', diff saved to https://phabricator.wikimedia.org/P80697 and previous config saved to /var/cache/conftool/dbconfig/20250804-142109-ladsgroup.json [14:21:13] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [14:21:25] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1253.eqiad.wmnet with reason: Maintenance [14:21:30] (03PS6) 10Brouberol: cumin: remove any mention of an-airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173903 (https://phabricator.wikimedia.org/T395296) [14:21:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1253 (T400854)', diff saved to https://phabricator.wikimedia.org/P80698 and previous config saved to /var/cache/conftool/dbconfig/20250804-142132-ladsgroup.json [14:22:11] (03CR) 10Brouberol: cumin: remove any mention of an-airflow hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1173903 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [14:22:22] RECOVERY - Host ms-be2088 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms [14:22:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1165.eqiad.wmnet with reason: Maintenance [14:23:07] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:23:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T399728)', diff saved to https://phabricator.wikimedia.org/P80699 and previous config saved to /var/cache/conftool/dbconfig/20250804-142314-fceratto.json [14:23:18] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [14:23:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:23:57] !log push pfw policies - https://phabricator.wikimedia.org/T400936 [14:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:21] Thank you! [14:25:42] (03PS2) 10Brouberol: kafka: reach out to the newly introduce spicerack.kafka.admin_client method [cookbooks] - 10https://gerrit.wikimedia.org/r/1174727 [14:25:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T399728)', diff saved to https://phabricator.wikimedia.org/P80700 and previous config saved to /var/cache/conftool/dbconfig/20250804-142542-fceratto.json [14:26:15] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11057769 (10dr0ptp4kt) Thanks @ssingh! [14:26:30] (03CR) 10Ayounsi: [C:03+2] dhcp: add support for Nokia switches [software/spicerack] - 10https://gerrit.wikimedia.org/r/1175395 (https://phabricator.wikimedia.org/T401013) (owner: 10Ayounsi) [14:26:56] hmm, sync-apaches has been running for ca 2½ minutes [14:27:04] it shouldn’t take that long AFAIK [14:27:09] I hope etcd is okay [14:27:55] (03PS2) 10Effie Mouzeli: service: add discovery active/active config for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1164458 (https://phabricator.wikimedia.org/T397618) (owner: 10Hnowlan) [14:28:18] It shouldn't have apaches to sync actually [14:28:20] or nearly 0 [14:28:24] well, 7 [14:28:35] four mwdebugs that aren’t quite dead yet, and, idk what else? [14:28:48] (hopefully without traffic, given they’re still on php7.4 IIUC) [14:28:52] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175507|Set wgCampaignEventsCountrySchemaMigrationStage to MIGRATION_WRITE_BOTH (T397476)]] (duration: 16m 17s) [14:28:55] T397476: Country of event data migration (free text -> code; optional -> required; remove country from address) - https://phabricator.wikimedia.org/T397476 [14:29:02] ok scap said the ssh connections to snapshot* timed out… [14:29:07] output in https://spiderpig.wikimedia.org/jobs/389 [14:29:18] Lucas_WMDE: sigh which brings to something else that dropped in priority [14:29:20] that’s technically an error but hopefully a relatively harmless one? [14:29:34] The 2 mwmaint should be out of scap [14:29:43] Lucas_WMDE: Ah, that's btullis [14:30:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T400854)', diff saved to https://phabricator.wikimedia.org/P80701 and previous config saved to /var/cache/conftool/dbconfig/20250804-143001-ladsgroup.json [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1430) [14:30:05] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [14:30:05] ok [14:30:06] btullis: You may have forgotten to remove hosts from dsh before decomming them? [14:30:20] I will do it along with mwdebug [14:30:51] is it still okay to declare the backport+config window done? (without a further “cleanup” deploy?) [14:31:13] * Lucas_WMDE gets permission denied (publickey) when trying to SSH to snapshot1010 [14:31:30] (03CR) 10Effie Mouzeli: [C:03+2] dsh: remove testservers from scap destinations 1 [puppet] - 10https://gerrit.wikimedia.org/r/1169673 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [14:32:17] !log UTC afternoon backport+config window hopefully done after some difficulties [14:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:52] Lucas_WMDE: as long as deployments went fine for k8s you're good [14:32:56] (03CR) 10CI reject: [V:04-1] kafka: reach out to the newly introduce spicerack.kafka.admin_client method [cookbooks] - 10https://gerrit.wikimedia.org/r/1174727 (owner: 10Brouberol) [14:32:57] scap didn't roll back ? [14:33:17] I don’t think so [14:33:31] I don't think so either, the pods would be younger than they are [14:33:33] there’s no more k8s output after the failed apaches [14:33:37] (~10 min) [14:34:59] (03Merged) 10jenkins-bot: dhcp: add support for Nokia switches [software/spicerack] - 10https://gerrit.wikimedia.org/r/1175395 (https://phabricator.wikimedia.org/T401013) (owner: 10Ayounsi) [14:38:26] brouberol@cumin1003 decommission (PID 556816) is awaiting input [14:38:42] !log brouberol@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-airflow1002.eqiad.wmnet [14:40:40] Would it be okay for me to deploy something now, or is there still cleanup going on? [14:40:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P80702 and previous config saved to /var/cache/conftool/dbconfig/20250804-144050-fceratto.json [14:40:51] (03CR) 10JHathaway: [C:03+1] dhcp: add support for Nokia switches (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1175395 (https://phabricator.wikimedia.org/T401013) (owner: 10Ayounsi) [14:42:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:44:47] jouncebot: nowandnext [14:44:47] For the next 0 hour(s) and 15 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1430) [14:44:48] In 0 hour(s) and 45 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1530) [14:44:51] Kemayo: should be fine [14:44:56] claime: Thanks! [14:45:09] !log brouberol@cumin1003 START - Cookbook sre.dns.netbox [14:45:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P80703 and previous config saved to /var/cache/conftool/dbconfig/20250804-144509-ladsgroup.json [14:45:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175253 (owner: 10DLynch) [14:46:45] (03CR) 10MVernon: [C:03+2] thanos: fix typo in hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1175517 (owner: 10MVernon) [14:46:45] (03PS2) 10Effie Mouzeli: etcd::tlsproxy: Remove testserver ACLs 2 [puppet] - 10https://gerrit.wikimedia.org/r/1173871 (https://phabricator.wikimedia.org/T397498) [14:46:53] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1173871 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [14:49:16] Kemayo: are you going to run scap by any chance? [14:49:37] effie: Only in the sense that I'm going to let spiderpig do it for me. [14:49:48] potato potato [14:49:49] ok tx [14:50:30] effie: I'm monitoring the etcd db size btw [14:50:34] * Lucas_WMDE wonders if that’s an “on your head be it” ok [14:50:55] brouberol@cumin1003 decommission (PID 556816) is awaiting input [14:50:58] I wonder if it's going to keep going upat every deploy [14:51:00] claime: aye, we have quite a few horses in that race [14:51:01] I am but a lowly consumer of these tools, alas. [14:51:27] more deploys = more data to test that hypothesis? *hides* [14:51:27] but I wanted to see if scap will attempt to deploy to mwdebug* :p [14:51:38] heh fair [14:51:44] !log brouberol@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-airflow1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003" [14:51:47] effie: you should be able to follow through spiderpig [14:52:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11057874 (10MatthewVernon) Thanks! [14:52:14] jenkins estimates that I've got around 5 minutes until the pre-merge tests on my patches are done, so that's about how long you have to wait to see. [14:52:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11057877 (10MatthewVernon) [14:52:22] !log brouberol@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-airflow1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003" [14:52:22] !log brouberol@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:52:24] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-airflow1002.eqiad.wmnet [14:54:40] (03PS1) 10Btullis: Begin cleanup up the clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/1175529 (https://phabricator.wikimedia.org/T398438) [14:54:42] (03PS1) 10Btullis: Complete cleanup of clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/1175530 (https://phabricator.wikimedia.org/T398438) [14:54:47] !log brouberol@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-airflow1004.eqiad.wmnet [14:54:58] Lucas_WMDE: I think the scap snapshot thing was just a race condition, b.tullis did remove the snapshot hosts from scap, but puppet must not have yet run on the deployment server when you started scap [14:55:11] ok [14:55:53] yeah apparently puppet-agent-timer.timer fired 6 min ago [14:55:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P80704 and previous config saved to /var/cache/conftool/dbconfig/20250804-145557-fceratto.json [14:56:13] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6485/co" [puppet] - 10https://gerrit.wikimedia.org/r/1175529 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [14:56:58] * Lucas_WMDE is mildly amused that b.tullis got a real ping for “probably your fault” but a dot-disarmed ping for “ok not your fault after all” ;) [14:57:13] lol I didn't even [14:57:19] btullis apology ping [14:57:20] (03Merged) 10jenkins-bot: GutterSidebarEditCheckDialog: Guard against null bounding rects [extensions/VisualEditor] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175253 (owner: 10DLynch) [14:57:36] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1175253|GutterSidebarEditCheckDialog: Guard against null bounding rects]] [14:57:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:58:24] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:58:39] * claime glares at kubernetes secrets API [14:59:13] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1175253|GutterSidebarEditCheckDialog: Guard against null bounding rects]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:59:27] (03CR) 10Scott French: [C:03+1] "Thanks for your patience!" [puppet] - 10https://gerrit.wikimedia.org/r/1174579 (https://phabricator.wikimedia.org/T376776) (owner: 10RLazarus) [14:59:37] !log brouberol@cumin1003 START - Cookbook sre.dns.netbox [15:00:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P80705 and previous config saved to /var/cache/conftool/dbconfig/20250804-150018-ladsgroup.json [15:00:28] !log kemayo@deploy1003 kemayo: Continuing with sync [15:01:42] (03CR) 10CDobbins: [C:03+2] admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175168 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins) [15:03:24] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:03:29] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6486/co" [puppet] - 10https://gerrit.wikimedia.org/r/1175530 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [15:03:38] !log brouberol@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-airflow1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003" [15:04:25] !log brouberol@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-airflow1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003" [15:04:25] !log brouberol@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:04:26] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-airflow1004.eqiad.wmnet [15:05:52] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175253|GutterSidebarEditCheckDialog: Guard against null bounding rects]] (duration: 08m 16s) [15:05:55] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11057922 (10Andrew) 05Open→03Stalled [15:06:14] (03CR) 10Stevemunene: [C:03+1] cumin: remove any mention of an-airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173903 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [15:06:35] (03CR) 10Brouberol: [C:03+2] cumin: remove any mention of an-airflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1173903 (https://phabricator.wikimedia.org/T395296) (owner: 10Brouberol) [15:06:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:06:44] Deployment cert-manager-cainjector in cert-manager at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=cert-manager&var-deployment=cert-manager-cainjector - ... [15:06:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:06:55] !log brouberol@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-airflow1005.eqiad.wmnet [15:07:43] cert manager oomkill crashloopbackoff [15:07:45] awesome [15:09:28] claime: need a hand? [15:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:38] cdanis: I think I'll just bump the memory limit [15:09:41] 👍 [15:09:44] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:47] but 2GB per cainjector is weird [15:10:41] (03CR) 10Brouberol: Complete cleanup of clouddumps servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175530 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [15:10:53] (03CR) 10Brouberol: [C:03+1] Begin cleanup up the clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/1175529 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [15:11:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T399728)', diff saved to https://phabricator.wikimedia.org/P80706 and previous config saved to /var/cache/conftool/dbconfig/20250804-151105-fceratto.json [15:11:08] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [15:11:11] (03PS1) 10Clément Goubert: Bump cert-manager-cainjector memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175533 [15:11:20] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1168.eqiad.wmnet with reason: Maintenance [15:11:21] (03CR) 10CDanis: [C:03+1] Bump cert-manager-cainjector memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175533 (owner: 10Clément Goubert) [15:11:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T399728)', diff saved to https://phabricator.wikimedia.org/P80707 and previous config saved to /var/cache/conftool/dbconfig/20250804-151127-fceratto.json [15:11:30] (03CR) 10Effie Mouzeli: [C:03+1] Bump cert-manager-cainjector memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175533 (owner: 10Clément Goubert) [15:11:46] !log brouberol@cumin1003 START - Cookbook sre.dns.netbox [15:13:02] (03CR) 10Stevemunene: [C:03+1] Begin cleanup up the clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/1175529 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [15:13:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T399728)', diff saved to https://phabricator.wikimedia.org/P80708 and previous config saved to /var/cache/conftool/dbconfig/20250804-151355-fceratto.json [15:14:47] (03CR) 10Btullis: [V:03+1 C:03+2] Begin cleanup up the clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/1175529 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [15:15:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T400854)', diff saved to https://phabricator.wikimedia.org/P80709 and previous config saved to /var/cache/conftool/dbconfig/20250804-151526-ladsgroup.json [15:15:29] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [15:15:31] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [15:16:14] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [15:16:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T400854)', diff saved to https://phabricator.wikimedia.org/P80710 and previous config saved to /var/cache/conftool/dbconfig/20250804-151621-ladsgroup.json [15:16:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [15:16:44] Deployment cert-manager-cainjector in cert-manager at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=cert-manager&var-deployment=cert-manager-cainjector - ... [15:16:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:16:56] jouncebot: nowandnext [15:16:56] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [15:16:56] In 0 hour(s) and 13 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1530) [15:17:24] brouberol@cumin1003 decommission (PID 560466) is awaiting input [15:17:40] I might try to borrow that deploy window from jan_drewniak for a security patch [15:17:40] ^ that is a welcome addition [15:18:03] (03PS1) 10CDobbins: Revert "admin: add SD0001 to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/1175535 [15:18:07] !log brouberol@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-airflow1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003" [15:18:09] (03CR) 10Clément Goubert: [C:03+2] Bump cert-manager-cainjector memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175533 (owner: 10Clément Goubert) [15:18:24] !log brouberol@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-airflow1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003" [15:18:24] !log brouberol@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:18:25] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-airflow1005.eqiad.wmnet [15:18:42] And now certmanager is fine [15:18:46] wth [15:19:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T400854)', diff saved to https://phabricator.wikimedia.org/P80712 and previous config saved to /var/cache/conftool/dbconfig/20250804-151910-ladsgroup.json [15:19:12] !log brouberol@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-airflow1006.eqiad.wmnet [15:19:44] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:20:15] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [15:20:31] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11057968 (10Papaul) I checked this this, it looks like all the transceivers showing unspecified are Cisco's [15:21:30] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:22:01] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [15:22:06] (03PS1) 10CDobbins: admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175536 (https://phabricator.wikimedia.org/T400405) [15:22:41] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [15:23:44] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:24:24] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:24:38] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:24:48] (03Merged) 10jenkins-bot: Bump cert-manager-cainjector memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175533 (owner: 10Clément Goubert) [15:24:50] (03CR) 10Jasmine: [C:03+2] mwmaint: decommission mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/1174753 (https://phabricator.wikimedia.org/T400442) (owner: 10Jasmine) [15:25:30] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcephosd1043 - vriley@cumin1002" [15:25:48] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcephosd1043 - vriley@cumin1002" [15:25:48] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:26:07] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1043 [15:26:21] !log brouberol@cumin1003 START - Cookbook sre.dns.netbox [15:26:23] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1043 [15:27:22] (03PS1) 10Jgreen: Switch payments.wm.o to new load balancer IP. [dns] - 10https://gerrit.wikimedia.org/r/1175539 [15:27:25] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:28:38] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:28:41] (03CR) 10Jgreen: [C:03+2] Switch payments.wm.o to new load balancer IP. [dns] - 10https://gerrit.wikimedia.org/r/1175539 (owner: 10Jgreen) [15:28:59] !log jgreen@dns1004 START - running authdns-update [15:29:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P80713 and previous config saved to /var/cache/conftool/dbconfig/20250804-152903-fceratto.json [15:29:06] !log brouberol@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:29:07] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-airflow1006.eqiad.wmnet [15:29:17] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [15:29:54] !log jgreen@dns1004 END - running authdns-update [15:29:57] !log brouberol@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-airflow1007.eqiad.wmnet [15:30:03] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:30:05] jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1530). Please do the needful. [15:30:18] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:31:11] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:31:52] I need to run a maintenance script for T397270, basically a repeat of T397270#11024395. Read-only, no writes, in wikishared DBs; it takes about 10 seconds. Can I go ahead now? (I accidentally started the script ~5 minutes ago while looking for the command, but aborted it immediately before it actually started) [15:31:52] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [15:31:53] jan_drewniak: do you need this deploy window? otherwise I’d like to deploy a security fix during it [15:33:34] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:34:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P80714 and previous config saved to /var/cache/conftool/dbconfig/20250804-153418-ladsgroup.json [15:34:48] !log brouberol@cumin1003 START - Cookbook sre.dns.netbox [15:34:51] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1042.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:35:43] Daimona: can you please hold on ? [15:35:53] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11058015 (10VRiley-WMF) [15:36:14] (03PS2) 10BCornwall: Revert "acme-chief: Remove nc domains with DNSSEC enabled" [puppet] - 10https://gerrit.wikimedia.org/r/1175146 (https://phabricator.wikimedia.org/T400731) [15:36:35] (03CR) 10BCornwall: Revert "acme-chief: Remove nc domains with DNSSEC enabled" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175146 (https://phabricator.wikimedia.org/T400731) (owner: 10BCornwall) [15:36:39] jasmine@cumin1003 decommission (PID 564753) is awaiting input [15:37:45] Yep sure. LMK when I can go ahead [15:38:21] (03PS2) 10BCornwall: Revert "ncredir: Revert addition of for-pay domains" [puppet] - 10https://gerrit.wikimedia.org/r/1175164 [15:38:32] (03CR) 10BCornwall: Revert "ncredir: Revert addition of for-pay domains" [puppet] - 10https://gerrit.wikimedia.org/r/1175164 (owner: 10BCornwall) [15:39:11] !log jasmine@cumin1003 START - Cookbook sre.hosts.decommission for hosts mwmaint1002.eqiad.wmnet [15:41:01] !log brouberol@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-airflow1007.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003" [15:41:56] (03CR) 10BCornwall: [C:03+2] Revert "acme-chief: Remove nc domains with DNSSEC enabled" [puppet] - 10https://gerrit.wikimedia.org/r/1175146 (https://phabricator.wikimedia.org/T400731) (owner: 10BCornwall) [15:43:42] effie: does that mean I shouldn’t security-deploy either? [15:43:44] (03PS3) 10BCornwall: Revert "ncredir: Revert addition of for-pay domains" [puppet] - 10https://gerrit.wikimedia.org/r/1175164 (https://phabricator.wikimedia.org/T400731) [15:44:04] (03CR) 10Ahmon Dancy: "Thank you claime and hashar!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174506 (owner: 10Ahmon Dancy) [15:44:06] brouberol@cumin1003 decommission (PID 563661) is awaiting input [15:44:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P80715 and previous config saved to /var/cache/conftool/dbconfig/20250804-154410-fceratto.json [15:44:39] (03CR) 10BCornwall: [C:03+1] Revert "admin: add SD0001 to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/1175535 (owner: 10CDobbins) [15:44:43] !log brouberol@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-airflow1007.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1003" [15:44:43] !log brouberol@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:44:44] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-airflow1007.eqiad.wmnet [15:44:45] !log jasmine@cumin1003 START - Cookbook sre.dns.netbox [15:45:01] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Follow up on lists.wm.o TLS usage - https://phabricator.wikimedia.org/T398018#11058048 (10Arnoldokoth) a:03Vgutierrez [15:45:29] Lucas_WMDE: it would great if you could hold on for a bit there [15:48:05] ack [15:49:24] !log jasmine@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mwmaint1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1003" [15:49:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P80716 and previous config saved to /var/cache/conftool/dbconfig/20250804-154925-ladsgroup.json [15:49:27] !log jasmine@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mwmaint1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1003" [15:49:28] !log jasmine@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:49:29] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mwmaint1002.eqiad.wmnet [15:50:28] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm [15:50:51] 06SRE, 10SRE-swift-storage: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127 (10MatthewVernon) 03NEW [15:52:41] !log jasmine@cumin1003 START - Cookbook sre.hosts.decommission for hosts mwmaint2002.codfw.wmnet [15:57:12] 06SRE, 10SRE-swift-storage: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11058086 (10MatthewVernon) p:05Triage→03High High priority because we can't start deploying hosts with these JBOD controllers until how to refer to them in the swift rings is resolved. [15:57:22] !log jasmine@cumin1003 START - Cookbook sre.dns.netbox [15:57:40] 06SRE, 10SRE-swift-storage: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11058102 (10MatthewVernon) [15:57:42] 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new JBOD disk controllers into SM swift backends - https://phabricator.wikimedia.org/T400878#11058103 (10MatthewVernon) [15:57:46] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11058104 (10MatthewVernon) [15:57:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11058105 (10MatthewVernon) [15:58:51] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm [15:59:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T399728)', diff saved to https://phabricator.wikimedia.org/P80717 and previous config saved to /var/cache/conftool/dbconfig/20250804-155919-fceratto.json [15:59:22] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [15:59:34] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1173.eqiad.wmnet with reason: Maintenance [15:59:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T399728)', diff saved to https://phabricator.wikimedia.org/P80718 and previous config saved to /var/cache/conftool/dbconfig/20250804-155941-fceratto.json [16:02:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T399728)', diff saved to https://phabricator.wikimedia.org/P80719 and previous config saved to /var/cache/conftool/dbconfig/20250804-160210-fceratto.json [16:03:04] jasmine@cumin1003 decommission (PID 567830) is awaiting input [16:03:53] !log jasmine@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mwmaint2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1003" [16:04:23] !log jasmine@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mwmaint2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1003" [16:04:23] !log jasmine@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:04:25] !log jasmine@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mwmaint2002.codfw.wmnet [16:04:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T400854)', diff saved to https://phabricator.wikimedia.org/P80720 and previous config saved to /var/cache/conftool/dbconfig/20250804-160433-ladsgroup.json [16:04:39] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [16:04:49] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [16:04:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T400854)', diff saved to https://phabricator.wikimedia.org/P80721 and previous config saved to /var/cache/conftool/dbconfig/20250804-160456-ladsgroup.json [16:07:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T400854)', diff saved to https://phabricator.wikimedia.org/P80722 and previous config saved to /var/cache/conftool/dbconfig/20250804-160746-ladsgroup.json [16:13:27] 06SRE, 10SRE-Access-Requests, 06MW-Interfaces-Team: Requesting access to analytics-privatedata-users, SSH and Kerberos for HCoplin-WMF - https://phabricator.wikimedia.org/T400897#11058164 (10HCoplin-WMF) Added to my metawiki user page! See here: https://meta.wikimedia.org/wiki/User:HCoplin-WMF [16:14:58] (03CR) 10Dzahn: [C:03+2] "thank you, reviewers, going ahead with it then" [puppet] - 10https://gerrit.wikimedia.org/r/1174842 (https://phabricator.wikimedia.org/T394838) (owner: 10BryanDavis) [16:15:45] Daimona: please go ahead, sorry for the wait [16:16:44] jouncebot: nowandnext [16:16:44] No deployments scheduled for the next 0 hour(s) and 43 minute(s) [16:16:44] In 0 hour(s) and 43 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1700) [16:16:44] In 0 hour(s) and 43 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1700) [16:16:51] I’ll deploy some security patches [16:17:00] Thank you! Nothing to be sorry about, thanks for keeping the lights on ;) [16:17:03] (shouldn’t conflict with Daimona’s maintenance script hopefully) [16:17:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P80723 and previous config saved to /var/cache/conftool/dbconfig/20250804-161718-fceratto.json [16:18:13] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:19:43] !log Running maintenance script for T397270 in x1: testwiki, test2wiki, officewiki, wikishared [16:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:49] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [16:22:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P80725 and previous config saved to /var/cache/conftool/dbconfig/20250804-162255-ladsgroup.json [16:24:17] (03PS2) 10Btullis: Complete cleanup of clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/1175530 (https://phabricator.wikimedia.org/T398438) [16:24:21] (03PS2) 10Btullis: Remove all puppet code related to snapshot and dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/1175548 (https://phabricator.wikimedia.org/T398438) [16:24:26] * Lucas_WMDE scap’ing [16:24:35] (03PS1) 10Elukey: install_server: fix cacheproxy-efi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1175549 (https://phabricator.wikimedia.org/T392851) [16:24:36] (03CR) 10Btullis: Complete cleanup of clouddumps servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175530 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [16:27:35] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6488/co" [puppet] - 10https://gerrit.wikimedia.org/r/1175548 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [16:30:16] (03CR) 10Btullis: Remove all puppet code related to snapshot and dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/1175548 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [16:31:02] !log lucaswerkmeister-wmde Deployed security patch for T401099 [16:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.49s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:32:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P80729 and previous config saved to /var/cache/conftool/dbconfig/20250804-163226-fceratto.json [16:35:13] * Lucas_WMDE done deploying [16:35:57] (03CR) 10Dzahn: [C:03+1] gerrit: replica renames as "gerrit2" application user [puppet] - 10https://gerrit.wikimedia.org/r/1175122 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [16:37:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.467s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:37:30] (03CR) 10Dzahn: [C:03+1] "ah, shell user vs gerrit database user.. yea.. oof.. +1 for this. Do we also want to rename that user in the database later?" [puppet] - 10https://gerrit.wikimedia.org/r/1175122 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [16:38:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P80730 and previous config saved to /var/cache/conftool/dbconfig/20250804-163803-ladsgroup.json [16:43:50] (03CR) 10CDanis: [C:03+1] "methodology LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1082288 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [16:47:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T399728)', diff saved to https://phabricator.wikimedia.org/P80731 and previous config saved to /var/cache/conftool/dbconfig/20250804-164736-fceratto.json [16:47:40] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:47:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1180.eqiad.wmnet with reason: Maintenance [16:48:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T399728)', diff saved to https://phabricator.wikimedia.org/P80732 and previous config saved to /var/cache/conftool/dbconfig/20250804-164759-fceratto.json [16:49:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T399728)', diff saved to https://phabricator.wikimedia.org/P80733 and previous config saved to /var/cache/conftool/dbconfig/20250804-164928-fceratto.json [16:49:41] (03PS1) 10Dzahn: gerrit: add NEL headers to apache [puppet] - 10https://gerrit.wikimedia.org/r/1175552 (https://phabricator.wikimedia.org/T303725) [16:53:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T400854)', diff saved to https://phabricator.wikimedia.org/P80734 and previous config saved to /var/cache/conftool/dbconfig/20250804-165312-ladsgroup.json [16:53:16] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [16:53:28] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [16:53:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T400854)', diff saved to https://phabricator.wikimedia.org/P80735 and previous config saved to /var/cache/conftool/dbconfig/20250804-165335-ladsgroup.json [16:56:13] (03CR) 10Dzahn: "setting headers only inside the :443 virtual host, but accessing port 80 from outside gets a 403 Forbidden" [puppet] - 10https://gerrit.wikimedia.org/r/1175552 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn) [16:56:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T400854)', diff saved to https://phabricator.wikimedia.org/P80736 and previous config saved to /var/cache/conftool/dbconfig/20250804-165623-ladsgroup.json [16:59:31] (03CR) 10Dzahn: gerrit: add daemons ssh host key to known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1700) [17:00:04] ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1700). [17:01:48] (03CR) 10Dzahn: [C:03+1] "seems reasonable to me" [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T80222) (owner: 10Krinkle) [17:02:57] (03CR) 10Dzahn: "what was missing?" [puppet] - 10https://gerrit.wikimedia.org/r/1175164 (https://phabricator.wikimedia.org/T400731) (owner: 10BCornwall) [17:04:05] (03CR) 10Dzahn: [C:03+1] "oh, DNSSEC-related? woah" [puppet] - 10https://gerrit.wikimedia.org/r/1175164 (https://phabricator.wikimedia.org/T400731) (owner: 10BCornwall) [17:04:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P80737 and previous config saved to /var/cache/conftool/dbconfig/20250804-170436-fceratto.json [17:09:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:11:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.225s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:11:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P80738 and previous config saved to /var/cache/conftool/dbconfig/20250804-171130-ladsgroup.json [17:16:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.225s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:19:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P80739 and previous config saved to /var/cache/conftool/dbconfig/20250804-171945-fceratto.json [17:21:45] (03CR) 10CDobbins: [C:03+2] Revert "admin: add SD0001 to analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/1175535 (owner: 10CDobbins) [17:22:01] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new JBOD disk controllers into SM swift backends - https://phabricator.wikimedia.org/T400878#11058431 (10Jclark-ctr) [17:23:15] (03CR) 10Eevans: "No, it _does_ change the URL. Extensions though should be using the service listener (a 127.0.0.1 address in mediawiki-config) and not th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171702 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [17:23:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.712s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:23:34] (03PS1) 10Andrew Bogott: Magnum: use capi-helm driver in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1175555 (https://phabricator.wikimedia.org/T393782) [17:25:23] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175555 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [17:26:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P80740 and previous config saved to /var/cache/conftool/dbconfig/20250804-172637-ladsgroup.json [17:28:13] (03PS2) 10CDobbins: admin: add sd to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175536 (https://phabricator.wikimedia.org/T400405) [17:28:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.128s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:28:45] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.038s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:30:26] (03PS1) 10Andrew Bogott: Added dump capiservicek3s.yaml file [labs/private] - 10https://gerrit.wikimedia.org/r/1175557 (https://phabricator.wikimedia.org/T393782) [17:33:45] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.297s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:34:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T399728)', diff saved to https://phabricator.wikimedia.org/P80741 and previous config saved to /var/cache/conftool/dbconfig/20250804-173454-fceratto.json [17:35:01] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [17:35:11] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [17:35:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T399728)', diff saved to https://phabricator.wikimedia.org/P80742 and previous config saved to /var/cache/conftool/dbconfig/20250804-173518-fceratto.json [17:37:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T399728)', diff saved to https://phabricator.wikimedia.org/P80743 and previous config saved to /var/cache/conftool/dbconfig/20250804-173745-fceratto.json [17:38:21] (03PS5) 10Scott French: httpd-fcgi: add missing image build dependency [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1173977 [17:38:41] (03PS2) 10Andrew Bogott: Magnum: use capi-helm driver in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1175555 (https://phabricator.wikimedia.org/T393782) [17:38:56] jouncebot: nowandnext [17:38:57] For the next 0 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T1700) [17:38:57] In 2 hour(s) and 21 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T2000) [17:39:06] (03CR) 10CI reject: [V:04-1] Magnum: use capi-helm driver in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1175555 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [17:40:02] (03PS3) 10Andrew Bogott: Magnum: use capi-helm driver in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1175555 (https://phabricator.wikimedia.org/T393782) [17:40:28] (03CR) 10CI reject: [V:04-1] Magnum: use capi-helm driver in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1175555 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [17:41:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T400854)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250804-174145-ladsgroup.json [17:42:05] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [17:42:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T400854)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250804-174212-ladsgroup.json [17:42:22] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [17:44:09] (03PS2) 10Andrew Bogott: Added dummy capiservicek3s.yaml file [labs/private] - 10https://gerrit.wikimedia.org/r/1175557 (https://phabricator.wikimedia.org/T393782) [17:44:16] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Added dummy capiservicek3s.yaml file [labs/private] - 10https://gerrit.wikimedia.org/r/1175557 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [17:44:36] (03CR) 10BCornwall: [C:03+1] admin: add sd to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175536 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins) [17:45:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T400854)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250804-174505-ladsgroup.json [17:48:00] (03PS4) 10Andrew Bogott: Magnum: use capi-helm driver in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1175555 (https://phabricator.wikimedia.org/T393782) [17:48:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:48:32] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175555 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [17:48:34] (03CR) 10CI reject: [V:04-1] Magnum: use capi-helm driver in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1175555 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [17:48:38] (03CR) 10Scott French: [V:03+2] "Build locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1173977 (owner: 10Scott French) [17:49:51] (03PS5) 10Andrew Bogott: Magnum: use capi-helm driver in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1175555 (https://phabricator.wikimedia.org/T393782) [17:50:04] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175555 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [17:50:22] FYI, I'm going to be merging a change shortly that will require a rebuild of the httpd production image used by mediawiki, and thus a follow-on no-code-change scap deployment to verify [17:50:48] (03CR) 10Scott French: [V:03+2] "Thanks for the review, Effie!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1173977 (owner: 10Scott French) [17:50:49] FIRING: [2x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:51:04] hm [17:51:14] (03CR) 10Scott French: [V:03+2 C:03+2] httpd-fcgi: add missing image build dependency [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1173977 (owner: 10Scott French) [17:51:19] (acked) [17:52:50] rzl: I can hold on my deployment to avoid excess noise [17:52:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P80746 and previous config saved to /var/cache/conftool/dbconfig/20250804-175252-fceratto.json [17:53:00] !log dancy@deploy1003 Installing scap version "4.195.0" for 2 host(s) [17:53:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:53:13] swfrench-wmf: ehh, I think it's fine for you to go ahead [17:53:26] swfrench-wmf: yeah, I wouldn't worry [17:53:44] cdanis: rzl: cool cool, will do then :) [17:54:02] the top ISP for phab requests in the past few hours is "Wikimedia Foundation" [17:54:04] I don't see what this was yet, moderate bump in phab rps though [17:54:16] ohhh not *them* again [17:54:21] haha [17:54:47] !log dancy@deploy1003 Installation of scap version "4.195.0" completed for 2 hosts [17:55:14] (03CR) 10Andrew Bogott: [C:03+2] Magnum: use capi-helm driver in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1175555 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [17:55:41] (03CR) 10BCornwall: [C:03+2] Revert "ncredir: Revert addition of for-pay domains" [puppet] - 10https://gerrit.wikimedia.org/r/1175164 (https://phabricator.wikimedia.org/T400731) (owner: 10BCornwall) [17:55:49] RESOLVED: [2x] ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:56:26] dancy: any concerns about a mediawiki deployment? (I see you just updated scap, though not sure on which hosts) [17:56:36] mutante: do you have any idea what this is? https://w.wiki/EwZM [17:57:19] swfrench-wmf: You're good to go [17:57:34] dancy: ack, thanks! [17:58:06] !log swfrench@deploy1003 Started scap sync-world: Deployment to pick up rebuilt mediawiki-httpd image [17:59:23] !log swfrench@deploy1003 swfrench: Deployment to pick up rebuilt mediawiki-httpd image synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:00:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P80747 and previous config saved to /var/cache/conftool/dbconfig/20250804-180017-ladsgroup.json [18:01:00] !log swfrench@deploy1003 swfrench: Continuing with sync [18:02:02] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:02:17] (03PS1) 10Clare Ming: Temporarily add config var back in for group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175561 (https://phabricator.wikimedia.org/T401135) [18:04:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:05:34] (03PS1) 10Ebernhardson: Revert "cirrus: Start AB test of completion suggester fuzziness" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175562 (https://phabricator.wikimedia.org/T397732) [18:06:22] !log swfrench@deploy1003 Finished scap sync-world: Deployment to pick up rebuilt mediawiki-httpd image (duration: 08m 33s) [18:06:25] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:08:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P80748 and previous config saved to /var/cache/conftool/dbconfig/20250804-180801-fceratto.json [18:15:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P80749 and previous config saved to /var/cache/conftool/dbconfig/20250804-181526-ladsgroup.json [18:16:18] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1175564 [18:16:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175562 (https://phabricator.wikimedia.org/T397732) (owner: 10Ebernhardson) [18:16:36] (03PS1) 10Ottomata: HaproxyKafkaDeliveryErrors - use rate instead of irate [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) [18:16:44] (03CR) 10CI reject: [V:04-1] wip [puppet] - 10https://gerrit.wikimedia.org/r/1175564 (owner: 10Herron) [18:17:38] (03PS2) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1175564 [18:17:41] (03CR) 10Ottomata: "I'm not certain this will work, but reading the docs on rate vs irate, I think it might? Worth a try?" [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) (owner: 10Ottomata) [18:19:36] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:20:55] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:22:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:23:05] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1043.eqiad.wmnet with OS bullseye [18:23:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T399728)', diff saved to https://phabricator.wikimedia.org/P80750 and previous config saved to /var/cache/conftool/dbconfig/20250804-182309-fceratto.json [18:23:13] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [18:23:14] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11058645 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bullseye [18:23:25] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:24:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1231.eqiad.wmnet with reason: Maintenance [18:24:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T399728)', diff saved to https://phabricator.wikimedia.org/P80751 and previous config saved to /var/cache/conftool/dbconfig/20250804-182420-fceratto.json [18:26:16] (03CR) 10CDobbins: [C:03+2] admin: add sd to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175536 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins) [18:26:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T399728)', diff saved to https://phabricator.wikimedia.org/P80752 and previous config saved to /var/cache/conftool/dbconfig/20250804-182649-fceratto.json [18:29:17] FIRING: [2x] ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:30:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T400854)', diff saved to https://phabricator.wikimedia.org/P80753 and previous config saved to /var/cache/conftool/dbconfig/20250804-183033-ladsgroup.json [18:30:37] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [18:30:49] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2198.codfw.wmnet with reason: Maintenance [18:31:52] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2200.codfw.wmnet with reason: Maintenance [18:32:53] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2208.codfw.wmnet with reason: Maintenance [18:33:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T400854)', diff saved to https://phabricator.wikimedia.org/P80754 and previous config saved to /var/cache/conftool/dbconfig/20250804-183259-ladsgroup.json [18:35:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T400854)', diff saved to https://phabricator.wikimedia.org/P80755 and previous config saved to /var/cache/conftool/dbconfig/20250804-183543-ladsgroup.json [18:35:50] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [18:36:19] (03PS1) 10Ebernhardson: Clean up CirrusSearch settings on ex-wikipedia special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175566 (https://phabricator.wikimedia.org/T400062) [18:37:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175566 (https://phabricator.wikimedia.org/T400062) (owner: 10Ebernhardson) [18:41:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:41:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P80756 and previous config saved to /var/cache/conftool/dbconfig/20250804-184156-fceratto.json [18:50:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P80757 and previous config saved to /var/cache/conftool/dbconfig/20250804-185052-ladsgroup.json [18:51:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:52:36] cdanis: While I cant point to an actual code path it comes from.. it is also not new to us. A couple of us have noticed this before. It comes from Phabricator itself, from its own IP, likely its in Differential. the 252 is an *. I think Brennen was already looking at source code to find it before. Since it keeps coming up, I am digging some more. [18:54:53] (03CR) 10Daimona Eaytoy: [C:03+1] Clean up CirrusSearch settings on ex-wikipedia special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175566 (https://phabricator.wikimedia.org/T400062) (owner: 10Ebernhardson) [18:55:12] (03PS1) 10Cwhite: prometheus: add additional metrics from logs [puppet] - 10https://gerrit.wikimedia.org/r/1175569 [18:55:31] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:57:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P80758 and previous config saved to /var/cache/conftool/dbconfig/20250804-185705-fceratto.json [18:57:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175561 (https://phabricator.wikimedia.org/T401135) (owner: 10Clare Ming) [18:58:10] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:59:01] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:00:59] (03CR) 10RLazarus: "Thanks! Let me know if you need any help with adding the tests. Happy to take care of deploying when that's done." [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T80222) (owner: 10Krinkle) [19:06:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P80759 and previous config saved to /var/cache/conftool/dbconfig/20250804-190559-ladsgroup.json [19:07:07] (03PS1) 10Bartosz Dziewoński: SessionManager: Add $sessionWriteReason to shutdown and when saves are triggered from the destructor [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175574 (https://phabricator.wikimedia.org/T400249) [19:07:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175574 (https://phabricator.wikimedia.org/T400249) (owner: 10Bartosz Dziewoński) [19:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:09:38] (03PS1) 10DLynch: Change search teardown focus to not use an over-broad route [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175575 (https://phabricator.wikimedia.org/T401090) [19:09:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175575 (https://phabricator.wikimedia.org/T401090) (owner: 10DLynch) [19:12:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T399728)', diff saved to https://phabricator.wikimedia.org/P80760 and previous config saved to /var/cache/conftool/dbconfig/20250804-191213-fceratto.json [19:12:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [19:12:33] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [19:17:03] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:19:27] vriley@cumin1002 reimage (PID 1229374) is awaiting input [19:19:58] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1043.eqiad.wmnet with OS bullseye [19:20:01] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1043.eqiad.wmnet with OS bullseye [19:20:06] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11058865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bullseye executed with errors: - cloudcephosd104... [19:20:09] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11058866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bullseye [19:21:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T400854)', diff saved to https://phabricator.wikimedia.org/P80761 and previous config saved to /var/cache/conftool/dbconfig/20250804-192107-ladsgroup.json [19:21:15] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [19:21:22] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2218.codfw.wmnet with reason: Maintenance [19:21:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T400854)', diff saved to https://phabricator.wikimedia.org/P80762 and previous config saved to /var/cache/conftool/dbconfig/20250804-192129-ladsgroup.json [19:24:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T400854)', diff saved to https://phabricator.wikimedia.org/P80763 and previous config saved to /var/cache/conftool/dbconfig/20250804-192415-ladsgroup.json [19:25:41] (03PS1) 10Cwhite: mediawiki-global: set sre as receiver of MediaWikiElevatedUnknownLogins [alerts] - 10https://gerrit.wikimedia.org/r/1175578 (https://phabricator.wikimedia.org/T395117) [19:26:51] (03PS1) 10Andrew Bogott: Magnum: only include [heat_client] if heat is enabled [puppet] - 10https://gerrit.wikimedia.org/r/1175579 (https://phabricator.wikimedia.org/T393782) [19:27:52] (03PS2) 10RLazarus: deployment_server: Add --sal to mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/1174579 (https://phabricator.wikimedia.org/T376776) [19:30:34] (03CR) 10RLazarus: [C:03+2] deployment_server: Add --sal to mwscript_k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1174579 (https://phabricator.wikimedia.org/T376776) (owner: 10RLazarus) [19:31:22] (03PS1) 10MusikAnimal: beta: use CodeMirror instead of CodeEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175581 (https://phabricator.wikimedia.org/T373711) [19:32:37] (03PS1) 10Ottomata: eventgate main and analytics - bump to v1.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175582 (https://phabricator.wikimedia.org/T376026) [19:33:11] (03CR) 10Ottomata: [C:03+2] eventgate main and analytics - bump to v1.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175582 (https://phabricator.wikimedia.org/T376026) (owner: 10Ottomata) [19:34:50] (03Merged) 10jenkins-bot: eventgate main and analytics - bump to v1.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175582 (https://phabricator.wikimedia.org/T376026) (owner: 10Ottomata) [19:35:03] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [19:35:09] (03PS1) 10Daimona Eaytoy: Add exceptions to country code migration script following test [extensions/CampaignEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175583 (https://phabricator.wikimedia.org/T397270) [19:35:10] !log deploying eventgate-analytics and eventgate-main to pick up meta.dt field logic change - T376026 [19:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:16] T376026: Update event-producing tools to overwrite `meta.dt` - https://phabricator.wikimedia.org/T376026 [19:35:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/CampaignEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175583 (https://phabricator.wikimedia.org/T397270) (owner: 10Daimona Eaytoy) [19:35:32] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [19:36:20] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1043.eqiad.wmnet with OS bullseye [19:36:27] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11058908 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bullseye executed with errors: - cloudcephosd104... [19:36:50] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [19:37:34] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [19:38:44] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [19:39:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P80764 and previous config saved to /var/cache/conftool/dbconfig/20250804-193923-ladsgroup.json [19:39:44] !log rzl@deploy1003 mwscript-k8s job started: Version.php --wiki=urwiki # Testing --sal for T376776 [19:39:47] T376776: mw-scripts SAL integration - https://phabricator.wikimedia.org/T376776 [19:39:47] (03CR) 10Andrew Bogott: [C:03+2] Magnum: only include [heat_client] if heat is enabled [puppet] - 10https://gerrit.wikimedia.org/r/1175579 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [19:50:29] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1043.eqiad.wmnet with OS bullseye [19:50:46] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11058930 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bullseye [19:54:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P80765 and previous config saved to /var/cache/conftool/dbconfig/20250804-195431-ladsgroup.json [19:59:41] 06SRE, 10SRE-Access-Requests, 06MW-Interfaces-Team: Requesting access to analytics-privatedata-users, SSH and Kerberos for HCoplin-WMF - https://phabricator.wikimedia.org/T400897#11058939 (10CDobbins) @Milimetric @Ahoelzl @Ottomata would any of you mind approving @HCoplin-WMF's membership in analytics-privat... [19:59:58] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: apply [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T2000). [20:00:05] Krinkle, Daimona, MatmaRex, ebernhardson, and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] o/ [20:00:12] o/ [20:00:14] hi [20:00:26] o/ [20:00:26] ^ FYI I am in the middle of eventgate deployment, almost done, helmfile apply is being slow, serviceops folks helping [20:00:33] \o [20:00:36] should not block backport [20:00:48] my three patches don't have anything to test on mwdebug, and can go out together [20:01:01] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [20:01:07] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [20:01:37] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [20:01:44] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [20:02:07] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [20:02:53] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1175587 [20:02:56] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1175588 [20:03:01] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1175589 [20:03:02] hi - i can deploy for those who don't want to self-deploy [20:03:21] I can get mine done myself, or you can get it, either's fine. [20:03:47] ack [20:03:55] Krinkle: do you want to start? [20:04:19] OK. [20:04:52] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1043.eqiad.wmnet with OS bullseye [20:04:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161757 (https://phabricator.wikimedia.org/T400586) (owner: 10Krinkle) [20:05:00] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11058942 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bullseye executed with errors: - cloudcephosd104... [20:05:08] Daimona: are you able to spiderpig? [20:05:27] Nope, not a deployer. [20:05:35] But I can run my script once the backport is done [20:05:38] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:05:40] np - cool [20:05:54] (03Merged) 10jenkins-bot: Set wgCentralBannerRecorder to /beacon/… instead of //example.org/beacon/… [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161757 (https://phabricator.wikimedia.org/T400586) (owner: 10Krinkle) [20:06:11] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1161757|Set wgCentralBannerRecorder to /beacon/… instead of //example.org/beacon/… (T400586)]] [20:06:15] T400586: Remove mobile domain variance from $wgCentralBannerRecorder URL - https://phabricator.wikimedia.org/T400586 [20:07:34] (all done with eventgate) [20:07:44] ty! [20:07:51] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1161757|Set wgCentralBannerRecorder to /beacon/… instead of //example.org/beacon/… (T400586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:09:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T400854)', diff saved to https://phabricator.wikimedia.org/P80767 and previous config saved to /var/cache/conftool/dbconfig/20250804-200938-ladsgroup.json [20:09:43] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [20:09:56] !log krinkle@deploy1003 krinkle: Continuing with sync [20:09:56] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2221.codfw.wmnet with reason: Maintenance [20:10:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2221 (T400854)', diff saved to https://phabricator.wikimedia.org/P80768 and previous config saved to /var/cache/conftool/dbconfig/20250804-201003-ladsgroup.json [20:12:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T400854)', diff saved to https://phabricator.wikimedia.org/P80769 and previous config saved to /var/cache/conftool/dbconfig/20250804-201246-ladsgroup.json [20:14:17] (03PS1) 10Andrew Bogott: Magnum: install python3-magnum-capi-helm when needed [puppet] - 10https://gerrit.wikimedia.org/r/1175590 (https://phabricator.wikimedia.org/T393782) [20:14:33] nearly done.. [20:14:41] 06SRE, 10SRE-Access-Requests, 06MW-Interfaces-Team: Requesting access to analytics-privatedata-users, SSH and Kerberos for HCoplin-WMF - https://phabricator.wikimedia.org/T400897#11058970 (10CDobbins) [20:14:42] great - thanks [20:15:16] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1161757|Set wgCentralBannerRecorder to /beacon/… instead of //example.org/beacon/… (T400586)]] (duration: 09m 05s) [20:15:21] T400586: Remove mobile domain variance from $wgCentralBannerRecorder URL - https://phabricator.wikimedia.org/T400586 [20:15:26] cjming: all yours [20:15:35] Krinkle:: ty :) [20:15:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175590 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [20:15:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/CampaignEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175583 (https://phabricator.wikimedia.org/T397270) (owner: 10Daimona Eaytoy) [20:16:07] Daimona: doing your backport now [20:16:33] Thank you! You can sync it straight away, I will test it as part of running the script immediately afterwards. [20:16:44] will do [20:17:09] (03Merged) 10jenkins-bot: Add exceptions to country code migration script following test [extensions/CampaignEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175583 (https://phabricator.wikimedia.org/T397270) (owner: 10Daimona Eaytoy) [20:17:23] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:17:24] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1175583|Add exceptions to country code migration script following test (T397270)]] [20:17:27] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [20:17:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:18:04] (03PS1) 10CDobbins: admin: give HCoplin access to SSH, Kerberos, and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175592 (https://phabricator.wikimedia.org/T400897) [20:18:13] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:18:28] (03CR) 10Andrew Bogott: [C:03+2] Magnum: install python3-magnum-capi-helm when needed [puppet] - 10https://gerrit.wikimedia.org/r/1175590 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [20:18:51] MatmaRex: do you want me to deploy your patches? [20:19:01] !log cjming@deploy1003 daimona, cjming: Backport for [[gerrit:1175583|Add exceptions to country code migration script following test (T397270)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:19:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175581 (https://phabricator.wikimedia.org/T373711) (owner: 10MusikAnimal) [20:19:16] cjming: please do, thanks [20:19:21] ack [20:19:22] cjming: they can all be done at once [20:19:28] cool - will do [20:19:39] !log cjming@deploy1003 daimona, cjming: Continuing with sync [20:19:43] (and i don't have anything to test on mwdebug) [20:19:55] alrighty [20:24:20] (03PS1) 10Andrew Bogott: Magnum: fix typo in python3-magnum-capi-helm package name [puppet] - 10https://gerrit.wikimedia.org/r/1175593 [20:24:55] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175583|Add exceptions to country code migration script following test (T397270)]] (duration: 07m 30s) [20:24:56] (03CR) 10Andrew Bogott: [C:03+2] Magnum: fix typo in python3-magnum-capi-helm package name [puppet] - 10https://gerrit.wikimedia.org/r/1175593 (owner: 10Andrew Bogott) [20:25:01] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [20:25:06] Daimona: should be live [20:25:24] Nice, thank you. I'll run the script now, lemme find the command to log [20:25:25] (03CR) 10Gergő Tisza: "Something like `WebAuthn: Limit passkeys to roaming` would be a more descriptive title." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174048 (https://phabricator.wikimedia.org/T399665) (owner: 10Mstyles) [20:25:29] not sure if i should wait for Daimona to run script before starting in on next patches [20:25:42] I'm only going to take a couple minutes [20:26:11] cool - then i'll err on the side of waiting [20:26:27] !log Re-run CampaignEvents country migration script in dry-run mode one last time for all wikis # T397270 [20:26:30] unless someone else knows definitively that it's ok to continue [20:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:10] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175588 (owner: 10Ncmonitor) [20:27:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P80771 and previous config saved to /var/cache/conftool/dbconfig/20250804-202754-ladsgroup.json [20:29:51] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1175588 (owner: 10Ncmonitor) [20:29:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:30:02] (03CR) 10BCornwall: [C:03+1] admin: give HCoplin access to SSH, Kerberos, and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175592 (https://phabricator.wikimedia.org/T400897) (owner: 10CDobbins) [20:30:46] (03CR) 10CDobbins: [C:03+2] admin: give HCoplin access to SSH, Kerberos, and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175592 (https://phabricator.wikimedia.org/T400897) (owner: 10CDobbins) [20:30:50] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175588 (owner: 10Ncmonitor) [20:30:52] (03PS1) 10Ottomata: eventgate-analytics - revert to v1.11.0 to avoid log spam [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175596 (https://phabricator.wikimedia.org/T376026) [20:31:26] I'm going to take a bit longer actually, because I want to double-check everything. Won't take long anyway [20:31:32] ack [20:31:36] FYI cjming I'm going to deploy eventgate-analytics to avoid logspam I just caused. Should have no relevance to ongoing backport window. [20:31:50] ottomata: sounds good - thanks for heads up [20:31:54] (03CR) 10Ottomata: [C:03+2] eventgate-analytics - revert to v1.11.0 to avoid log spam [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175596 (https://phabricator.wikimedia.org/T376026) (owner: 10Ottomata) [20:32:21] (Also, "refreshing the local helm cache" seems to take a while) [20:32:22] !log reprepro include php8.3_8.3.24-1+wmf11u2 in component/php83 - T398245 [20:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:30] T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245 [20:33:04] !log mwscript-k8s --comment="T397270" -f --file /srv/mediawiki/php-1.45.0-wmf.12/extensions/CampaignEvents/maintenance/countryExceptionMappings.csv -- CampaignEvents:UpdateCountriesColumn --wiki testwiki --exceptions countryExceptionMappings.csv --commit [20:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:10] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [20:33:12] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1043.eqiad.wmnet with OS bullseye [20:33:13] (03PS1) 10Cwhite: logstash: drop logspam [puppet] - 10https://gerrit.wikimedia.org/r/1175597 (https://phabricator.wikimedia.org/T390215) [20:33:29] 06SRE, 10SRE-Access-Requests, 06MW-Interfaces-Team, 13Patch-For-Review: Requesting access to analytics-privatedata-users, SSH and Kerberos for HCoplin-WMF - https://phabricator.wikimedia.org/T400897#11059063 (10CDobbins) [20:33:33] (03Merged) 10jenkins-bot: eventgate-analytics - revert to v1.11.0 to avoid log spam [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175596 (https://phabricator.wikimedia.org/T376026) (owner: 10Ottomata) [20:33:56] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [20:34:01] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [20:34:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bullseye [20:34:30] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [20:34:37] 07Puppet, 10Beta-Cluster-Infrastructure, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet configures kernel.core_pattern |/usr/lib/systemd/systemd-coredump, but systemd-coredump is not installed - https://phabricator.wikimedia.org/T400247#11059070 (10bd808) a:03Lucas_Werkmeister_WMDE [20:34:43] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [20:34:47] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [20:35:19] !log mwscript-k8s --comment="T397270" -f --file /srv/mediawiki/php-1.45.0-wmf.12/extensions/CampaignEvents/maintenance/countryExceptionMappings.csv -- CampaignEvents:UpdateCountriesColumn --wiki test2wiki --exceptions countryExceptionMappings.csv --commit [20:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:30] (03PS1) 10Andrew Bogott: Magnum: install helm when needed [puppet] - 10https://gerrit.wikimedia.org/r/1175598 (https://phabricator.wikimedia.org/T393782) [20:35:33] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [20:36:13] (03CR) 10Cwhite: [C:03+2] logstash: drop logspam [puppet] - 10https://gerrit.wikimedia.org/r/1175597 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [20:36:22] (03CR) 10Andrew Bogott: [C:03+2] Magnum: install helm when needed [puppet] - 10https://gerrit.wikimedia.org/r/1175598 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [20:36:30] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [20:36:54] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [20:37:27] !log mwscript-k8s --comment="T397270" -f --file /srv/mediawiki/php-1.45.0-wmf.12/extensions/CampaignEvents/maintenance/countryExceptionMappings.csv -- CampaignEvents:UpdateCountriesColumn --wiki officewiki --exceptions countryExceptionMappings.csv --commit [20:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:11] Almost done, that step is really slow tonight [20:39:49] !log mwscript-k8s --comment="T397270" -f --file /srv/mediawiki/php-1.45.0-wmf.12/extensions/CampaignEvents/maintenance/countryExceptionMappings.csv -- CampaignEvents:UpdateCountriesColumn --wiki metawiki --exceptions countryExceptionMappings.csv --commit [20:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:53] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [20:40:56] Alright, I'm done. Thanks and apologies for the delay! [20:41:04] yay! continuing [20:41:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175511 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [20:41:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175512 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [20:41:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175574 (https://phabricator.wikimedia.org/T400249) (owner: 10Bartosz Dziewoński) [20:41:38] (03PS1) 10Scott French: php8.3: rebuild to pick up 8.3.24-1+wmf11u2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175599 (https://phabricator.wikimedia.org/T398246) [20:42:50] (03CR) 10Scott French: [V:03+2] "Built locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175599 (https://phabricator.wikimedia.org/T398246) (owner: 10Scott French) [20:43:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P80776 and previous config saved to /var/cache/conftool/dbconfig/20250804-204305-ladsgroup.json [20:45:39] !log eventgate-analytics in eqiad cannot be deployed due to stuck helm STATUS: pending-upgrade. This needs to be deployed to rollback to a version that doesn't cause logspam. cc cwhite, rzl - T376026 [20:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:43] T376026: Update event-producing tools to overwrite `meta.dt` - https://phabricator.wikimedia.org/T376026 [20:46:18] I have to run for the day, i'm sorry! i have to do kid duty. k8s is stuck for eventgate-analytics eqiad. any help to apply the current merged helmfile (at image v1.11.0) would be much obliged! [20:49:52] (03Merged) 10jenkins-bot: Clear edit count when unattaching local users for rename [extensions/CentralAuth] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175511 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [20:49:54] (03Merged) 10jenkins-bot: fixStuckGlobalRename: Fix using actor_id from the wrong wiki [extensions/CentralAuth] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175512 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [20:51:02] cwhite: rzl rolled back for me, logspam should stop [20:54:42] (03CR) 10Aleksandar Mastilovic: "I'm doubtful it will work 😊 If it doesn't, you might try using the `or` operator like so: https://promlabs.com/blog/2023/09/13/dealing-wit" [alerts] - 10https://gerrit.wikimedia.org/r/1175565 (https://phabricator.wikimedia.org/T395539) (owner: 10Ottomata) [20:55:20] (03Merged) 10jenkins-bot: SessionManager: Add $sessionWriteReason to shutdown and when saves are triggered from the destructor [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175574 (https://phabricator.wikimedia.org/T400249) (owner: 10Bartosz Dziewoński) [20:55:37] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1175511|Clear edit count when unattaching local users for rename (T313900)]], [[gerrit:1175512|fixStuckGlobalRename: Fix using actor_id from the wrong wiki (T398177)]], [[gerrit:1175574|SessionManager: Add $sessionWriteReason to shutdown and when saves are triggered from the destructor (T400249)]] [20:55:44] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [20:55:45] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [20:55:45] T400249: SessionBackend should save sessions at the end of the request (and only there) - https://phabricator.wikimedia.org/T400249 [20:56:14] PROBLEM - Disk space on an-worker1122 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 147687 MB (3% inode=99%): /var/lib/hadoop/data/e 164273 MB (4% inode=99%): /var/lib/hadoop/data/f 165832 MB (4% inode=99%): /var/lib/hadoop/data/b 162040 MB (4% inode=99%): /var/lib/hadoop/data/g 170421 MB (4% inode=99%): /var/lib/hadoop/data/d 167504 MB (4% inode=99%): /var/lib/hadoop/data/j 165636 MB (4% inode=99%): /var/lib/hadoop/data [20:56:14] 0 MB (4% inode=99%): /var/lib/hadoop/data/h 169985 MB (4% inode=99%): /var/lib/hadoop/data/l 170359 MB (4% inode=99%): /var/lib/hadoop/data/k 177967 MB (4% inode=99%): /var/lib/hadoop/data/m 157765 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1122&var-datasource=eqiad+prometheus/ops [20:57:14] !log cjming@deploy1003 matmarex, cjming: Backport for [[gerrit:1175511|Clear edit count when unattaching local users for rename (T313900)]], [[gerrit:1175512|fixStuckGlobalRename: Fix using actor_id from the wrong wiki (T398177)]], [[gerrit:1175574|SessionManager: Add $sessionWriteReason to shutdown and when saves are triggered from the destructor (T400249)]] synced to the testservers (see https://wikitech.wikimedia.org/w [20:57:14] iki/Mwdebug). Changes can now be verified there. [20:57:40] !log cjming@deploy1003 matmarex, cjming: Continuing with sync [20:57:51] Security team deployers: there are a few more patches in this backport window - are we ok to go over or do you need the window? [20:58:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T400854)', diff saved to https://phabricator.wikimedia.org/P80777 and previous config saved to /var/cache/conftool/dbconfig/20250804-205813-ladsgroup.json [20:58:18] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [20:58:30] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2222.codfw.wmnet with reason: Maintenance [20:58:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T400854)', diff saved to https://phabricator.wikimedia.org/P80778 and previous config saved to /var/cache/conftool/dbconfig/20250804-205837-ladsgroup.json [21:00:04] Reedy, sbassett, Maryum, and manfredi: Time to snap out of that daydream and deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T2100). [21:01:03] Reedy, sbassett, Maryum, and manfredi: see Q above about UTC late backport window going over? [21:01:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T400854)', diff saved to https://phabricator.wikimedia.org/P80779 and previous config saved to /var/cache/conftool/dbconfig/20250804-210119-ladsgroup.json [21:03:13] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175511|Clear edit count when unattaching local users for rename (T313900)]], [[gerrit:1175512|fixStuckGlobalRename: Fix using actor_id from the wrong wiki (T398177)]], [[gerrit:1175574|SessionManager: Add $sessionWriteReason to shutdown and when saves are triggered from the destructor (T400249)]] (duration: 07m 36s) [21:03:23] MatmaRex: should be live [21:03:25] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [21:03:26] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [21:03:28] T400249: SessionBackend should save sessions at the end of the request (and only there) - https://phabricator.wikimedia.org/T400249 [21:03:40] thanks cjming [21:03:49] yw! [21:04:25] ebernhardson, kemayo: I'm asking about extending the window -- if we're ok to, are you up for self-deploying? [21:04:50] cjming: No problem for me. [21:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:05:03] cjming: yea [21:05:50] cool - the other option is if the security folks need the window, you can probably go after them - the web team might not need their window [21:06:09] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175588 (owner: 10Ncmonitor) [21:06:29] I mean, there isn't even a Web team any more. [21:06:32] cjming: i think you're probably ok to go ahead [21:06:46] kemayo: truth [21:06:51] brennen: thanks [21:06:55] ebernhardson: i'd probably go ahead since no one has piped up from the security team [21:07:17] alrighty, kicking off [21:07:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175562 (https://phabricator.wikimedia.org/T397732) (owner: 10Ebernhardson) [21:07:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175566 (https://phabricator.wikimedia.org/T400062) (owner: 10Ebernhardson) [21:07:36] ebernhardson: will you pass over to kemayo when you're done? [21:08:07] cjming: yup, i can do that [21:08:13] (03Merged) 10jenkins-bot: Revert "cirrus: Start AB test of completion suggester fuzziness" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175562 (https://phabricator.wikimedia.org/T397732) (owner: 10Ebernhardson) [21:08:15] (03Merged) 10jenkins-bot: Clean up CirrusSearch settings on ex-wikipedia special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175566 (https://phabricator.wikimedia.org/T400062) (owner: 10Ebernhardson) [21:08:25] ty - then i will sign off - thanks everybody [21:08:27] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1175562|Revert "cirrus: Start AB test of completion suggester fuzziness" (T397732)]], [[gerrit:1175566|Clean up CirrusSearch settings on ex-wikipedia special wikis (T400062)]] [21:08:29] thanks! [21:08:33] T397732: Run a test evaluating fuzziness of completion suggester - https://phabricator.wikimedia.org/T397732 [21:08:34] T400062: Clean up CirrusSearch settings on ex-wikipedia special wikis - https://phabricator.wikimedia.org/T400062 [21:10:02] !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1175562|Revert "cirrus: Start AB test of completion suggester fuzziness" (T397732)]], [[gerrit:1175566|Clean up CirrusSearch settings on ex-wikipedia special wikis (T400062)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:11:13] !log ebernhardson@deploy1003 ebernhardson: Continuing with sync [21:14:12] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1043.eqiad.wmnet with OS bullseye [21:14:19] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059258 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bullseye executed with errors: - cloudcephosd104... [21:16:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P80780 and previous config saved to /var/cache/conftool/dbconfig/20250804-211628-ladsgroup.json [21:16:33] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175562|Revert "cirrus: Start AB test of completion suggester fuzziness" (T397732)]], [[gerrit:1175566|Clean up CirrusSearch settings on ex-wikipedia special wikis (T400062)]] (duration: 08m 06s) [21:16:39] T397732: Run a test evaluating fuzziness of completion suggester - https://phabricator.wikimedia.org/T397732 [21:16:39] T400062: Clean up CirrusSearch settings on ex-wikipedia special wikis - https://phabricator.wikimedia.org/T400062 [21:17:03] Kemayo: I'm all done, you're up [21:17:39] ebernhardson: thanks! [21:17:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175575 (https://phabricator.wikimedia.org/T401090) (owner: 10DLynch) [21:18:27] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059276 (10VRiley-WMF) Still working with this issue. It seems that cloudcephosd1041 has set the disks to a non-raid group. Upon trying that, it seemed like the script went a bit further tha... [21:19:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:21:31] (03CR) 10RLazarus: [C:03+1] php8.3: rebuild to pick up 8.3.24-1+wmf11u2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175599 (https://phabricator.wikimedia.org/T398246) (owner: 10Scott French) [21:27:23] 06SRE, 10SRE-swift-storage: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127#11059301 (10Eevans) Oh good, so it's not just me. :) I (just recently) encountered the same, namely that `/dev/disk/by-path/...` was inconsistent across the hardware I was working with,... [21:30:49] (03Merged) 10jenkins-bot: Change search teardown focus to not use an over-broad route [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175575 (https://phabricator.wikimedia.org/T401090) (owner: 10DLynch) [21:31:02] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1175575|Change search teardown focus to not use an over-broad route (T401090)]] [21:31:06] T401090: Anchor links issue on mobile: focus is set on search instead of the anchor - https://phabricator.wikimedia.org/T401090 [21:31:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P80781 and previous config saved to /var/cache/conftool/dbconfig/20250804-213136-ladsgroup.json [21:32:39] !log kemayo@deploy1003 kemayo: Backport for [[gerrit:1175575|Change search teardown focus to not use an over-broad route (T401090)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:33:11] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059322 (10Andrew) A few things: # Ceph uses a jbod setup, so we don't want hardware raid involved at all. All drives should be set to non-raid # The partman recipe will make a mirrored rai... [21:33:49] !log kemayo@deploy1003 kemayo: Continuing with sync [21:39:11] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175575|Change search teardown focus to not use an over-broad route (T401090)]] (duration: 08m 08s) [21:39:14] T401090: Anchor links issue on mobile: focus is set on search instead of the anchor - https://phabricator.wikimedia.org/T401090 [21:39:32] Okay, I am all done. [21:39:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:40:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:41:16] (03CR) 10Santiago Faci: [C:03+1] Temporarily add config var back in for group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175561 (https://phabricator.wikimedia.org/T401135) (owner: 10Clare Ming) [21:45:23] (03CR) 10Dr0ptp4kt: Temporarily add config var back in for group2 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175561 (https://phabricator.wikimedia.org/T401135) (owner: 10Clare Ming) [21:45:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:46:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T400854)', diff saved to https://phabricator.wikimedia.org/P80782 and previous config saved to /var/cache/conftool/dbconfig/20250804-214644-ladsgroup.json [21:46:49] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [21:54:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:55:57] (03CR) 10Scott French: [V:03+2] "Thanks, Reuven!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175599 (https://phabricator.wikimedia.org/T398246) (owner: 10Scott French) [21:57:02] (03CR) 10Scott French: [V:03+2 C:03+2] php8.3: rebuild to pick up 8.3.24-1+wmf11u2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1175599 (https://phabricator.wikimedia.org/T398246) (owner: 10Scott French) [22:14:07] (03PS3) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1175588 (owner: 10Ncmonitor) [22:14:29] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175588 (owner: 10Ncmonitor) [22:27:44] (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1175588 (owner: 10Ncmonitor) [22:28:39] (03CR) 10BCornwall: [C:04-2] "DNS isn't properly propagated yet, so I'll wait until then." [puppet] - 10https://gerrit.wikimedia.org/r/1175588 (owner: 10Ncmonitor) [22:29:32] FIRING: [2x] ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:38:12] (03CR) 10Scott French: "Thanks, Ahmon! This seems entirely reasonable / desirable to me. Looks good other than the choice of changelog version." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174507 (owner: 10Ahmon Dancy) [22:38:19] (03PS1) 10Dzahn: admin: remove lmata from phabricator-admins [puppet] - 10https://gerrit.wikimedia.org/r/1175613 [22:42:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:43:08] (03CR) 10Dzahn: [C:03+2] admin: remove lmata from phabricator-admins [puppet] - 10https://gerrit.wikimedia.org/r/1175613 (owner: 10Dzahn) [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250804T2300) [23:07:22] (03PS3) 10Dzahn: zuul: add zookeeper to new-zuul main prod VMs [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) [23:09:30] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:10:57] (03CR) 10Dzahn: [C:04-2] zuul: add zookeeper to new-zuul main prod VMs [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:16:38] (03PS4) 10Dzahn: zuul: add zookeeper to new-zuul main prod VMs [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) [23:24:33] (03PS5) 10Dzahn: zuul: add zookeeper to new-zuul main prod VMs [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) [23:27:17] (03PS6) 10Dzahn: zuul: add zookeeper to new-zuul main prod VMs [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) [23:29:19] (03PS7) 10Dzahn: zuul: add zookeeper to new-zuul main prod VMs [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) [23:29:24] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1174566/6494/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:32:19] (03CR) 10Dzahn: [V:03+1] "Hey Elukey, I need a zookeeper installed on some new VMs for CI. It is not supposed to influence anything related to existing zookeeper in" [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:38:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175617 [23:38:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175617 (owner: 10TrainBranchBot) [23:42:17] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175617 (owner: 10TrainBranchBot) [23:42:40] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1043.eqiad.wmnet with OS bookworm [23:42:52] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11059441 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bookworm