[00:08:06] RECOVERY - OpenSearch health check for shards on 9200 on logstash2035 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: yellow, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 769, active_shards: 1647, relocating_shards: 0, initializing_shards: 20, unassigned_shards: 78, delayed_unassigned_s [00:08:06] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.38395415472779 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:08:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1174097 [00:08:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1174097 (owner: 10TrainBranchBot) [00:29:14] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1174097 (owner: 10TrainBranchBot) [01:00:57] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:11:50] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 10m 52s) [01:40:38] (03CR) 10Vgutierrez: "current status:" [dns] - 10https://gerrit.wikimedia.org/r/1174007 (https://phabricator.wikimedia.org/T400731) (owner: 10BCornwall) [01:43:09] (03PS1) 10Vgutierrez: haproxy,varnish: Disable bullseye-backports on tests Dockerfile [puppet] - 10https://gerrit.wikimedia.org/r/1174123 (https://phabricator.wikimedia.org/T400774) [02:54:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [02:54:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [02:56:14] PROBLEM - Disk space on an-worker1122 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 158237 MB (4% inode=99%): /var/lib/hadoop/data/e 152103 MB (4% inode=99%): /var/lib/hadoop/data/f 158249 MB (4% inode=99%): /var/lib/hadoop/data/b 158396 MB (4% inode=99%): /var/lib/hadoop/data/g 158467 MB (4% inode=99%): /var/lib/hadoop/data/d 158589 MB (4% inode=99%): /var/lib/hadoop/data/j 158895 MB (4% inode=99%): /var/lib/hadoop/data [02:56:14] 5 MB (3% inode=99%): /var/lib/hadoop/data/h 152414 MB (4% inode=99%): /var/lib/hadoop/data/l 151901 MB (4% inode=99%): /var/lib/hadoop/data/k 149123 MB (3% inode=99%): /var/lib/hadoop/data/m 155584 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1122&var-datasource=eqiad+prometheus/ops [03:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:49:50] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:51:28] 10ops-eqiad, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778 (10Jclark-ctr) 03NEW [03:54:02] 10ops-eqiad, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779 (10Jclark-ctr) 03NEW [03:55:36] 10ops-eqiad, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11045677 (10Jclark-ctr) [03:56:00] 10ops-eqiad, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11045682 (10Jclark-ctr) [03:56:38] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#11045684 (10Jclark-ctr) [04:05:02] 10ops-eqiad, 06DC-Ops: eqiad row C/D upgrade racking task - https://phabricator.wikimedia.org/T400780 (10Jclark-ctr) 03NEW [04:05:19] 10ops-eqiad, 06DC-Ops: eqiad row C/D upgrade racking task - https://phabricator.wikimedia.org/T400780#11045700 (10Jclark-ctr) a:03Jclark-ctr [04:11:03] 10ops-eqiad, 06DC-Ops: eqiad row C/D upgrade racking task - https://phabricator.wikimedia.org/T400780#11045701 (10Jclark-ctr) [04:11:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:12:39] 10ops-eqiad, 06DC-Ops: eqiad row C/D upgrade racking task - https://phabricator.wikimedia.org/T400780#11045702 (10Jclark-ctr) [04:12:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11045703 (10Jclark-ctr) [04:15:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-9m4xv - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [04:21:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:22:17] FIRING: [2x] ProbeDown: Service wdqs1026:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1026:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:27:17] RESOLVED: [2x] ProbeDown: Service wdqs1026:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1026:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:50:48] FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-6gtmw - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [05:09:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:17:13] (03CR) 10Arnaudb: [C:03+2] gerrit: add service ip address for gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1173370 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [05:28:59] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device lsw1-e8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T400758#11045722 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr @VRiley-WMF Please make sure servers are powered off if you have not com... [05:29:54] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device lsw1-f8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T400763#11045729 (10Jclark-ctr) @VRiley-WMF Please make sure servers are powered off if you have not completed this will cause allerts and automated ti... [05:30:00] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device lsw1-f8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T400763#11045731 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [05:54:51] 10ops-eqiad, 06DC-Ops: Alert for device lsw1-e8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T400782 (10phaultfinder) 03NEW [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T0600) [06:04:30] 10ops-eqiad, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783 (10ayounsi) 03NEW [06:09:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:09:54] 10ops-eqiad, 06DC-Ops: Alert for device lsw1-f8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T400784 (10phaultfinder) 03NEW [06:21:26] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:22:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:48] FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-6gtmw - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [06:37:08] !log jelto@cumin1003 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab2002.wikimedia.org to gitlab1004.wikimedia.org [06:40:50] PROBLEM - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 22: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [06:42:46] ^ expected T400252 [06:42:48] T400252: Gitlab switchover (gitlab2002 → gitlab1004) - https://phabricator.wikimedia.org/T400252 [06:43:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:44:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:54:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [06:54:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [07:00:05] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T0700). [07:00:05] matthiasmullie: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:28] (03CR) 10Jelto: [C:03+2] Gitlab: switchover from gitlab2002 to gitlab1004 [puppet] - 10https://gerrit.wikimedia.org/r/1172026 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [07:01:11] o/ [07:05:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-6gtmw - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [07:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:14:43] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:19:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171239 (https://phabricator.wikimedia.org/T385286) (owner: 10Matthias Mullie) [07:19:58] (03Merged) 10jenkins-bot: Add new MediaSearch config/coefficients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171239 (https://phabricator.wikimedia.org/T385286) (owner: 10Matthias Mullie) [07:20:22] (03CR) 10Tiziano Fogli: [C:03+1] kafkatee: fix webrequest input [puppet] - 10https://gerrit.wikimedia.org/r/1173878 (https://phabricator.wikimedia.org/T371366) (owner: 10Filippo Giunchedi) [07:21:26] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:22:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:24:43] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:31:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:34:45] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#11045826 (10elukey) @Jhancock.wm the host is already reimaged, you can go ahead with closing the task if everything checks out :) I was unsure about the partitionin... [07:35:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1160.eqiad.wmnet with reason: Maintenance [07:35:16] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#11045827 (10elukey) @Jhancock.wm yep this should be ready for a review, already reimaged! [07:35:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T399728)', diff saved to https://phabricator.wikimedia.org/P80271 and previous config saved to /var/cache/conftool/dbconfig/20250730-073517-fceratto.json [07:35:22] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [07:37:10] (03CR) 10Elukey: [C:03+2] redfish: expand is_uefi for Dells [software/spicerack] - 10https://gerrit.wikimedia.org/r/1173923 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [07:38:42] (03CR) 10Elukey: "Sorry missed the comment! I'd prefer to keep things separated so it is clear what they differ on :)" [puppet] - 10https://gerrit.wikimedia.org/r/1173876 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [07:39:22] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1173878 (https://phabricator.wikimedia.org/T371366) (owner: 10Filippo Giunchedi) [07:41:47] (03CR) 10Filippo Giunchedi: [C:03+2] kafkatee: fix webrequest input [puppet] - 10https://gerrit.wikimedia.org/r/1173878 (https://phabricator.wikimedia.org/T371366) (owner: 10Filippo Giunchedi) [07:42:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T399728)', diff saved to https://phabricator.wikimedia.org/P80272 and previous config saved to /var/cache/conftool/dbconfig/20250730-074213-fceratto.json [07:42:20] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [07:43:59] (03PS4) 10Elukey: install_server: fix hwraid-1dev-nvme and modify boss_leavelvm.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1173876 (https://phabricator.wikimedia.org/T393044) [07:46:22] (03Merged) 10jenkins-bot: redfish: expand is_uefi for Dells [software/spicerack] - 10https://gerrit.wikimedia.org/r/1173923 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [07:47:19] (03CR) 10Elukey: [C:03+2] install_server: fix hwraid-1dev-nvme and modify boss_leavelvm.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1173876 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [07:48:58] (03CR) 10Jelto: [C:03+2] Gitlab: switchover from gitlab2002 to gitlab1004 [dns] - 10https://gerrit.wikimedia.org/r/1172029 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [07:49:28] !log jelto@dns1004 START - running authdns-update [07:49:50] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:50:02] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: add custom settings for Supermicro (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [07:50:27] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:50:46] !log jelto@dns1004 END - running authdns-update [07:50:56] !log jelto@cumin1003 START - Cookbook sre.dns.wipe-cache 'https://gitlab.wikimedia.org/ https://gitlab-replica-b.wikimedia.org/' on all recursors [07:50:59] !log jelto@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'https://gitlab.wikimedia.org/ https://gitlab-replica-b.wikimedia.org/' on all recursors [07:51:33] !log jelto@cumin1003 START - Cookbook sre.dns.wipe-cache 'https://gitlab.wikimedia.org/ https://gitlab-replica-b.wikimedia.org/' on all recursors [07:51:37] !log jelto@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'https://gitlab.wikimedia.org/ https://gitlab-replica-b.wikimedia.org/' on all recursors [07:51:50] RECOVERY - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [07:53:48] !log jelto@cumin1003 START - Cookbook sre.dns.wipe-cache 'https://gitlab.wikimedia.org/ https://gitlab-replica-b.wikimedia.org/' on all recursors [07:53:51] !log jelto@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'https://gitlab.wikimedia.org/ https://gitlab-replica-b.wikimedia.org/' on all recursors [07:55:59] !log mlitn@deploy1003 Started scap sync-world: Backport for [[gerrit:1171239|Add new MediaSearch config/coefficients (T385286)]] [07:56:04] T385286: [SPIKE] Gather labeled data to re-tune MediaSearch - https://phabricator.wikimedia.org/T385286 [07:56:10] !log jelto@cumin1003 START - Cookbook sre.dns.wipe-cache 'https://gitlab.wikimedia.org/ https://gitlab-replica-b.wikimedia.org/' on all recursors [07:56:14] !log jelto@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'https://gitlab.wikimedia.org/ https://gitlab-replica-b.wikimedia.org/' on all recursors [07:56:29] jelto: the wipe-cache cookbook takes the DNS name, not a HTTPS url [07:57:12] thank you! I already suspected that. I'll update my cookbook to use the propery dns name. But all good, TTL of 300s is over and new DNS entry present [07:57:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P80273 and previous config saved to /var/cache/conftool/dbconfig/20250730-075720-fceratto.json [07:58:22] !log mlitn@deploy1003 mlitn: Backport for [[gerrit:1171239|Add new MediaSearch config/coefficients (T385286)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:58:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:58:41] (03PS1) 10Fabfur: haproxykafka: replaced reReplaceAll regex with stripPort [alerts] - 10https://gerrit.wikimedia.org/r/1174386 [07:59:14] jelto@cumin1003 failover (PID 4002853) is awaiting input [08:00:09] !log mlitn@deploy1003 mlitn: Continuing with sync [08:01:01] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:03:22] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.failover (exit_code=0) Failover of gitlab from gitlab2002.wikimedia.org to gitlab1004.wikimedia.org [08:03:54] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [08:04:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:05:41] !log mlitn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1171239|Add new MediaSearch config/coefficients (T385286)]] (duration: 09m 42s) [08:05:47] T385286: [SPIKE] Gather labeled data to re-tune MediaSearch - https://phabricator.wikimedia.org/T385286 [08:08:38] (03CR) 10Tiziano Fogli: [C:03+1] haproxykafka: replaced reReplaceAll regex with stripPort [alerts] - 10https://gerrit.wikimedia.org/r/1174386 (owner: 10Fabfur) [08:09:09] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS bookworm [08:10:10] tnx tappof [08:10:16] (03CR) 10Fabfur: [C:03+2] haproxykafka: replaced reReplaceAll regex with stripPort [alerts] - 10https://gerrit.wikimedia.org/r/1174386 (owner: 10Fabfur) [08:11:27] (03Merged) 10jenkins-bot: haproxykafka: replaced reReplaceAll regex with stripPort [alerts] - 10https://gerrit.wikimedia.org/r/1174386 (owner: 10Fabfur) [08:12:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P80274 and previous config saved to /var/cache/conftool/dbconfig/20250730-081228-fceratto.json [08:19:42] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy revertrisk-language-agnostic latest published image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172622 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis) [08:21:22] (03Merged) 10jenkins-bot: ml-services: Deploy revertrisk-language-agnostic latest published image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172622 (https://phabricator.wikimedia.org/T400266) (owner: 10Gkyziridis) [08:23:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11045957 (10fnegri) @VRiley-WMF I //think// that my patch above (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1173974) should fix t... [08:27:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T399728)', diff saved to https://phabricator.wikimedia.org/P80275 and previous config saved to /var/cache/conftool/dbconfig/20250730-082735-fceratto.json [08:27:41] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [08:27:51] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1190.eqiad.wmnet with reason: Maintenance [08:27:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T399728)', diff saved to https://phabricator.wikimedia.org/P80276 and previous config saved to /var/cache/conftool/dbconfig/20250730-082758-fceratto.json [08:28:19] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:28:32] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:32:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T399728)', diff saved to https://phabricator.wikimedia.org/P80278 and previous config saved to /var/cache/conftool/dbconfig/20250730-083252-fceratto.json [08:33:01] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [08:36:05] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS bookworm [08:36:07] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2183.codfw.wmnet with reason: upgrade mariadb [08:38:11] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2184.codfw.wmnet with reason: replication will stop [08:38:25] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:40:27] (03PS23) 10Tiziano Fogli: nrpe wrapper: add wrapper to be invoked a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446) [08:41:13] (03PS1) 10Jcrespo: mariadb: Upgrade db2183 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1174400 (https://phabricator.wikimedia.org/T394487) [08:42:16] (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db2183 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1174400 (https://phabricator.wikimedia.org/T394487) (owner: 10Jcrespo) [08:43:23] (03CR) 10Tiziano Fogli: [C:03+2] nrpe wrapper: add wrapper to be invoked a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [08:48:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P80279 and previous config saved to /var/cache/conftool/dbconfig/20250730-084800-fceratto.json [08:51:22] (03PS1) 10Gkyziridis: ml-services: Deploy ores-legacy-model to the latest published image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174404 (https://phabricator.wikimedia.org/T400348) [08:54:54] (03PS1) 10Elukey: Add sretest2010 to the catalog [puppet] - 10https://gerrit.wikimedia.org/r/1174405 (https://phabricator.wikimedia.org/T394357) [08:55:47] (03CR) 10Elukey: [C:03+2] Add sretest2010 to the catalog [puppet] - 10https://gerrit.wikimedia.org/r/1174405 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [08:56:02] (03PS1) 10Gmodena: EventStreamConfig: remove staging page change conf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166827 (https://phabricator.wikimedia.org/T394899) [08:58:54] (03PS1) 10Tiziano Fogli: nrpe wrapper: install nrpe2nodexp dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1174406 (https://phabricator.wikimedia.org/T395446) [08:59:11] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [08:59:27] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db[2183-2184].codfw.wmnet [08:59:28] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db[2183-2184].codfw.wmnet [08:59:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [08:59:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [08:59:48] (03CR) 10Filippo Giunchedi: [C:03+1] nrpe wrapper: install nrpe2nodexp dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1174406 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [09:01:50] (03PS1) 10Ayounsi: [WIP] sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 [09:01:59] (03CR) 10Tiziano Fogli: [C:03+2] nrpe wrapper: install nrpe2nodexp dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1174406 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [09:02:22] (03PS1) 10Jcrespo: mariadb: Upgrade db1204 & db1205 MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1174408 (https://phabricator.wikimedia.org/T394487) [09:03:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P80280 and previous config saved to /var/cache/conftool/dbconfig/20250730-090309-fceratto.json [09:03:44] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [09:08:55] (03CR) 10CI reject: [V:04-1] [WIP] sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 (owner: 10Ayounsi) [09:09:38] (03PS1) 10Jelto: gitlab: pause restore on gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/1174409 (https://phabricator.wikimedia.org/T400252) [09:10:58] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db[1204-1205].eqiad.wmnet with reason: upgrade mariadb [09:11:11] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:11:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:11:30] (03CR) 10Filippo Giunchedi: "Please run PCC too, I don't remember if we have to explicitly absent the existing rsyslog configuration and logrotate configuration or jus" [puppet] - 10https://gerrit.wikimedia.org/r/1173442 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [09:11:33] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1174409 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [09:12:25] (03CR) 10Arnaudb: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1174409 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [09:15:27] (03PS1) 10Gkyziridis: ml-services: Deploy latest images for articletopic-outlink-model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174410 (https://phabricator.wikimedia.org/T400349) [09:15:53] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: pause restore on gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/1174409 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [09:15:56] (03PS1) 10Tiziano Fogli: nrpe wrapper: enable nrpe2nodexp for check_disk_space (for testing) [puppet] - 10https://gerrit.wikimedia.org/r/1174411 (https://phabricator.wikimedia.org/T395446) [09:16:32] (03PS2) 10Vgutierrez: haproxy,varnish: Use bullseye image 20250723 for docker based tests [puppet] - 10https://gerrit.wikimedia.org/r/1174123 (https://phabricator.wikimedia.org/T400774) [09:16:40] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:16:46] (03PS2) 10Ayounsi: [WIP] sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 [09:16:55] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [09:16:56] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2010.codfw.wmnet with OS bookworm [09:18:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T399728)', diff saved to https://phabricator.wikimedia.org/P80281 and previous config saved to /var/cache/conftool/dbconfig/20250730-091817-fceratto.json [09:18:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1199.eqiad.wmnet with reason: Maintenance [09:18:23] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:18:25] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Deploy latest images for articletopic-outlink-model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174410 (https://phabricator.wikimedia.org/T400349) (owner: 10Gkyziridis) [09:18:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T399728)', diff saved to https://phabricator.wikimedia.org/P80282 and previous config saved to /var/cache/conftool/dbconfig/20250730-091829-fceratto.json [09:19:51] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11046057 (10elukey) @Jhancock.wm sretest2010 successfully provisioned and reimaged. Given the amount of extra disks of ~7TB I suspect this is another hadoop-... [09:22:02] (03PS1) 10Gkyziridis: ml-services: Deploy latest image for langid on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174412 (https://phabricator.wikimedia.org/T400347) [09:23:09] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11046064 (10elukey) >>! In T392851#11044305, @Jhancock.wm wrote: > There's two things i can think of. One is converting the disks to raid capable and either setting two... [09:23:17] (03CR) 10CI reject: [V:04-1] [WIP] sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 (owner: 10Ayounsi) [09:23:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T399728)', diff saved to https://phabricator.wikimedia.org/P80283 and previous config saved to /var/cache/conftool/dbconfig/20250730-092332-fceratto.json [09:23:38] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:26:01] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy latest images for articletopic-outlink-model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174410 (https://phabricator.wikimedia.org/T400349) (owner: 10Gkyziridis) [09:26:07] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Deploy latest image for langid on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174412 (https://phabricator.wikimedia.org/T400347) (owner: 10Gkyziridis) [09:27:37] (03Merged) 10jenkins-bot: ml-services: Deploy latest images for articletopic-outlink-model on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174410 (https://phabricator.wikimedia.org/T400349) (owner: 10Gkyziridis) [09:28:13] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Deploy ores-legacy-model to the latest published image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174404 (https://phabricator.wikimedia.org/T400348) (owner: 10Gkyziridis) [09:29:19] (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db1204 & db1205 MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1174408 (https://phabricator.wikimedia.org/T394487) (owner: 10Jcrespo) [09:30:35] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:30:45] (03PS3) 10Ayounsi: [WIP] sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 [09:33:27] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy latest image for langid on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174412 (https://phabricator.wikimedia.org/T400347) (owner: 10Gkyziridis) [09:33:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-b4glm - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [09:35:06] (03Merged) 10jenkins-bot: ml-services: Deploy latest image for langid on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174412 (https://phabricator.wikimedia.org/T400347) (owner: 10Gkyziridis) [09:36:22] (03CR) 10Fabfur: haproxy,varnish: Use bullseye image 20250723 for docker based tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1174123 (https://phabricator.wikimedia.org/T400774) (owner: 10Vgutierrez) [09:36:33] (03CR) 10Fabfur: [C:03+1] haproxy,varnish: Use bullseye image 20250723 for docker based tests [puppet] - 10https://gerrit.wikimedia.org/r/1174123 (https://phabricator.wikimedia.org/T400774) (owner: 10Vgutierrez) [09:36:59] (03CR) 10Vgutierrez: haproxy,varnish: Use bullseye image 20250723 for docker based tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1174123 (https://phabricator.wikimedia.org/T400774) (owner: 10Vgutierrez) [09:37:03] (03PS1) 10Brouberol: airflow: avoid configmap name collision when spawning multiple devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174413 [09:37:08] (03CR) 10Vgutierrez: [C:03+2] haproxy,varnish: Use bullseye image 20250723 for docker based tests [puppet] - 10https://gerrit.wikimedia.org/r/1174123 (https://phabricator.wikimedia.org/T400774) (owner: 10Vgutierrez) [09:37:43] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [09:37:44] (03PS4) 10Ayounsi: [WIP] sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 [09:38:33] (03PS5) 10Ayounsi: sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 [09:38:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P80284 and previous config saved to /var/cache/conftool/dbconfig/20250730-093839-fceratto.json [09:38:49] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy ores-legacy-model to the latest published image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174404 (https://phabricator.wikimedia.org/T400348) (owner: 10Gkyziridis) [09:40:35] (03Merged) 10jenkins-bot: ml-services: Deploy ores-legacy-model to the latest published image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174404 (https://phabricator.wikimedia.org/T400348) (owner: 10Gkyziridis) [09:42:47] (03CR) 10Hamish: [C:03+1] "Make sense from my end." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [09:43:07] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:43:44] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:44:15] !log gkyziridis@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:47:58] (03CR) 10Stevemunene: [C:03+1] "Looks good, Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174413 (owner: 10Brouberol) [09:53:14] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db[1204-1205].eqiad.wmnet [09:53:15] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db[1204-1205].eqiad.wmnet [09:53:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P80285 and previous config saved to /var/cache/conftool/dbconfig/20250730-095346-fceratto.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T1000) [10:07:23] (03PS1) 10Fabfur: haproxy: copied some fixes from haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1174416 [10:08:08] (03PS1) 10Jelto: sre.gitlab.failover: use hostname in wipe-cache [cookbooks] - 10https://gerrit.wikimedia.org/r/1174417 (https://phabricator.wikimedia.org/T400252) [10:08:34] (03CR) 10CI reject: [V:04-1] haproxy: copied some fixes from haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1174416 (owner: 10Fabfur) [10:08:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-b4glm - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [10:08:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T399728)', diff saved to https://phabricator.wikimedia.org/P80286 and previous config saved to /var/cache/conftool/dbconfig/20250730-100854-fceratto.json [10:09:01] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:09:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1221.eqiad.wmnet with reason: Maintenance [10:09:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:09:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T399728)', diff saved to https://phabricator.wikimedia.org/P80287 and previous config saved to /var/cache/conftool/dbconfig/20250730-100934-fceratto.json [10:10:20] (03PS2) 10Fabfur: haproxy: copied some fixes from haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1174416 [10:13:26] (03CR) 10CI reject: [V:04-1] sre.gitlab.failover: use hostname in wipe-cache [cookbooks] - 10https://gerrit.wikimedia.org/r/1174417 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [10:13:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T399728)', diff saved to https://phabricator.wikimedia.org/P80288 and previous config saved to /var/cache/conftool/dbconfig/20250730-101343-fceratto.json [10:15:23] (03CR) 10Filippo Giunchedi: [C:03+1] nrpe wrapper: enable nrpe2nodexp for check_disk_space (for testing) [puppet] - 10https://gerrit.wikimedia.org/r/1174411 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [10:19:22] (03PS2) 10Jelto: sre.gitlab.failover: use hostname in wipe-cache [cookbooks] - 10https://gerrit.wikimedia.org/r/1174417 (https://phabricator.wikimedia.org/T400252) [10:20:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11046260 (10cmooney) @Jclark-ctr @VRiley-WMF I added the links in Netbox now. I just used dummy labels in Netbox for the cables, understand we don't have the cables or opti... [10:24:32] (03PS1) 10Vgutierrez: varnish: Expand authorization method reporting [puppet] - 10https://gerrit.wikimedia.org/r/1174420 (https://phabricator.wikimedia.org/T400238) [10:24:48] FIRING: PuppetFailure: Puppet has failed on maps2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:26:41] (03PS1) 10Fabfur: haproxykafka: fixed alert HaproxykafkaNoMessages [alerts] - 10https://gerrit.wikimedia.org/r/1174421 (https://phabricator.wikimedia.org/T400039) [10:26:48] FIRING: PuppetFailure: Puppet has failed on puppetmaster1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:27:34] (03CR) 10Vgutierrez: haproxykafka: fixed alert HaproxykafkaNoMessages (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1174421 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [10:28:43] (03PS2) 10Fabfur: team-data-engineering: fixed alert HaproxykafkaNoMessages [alerts] - 10https://gerrit.wikimedia.org/r/1174421 (https://phabricator.wikimedia.org/T400039) [10:28:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P80289 and previous config saved to /var/cache/conftool/dbconfig/20250730-102850-fceratto.json [10:28:57] (03CR) 10Fabfur: team-data-engineering: fixed alert HaproxykafkaNoMessages (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1174421 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [10:29:48] FIRING: [2x] PuppetFailure: Puppet has failed on maps1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:31:48] FIRING: [2x] PuppetFailure: Puppet has failed on puppetmaster1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:33:13] (03CR) 10Vgutierrez: "text tests are happy: `0 tests failed, 0 tests skipped, 39 tests passed`" [puppet] - 10https://gerrit.wikimedia.org/r/1174420 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [10:33:51] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1174420 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [10:36:39] (03PS2) 10Vgutierrez: varnish: Expand authorization method reporting [puppet] - 10https://gerrit.wikimedia.org/r/1174420 (https://phabricator.wikimedia.org/T400238) [10:36:48] FIRING: [3x] PuppetFailure: Puppet has failed on puppetmaster1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:38:48] FIRING: PuppetFailure: Puppet has failed on mwmaint1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:39:48] FIRING: [5x] PuppetFailure: Puppet has failed on maps1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:43:10] (03PS3) 10Jelto: sre.gitlab.failover: use hostname in wipe-cache [cookbooks] - 10https://gerrit.wikimedia.org/r/1174417 (https://phabricator.wikimedia.org/T400252) [10:43:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P80290 and previous config saved to /var/cache/conftool/dbconfig/20250730-104357-fceratto.json [10:48:12] (03CR) 10Fabfur: [C:03+1] varnish: Expand authorization method reporting [puppet] - 10https://gerrit.wikimedia.org/r/1174420 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [10:48:48] FIRING: [2x] PuppetFailure: Puppet has failed on mwmaint1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:49:39] (03CR) 10Giuseppe Lavagetto: [C:03+1] varnish: Expand authorization method reporting [puppet] - 10https://gerrit.wikimedia.org/r/1174420 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [10:49:48] FIRING: [9x] PuppetFailure: Puppet has failed on maps1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:49:50] (03CR) 10Vgutierrez: [C:03+2] varnish: Expand authorization method reporting [puppet] - 10https://gerrit.wikimedia.org/r/1174420 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [10:49:51] (03CR) 10Brouberol: [C:03+2] airflow: avoid configmap name collision when spawning multiple devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174413 (owner: 10Brouberol) [10:51:48] FIRING: [4x] PuppetFailure: Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:53:02] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173358 (owner: 10PipelineBot) [10:54:48] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173358 (owner: 10PipelineBot) [10:54:48] FIRING: [12x] PuppetFailure: Puppet has failed on maps1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:55:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-skmkl - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [10:56:42] (03PS1) 10Mvolz: Update zotero translators repository [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174430 [10:59:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T399728)', diff saved to https://phabricator.wikimedia.org/P80291 and previous config saved to /var/cache/conftool/dbconfig/20250730-105904-fceratto.json [10:59:10] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:59:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1238.eqiad.wmnet with reason: Maintenance [10:59:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T399728)', diff saved to https://phabricator.wikimedia.org/P80292 and previous config saved to /var/cache/conftool/dbconfig/20250730-105926-fceratto.json [11:00:04] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T1100). [11:01:14] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:01:39] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:04:58] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:05:24] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:05:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T399728)', diff saved to https://phabricator.wikimedia.org/P80293 and previous config saved to /var/cache/conftool/dbconfig/20250730-110526-fceratto.json [11:05:32] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:05:48] FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-bhnn5 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [11:07:26] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:07:53] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:15:15] (03CR) 10Mvolz: [C:03+2] Update zotero translators repository [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174430 (owner: 10Mvolz) [11:15:34] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync [11:16:14] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [11:16:50] (03Merged) 10jenkins-bot: Update zotero translators repository [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174430 (owner: 10Mvolz) [11:19:58] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply [11:20:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P80294 and previous config saved to /var/cache/conftool/dbconfig/20250730-112034-fceratto.json [11:20:48] RESOLVED: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-bhnn5 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [11:21:19] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:31:32] (03PS1) 10Brouberol: (hotfix) airflow: avoid defining empty volume or volumeMount lists [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174436 [11:31:46] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11046485 (10Mvolz) If citoid calls zotero, sees a 404, and it reports a 404 back, yes, we want to count that. But obviously only once. We just wan... [11:32:17] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/zotero: apply [11:32:47] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:33:12] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/zotero: apply [11:33:26] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:34:13] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/zotero: apply [11:34:38] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [11:35:31] (03CR) 10Stevemunene: [C:03+1] (hotfix) airflow: avoid defining empty volume or volumeMount lists [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174436 (owner: 10Brouberol) [11:35:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P80295 and previous config saved to /var/cache/conftool/dbconfig/20250730-113541-fceratto.json [11:35:46] (03CR) 10Brouberol: [C:03+2] (hotfix) airflow: avoid defining empty volume or volumeMount lists [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174436 (owner: 10Brouberol) [11:36:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173920 (owner: 10Sergio Gimeno) [11:38:24] (03CR) 10TChin: [C:03+1] EventStreamConfig: remove staging page change conf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166827 (https://phabricator.wikimedia.org/T394899) (owner: 10Gmodena) [11:39:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:39:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172279 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [11:40:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:41:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:41:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:44:59] (03PS7) 10Brouberol: Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) [11:45:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:46:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:47:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [11:47:15] (03CR) 10David Caro: [C:03+1] "❤️" [puppet] - 10https://gerrit.wikimedia.org/r/1173949 (owner: 10Majavah) [11:47:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [11:49:50] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:50:18] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [11:50:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T399728)', diff saved to https://phabricator.wikimedia.org/P80296 and previous config saved to /var/cache/conftool/dbconfig/20250730-115049-fceratto.json [11:50:53] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [11:50:55] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:51:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1241.eqiad.wmnet with reason: Maintenance [11:51:10] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Set custom runbook URL [puppet] - 10https://gerrit.wikimedia.org/r/1173949 (owner: 10Majavah) [11:51:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T399728)', diff saved to https://phabricator.wikimedia.org/P80297 and previous config saved to /var/cache/conftool/dbconfig/20250730-115112-fceratto.json [11:51:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [11:52:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [11:53:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [11:54:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [11:55:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [11:56:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [11:56:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T399728)', diff saved to https://phabricator.wikimedia.org/P80298 and previous config saved to /var/cache/conftool/dbconfig/20250730-115614-fceratto.json [11:56:20] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:56:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [11:57:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [11:57:15] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11046543 (10cmooney) [11:57:55] (03PS1) 10Ladsgroup: DropUnusedTables: add --dry-run option [extensions/SecurePoll] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1174441 (https://phabricator.wikimedia.org/T395928) [11:58:07] (03PS1) 10Ladsgroup: DropUnusedTables: add --dry-run option [extensions/SecurePoll] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174442 (https://phabricator.wikimedia.org/T395928) [11:58:15] jouncebot: nowandnext [11:58:15] For the next 0 hour(s) and 1 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T1100) [11:58:15] In 1 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T1300) [11:58:16] (03PS2) 10Brouberol: Provision kafka-jumbo1017 [puppet] - 10https://gerrit.wikimedia.org/r/1166833 (https://phabricator.wikimedia.org/T398826) [11:58:29] (03CR) 10Ladsgroup: [C:03+2] DropUnusedTables: add --dry-run option [extensions/SecurePoll] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1174441 (https://phabricator.wikimedia.org/T395928) (owner: 10Ladsgroup) [11:58:34] (03CR) 10Ladsgroup: [C:03+2] DropUnusedTables: add --dry-run option [extensions/SecurePoll] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174442 (https://phabricator.wikimedia.org/T395928) (owner: 10Ladsgroup) [12:00:50] (03PS2) 10Ladsgroup: private_tables: Drop private tables that don't exist in production [puppet] - 10https://gerrit.wikimedia.org/r/1168148 (https://phabricator.wikimedia.org/T398945) [12:01:11] (03CR) 10Ladsgroup: [V:03+2 C:03+2] private_tables: Drop private tables that don't exist in production [puppet] - 10https://gerrit.wikimedia.org/r/1168148 (https://phabricator.wikimedia.org/T398945) (owner: 10Ladsgroup) [12:01:46] (03PS1) 10Effie Mouzeli: thumbor: Add configurable SUBPROCESS_TIMEOUT_KILL_AFTER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174444 (https://phabricator.wikimedia.org/T374350) [12:01:48] (03PS1) 10Effie Mouzeli: Add configurable --kill-after parameter for subprocess timeout [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1174445 (https://phabricator.wikimedia.org/T374350) [12:02:04] (03Merged) 10jenkins-bot: DropUnusedTables: add --dry-run option [extensions/SecurePoll] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1174441 (https://phabricator.wikimedia.org/T395928) (owner: 10Ladsgroup) [12:02:08] (03Merged) 10jenkins-bot: DropUnusedTables: add --dry-run option [extensions/SecurePoll] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174442 (https://phabricator.wikimedia.org/T395928) (owner: 10Ladsgroup) [12:03:07] (03CR) 10Brouberol: [C:03+2] Provision kafka-jumbo1017 [puppet] - 10https://gerrit.wikimedia.org/r/1166833 (https://phabricator.wikimedia.org/T398826) (owner: 10Brouberol) [12:03:36] elukey: headsup, I'm about to start adding kafka-jumbo1017 to the cluster [12:03:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: msw1-eqiad: cable me0 dedicated mgmt port directly to the switch itself - https://phabricator.wikimedia.org/T400161#11046558 (10cmooney) [12:03:46] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1174442|DropUnusedTables: add --dry-run option (T395928)]], [[gerrit:1174441|DropUnusedTables: add --dry-run option (T395928)]] [12:03:52] T395928: On wikis where user right securepoll-create-poll is missing, delete non-essential SecurePoll SQL tables - https://phabricator.wikimedia.org/T395928 [12:04:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: msw1-eqiad: cable me0 dedicated mgmt port directly to the switch itself - https://phabricator.wikimedia.org/T400161#11046560 (10cmooney) [12:05:56] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1174442|DropUnusedTables: add --dry-run option (T395928)]], [[gerrit:1174441|DropUnusedTables: add --dry-run option (T395928)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:06:45] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [12:06:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11046567 (10cmooney) [12:06:47] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11046568 (10cmooney) [12:08:57] Amir1: can you ping me when you finish your deployment? [12:09:23] sure, it'll be done really quickly [12:09:40] ty! [12:11:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P80299 and previous config saved to /var/cache/conftool/dbconfig/20250730-121122-fceratto.json [12:12:07] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1174442|DropUnusedTables: add --dry-run option (T395928)]], [[gerrit:1174441|DropUnusedTables: add --dry-run option (T395928)]] (duration: 08m 20s) [12:12:12] T395928: On wikis where user right securepoll-create-poll is missing, delete non-essential SecurePoll SQL tables - https://phabricator.wikimedia.org/T395928 [12:13:05] urbanecm: I'm done! [12:13:13] ty! proceeding with my stuff then. [12:14:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:14:34] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:16:16] !log brouberol@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:16:27] !log brouberol@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:17:31] !log brouberol@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [12:17:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163028 (https://phabricator.wikimedia.org/T397515) (owner: 10Urbanecm) [12:17:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163288 (https://phabricator.wikimedia.org/T397515) (owner: 10Urbanecm) [12:18:18] !log brouberol@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:19:14] !log kafka-jumbo1017 is added to the cluster, puppet ran on all kafka/zookeeper hosts, external-services was updated on dse-k8s-eqiad, codfw and eqiad - T398826 [12:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:20] T398826: Bring kafka-jumbo101[6-8] into service - https://phabricator.wikimedia.org/T398826 [12:19:44] (03Merged) 10jenkins-bot: [Growth] Remove support code for Surfacing Structured Tasks experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163028 (https://phabricator.wikimedia.org/T397515) (owner: 10Urbanecm) [12:19:46] (03Merged) 10jenkins-bot: [Growth] Remove feature flags related to Surfacing Structured Tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163288 (https://phabricator.wikimedia.org/T397515) (owner: 10Urbanecm) [12:20:09] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1163028|[Growth] Remove support code for Surfacing Structured Tasks experiment (T397515)]], [[gerrit:1163288|[Growth] Remove feature flags related to Surfacing Structured Tasks (T397515)]] [12:20:15] T397515: End the Surfacing Structured Tasks experiment - https://phabricator.wikimedia.org/T397515 [12:21:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:22:04] (03CR) 10Urbanecm: [C:03+1] "SGTM, but it has a merge conflict now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173920 (owner: 10Sergio Gimeno) [12:22:19] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1163028|[Growth] Remove support code for Surfacing Structured Tasks experiment (T397515)]], [[gerrit:1163288|[Growth] Remove feature flags related to Surfacing Structured Tasks (T397515)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:22:34] !log brouberol@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [12:24:01] !log urbanecm@deploy1003 urbanecm: Continuing with sync [12:26:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P80300 and previous config saved to /var/cache/conftool/dbconfig/20250730-122629-fceratto.json [12:29:04] (03PS3) 10Phuedx: mw::maintenance: ExperimentationLab periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) [12:29:15] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163028|[Growth] Remove support code for Surfacing Structured Tasks experiment (T397515)]], [[gerrit:1163288|[Growth] Remove feature flags related to Surfacing Structured Tasks (T397515)]] (duration: 09m 06s) [12:29:21] T397515: End the Surfacing Structured Tasks experiment - https://phabricator.wikimedia.org/T397515 [12:29:21] * urbanecm done [12:31:35] (03PS4) 10Phuedx: mw::maintenance: ExperimentationLab periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) [12:38:38] (03PS1) 10AOkoth: site: add vrts1004 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1174450 [12:39:59] (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1174450 (owner: 10AOkoth) [12:40:44] (03CR) 10Ladsgroup: [C:03+1] CommonSettings: Stop setting wgDBuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174071 (owner: 10Zabe) [12:41:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T399728)', diff saved to https://phabricator.wikimedia.org/P80301 and previous config saved to /var/cache/conftool/dbconfig/20250730-124137-fceratto.json [12:41:42] (03PS2) 10AOkoth: site: add vrts1004 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1174450 [12:41:43] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:41:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1242.eqiad.wmnet with reason: Maintenance [12:42:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T399728)', diff saved to https://phabricator.wikimedia.org/P80302 and previous config saved to /var/cache/conftool/dbconfig/20250730-124159-fceratto.json [12:42:31] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [12:42:32] !log ladsgroup@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitarium_restart (exit_code=99) [12:42:48] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [12:45:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:47:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T399728)', diff saved to https://phabricator.wikimedia.org/P80303 and previous config saved to /var/cache/conftool/dbconfig/20250730-124708-fceratto.json [12:47:14] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:47:38] (03CR) 10AOkoth: [C:03+2] site: add vrts1004 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1174450 (owner: 10AOkoth) [12:51:49] (03CR) 10Brouberol: Allow the Airflow webserver to support long requests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174005 (https://phabricator.wikimedia.org/T400493) (owner: 10Btullis) [12:52:59] !log aokoth@cumin1003 START - Cookbook sre.ganeti.makevm for new host vrts1004.eqiad.wmnet [12:53:01] !log aokoth@cumin1003 START - Cookbook sre.dns.netbox [12:53:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.sanitarium_restart (exit_code=0) [12:56:52] (03PS2) 10Sergio Gimeno: Growth: remove conditional user options for get-started-notification [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173920 [12:57:06] !log aokoth@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM vrts1004.eqiad.wmnet - aokoth@cumin1003" [12:57:10] !log aokoth@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM vrts1004.eqiad.wmnet - aokoth@cumin1003" [12:57:11] !log aokoth@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:57:11] !log aokoth@cumin1003 START - Cookbook sre.dns.wipe-cache vrts1004.eqiad.wmnet on all recursors [12:57:14] !log aokoth@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) vrts1004.eqiad.wmnet on all recursors [12:57:44] !log aokoth@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM vrts1004.eqiad.wmnet - aokoth@cumin1003" [12:57:48] !log aokoth@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM vrts1004.eqiad.wmnet - aokoth@cumin1003" [12:59:48] !log aokoth@cumin1003 START - Cookbook sre.hosts.reimage for host vrts1004.eqiad.wmnet with OS bookworm [13:00:01] (03PS1) 10Jforrester: ZObjectContentHandler::fillParserOutput: Don't try to add bad links [extensions/WikiLambda] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174455 (https://phabricator.wikimedia.org/T400521) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T1300). Please do the needful. [13:00:05] sergi0 and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:39] o/ [13:00:39] o/ [13:00:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:01:50] sergi0: Are you a deployer? [13:02:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P80304 and previous config saved to /var/cache/conftool/dbconfig/20250730-130215-fceratto.json [13:02:16] I should be able to deploy [13:04:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173920 (owner: 10Sergio Gimeno) [13:04:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172279 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [13:04:47] (03CR) 10Vgutierrez: haproxy: copied some fixes from haproxykafka (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1174416 (owner: 10Fabfur) [13:06:01] 06SRE, 10Hiddenparma, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11046783 (10Joe) [13:06:23] (03Merged) 10jenkins-bot: Growth: remove conditional user options for get-started-notification [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173920 (owner: 10Sergio Gimeno) [13:06:27] (03Merged) 10jenkins-bot: MetricsPlatform: Disable synchronous configs fetching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172279 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [13:06:52] !log sgimeno@deploy1003 Started scap sync-world: Backport for [[gerrit:1173920|Growth: remove conditional user options for get-started-notification]], [[gerrit:1172279|MetricsPlatform: Disable synchronous configs fetching (T398422)]] [13:06:58] T398422: MetricsPlatform: InstrumentConfigFetcher: Make fetching asynchronous - https://phabricator.wikimedia.org/T398422 [13:07:28] !log aokoth@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on vrts1004.eqiad.wmnet with reason: host reimage [13:08:50] o/ [13:08:59] !log sgimeno@deploy1003 sgimeno, phuedx: Backport for [[gerrit:1173920|Growth: remove conditional user options for get-started-notification]], [[gerrit:1172279|MetricsPlatform: Disable synchronous configs fetching (T398422)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:08:59] I can also deploy ^^ [13:09:09] Checking now [13:09:13] I'm already on it, ty @Lucas_WMDE [13:09:15] ack [13:11:56] sergi0: No errors on the frontend. JS SDK is operating correctly. Logs look clean in logstash. LGTM [13:12:15] (03CR) 10Arnaudb: [C:03+1] "looks good to me!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1174417 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [13:12:32] @phuedx [13:12:36] ack [13:12:50] !log sgimeno@deploy1003 sgimeno, phuedx: Continuing with sync [13:13:36] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on vrts1004.eqiad.wmnet with reason: host reimage [13:15:41] (03CR) 10Ssingh: "Please double-check my review as well and compare it with the pdns-rec docs where necessary." [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [13:17:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P80305 and previous config saved to /var/cache/conftool/dbconfig/20250730-131723-fceratto.json [13:18:18] !log sgimeno@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173920|Growth: remove conditional user options for get-started-notification]], [[gerrit:1172279|MetricsPlatform: Disable synchronous configs fetching (T398422)]] (duration: 11m 26s) [13:18:23] T398422: MetricsPlatform: InstrumentConfigFetcher: Make fetching asynchronous - https://phabricator.wikimedia.org/T398422 [13:18:52] sergi0: ty [13:19:13] @phuedx happy to help [13:24:01] (03CR) 10Elukey: [C:03+1] "Really nice work!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [13:24:46] !log UTC afternoon backport+config window done [13:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:06] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host vrts1004.eqiad.wmnet with OS bookworm [13:26:06] !log aokoth@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host vrts1004.eqiad.wmnet [13:29:10] 06SRE, 10Hiddenparma, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11046832 (10Joe) [13:30:42] !log brouberol@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad [13:31:24] 06SRE, 10Hiddenparma, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11046835 (10Joe) [13:31:37] 06SRE, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11046837 (10Joe) [13:32:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T399728)', diff saved to https://phabricator.wikimedia.org/P80306 and previous config saved to /var/cache/conftool/dbconfig/20250730-133230-fceratto.json [13:32:37] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:32:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1243.eqiad.wmnet with reason: Maintenance [13:32:48] elukey: all done, and we *did* restart the current controller (1016) last this time. Eventgate only experienced slight produce time bumps [13:32:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T399728)', diff saved to https://phabricator.wikimedia.org/P80307 and previous config saved to /var/cache/conftool/dbconfig/20250730-133253-fceratto.json [13:33:22] brouberol: really great work on the cookbook! [13:33:26] <3 [13:35:12] (03CR) 10Brouberol: [C:03+2] Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [13:35:50] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10Data-Platform-SRE (2025.07.26 - 2025.08.15), 13Patch-For-Review: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#11046858 (10brouberol) 05In progress→03Resolved [13:37:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T399728)', diff saved to https://phabricator.wikimedia.org/P80308 and previous config saved to /var/cache/conftool/dbconfig/20250730-133751-fceratto.json [13:37:59] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:42:13] (03PS2) 10Brouberol: Provision kafka-jumbo1018 [puppet] - 10https://gerrit.wikimedia.org/r/1166834 (https://phabricator.wikimedia.org/T398826) [13:42:53] (03PS2) 10Tiziano Fogli: nrpe wrapper: enable nrpe2nodexp for check_disk_space (for testing) [puppet] - 10https://gerrit.wikimedia.org/r/1174411 (https://phabricator.wikimedia.org/T395446) [13:43:04] (03CR) 10Brouberol: [C:03+2] Provision kafka-jumbo1018 [puppet] - 10https://gerrit.wikimedia.org/r/1166834 (https://phabricator.wikimedia.org/T398826) (owner: 10Brouberol) [13:44:28] (03PS3) 10Tiziano Fogli: nrpe wrapper: enable nrpe2nodexp for check_disk_space (for testing) [puppet] - 10https://gerrit.wikimedia.org/r/1174411 (https://phabricator.wikimedia.org/T395446) [13:44:38] (03Merged) 10jenkins-bot: Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [13:46:45] (03PS1) 10Tiziano Fogli: nrpe wrapper: install dependencies only on bullseye and newer [puppet] - 10https://gerrit.wikimedia.org/r/1174462 (https://phabricator.wikimedia.org/T395446) [13:50:19] (03CR) 10Filippo Giunchedi: [C:03+1] nrpe wrapper: install dependencies only on bullseye and newer [puppet] - 10https://gerrit.wikimedia.org/r/1174462 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [13:51:12] !log brouberol@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [13:51:32] (03CR) 10Tiziano Fogli: [C:03+2] nrpe wrapper: install dependencies only on bullseye and newer [puppet] - 10https://gerrit.wikimedia.org/r/1174462 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [13:51:58] !log brouberol@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:52:17] !log brouberol@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:52:20] (03PS2) 10Tiziano Fogli: nrpe wrapper: install dependencies only on bullseye and newer [puppet] - 10https://gerrit.wikimedia.org/r/1174462 (https://phabricator.wikimedia.org/T395446) [13:52:31] !log brouberol@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:52:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:52:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P80309 and previous config saved to /var/cache/conftool/dbconfig/20250730-135259-fceratto.json [13:53:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:53:44] !log kafka-jumbo1018 is added to the cluster, puppet ran on all kafka/zookeeper hosts, external-services was updated on dse-k8s-eqiad, codfw and eqiad - T398826 [13:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:52] T398826: Bring kafka-jumbo101[6-8] into service - https://phabricator.wikimedia.org/T398826 [13:54:20] !log brouberol@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [13:55:54] RR-ing kafka-jumbo for the last time, to add kafka-jumbo1018 [13:58:25] (03PS1) 10Jelto: gitlab: change nftables rate-limiting policy to accept [puppet] - 10https://gerrit.wikimedia.org/r/1174464 (https://phabricator.wikimedia.org/T400252) [13:58:48] (03CR) 10Tiziano Fogli: [C:03+2] nrpe wrapper: install dependencies only on bullseye and newer [puppet] - 10https://gerrit.wikimedia.org/r/1174462 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T1400) [14:00:46] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6455/co" [puppet] - 10https://gerrit.wikimedia.org/r/1174464 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [14:02:17] (03CR) 10Arnaudb: [C:03+1] "looks good to me, more details received on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1174464 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [14:03:19] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-07-08-183416 to 2025-07-30-130544 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174465 [14:03:19] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-07-15-225151 to 2025-07-29-155618 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174466 (https://phabricator.wikimedia.org/T391208) [14:03:37] (03PS2) 10Jelto: gitlab: disable nftables rate-limiting temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1174464 (https://phabricator.wikimedia.org/T400252) [14:04:48] FIRING: [12x] PuppetFailure: Puppet has failed on maps1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:06:07] (03CR) 10Jelto: [C:03+2] gitlab: disable nftables rate-limiting temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1174464 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [14:06:31] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade evaluators from 2025-07-08-183416 to 2025-07-30-130544 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174465 (owner: 10Jforrester) [14:06:48] FIRING: [4x] PuppetFailure: Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:08:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P80310 and previous config saved to /var/cache/conftool/dbconfig/20250730-140806-fceratto.json [14:08:07] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-07-08-183416 to 2025-07-30-130544 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174465 (owner: 10Jforrester) [14:09:20] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:09:47] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:09:48] FIRING: [12x] PuppetFailure: Puppet has failed on maps1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:10:24] !log apine@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:11:11] !log apine@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:11:48] FIRING: [4x] PuppetFailure: Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:12:03] !log apine@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:13:04] !log apine@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:13:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device lsw1-f8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T400784#11047004 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:13:54] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-07-15-225151 to 2025-07-29-155618 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174466 (https://phabricator.wikimedia.org/T391208) (owner: 10Jforrester) [14:14:44] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device lsw1-e8-eqiad.mgmt.eqiad.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T400782#11047012 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [14:15:40] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-07-15-225151 to 2025-07-29-155618 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174466 (https://phabricator.wikimedia.org/T391208) (owner: 10Jforrester) [14:16:14] !log apine@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:16:43] !log apine@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:16:48] FIRING: [4x] PuppetFailure: Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:17:28] !log apine@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:18:00] !log apine@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:18:06] 06SRE, 06Infrastructure-Foundations: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11047024 (10CDanis) [14:18:13] !log apine@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:18:48] FIRING: [2x] PuppetFailure: Puppet has failed on mwmaint1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:19:06] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11047026 (10Jelto) [14:19:12] !log apine@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:19:31] !log apine@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:19:36] !log apine@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:19:48] FIRING: [12x] PuppetFailure: Puppet has failed on maps1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:23:11] (03PS1) 10Jelto: gitlab: disable nftables rate-limiting monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1174476 (https://phabricator.wikimedia.org/T400252) [14:23:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T399728)', diff saved to https://phabricator.wikimedia.org/P80311 and previous config saved to /var/cache/conftool/dbconfig/20250730-142314-fceratto.json [14:23:20] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [14:23:30] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [14:23:38] (03CR) 10Arnaudb: [C:03+1] "discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1174476 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [14:25:11] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6456/co" [puppet] - 10https://gerrit.wikimedia.org/r/1174476 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [14:25:42] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: disable nftables rate-limiting monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1174476 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [14:26:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1247.eqiad.wmnet with reason: Maintenance [14:26:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T399728)', diff saved to https://phabricator.wikimedia.org/P80313 and previous config saved to /var/cache/conftool/dbconfig/20250730-142644-fceratto.json [14:28:48] RESOLVED: [2x] PuppetFailure: Puppet has failed on mwmaint1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:29:26] (03PS1) 10Arnaudb: gitlab: binding nft throttling and its monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1174479 (https://phabricator.wikimedia.org/T400252) [14:29:48] FIRING: [8x] PuppetFailure: Puppet has failed on maps1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T1400) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T1430) [14:30:28] 06SRE, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11047078 (10RoySmith) Would it be possible to include a link to this phab ticket and/or the policy page in the HTTP error response? [14:32:07] !log pt1979@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1022.eqiad.wmnet with OS bookworm [14:32:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11047090 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host clouddb1022.eqiad.wmnet with... [14:32:20] (03PS4) 10Tiziano Fogli: nrpe wrapper: enable nrpe2nodexp for check_disk_space (for testing) [puppet] - 10https://gerrit.wikimedia.org/r/1174411 (https://phabricator.wikimedia.org/T395446) [14:32:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T399728)', diff saved to https://phabricator.wikimedia.org/P80314 and previous config saved to /var/cache/conftool/dbconfig/20250730-143235-fceratto.json [14:32:41] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [14:33:32] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2034.codfw.wmnet with OS bookworm [14:34:11] !log cwhite@cumin2002 START - Cookbook sre.hosts.move-vlan for host logstash2034 [14:34:18] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [14:34:48] FIRING: [8x] PuppetFailure: Puppet has failed on maps1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:35:34] (03PS5) 10Andrea Denisse: centrallog: Add sampling rules for debug logging [puppet] - 10https://gerrit.wikimedia.org/r/1173442 (https://phabricator.wikimedia.org/T383309) [14:36:45] 10ops-eqiad, 06SRE, 06DC-Ops: Decom eqiad row B <-> cloudsw links - https://phabricator.wikimedia.org/T391489#11047109 (10Jclark-ctr) 05Open→03Resolved Removed links updated Netbox [14:39:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11047138 (10fnegri) @Papaul the reimage of clouddb1022 will fail until my patch above is merged, I'm waiting for a review from #data-persis... [14:40:02] cwhite@cumin2002 reimage (PID 1730735) is awaiting input [14:41:05] (03PS6) 10Andrea Denisse: centrallog: Add sampling rules for debug logging [puppet] - 10https://gerrit.wikimedia.org/r/1173442 (https://phabricator.wikimedia.org/T383309) [14:41:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:42:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11047161 (10Papaul) @fnegri thanks i was about to ping you also on that. [14:45:18] (03CR) 10Papaul: [C:03+1] installserver: setup new hosts clouddb102[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/1173974 (https://phabricator.wikimedia.org/T393733) (owner: 10FNegri) [14:47:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P80315 and previous config saved to /var/cache/conftool/dbconfig/20250730-144743-fceratto.json [14:48:03] (03CR) 10Andrea Denisse: "Hi Filippo, thanks for taking a look!" [puppet] - 10https://gerrit.wikimedia.org/r/1173442 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [14:48:55] !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host logstash2034 - cwhite@cumin2002" [14:49:51] !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host logstash2034 - cwhite@cumin2002" [14:49:51] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:49:52] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash2034.codfw.wmnet 30.16.192.10.in-addr.arpa 0.3.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:49:53] !log 💙cdanis@cumin1003.eqiad.wmnet ~ 🕥☕ sudo cumin 'A:cp' 'disable-puppet "cdanis deploy I74ada0e T400753"' [14:49:55] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash2034.codfw.wmnet 30.16.192.10.in-addr.arpa 0.3.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:49:56] !log cwhite@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host logstash2034 [14:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:07] T400753: FY25/26 WE4.3.1: edge uniques in requestctl - https://phabricator.wikimedia.org/T400753 [14:50:21] (03CR) 10CDanis: [C:03+2] haproxy: scrub part of x-analytics even when xwd debug [puppet] - 10https://gerrit.wikimedia.org/r/1172401 (https://phabricator.wikimedia.org/T400753) (owner: 10CDanis) [14:50:33] !log cwhite@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logstash2034 [14:50:34] 06SRE, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11047176 (10AntiCompositeNumber) > external mw-related: requests with user-agent strings set by MediaWiki (like ForeignApiRepo) or by other mw-related software like WDQS Updater Does this inclu... [14:50:34] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host logstash2034 [14:51:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:54:12] (03PS5) 10Tiziano Fogli: nrpe wrapper: enable nrpe2nodexp for check_disk_space (for testing) [puppet] - 10https://gerrit.wikimedia.org/r/1174411 (https://phabricator.wikimedia.org/T395446) [14:54:19] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1174411 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [14:55:50] (03CR) 10Andrew Bogott: [C:03+1] installserver: setup new hosts clouddb102[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/1173974 (https://phabricator.wikimedia.org/T393733) (owner: 10FNegri) [14:56:29] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11047208 (10cmooney) @robh we probably need to order the optics to support these connections. I believe from talking to @Jclark-ctr that enough fibre cables were ordered in our last batch to... [14:59:02] !log phase1 💙cdanis@cumin1003.eqiad.wmnet ~ 🕚☕ sudo cumin 'A:cp' 'run-puppet-agent --enable "cdanis deploy I74ada0e T400753"' [14:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:07] T400753: FY25/26 WE4.3.1: edge uniques in requestctl - https://phabricator.wikimedia.org/T400753 [14:59:15] !log brouberol@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad [14:59:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#11047222 (10cmooney) >>! In T394333#11044990, @Andrew wrote: > @Jclark-ctr are we waiting on more DACs before we can move ahead with these? We are awaitin... [14:59:48] RESOLVED: PuppetFailure: Puppet has failed on maps1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:00:45] (03PS3) 10Fabfur: traffic: copied some fixes from haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1174416 [15:00:54] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad row C/D upgrade racking task - https://phabricator.wikimedia.org/T400780#11047227 (10Jclark-ctr) [15:01:48] RESOLVED: PuppetFailure: Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:02:12] (03CR) 10CI reject: [V:04-1] traffic: copied some fixes from haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1174416 (owner: 10Fabfur) [15:02:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P80316 and previous config saved to /var/cache/conftool/dbconfig/20250730-150250-fceratto.json [15:02:52] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad row C/D upgrade racking task - https://phabricator.wikimedia.org/T400780#11047248 (10Jclark-ctr) All switches are Racked and have mgmt and console connected closing work will. [15:02:59] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad row C/D upgrade racking task - https://phabricator.wikimedia.org/T400780#11047250 (10Jclark-ctr) 05Open→03Resolved [15:06:24] !log begin phase2 💔cdanis@cumin1003.eqiad.wmnet ~ 🕚☕ sudo cumin 'A:cp' 'disable-puppet "cdanis deploy I74ada0e T400753"' [15:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:30] T400753: FY25/26 WE4.3.1: edge uniques in requestctl - https://phabricator.wikimedia.org/T400753 [15:06:38] (03CR) 10CDanis: [C:03+2] "done, ty!" [puppet] - 10https://gerrit.wikimedia.org/r/1172402 (https://phabricator.wikimedia.org/T400753) (owner: 10CDanis) [15:09:11] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2034.codfw.wmnet with reason: host reimage [15:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:38] (03PS1) 10CDanis: Revert "haproxy: scrub part of x-analytics even when xwd debug" [puppet] - 10https://gerrit.wikimedia.org/r/1174490 [15:09:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:46] (03CR) 10CDanis: [C:03+2] Revert "haproxy: scrub part of x-analytics even when xwd debug" [puppet] - 10https://gerrit.wikimedia.org/r/1174490 (owner: 10CDanis) [15:11:37] 06SRE, 10SRE-swift-storage: Integrity check of commons' original images container dbs - https://phabricator.wikimedia.org/T400700#11047277 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon All containers are OK, check took 155 minutes. [15:12:41] (03CR) 10FNegri: [C:03+2] installserver: setup new hosts clouddb102[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/1173974 (https://phabricator.wikimedia.org/T393733) (owner: 10FNegri) [15:14:15] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2034.codfw.wmnet with reason: host reimage [15:15:10] (03PS4) 10Fabfur: traffic: copied some fixes from haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1174416 [15:15:55] Hey all - I’d like to quickly deploy a private security mitigation soon. Let me know if I should hold off. [15:17:34] (03CR) 10CI reject: [V:04-1] traffic: copied some fixes from haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1174416 (owner: 10Fabfur) [15:17:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T399728)', diff saved to https://phabricator.wikimedia.org/P80317 and previous config saved to /var/cache/conftool/dbconfig/20250730-151758-fceratto.json [15:18:04] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [15:18:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1248.eqiad.wmnet with reason: Maintenance [15:18:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T399728)', diff saved to https://phabricator.wikimedia.org/P80318 and previous config saved to /var/cache/conftool/dbconfig/20250730-151821-fceratto.json [15:18:29] !log all done 💙cdanis@cumin1003.eqiad.wmnet ~ 🕚☕ sudo cumin 'A:cp' 'run-puppet-agent --enable "cdanis deploy I74ada0e T400753"' [15:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:34] T400753: FY25/26 WE4.3.1: edge uniques in requestctl - https://phabricator.wikimedia.org/T400753 [15:18:43] (03PS5) 10Fabfur: traffic: copied some fixes from haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1174416 [15:19:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:20:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:20:50] (03CR) 10Fabfur: traffic: copied some fixes from haproxykafka (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1174416 (owner: 10Fabfur) [15:22:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T399728)', diff saved to https://phabricator.wikimedia.org/P80319 and previous config saved to /var/cache/conftool/dbconfig/20250730-152226-fceratto.json [15:24:06] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11047342 (10Jclark-ctr) a:03Jclark-ctr [15:25:17] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11047343 (10Jclark-ctr) [15:26:46] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (12) PDUs for ML expansion - https://phabricator.wikimedia.org/T400778#11047345 (10Jclark-ctr) [15:28:25] !log Deployed security mitigation for T400697 [15:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:01] (03PS1) 10CDobbins: admin: remove access for users jamesur and matanya [puppet] - 10https://gerrit.wikimedia.org/r/1174494 (https://phabricator.wikimedia.org/T400374) [15:37:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P80320 and previous config saved to /var/cache/conftool/dbconfig/20250730-153734-fceratto.json [15:39:40] (03PS5) 10CDobbins: admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) [15:39:52] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2034.codfw.wmnet with OS bookworm [15:40:29] (03CR) 10CI reject: [V:04-1] admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [15:41:08] (03PS1) 10Ahmon Dancy: python-build: Adjust text README.md, remove typos [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174499 [15:41:47] (03PS2) 10Ahmon Dancy: python-build: Adjust text README.md, remove typos [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174499 [15:42:04] (03PS6) 10CDobbins: admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) [15:42:50] (03CR) 10CI reject: [V:04-1] admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [15:44:58] (03CR) 10Dereckson: "> CCed users: if you think you should keep access please reach out to us" [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [15:45:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:45:55] (03CR) 10Ssingh: [C:03+1] "Looks good. Please keep the main task updated as well about the specific removals in this request." [puppet] - 10https://gerrit.wikimedia.org/r/1174494 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [15:47:11] (03PS7) 10CDobbins: admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) [15:48:43] (03CR) 10CI reject: [V:04-1] admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [15:49:00] (03PS1) 10CDanis: varnish: fix nocookies / wmf-uniq interaction [puppet] - 10https://gerrit.wikimedia.org/r/1174501 (https://phabricator.wikimedia.org/T400753) [15:49:50] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:50:32] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11047466 (10cmooney) @Jclark-ctr @VRiley-WMF I added the links in Netbox now. I just used dummy labels in Netbox for the cables, understand we don't have the cables or optics yet but for when... [15:51:11] pt1979@cumin1002 reimage (PID 3751806) is awaiting input [15:52:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P80321 and previous config saved to /var/cache/conftool/dbconfig/20250730-155241-fceratto.json [15:53:29] (03PS8) 10CDobbins: admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) [15:55:14] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:57:09] (03PS1) 10Tiziano Fogli: alertmanager/api/ro: permit ro api calls to domain_networks [puppet] - 10https://gerrit.wikimedia.org/r/1174500 (https://phabricator.wikimedia.org/T400443) [16:01:46] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1022.eqiad.wmnet with OS bookworm [16:01:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11047505 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with... [16:03:38] (03CR) 10CDobbins: "👍 This is the change for the already opted-in users: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1174494" [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [16:06:15] (03CR) 10Brouberol: [C:03+2] Blunderbuss helm chart that works with the new Blunderbuss versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171732 (https://phabricator.wikimedia.org/T392244) (owner: 10Aleksandar Mastilovic) [16:07:01] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1174500 (https://phabricator.wikimedia.org/T400443) (owner: 10Tiziano Fogli) [16:07:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T399728)', diff saved to https://phabricator.wikimedia.org/P80322 and previous config saved to /var/cache/conftool/dbconfig/20250730-160749-fceratto.json [16:07:55] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:08:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1249.eqiad.wmnet with reason: Maintenance [16:08:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T399728)', diff saved to https://phabricator.wikimedia.org/P80323 and previous config saved to /var/cache/conftool/dbconfig/20250730-160812-fceratto.json [16:13:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T399728)', diff saved to https://phabricator.wikimedia.org/P80324 and previous config saved to /var/cache/conftool/dbconfig/20250730-161308-fceratto.json [16:13:15] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:21:07] poor wikibugs [16:25:24] 06SRE, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11047556 (10DavidBrooks) AutoWikiBrowser uses the MediaWiki API and User-Agent is `WikiFunctions/n.n.n.n (Microsoft Windows NT n.n.n.n; .NET CLR 4.0.n.n)`. I don't know if that is distinctive en... [16:28:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P80325 and previous config saved to /var/cache/conftool/dbconfig/20250730-162816-fceratto.json [16:29:23] (03CR) 10Ottomata: [C:03+1] EventStreamConfig: remove staging page change conf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166827 (https://phabricator.wikimedia.org/T394899) (owner: 10Gmodena) [16:30:40] 06SRE, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11047571 (10Joe) >>! In T400119#11047078, @RoySmith wrote: > Would it be possible to include a link to this phab ticket and/or the policy page in the HTTP error response? The error response wil... [16:31:48] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1174500 (https://phabricator.wikimedia.org/T400443) (owner: 10Tiziano Fogli) [16:35:29] (03PS2) 10CDanis: varnish: fix nocookies / wmf-uniq interaction [puppet] - 10https://gerrit.wikimedia.org/r/1174501 (https://phabricator.wikimedia.org/T400753) [16:35:29] (03PS1) 10CDanis: varnish: tests: env var for which docker [puppet] - 10https://gerrit.wikimedia.org/r/1174505 [16:37:35] (03PS3) 10Ahmon Dancy: python-build: Adjust text README.md, remove typos [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174499 [16:37:35] (03PS1) 10Ahmon Dancy: python: Include python3-venv package in python base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174506 [16:37:35] (03PS1) 10Ahmon Dancy: python-build/bookworm/Dockerfile.template: Modernize [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174507 [16:37:48] 06SRE, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11047583 (10Joe) >>! In T400119#11047176, @AntiCompositeNumber wrote: >> external mw-related: requests with user-agent strings set by MediaWiki (like ForeignApiRepo) or by other mw-related softw... [16:42:57] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Decommission cirrussearch2055-2060 - https://phabricator.wikimedia.org/T395855#11047590 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:43:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P80326 and previous config saved to /var/cache/conftool/dbconfig/20250730-164323-fceratto.json [16:44:55] 06SRE, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11047597 (10Joe) >>! In T400119#11047556, @DavidBrooks wrote: > AutoWikiBrowser uses the MediaWiki API and User-Agent is `WikiFunctions/n.n.n.n (Microsoft Windows NT n.n.n.n; .NET CLR 4.0.n.n)`.... [16:58:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T399728)', diff saved to https://phabricator.wikimedia.org/P80327 and previous config saved to /var/cache/conftool/dbconfig/20250730-165831-fceratto.json [16:58:37] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:58:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1252.eqiad.wmnet with reason: Maintenance [16:58:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1252 (T399728)', diff saved to https://phabricator.wikimedia.org/P80328 and previous config saved to /var/cache/conftool/dbconfig/20250730-165853-fceratto.json [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T1700) [17:04:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T399728)', diff saved to https://phabricator.wikimedia.org/P80329 and previous config saved to /var/cache/conftool/dbconfig/20250730-170359-fceratto.json [17:04:05] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [17:06:10] (03CR) 10Andrea Denisse: "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1174500 (https://phabricator.wikimedia.org/T400443) (owner: 10Tiziano Fogli) [17:07:35] (03CR) 10CDobbins: [C:03+2] admin: remove access for users jamesur and matanya [puppet] - 10https://gerrit.wikimedia.org/r/1174494 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [17:16:33] (03CR) 10Andrea Denisse: "I'd possibly remove the 'Hosts:' lines from the commit, other than that it LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1174500 (https://phabricator.wikimedia.org/T400443) (owner: 10Tiziano Fogli) [17:18:24] 06SRE, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11047688 (10DavidBrooks) @Joe I wasn't addressing AWB used as a bot, but as an interactive Windows app. Still, the rest of your comment seems applicable. The contact information would be the use... [17:19:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P80330 and previous config saved to /var/cache/conftool/dbconfig/20250730-171906-fceratto.json [17:34:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P80331 and previous config saved to /var/cache/conftool/dbconfig/20250730-173413-fceratto.json [17:38:10] (03PS1) 10Jsn.sherman: Add experiment code to group by toggle [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174515 (https://phabricator.wikimedia.org/T397728) [17:39:31] (03CR) 10Krinkle: [C:03+1] Enable sitemaps API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173575 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [17:49:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T399728)', diff saved to https://phabricator.wikimedia.org/P80332 and previous config saved to /var/cache/conftool/dbconfig/20250730-174921-fceratto.json [17:49:27] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [17:49:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:50:58] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2033.codfw.wmnet with OS bookworm [17:51:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174515 (https://phabricator.wikimedia.org/T397728) (owner: 10Jsn.sherman) [17:51:28] !log cwhite@cumin2002 START - Cookbook sre.hosts.move-vlan for host logstash2033 [17:51:33] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [17:57:45] !log cwhite@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host logstash2033 - cwhite@cumin2002" [17:57:51] !log cwhite@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host logstash2033 - cwhite@cumin2002" [17:57:51] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:57:52] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash2033.codfw.wmnet 16.0.192.10.in-addr.arpa 6.1.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:57:54] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash2033.codfw.wmnet 16.0.192.10.in-addr.arpa 6.1.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:57:55] !log cwhite@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host logstash2033 [17:58:13] !log cwhite@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logstash2033 [17:58:13] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host logstash2033 [18:00:04] brennen and dduvall: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T1800). nyaa~ [18:04:13] (03CR) 10Krinkle: redirects: update SVN rewrite rules, do not link to Phabricator anymore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [18:04:27] (03CR) 10Krinkle: [C:04-1] redirects: update SVN rewrite rules, do not link to Phabricator anymore [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [18:05:00] o/ [18:06:57] !log train 1.45.0-wmf.12 status: no current blockers, rolling to group1 using spiderpig [18:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:32] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174523 (https://phabricator.wikimedia.org/T396373) [18:07:34] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174523 (https://phabricator.wikimedia.org/T396373) (owner: 10TrainBranchBot) [18:08:44] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174523 (https://phabricator.wikimedia.org/T396373) (owner: 10TrainBranchBot) [18:16:35] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2033.codfw.wmnet with reason: host reimage [18:16:37] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.12 refs T396373 [18:16:44] T396373: 1.45.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T396373 [18:23:19] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2033.codfw.wmnet with reason: host reimage [18:25:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission an-conf100[1-3] - https://phabricator.wikimedia.org/T398013#11047925 (10Jclark-ctr) a:03Jclark-ctr [18:29:23] vriley@cumin1002 reimage (PID 3936284) is awaiting input [18:30:23] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1022.eqiad.wmnet with OS bookworm [18:30:23] (03CR) 10BCornwall: "The records are all updated now; I was queuing up changes first 😊" [dns] - 10https://gerrit.wikimedia.org/r/1174007 (https://phabricator.wikimedia.org/T400731) (owner: 10BCornwall) [18:30:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11047938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm [18:31:17] (03PS9) 10CDobbins: admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) [18:41:51] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1022.eqiad.wmnet with reason: host reimage [18:46:30] (03PS1) 10Robertsky: wikimaniawiki: adjust down 2025 namespace protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174527 (https://phabricator.wikimedia.org/T400833) [18:47:58] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1022.eqiad.wmnet with reason: host reimage [18:49:11] (03CR) 10Chlod Alejandro: [C:03+1] wikimaniawiki: adjust down 2025 namespace protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174527 (https://phabricator.wikimedia.org/T400833) (owner: 10Robertsky) [18:50:03] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1023.eqiad.wmnet with OS bookworm [18:50:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11048010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1023.eqiad.wmnet with OS bookworm [18:50:57] (03PS10) 10CDobbins: admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) [18:51:11] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2033.codfw.wmnet with OS bookworm [18:52:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174527 (https://phabricator.wikimedia.org/T400833) (owner: 10Robertsky) [18:55:53] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174528 [19:00:49] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174530 [19:01:58] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1023.eqiad.wmnet with reason: host reimage [19:02:51] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [19:04:34] (03CR) 10Ssingh: admin: remove prod access for listed users (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [19:05:00] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [19:05:01] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1022.eqiad.wmnet with OS bookworm [19:05:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11048079 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm completed: - c... [19:06:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11048082 (10VRiley-WMF) [19:07:24] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1023.eqiad.wmnet with reason: host reimage [19:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:11:09] (03PS11) 10CDobbins: admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) [19:12:58] (03PS4) 10Ahmon Dancy: python-build: README.md: Clarify some text [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174499 [19:12:59] (03PS2) 10Ahmon Dancy: python: Include python3-venv package in python base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174506 [19:12:59] (03PS2) 10Ahmon Dancy: python-build/bookworm/Dockerfile.template: Modernize [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174507 [19:17:41] (03CR) 10CDobbins: admin: remove prod access for listed users (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [19:19:08] (03CR) 10Ssingh: [C:03+1] admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [19:20:06] (03CR) 10CDobbins: [C:03+2] admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [19:20:19] (03CR) 10Dzahn: [C:03+1] admin: remove prod access for listed users [puppet] - 10https://gerrit.wikimedia.org/r/1173484 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [19:22:17] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [19:25:07] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [19:25:08] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1023.eqiad.wmnet with OS bookworm [19:25:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11048141 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1023.eqiad.wmnet with OS bookworm completed: - c... [19:26:10] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host clouddb1024 [19:26:19] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host clouddb1024 [19:27:35] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [19:33:19] (03CR) 10Ahmon Dancy: "This is the meat of what I'd like to change. I know there's changelog changes to make but I'm having trouble getting `docker-pkg update` " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174506 (owner: 10Ahmon Dancy) [19:33:34] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clouddb1024 - vriley@cumin1002" [19:33:39] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clouddb1024 - vriley@cumin1002" [19:33:39] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:34:24] (03PS1) 10BCornwall: ncredir: Add batch of pay-for-edit domains [puppet] - 10https://gerrit.wikimedia.org/r/1174539 (https://phabricator.wikimedia.org/T400731) [19:34:50] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:39:11] (03CR) 10Pppery: ncredir: Add batch of pay-for-edit domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1174539 (https://phabricator.wikimedia.org/T400731) (owner: 10BCornwall) [19:40:17] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host clouddb1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:41:34] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:42:37] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1037.eqiad.wmnet with OS bookworm [19:42:47] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host clouddb1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:44:42] (03PS3) 10Bking: Introduce opensearch-operator-crds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173947 (https://phabricator.wikimedia.org/T397246) [19:49:50] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:53:08] (03PS2) 10BCornwall: ncredir: Add batch of pay-for-edit domains [puppet] - 10https://gerrit.wikimedia.org/r/1174539 (https://phabricator.wikimedia.org/T400731) [19:53:56] (03CR) 10BCornwall: ncredir: Add batch of pay-for-edit domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1174539 (https://phabricator.wikimedia.org/T400731) (owner: 10BCornwall) [19:55:42] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:58:55] vriley@cumin1002 provision (PID 3978406) is awaiting input [19:59:55] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1037.eqiad.wmnet with reason: host reimage [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T2000). [20:00:05] katherine_g and robertsky: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:21] o/ [20:02:54] robertsky: do you mind if I go first? [20:03:57] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1037.eqiad.wmnet with reason: host reimage [20:04:15] katherine_g: You're listed first and robertsky hasn't popped in yet, so I say go for it! [20:04:29] And I see that you have. Excellent. :-) [20:04:30] dancy: ok great, going for it [20:04:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174515 (https://phabricator.wikimedia.org/T397728) (owner: 10Jsn.sherman) [20:06:07] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host clouddb1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:06:47] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [20:10:53] (03CR) 10Pppery: ncredir: Add batch of pay-for-edit domains (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1174539 (https://phabricator.wikimedia.org/T400731) (owner: 10BCornwall) [20:11:22] (03Merged) 10jenkins-bot: Add experiment code to group by toggle [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174515 (https://phabricator.wikimedia.org/T397728) (owner: 10Jsn.sherman) [20:11:47] !log kgraessle@deploy1003 Started scap sync-world: Backport for [[gerrit:1174515|Add experiment code to group by toggle (T397728)]] [20:11:56] T397728: Add experiment code to group by toggle - https://phabricator.wikimedia.org/T397728 [20:12:07] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clouddb1025 - vriley@cumin1002" [20:12:13] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt clouddb1025 - vriley@cumin1002" [20:12:13] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:13:56] !log kgraessle@deploy1003 kgraessle, jsn: Backport for [[gerrit:1174515|Add experiment code to group by toggle (T397728)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:16:20] !log kgraessle@deploy1003 kgraessle, jsn: Continuing with sync [20:18:56] hihi [20:19:02] I am present. [20:21:41] !log kgraessle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1174515|Add experiment code to group by toggle (T397728)]] (duration: 09m 53s) [20:21:47] T397728: Add experiment code to group by toggle - https://phabricator.wikimedia.org/T397728 [20:21:55] robertsky: hi, just finished up you're good to go [20:22:32] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host clouddb1025 [20:22:39] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host clouddb1025 [20:23:10] okay! [20:23:45] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1037.eqiad.wmnet with OS bookworm [20:23:46] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host clouddb1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:25:10] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host clouddb1025.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:26:17] hmm... will need someone to help with the deployment? [20:26:58] robertsky: I can deploy for you [20:27:06] yay! thanks! :) [20:27:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174527 (https://phabricator.wikimedia.org/T400833) (owner: 10Robertsky) [20:29:36] (03Merged) 10jenkins-bot: wikimaniawiki: adjust down 2025 namespace protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174527 (https://phabricator.wikimedia.org/T400833) (owner: 10Robertsky) [20:30:01] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1174527|wikimaniawiki: adjust down 2025 namespace protection (T400833)]] [20:30:08] T400833: wikimaniawiki: adjust editing permission for 2025 namespace - https://phabricator.wikimedia.org/T400833 [20:30:51] (03PS7) 10Dzahn: add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) [20:31:14] (03CR) 10CI reject: [V:04-1] add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [20:32:10] !log dancy@deploy1003 robertsky, dancy: Backport for [[gerrit:1174527|wikimaniawiki: adjust down 2025 namespace protection (T400833)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:33:05] robertsky: Please test [20:33:15] tested. all good. [20:33:21] !log dancy@deploy1003 robertsky, dancy: Continuing with sync [20:38:25] great! thanks for the help, dancy! [20:38:29] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1174527|wikimaniawiki: adjust down 2025 namespace protection (T400833)]] (duration: 08m 27s) [20:38:34] T400833: wikimaniawiki: adjust editing permission for 2025 namespace - https://phabricator.wikimedia.org/T400833 [20:38:36] yw [20:45:46] (03PS8) 10Dzahn: add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) [20:47:19] (03CR) 10CI reject: [V:04-1] add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [21:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T2100) [21:11:18] (03PS1) 10Dreamy Jazz: ListPage: don't try to list votes for jump polls [extensions/SecurePoll] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174556 (https://phabricator.wikimedia.org/T400831) [21:11:39] (03PS1) 10Dreamy Jazz: ListPage: don't try to list votes for jump polls [extensions/SecurePoll] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1174557 (https://phabricator.wikimedia.org/T400831) [21:11:50] jouncebot: nowandnext [21:11:50] For the next 0 hour(s) and 48 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T2100) [21:11:50] In 0 hour(s) and 48 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T2200) [21:12:06] Anyone using this window? [21:12:09] Like to backport [21:14:01] (03PS1) 10Ryan Kemper: wdqs: allow freiburg's osm-planet qlever endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1174558 (https://phabricator.wikimedia.org/T400594) [21:14:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/SecurePoll] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174556 (https://phabricator.wikimedia.org/T400831) (owner: 10Dreamy Jazz) [21:14:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/SecurePoll] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1174557 (https://phabricator.wikimedia.org/T400831) (owner: 10Dreamy Jazz) [21:15:35] (03CR) 10Bking: [C:03+1] wdqs: allow freiburg's osm-planet qlever endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1174558 (https://phabricator.wikimedia.org/T400594) (owner: 10Ryan Kemper) [21:16:00] (03CR) 10Ryan Kemper: [C:03+2] wdqs: allow freiburg's osm-planet qlever endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1174558 (https://phabricator.wikimedia.org/T400594) (owner: 10Ryan Kemper) [21:17:28] (03Merged) 10jenkins-bot: ListPage: don't try to list votes for jump polls [extensions/SecurePoll] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1174556 (https://phabricator.wikimedia.org/T400831) (owner: 10Dreamy Jazz) [21:17:46] (03Merged) 10jenkins-bot: ListPage: don't try to list votes for jump polls [extensions/SecurePoll] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1174557 (https://phabricator.wikimedia.org/T400831) (owner: 10Dreamy Jazz) [21:18:12] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1174556|ListPage: don't try to list votes for jump polls (T400831 T75915 T398126)]], [[gerrit:1174557|ListPage: don't try to list votes for jump polls (T400831 T75915 T398126)]] [21:18:23] T400831: Wikimedia\Rdbms\DBQueryError: Error 1146: Table '[wiki].securepoll_votes' doesn't existFunction: MediaWiki\Extension\SecurePoll\Pages\ListPage::executeQuery: SELECT DISTINCT vote_voter FROM `securepoll_votes` WHERE - https://phabricator.wikimedia.org/T400831 [21:18:24] T75915: Redirect voter list of jump wikis to control wiki for mw-remote elections - https://phabricator.wikimedia.org/T75915 [21:18:24] T398126: redirect polls should provide a redirect link on every special page - https://phabricator.wikimedia.org/T398126 [21:30:47] (03PS1) 10Ryan Kemper: wdqs: fix roll restart on non-categories hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1174560 (https://phabricator.wikimedia.org/T349011) [21:39:14] 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847 (10thcipriani) 03NEW [21:41:20] 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11048469 (10thcipriani) This task is to update @QChris 's NDA. @sguebo_WMF mentioned that it had expired and asked what access was still needed. After chatting with @QChris, he no lon... [21:43:43] (03CR) 10Bking: [C:03+1] wdqs: fix roll restart on non-categories hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1174560 (https://phabricator.wikimedia.org/T349011) (owner: 10Ryan Kemper) [21:45:26] (03CR) 10Ryan Kemper: [C:03+2] wdqs: fix roll restart on non-categories hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1174560 (https://phabricator.wikimedia.org/T349011) (owner: 10Ryan Kemper) [21:46:17] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1174556|ListPage: don't try to list votes for jump polls (T400831 T75915 T398126)]], [[gerrit:1174557|ListPage: don't try to list votes for jump polls (T400831 T75915 T398126)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:46:25] T400831: Wikimedia\Rdbms\DBQueryError: Error 1146: Table '[wiki].securepoll_votes' doesn't existFunction: MediaWiki\Extension\SecurePoll\Pages\ListPage::executeQuery: SELECT DISTINCT vote_voter FROM `securepoll_votes` WHERE - https://phabricator.wikimedia.org/T400831 [21:46:25] T75915: Redirect voter list of jump wikis to control wiki for mw-remote elections - https://phabricator.wikimedia.org/T75915 [21:46:26] T398126: redirect polls should provide a redirect link on every special page - https://phabricator.wikimedia.org/T398126 [21:47:45] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [21:50:29] (03PS1) 10Fabfur: traffic: added alert for haproxykafka_socket_dropped_messages [alerts] - 10https://gerrit.wikimedia.org/r/1174565 (https://phabricator.wikimedia.org/T400684) [21:51:17] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [21:51:55] (03CR) 10CI reject: [V:04-1] traffic: added alert for haproxykafka_socket_dropped_messages [alerts] - 10https://gerrit.wikimedia.org/r/1174565 (https://phabricator.wikimedia.org/T400684) (owner: 10Fabfur) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250730T2200) [22:00:52] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1174556|ListPage: don't try to list votes for jump polls (T400831 T75915 T398126)]], [[gerrit:1174557|ListPage: don't try to list votes for jump polls (T400831 T75915 T398126)]] (duration: 42m 40s) [22:01:01] T400831: Wikimedia\Rdbms\DBQueryError: Error 1146: Table '[wiki].securepoll_votes' doesn't existFunction: MediaWiki\Extension\SecurePoll\Pages\ListPage::executeQuery: SELECT DISTINCT vote_voter FROM `securepoll_votes` WHERE - https://phabricator.wikimedia.org/T400831 [22:01:01] T75915: Redirect voter list of jump wikis to control wiki for mw-remote elections - https://phabricator.wikimedia.org/T75915 [22:01:02] T398126: redirect polls should provide a redirect link on every special page - https://phabricator.wikimedia.org/T398126 [22:02:11] (03PS2) 10Fabfur: traffic: added alert for haproxykafka_socket_dropped_messages [alerts] - 10https://gerrit.wikimedia.org/r/1174565 (https://phabricator.wikimedia.org/T400684) [22:02:38] 10SRE-SLO, 10EditCheck, 10Lift-Wing, 06Machine-Learning-Team, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11048515 (10DLynch) @elukey The 24th was when the train reached most wikis containing the change that turned on running... [22:03:37] (03CR) 10CI reject: [V:04-1] traffic: added alert for haproxykafka_socket_dropped_messages [alerts] - 10https://gerrit.wikimedia.org/r/1174565 (https://phabricator.wikimedia.org/T400684) (owner: 10Fabfur) [22:16:23] (03PS1) 10Dzahn: zuul: add zookeeper to new-zuul main prod VMs [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) [22:19:06] (03CR) 10Dzahn: [V:04-1] "unrelated and unexpected: Could not find class ::passwords::mysql::zuul for zuul1001.eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:22:28] (03PS1) 10Dzahn: add passwords::mysql::zuul with fake password [labs/private] - 10https://gerrit.wikimedia.org/r/1174567 (https://phabricator.wikimedia.org/T395938) [22:22:54] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1036.eqiad.wmnet with OS bookworm [22:23:01] (03CR) 10Dzahn: [V:03+2 C:03+2] add passwords::mysql::zuul with fake password [labs/private] - 10https://gerrit.wikimedia.org/r/1174567 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:29:13] (03PS2) 10Dzahn: zuul: add zookeeper to new-zuul main prod VMs [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) [22:40:04] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1036.eqiad.wmnet with reason: host reimage [22:44:36] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1036.eqiad.wmnet with reason: host reimage [22:50:50] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [22:53:48] 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11048605 (10Dzahn) @KFrancis qchris is in the NDA/MOU spreadsheet on row 6. The email address is misspelled though. The correct spelling of the domain is `@quelltextlich.at`. I can'... [22:55:54] 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11048611 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1173484 already converted the shell user to ldap-only. All we need here is the renewal. [22:59:11] FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:59:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11048613 (10VRiley-WMF) [23:02:34] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1036.eqiad.wmnet with OS bookworm [23:02:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11048614 (10VRiley-WMF) Was able to run though 2 of these. Running into issues with BMC password. clouddb1022 - Finished, no issues clouddb1023 - Finished, Pass... [23:09:28] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:10:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:11:49] (03PS1) 10Dzahn: zuul: add initial new-zuul config from template [puppet] - 10https://gerrit.wikimedia.org/r/1174570 (https://phabricator.wikimedia.org/T395938) [23:12:14] (03CR) 10CI reject: [V:04-1] zuul: add initial new-zuul config from template [puppet] - 10https://gerrit.wikimedia.org/r/1174570 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:24:11] RESOLVED: SystemdUnitFailed: wdqs-updater.service on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:27:37] 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11048644 (10Novem_Linguae) Quick update. I e-signed the NDA on 2025-07-24. I guess next step is, when she gets a moment, for @KFrancis to confirm in this ticket? [23:38:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1174572 [23:38:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1174572 (owner: 10TrainBranchBot) [23:48:57] (03PS9) 10Krinkle: scap: Limit foreachwikiindblist and expanddblist to beta wikis in beta [puppet] - 10https://gerrit.wikimedia.org/r/941479 (https://phabricator.wikimedia.org/T357877) [23:48:59] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/941479 (https://phabricator.wikimedia.org/T357877) (owner: 10Krinkle) [23:49:03] (03PS14) 10Krinkle: beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [23:49:07] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [23:49:50] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:53:01] (03PS10) 10Krinkle: scap: Limit foreachwikiindblist and expanddblist in beta to beta wikis [puppet] - 10https://gerrit.wikimedia.org/r/941479 (https://phabricator.wikimedia.org/T357877) [23:53:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1174572 (owner: 10TrainBranchBot) [23:56:00] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/941479 (https://phabricator.wikimedia.org/T357877) (owner: 10Krinkle) [23:57:37] (03PS3) 10BCornwall: ncredir: Add batch of pay-for-edit domains [puppet] - 10https://gerrit.wikimedia.org/r/1174539 (https://phabricator.wikimedia.org/T400731) [23:57:41] (03CR) 10BCornwall: ncredir: Add batch of pay-for-edit domains (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1174539 (https://phabricator.wikimedia.org/T400731) (owner: 10BCornwall) [23:59:16] (03CR) 10Zabe: [C:03+2] CommonSettings: Stop setting wgDBuser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174071 (owner: 10Zabe)