[00:03:33] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1150157 [00:08:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1150157 (owner: 10TrainBranchBot) [00:09:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.192s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:10:33] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 623.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:18:33] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:35:28] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1150157 (owner: 10TrainBranchBot) [00:47:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.395s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:52:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.166s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:56:15] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:57:03] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:57:13] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:57:53] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:58:03] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:15:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.377s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:20:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.377s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:22:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.742s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:52:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.068s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:55:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.598s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:58:33] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:33] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:15:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.037s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:18:33] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [04:03:33] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:18:33] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:20:45] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:24:39] FIRING: TransitBGPDown: Transit BGP session down between cr1-esams and Hurricane Electric (2001:7f8:1::a500:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDo [04:30:43] PROBLEM - Disk space on restbase1031 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 63967 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1031&var-datasource=eqiad+prometheus/ops [04:34:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Hurricane Electric (2001:7f8:1::a500:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [04:39:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Hurricane Electric (2001:7f8:1::a500:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:02:53] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: PuppetFailure (instance an-worker1068:9100) - https://phabricator.wikimedia.org/T392554#10855640 (10Stevemunene) 05Open→03Resolved [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:09] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1177.eqiad.wmnet [05:15:17] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1177.eqiad.wmnet [05:16:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10855642 (10Stevemunene) Thanks @Jclark-ctr For the host I [x] verified the VDs [x] created the journal node [x] ran... [05:17:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10855643 (10Stevemunene) [05:18:35] (03PS1) 10Stevemunene: Revert "hdfs: add an-worker1177 to in retup role" [puppet] - 10https://gerrit.wikimedia.org/r/1150223 [05:33:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: BAD PEM3 on cr2-codfw - https://phabricator.wikimedia.org/T394868#10855645 (10Papaul) ` UPDATE HAS BEEN ADDED: Dear Juniper Networks Customer, Your replacement part associated with RMA R200568010 Item # 100 has been successfu... [05:58:33] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:06:24] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:06:55] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:15:26] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10netbox: Selena can't see objects in Netbox despite having wmf group membership - https://phabricator.wikimedia.org/T395172#10855647 (10Volans) The user created in Netbox has username `sdeckelmann` while the user in LDAP has UID `sdeckelmann-wmf`... [06:28:58] !log installing intel-microcode security updates [06:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:56:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149805 (https://phabricator.wikimedia.org/T394603) (owner: 10Bunnypranav) [06:58:57] !log installing Linux 6.1.140 packages [06:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T0700). [07:00:05] bunnypranav: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:26] o/ [07:01:44] (03CR) 10Elukey: [C:03+1] Default the Kerberos role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1149542 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [07:02:42] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1149736 (https://phabricator.wikimedia.org/T393579) (owner: 10Dzahn) [07:03:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [07:04:01] (03CR) 10Brouberol: [C:03+1] "yes please!" [puppet] - 10https://gerrit.wikimedia.org/r/1150223 (owner: 10Stevemunene) [07:06:24] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:06:55] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:08:10] (03CR) 10Stevemunene: [C:03+2] Revert "hdfs: add an-worker1177 to in retup role" [puppet] - 10https://gerrit.wikimedia.org/r/1150223 (owner: 10Stevemunene) [07:17:54] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1177.eqiad.wmnet [07:18:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10855685 (10ops-monitoring-bot) Host an-worker1177.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting... [07:18:33] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:20:34] (03PS1) 10Elukey: role::ml_k8s::worker: move ml-serve1003 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1150494 (https://phabricator.wikimedia.org/T387854) [07:20:53] (03CR) 10Muehlenhoff: [C:03+2] Fix auto restart for alertmanager-irc-relay [puppet] - 10https://gerrit.wikimedia.org/r/1149544 (owner: 10Muehlenhoff) [07:22:59] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1003.eqiad.wmnet [07:23:00] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1003.eqiad.wmnet [07:23:22] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1003.eqiad.wmnet [07:23:51] (03CR) 10Elukey: [C:03+2] role::ml_k8s::worker: move ml-serve1003 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1150494 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:25:42] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1177.eqiad.wmnet [07:26:46] (03CR) 10Elukey: [C:03+1] profile::prometheus::k8s: drop terminated pod targets [puppet] - 10https://gerrit.wikimedia.org/r/1149505 (https://phabricator.wikimedia.org/T395052) (owner: 10Scott French) [07:28:26] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1003.eqiad.wmnet [07:32:33] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1003.eqiad.wmnet with OS bookworm [07:37:08] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:37:10] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:40:32] (03PS16) 10Arnaudb: gerrit: lock, preflight checks, hieradata lookups, verbosity [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) [07:40:32] (03CR) 10Arnaudb: "Following up April 30th Gerrit split brain, there are now:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [07:45:42] (03PS3) 10Effie Mouzeli: admin_ng: add ValidatingAdmissionPolicy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [07:46:54] (03PS4) 10Effie Mouzeli: admin_ng: add ValidatingAdmissionPolicy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [07:52:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10855739 (10Stevemunene) Host has successfully rejoined the cluster {F60548076} [07:52:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10855740 (10Stevemunene) [07:52:39] (03PS5) 10Effie Mouzeli: admin_ng: add ValidatingAdmissionPolicy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [07:52:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10855741 (10Stevemunene) 05Open→03Resolved [07:57:05] (03PS9) 10Fabfur: haproxy: use maxmind lua bindings to lookup client ISP [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) [07:57:42] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:58:51] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10netbox: Selena can't see objects in Netbox despite having wmf group membership - https://phabricator.wikimedia.org/T395172#10855743 (10SLyngshede-WMF) The account is correctly linked in the social_auth tabel, not sure how though: ` >>> u = Use... [07:59:57] (03PS10) 10Volans: homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) [07:59:58] (03PS5) 10Volans: git::clone: fix support for different remote name [puppet] - 10https://gerrit.wikimedia.org/r/1148267 [08:01:17] (03CR) 10CI reject: [V:04-1] git::clone: fix support for different remote name [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [08:03:33] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:20] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10netbox: Selena can't see objects in Netbox despite having wmf group membership - https://phabricator.wikimedia.org/T395172#10855775 (10SLyngshede-WMF) @SDeckelmann-WMF can you try something, not sure if that will work, but I'll like to get a con... [08:05:27] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10netbox: Selena can't see objects in Netbox despite having wmf group membership - https://phabricator.wikimedia.org/T395172#10855776 (10SLyngshede-WMF) p:05Triage→03High a:03SLyngshede-WMF [08:06:28] (03CR) 10Ayounsi: [C:03+1] definitions: Add port for x3 wiki replica backend [homer/public] - 10https://gerrit.wikimedia.org/r/1149606 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [08:07:23] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10855779 (10tappof) Ok, thank you @RobH. I’ll add some Pint directives to silence alerts for missing metrics in the DCs that do... [08:08:10] (03CR) 10Majavah: [C:03+2] definitions: Add port for x3 wiki replica backend [homer/public] - 10https://gerrit.wikimedia.org/r/1149606 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [08:08:41] (03Merged) 10jenkins-bot: definitions: Add port for x3 wiki replica backend [homer/public] - 10https://gerrit.wikimedia.org/r/1149606 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [08:10:57] (03PS4) 10Tiziano Fogli: pdus: add pro4x breaker alerts [alerts] - 10https://gerrit.wikimedia.org/r/1149343 (https://phabricator.wikimedia.org/T387231) [08:11:58] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync [08:11:59] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync [08:12:28] (03CR) 10Volans: "Just passing by and left some spicerack-specific suggestions." [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [08:12:50] (03PS6) 10Volans: git::clone: fix support for different remote name [puppet] - 10https://gerrit.wikimedia.org/r/1148267 [08:12:58] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [08:13:16] (03CR) 10Fabfur: [C:03+2] haproxy: use maxmind lua bindings to lookup client ISP [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [08:13:58] (03CR) 10Tiziano Fogli: [C:03+2] pdus: add pro4x breaker alerts [alerts] - 10https://gerrit.wikimedia.org/r/1149343 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [08:14:56] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync [08:15:02] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync [08:15:57] (03Merged) 10jenkins-bot: pdus: add pro4x breaker alerts [alerts] - 10https://gerrit.wikimedia.org/r/1149343 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [08:17:29] (03CR) 10Brouberol: "I think you're right. Let's abandon it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149013 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [08:17:51] (03Abandoned) 10Brouberol: airflow: relax timeout after which DAGs are deleted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149013 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [08:18:33] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:19:03] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync [08:19:09] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync [08:21:10] (03PS1) 10Slyngshede: P:idp always use Wikimedia theme [puppet] - 10https://gerrit.wikimedia.org/r/1150581 [08:22:56] jouncebot: nowandnext [08:22:56] No deployments scheduled for the next 1 hour(s) and 37 minute(s) [08:22:56] In 1 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1000) [08:23:33] RESOLVED: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:23:53] anybody object if I use this window to do a citoid deploy? can't make this week's scheduled one. [08:24:43] +1 from my side, it doesn't seem to be a problem. Anything risky to deploy? [08:29:16] (03PS1) 10Brouberol: airflow: never start an instance with 0 DAG parsing processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150586 (https://phabricator.wikimedia.org/T393998) [08:30:25] Nope [08:30:29] Nothing risky [08:31:08] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147106 (owner: 10PipelineBot) [08:31:41] (famous last words) [08:32:50] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147106 (owner: 10PipelineBot) [08:34:19] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1003.eqiad.wmnet with OS bookworm [08:34:44] (03PS1) 10Vgutierrez: hiera: Depool lvs1013 before switching to katran [puppet] - 10https://gerrit.wikimedia.org/r/1150587 (https://phabricator.wikimedia.org/T395228) [08:34:45] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1003.eqiad.wmnet with OS bookworm [08:34:49] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [08:35:13] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [08:35:48] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150587 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [08:35:49] (03CR) 10Volans: "I went for the simplification option, not passing remote_name to git::clone and overriding everything in the .git/config file." [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [08:41:08] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [08:41:35] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [08:42:43] (03PS7) 10Volans: git::clone: remote remote_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/1148267 [08:43:50] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [08:44:17] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [08:45:13] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150586 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [08:46:11] (03PS8) 10Volans: git::clone: remove remote_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/1148267 [08:46:35] (03CR) 10Brouberol: [C:03+2] airflow: never start an instance with 0 DAG parsing processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150586 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [08:46:36] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [08:48:26] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker1135.eqiad.wmnet with reason: Investigate MegaRAID failure [08:48:29] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632#10855909 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b04912d5-a1f9-43e3-8127-02a5d51fd650) set by stevemunene@cumin10... [08:48:38] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632#10855911 (10Stevemunene) Hi @jcrespo apologies for the delay, this has been downtimed [08:48:52] (03CR) 10Vgutierrez: [C:03+2] systemd::timer: Allow setting FixedRandomDelay [puppet] - 10https://gerrit.wikimedia.org/r/1149647 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [08:50:39] (03CR) 10Vgutierrez: [C:03+2] systemd::timer::job: Allow setting accuracy and fixed_random_delay [puppet] - 10https://gerrit.wikimedia.org/r/1149648 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [08:53:12] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage [08:56:02] (03CR) 10Elukey: [C:03+1] homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [08:57:12] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage [08:59:26] (03PS1) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) [09:02:34] (03CR) 10Fabfur: external_cloud_vendors: fix Azure prefix fetch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur) [09:03:41] (03CR) 10Vgutierrez: [C:04-1] "this opens the door for third-parties spoofing the value of X-Requestctl-ISP" [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [09:04:50] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632#10855999 (10jcrespo) Assuming this is a hw failure, remember to notify dc-ops ( https://phabricator.wikimedia.org/maniphest/task/edit/form/55... [09:07:01] (03PS2) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) [09:07:33] (03CR) 10Fabfur: "yeah, nice catch, fixed thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [09:09:10] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:09:10] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:09:25] (03CR) 10Vgutierrez: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [09:14:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1003.eqiad.wmnet with OS bookworm [09:14:48] (03CR) 10Vgutierrez: [C:03+2] varnish: Deploy edge uniques experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [09:17:04] (03CR) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [09:21:57] (03CR) 10Volans: "PCC results: https://puppet-compiler.wmflabs.org/output/1148267/4002/" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [09:25:05] (03PS5) 10Fabfur: external_cloud_vendors: fix Azure prefix fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) [09:25:30] (03CR) 10Brouberol: [C:03+2] airflow: disable hardcoded networkpolicy in favor of the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149639 (https://phabricator.wikimedia.org/T392668) (owner: 10Brouberol) [09:27:14] (03CR) 10Fabfur: external_cloud_vendors: fix Azure prefix fetch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur) [09:27:37] (03PS1) 10Brouberol: airflow: emit lineage metadata to datahub via kafka instead of the GMS REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150595 (https://phabricator.wikimedia.org/T395106) [09:27:39] (03PS1) 10Vgutierrez: varnish: Fix wmfuniq_experiment_fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1150596 (https://phabricator.wikimedia.org/T395001) [09:28:46] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::purge_securepoll: Only run on securepollglobal.dblist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149629 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert) [09:29:41] (03CR) 10Vgutierrez: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [09:31:23] (03PS1) 10Clément Goubert: mw::maintenance::purge_securepoll: Fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1150598 (https://phabricator.wikimedia.org/T388542) [09:33:18] (03CR) 10Hnowlan: [C:03+1] mw::maintenance::purge_securepoll: Fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1150598 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert) [09:33:38] PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp3080 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:33:38] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1150596 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [09:33:38] PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp3066 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:33:38] PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp6009 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:33:38] PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp7014 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:33:38] PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp7002 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:33:39] (03PS2) 10Hnowlan: mw::periodic_job: clean up migration_title parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150594 (https://phabricator.wikimedia.org/T341555) [09:33:39] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150594 (https://phabricator.wikimedia.org/T341555) (owner: 10Hnowlan) [09:33:40] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150598 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert) [09:34:47] (03CR) 10Vgutierrez: [C:03+2] varnish: Fix wmfuniq_experiment_fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1150596 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [09:34:50] (03PS2) 10Brouberol: deployment_server: deploy the mediawiki-dumps-legacy scap target [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) [09:34:58] (03PS2) 10Clément Goubert: mw::maintenance::purge_securepoll: Fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1150598 (https://phabricator.wikimedia.org/T388542) [09:35:00] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150598 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert) [09:35:06] (03CR) 10Brouberol: "Thanks for the review Scott!" [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [09:37:54] PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp1114 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:37:54] PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp1105 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:37:54] PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp3077 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:37:56] PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp7003 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:37:56] PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp7015 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:37:56] PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp7009 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:40:41] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::purge_securepoll: Fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1150598 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert) [09:42:12] PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp6012 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:42:16] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::purge_securepoll: Only run on securepollglobal.dblist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149629 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert) [09:42:37] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#10856107 (10akosiaris) 1 single bucket, at least at the beginning. Reading https://distribution.github.io/distribution/about/configuration/, I don't think the softw... [09:43:06] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#10856108 (10akosiaris) > If you want to do some testing, I could set you up with a test account on apus. That would be swell! [09:43:38] RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp3080 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:43:38] RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp3066 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:43:38] RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp6009 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:43:39] RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp7002 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:43:39] RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp7014 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:45:29] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1003.eqiad.wmnet [09:45:30] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1003.eqiad.wmnet [09:47:41] (03PS1) 10Elukey: role::ml_k8s::worker: upgrade ml-serve1004 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1150604 (https://phabricator.wikimedia.org/T387854) [09:47:47] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1004.eqiad.wmnet [09:47:54] RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp1114 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:47:54] RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp1105 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:47:54] RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp3077 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:47:56] RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp7015 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:47:56] RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp7003 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:47:56] RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp7009 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:51:21] (03CR) 10Elukey: [C:03+2] role::ml_k8s::worker: upgrade ml-serve1004 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1150604 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [09:52:12] RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp6012 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:52:52] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1004.eqiad.wmnet [09:53:55] (03PS1) 10Jgiannelos: pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150607 (https://phabricator.wikimedia.org/T394896) [09:54:38] (03CR) 10Jgiannelos: "This is the patch from last week after some more testing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150607 (https://phabricator.wikimedia.org/T394896) (owner: 10Jgiannelos) [09:55:59] (03CR) 10Giuseppe Lavagetto: [C:03+1] external_cloud_vendors: fix Azure prefix fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur) [09:58:33] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:58:53] (03CR) 10Vgutierrez: [C:03+1] external_cloud_vendors: fix Azure prefix fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1000) [10:04:26] elukey@cumin1002 reimage (PID 2164230) is awaiting input [10:08:30] (03Abandoned) 10Hnowlan: mw::periodic_job: clean up migration_title parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150594 (https://phabricator.wikimedia.org/T341555) (owner: 10Hnowlan) [10:08:35] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [10:08:41] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1004.eqiad.wmnet with OS bookworm [10:09:14] (03CR) 10Hnowlan: [C:03+1] Move Kartotherian/staging to the new Bookworm nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148881 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:13:12] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:13:12] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:13:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [10:22:13] (03CR) 10Fabfur: [C:03+2] external_cloud_vendors: fix Azure prefix fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur) [10:22:36] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1004.eqiad.wmnet with OS bookworm [10:23:02] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1004.eqiad.wmnet with OS bookworm [10:32:53] (03PS1) 10Michael Große: SpecialHomepageLogger: Populate email state even with StartModule disabled [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1150619 (https://phabricator.wikimedia.org/T394017) [10:33:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1150619 (https://phabricator.wikimedia.org/T394017) (owner: 10Michael Große) [10:34:03] (03PS2) 10Brouberol: airflow: emit lineage metadata to datahub via kafka instead of the GMS REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150595 (https://phabricator.wikimedia.org/T395106) [10:39:11] (03PS3) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) [10:39:33] (03CR) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [10:42:26] (03CR) 10FNegri: [C:03+1] openstack: wmcs-bastionless: Fix condition [puppet] - 10https://gerrit.wikimedia.org/r/1149811 (https://phabricator.wikimedia.org/T379550) (owner: 10Majavah) [10:43:49] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [10:44:11] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:44:33] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:46:40] (03CR) 10Majavah: [C:03+2] openstack: wmcs-bastionless: Fix condition [puppet] - 10https://gerrit.wikimedia.org/r/1149811 (https://phabricator.wikimedia.org/T379550) (owner: 10Majavah) [10:51:14] (03CR) 10Majavah: [C:03+1] Remove unused option to enable host-based auth [puppet] - 10https://gerrit.wikimedia.org/r/1149371 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [10:56:00] (03PS1) 10Hnowlan: alertmanager: adjust phab project to security-team rather than security tag [puppet] - 10https://gerrit.wikimedia.org/r/1150624 (https://phabricator.wikimedia.org/T388531) [10:58:29] (03CR) 10JMeybohm: [C:03+1] profile::prometheus::k8s: drop terminated pod targets [puppet] - 10https://gerrit.wikimedia.org/r/1149505 (https://phabricator.wikimedia.org/T395052) (owner: 10Scott French) [10:58:55] (03PS2) 10Vgutierrez: hiera: Depool lvs1013 before switching to katran [puppet] - 10https://gerrit.wikimedia.org/r/1150587 (https://phabricator.wikimedia.org/T395228) [10:58:55] (03PS1) 10Vgutierrez: hiera: Use katran in lvs1013 [puppet] - 10https://gerrit.wikimedia.org/r/1150626 (https://phabricator.wikimedia.org/T395228) [10:59:17] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:59:39] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [11:04:59] (03PS11) 10Volans: homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) [11:04:59] (03PS9) 10Volans: git::clone: remove remote_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/1148267 [11:05:16] (03CR) 10Volans: homer: make private repo support multiple peers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [11:05:45] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [11:08:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [11:09:04] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [11:09:16] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.codfw.wmnet [11:11:57] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:12:11] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:14:48] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [11:14:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [11:14:57] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:15:04] elukey@cumin1002 reimage (PID 2165345) is awaiting input [11:15:11] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:15:25] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2005-dev.codfw.wmnet [11:15:56] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [11:16:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [11:16:14] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudbackup1001-dev.eqiad.wmnet [11:18:38] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:20:10] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet [11:20:13] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [11:21:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [11:21:55] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1150581 (owner: 10Slyngshede) [11:22:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [11:22:57] (03PS1) 10Samwilson: InitialiseSettings: wgTemplateDataEnableDiscovery on plwiki and arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150629 (https://phabricator.wikimedia.org/T377975) [11:24:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150629 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson) [11:25:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [11:25:45] PROBLEM - Host vrts2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:13] RECOVERY - Host vrts2002 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [11:27:20] FIRING: [2x] ProbeDown: Service vrts2002:25 has failed probes (tcp_vrts_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#vrts2002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:27:34] (03CR) 10Slyngshede: [C:03+2] P:idp always use Wikimedia theme [puppet] - 10https://gerrit.wikimedia.org/r/1150581 (owner: 10Slyngshede) [11:28:35] (03CR) 10Cparle: [C:03+1] InitialiseSettings: wgTemplateDataEnableDiscovery on plwiki and arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150629 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson) [11:31:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [11:32:20] RESOLVED: [2x] ProbeDown: Service vrts2002:25 has failed probes (tcp_vrts_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#vrts2002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:33:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2006.codfw.wmnet [11:34:57] (03PS1) 10Bartosz Wójtowicz: ml-services: Update multiple ML models on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150630 (https://phabricator.wikimedia.org/T393865) [11:37:10] (03CR) 10Gkyziridis: [C:03+1] "LGTM! Thnx for working on this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150630 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [11:38:53] (03CR) 10Gkyziridis: [C:03+2] ml-services: Update multiple ML models on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150630 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [11:39:47] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab1005.eqiad.wmnet with reason: update [11:40:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2006.codfw.wmnet [11:45:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2005.codfw.wmnet [11:45:46] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on people1004.eqiad.wmnet with reason: update [11:46:01] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:46:51] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on lists2001.wikimedia.org with reason: update [11:47:51] (03CR) 10Muehlenhoff: [C:03+2] Default the Kerberos role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1149542 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [11:48:25] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gerrit2003.wikimedia.org with reason: update [11:49:19] (03PS1) 10Majavah: openstack: drain-hypervisor: Ignore instances being deleted [puppet] - 10https://gerrit.wikimedia.org/r/1150636 (https://phabricator.wikimedia.org/T395244) [11:50:05] (03CR) 10JMeybohm: [C:04-1] "As said on IRC: I don't really like the name being so generic. Maybe you can find something that makes it more clear that this is rule is " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [11:50:34] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on etherpad2002.codfw.wmnet with reason: update [11:52:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2005.codfw.wmnet [11:52:31] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on etherpad1004.eqiad.wmnet with reason: update [11:52:55] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on doc2003.codfw.wmnet with reason: update [11:53:33] (03PS1) 10Clément Goubert: mw::maintenance::purge_securepoll: Ignore foreachwiki errors [puppet] - 10https://gerrit.wikimedia.org/r/1150637 (https://phabricator.wikimedia.org/T395245) [11:54:00] (03PS2) 10Majavah: openstack: drain-hypervisor: Ignore instances being deleted [puppet] - 10https://gerrit.wikimedia.org/r/1150636 (https://phabricator.wikimedia.org/T395244) [11:54:59] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [11:55:28] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on doc1004.eqiad.wmnet with reason: update [11:56:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2004.codfw.wmnet [11:57:30] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632#10856437 (10Stevemunene) Checking the battery details as per https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop/Administration#... [11:57:42] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:57:57] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on aphlict2001.codfw.wmnet with reason: update [11:58:02] !log installing postgresql-15 security updates [11:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:48] (03PS1) 10Majavah: openstack: drain-hypervisor: Catch and retry 409 Conflict errors [puppet] - 10https://gerrit.wikimedia.org/r/1150638 (https://phabricator.wikimedia.org/T395244) [11:59:32] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:59:57] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:00:01] (03CR) 10CI reject: [V:04-1] openstack: drain-hypervisor: Catch and retry 409 Conflict errors [puppet] - 10https://gerrit.wikimedia.org/r/1150638 (https://phabricator.wikimedia.org/T395244) (owner: 10Majavah) [12:00:18] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on aphlict1002.eqiad.wmnet with reason: update [12:00:25] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10856445 (10KCVelaga_WMF) > The deployment group brings a lot of power with it, though. I'm not sure that all of our possible Airflow developers would... [12:02:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2004.codfw.wmnet [12:03:19] (03PS1) 10Cathal Mooney: Add entry for cagefive2* hosts in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1150642 (https://phabricator.wikimedia.org/T394021) [12:03:33] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:10] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for cagefive2001 test server - cmooney@cumin1002" [12:04:16] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for cagefive2001 test server - cmooney@cumin1002" [12:04:16] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:04:40] !log cmooney@cumin1002 START - Cookbook sre.hosts.dhcp for host cagefive2001.codfw.wmnet [12:05:12] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on releases2003.codfw.wmnet with reason: update [12:05:39] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cagefive2001.codfw.wmnet [12:06:27] (03PS1) 10Muehlenhoff: Add library hint for gcc-12 [puppet] - 10https://gerrit.wikimedia.org/r/1150643 [12:07:26] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:07:49] (03PS1) 10Hnowlan: changeprop(-jobqueue): don't log 404s at ERROR level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150645 (https://phabricator.wikimedia.org/T395132) [12:07:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2003.codfw.wmnet [12:08:20] (03CR) 10Ayounsi: "Why not name them sretest like the others? Unless there is a good reason I'd prefer we keep our standards" [puppet] - 10https://gerrit.wikimedia.org/r/1150642 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [12:09:51] cmooney@cumin1002 reimage (PID 2179552) is awaiting input [12:11:11] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for cagefive2001 test server - cmooney@cumin1002" [12:11:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for cagefive2001 test server - cmooney@cumin1002" [12:11:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:11:31] PROBLEM - Hadoop NodeManager on an-worker1191 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:12:03] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache cagefive2001.mgmt.codfw.wmnet on all recursors [12:12:06] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cagefive2001.mgmt.codfw.wmnet on all recursors [12:12:14] (03CR) 10Effie Mouzeli: [C:03+1] pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150607 (https://phabricator.wikimedia.org/T394896) (owner: 10Jgiannelos) [12:13:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2003.codfw.wmnet [12:15:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2002.codfw.wmnet [12:16:03] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10856492 (10brouberol) > We could just create a group called `airflow-deployers` and reference all members of the `analytics-privatedata-users` group... [12:17:50] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for gcc-12 [puppet] - 10https://gerrit.wikimedia.org/r/1150643 (owner: 10Muehlenhoff) [12:19:31] RECOVERY - Hadoop NodeManager on an-worker1191 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:21:28] (03PS2) 10Majavah: openstack: drain-hypervisor: Catch and retry 409 Conflict errors [puppet] - 10https://gerrit.wikimedia.org/r/1150638 (https://phabricator.wikimedia.org/T395244) [12:21:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2002.codfw.wmnet [12:22:24] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10856524 (10cmooney) @Jhancock.wm hey I'm having some problems reaching cagefive2001 over management. The IP it is assigned is not respo... [12:22:29] (03PS1) 10Brouberol: admin/data: create an airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1150654 (https://phabricator.wikimedia.org/T395125) [12:22:31] (03PS1) 10Brouberol: airflow-dev: make kubeconfig group-owned by the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1150655 (https://phabricator.wikimedia.org/T395125) [12:23:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2001.codfw.wmnet [12:25:45] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10856529 (10brouberol) a:03brouberol [12:25:49] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10856531 (10brouberol) [12:25:49] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10856532 (10brouberol) 05Open→03In progress [12:26:41] (03Abandoned) 10Brouberol: admin/data: add kcvelaga to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1149666 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [12:29:44] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet [12:30:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2001.codfw.wmnet [12:50:27] (03CR) 10Volans: [C:03+2] homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [12:50:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150665 (https://phabricator.wikimedia.org/T395193) (owner: 10Anzx) [12:52:19] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [12:52:27] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1004.eqiad.wmnet with OS bookworm [12:52:41] !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:52:45] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1004.eqiad.wmnet with OS bookworm [12:52:55] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:53:33] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:53:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow7001.magru.wmnet [12:53:42] (03CR) 10Majavah: [C:03+2] openstack: drain-hypervisor: Ignore instances being deleted [puppet] - 10https://gerrit.wikimedia.org/r/1150636 (https://phabricator.wikimedia.org/T395244) (owner: 10Majavah) [12:53:51] (03CR) 10Majavah: [C:03+2] openstack: drain-hypervisor: Catch and retry 409 Conflict errors [puppet] - 10https://gerrit.wikimedia.org/r/1150638 (https://phabricator.wikimedia.org/T395244) (owner: 10Majavah) [12:54:19] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:54:37] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:55:17] (03PS1) 10Muehlenhoff: Add maps-bookworm alias [puppet] - 10https://gerrit.wikimedia.org/r/1150670 (https://phabricator.wikimedia.org/T381565) [12:55:46] !log Update Recommendation-API to 2025-05-26-081343-production (T394441, T395026, T306508, T391230) [12:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:53] T394441: Rec API tests failing intermittently - https://phabricator.wikimedia.org/T394441 [12:55:54] T395026: Rec APi not picking up new collection Wiki99/LGBT+ - https://phabricator.wikimedia.org/T395026 [12:55:54] T306508: ContentTranslation doesn't know that an article already exists in the Norwegian Bokmål Wikipedia - https://phabricator.wikimedia.org/T306508 [12:55:54] T391230: Unified Dashboard: Support country-level filtering under custom suggestions view - https://phabricator.wikimedia.org/T391230 [12:56:54] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1004.eqiad.wmnet with OS bookworm [12:57:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow7001.magru.wmnet [12:57:53] (03PS4) 10JMeybohm: Revert^2 "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1135046 (owner: 10Kamila Součková) [12:58:47] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:58:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [12:59:04] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135046 (owner: 10Kamila Součková) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1300). [13:00:05] isaranto, tgr, MichaelG_WMF, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:13] RECOVERY - Squid on install1004 is OK: TCP OK - 3.040 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [13:00:15] o/ [13:01:20] o/ [13:01:53] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:02:39] shall I proceed with the deployment? [13:03:08] My change cannot be tested in the UI (as far as I can tell), but a *lot of* errors in logstash should go away once it has been deployed. Is it possible to only see errors from "mwdebug" on logstash? [13:03:19] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [13:03:24] @isaranto If you can deploy, that would be great! [13:03:25] I'll be back in the second half of the hour [13:03:49] MichaelG_WMF: yes, use the mwdebug dashboard [13:04:06] I'm going to proceed with my backport first -- I'm in a meeting with folks to QA it first.MichaelG_WMF then I can proceed with your deployment [13:04:15] deploying! [13:04:29] or filter the normal dashboard by hostname, if you are using one of the non-k8s debug hosts [13:04:31] (03CR) 10Raymond Ndibe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149533 (https://phabricator.wikimedia.org/T394276) (owner: 10Raymond Ndibe) [13:04:39] isaranto: thank you, that works for me! [13:04:43] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:04:49] (03PS10) 10Volans: git::clone: remove remote_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/1148267 [13:04:49] (03PS1) 10Volans: homer: fix private repository config [puppet] - 10https://gerrit.wikimedia.org/r/1150675 (https://phabricator.wikimedia.org/T389380) [13:04:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149407 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [13:04:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [13:05:09] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:05:09] RECOVERY - Squid on install1004 is OK: TCP OK - 0.016 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [13:05:27] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:06:11] PROBLEM - Host cloudnet2005-dev is DOWN: PING CRITICAL - Packet loss = 100% [13:06:26] (03PS5) 10Alexandros Kosiaris: WIP: adding mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 (owner: 10Effie Mouzeli) [13:06:29] (03CR) 10Jelto: "I forgot about the already existing PTR, so yes that makes sense and we don't need another one! But let's wait for a clear decision in T39" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [13:06:57] tgr: isaranto: thanks, I have confirmed that I can trigger the error my change fixes on the mwdebug dashboard. That way I should be able to test that it works. [13:07:13] RECOVERY - Host cloudnet2005-dev is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [13:07:22] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150675 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [13:07:51] (03CR) 10JMeybohm: [C:03+2] Revert^2 "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1135046 (owner: 10Kamila Součková) [13:07:59] (03Merged) 10jenkins-bot: ores-extension: enable ores extention UI in idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149407 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [13:08:04] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1150675 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [13:08:17] !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1149407|ores-extension: enable ores extention UI in idwiki (T382171)]] [13:08:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet [13:08:21] T382171: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171 [13:08:30] (03CR) 10Muehlenhoff: [C:03+2] Add maps-bookworm alias [puppet] - 10https://gerrit.wikimedia.org/r/1150670 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:08:33] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:10] (03CR) 10Volans: [C:03+2] homer: fix private repository config [puppet] - 10https://gerrit.wikimedia.org/r/1150675 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [13:09:30] jayme: if you got my commit too feel free to merge it ;) [13:09:56] volans: I'll merge your patch along [13:10:06] thanks a lot :D [13:12:39] PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100% [13:13:31] RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 30.29 ms [13:13:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet [13:14:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet [13:14:59] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudnet2007-dev.codfw.wmnet [13:18:04] (03PS6) 10Effie Mouzeli: functions-orchestrator: add mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 (https://phabricator.wikimedia.org/T391986) [13:18:16] (03CR) 10Majavah: [C:04-1] "This needs to wait until https://phabricator.wikimedia.org/T394337. Right now there's a broken deployment in that namespace in tools and d" [puppet] - 10https://gerrit.wikimedia.org/r/1149533 (https://phabricator.wikimedia.org/T394276) (owner: 10Raymond Ndibe) [13:19:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet [13:20:48] (03PS2) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 210s. [dns] - 10https://gerrit.wikimedia.org/r/1148904 (https://phabricator.wikimedia.org/T394312) [13:21:16] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2007-dev.codfw.wmnet [13:21:36] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudnet2008-dev.codfw.wmnet [13:22:03] (03PS2) 10Effie Mouzeli: wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150666 [13:22:15] !log gkyziridis@deploy1003 isaranto, gkyziridis: Backport for [[gerrit:1149407|ores-extension: enable ores extention UI in idwiki (T382171)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:22:19] T382171: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171 [13:23:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3003.esams.wmnet [13:24:37] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage [13:27:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3003.esams.wmnet [13:28:05] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2008-dev.codfw.wmnet [13:28:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage [13:28:22] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudbackup1002-dev.eqiad.wmnet [13:29:12] (03PS3) 10Effie Mouzeli: wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150666 [13:29:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2003.codfw.wmnet [13:30:33] (03PS3) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 210s. [dns] - 10https://gerrit.wikimedia.org/r/1148904 (https://phabricator.wikimedia.org/T394312) [13:31:38] (03Abandoned) 10Slyngshede: Drop jackson-module-kotlin (experimental) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/809132 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [13:32:05] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet [13:32:25] !log gkyziridis@deploy1003 Sync cancelled. [13:33:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2003.codfw.wmnet [13:33:29] (03CR) 10Vgutierrez: [C:03+1] templates: lower TTLs for dyna.wm.org and upload.wm.org to 210s. [dns] - 10https://gerrit.wikimedia.org/r/1148904 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh) [13:33:31] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet [13:34:33] (03CR) 10Ssingh: [C:03+2] templates: lower TTLs for dyna.wm.org and upload.wm.org to 210s. [dns] - 10https://gerrit.wikimedia.org/r/1148904 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh) [13:34:41] !log sukhe@dns1004 START - running authdns-update [13:34:49] we didn't end up syncing my patch because there was an issue with QA. MichaelG_WMF shall I proceed with your patch? [13:35:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [13:35:15] yes, please! [13:35:19] !log sukhe@dns1004 END - running authdns-update [13:35:28] (03PS1) 10Majavah: P:openstack: Migrate simple rules to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1150685 [13:35:28] (03PS1) 10Majavah: P:openstack: pdns: Migrate mysql_root ferm service to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1150686 [13:35:29] (03PS1) 10Majavah: P:openstack: codfw1dev: Migrate Cumin ferm term to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1150687 [13:35:37] isaranto: you will need to revert your patch if you did not end up deploying it [13:35:40] let's do the rest at the same time, maybe? [13:35:53] taavi: on it! [13:35:54] (03PS4) 10Effie Mouzeli: wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150666 (https://phabricator.wikimedia.org/T391986) [13:35:55] not much hope of fitting them in the hour, otherwise [13:35:56] 06SRE, 06Traffic, 13Patch-For-Review: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312#10856773 (10ssingh) [13:36:35] (03CR) 10CI reject: [V:04-1] P:openstack: Migrate simple rules to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1150685 (owner: 10Majavah) [13:36:38] (03PS1) 10Ilias Sarantopoulos: Revert "ores-extension: enable ores extention UI in idwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150688 [13:36:52] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1150685 (owner: 10Majavah) [13:36:57] although I guess nothing important is happening afterwards [13:37:01] (03CR) 10Gkyziridis: [C:03+2] Revert "ores-extension: enable ores extention UI in idwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150688 (owner: 10Ilias Sarantopoulos) [13:37:53] (03Merged) 10jenkins-bot: Revert "ores-extension: enable ores extention UI in idwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150688 (owner: 10Ilias Sarantopoulos) [13:37:57] I have an image change to merge, but it's not urgent, it can wait until y'all are done with scap deployments [13:38:02] ok. just waiting for the revert to be merged and then will deploy Michael's patch [13:38:05] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1002.eqiad.wmnet [13:38:23] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 4 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1150685 (owner: 10Majavah) [13:38:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [13:38:58] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:40:12] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:40:16] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:40:20] I can deploy the next patch(es) shall I do them all together via Spiderpig? [13:40:46] tgr: feel free to extend the window [13:40:51] my revert has already been merged [13:40:59] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5679/co" [puppet] - 10https://gerrit.wikimedia.org/r/1150686 (owner: 10Majavah) [13:41:17] (03CR) 10Elukey: [C:03+2] Move Kartotherian/staging to the new Bookworm nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148881 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:41:54] I started just with MichaelG_WMF 's patch [13:42:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by isaranto@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1150619 (https://phabricator.wikimedia.org/T394017) (owner: 10Michael Große) [13:42:08] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on lists1004.wikimedia.org with reason: update [13:42:10] thanks! [13:42:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:42:33] (03PS1) 10Fabfur: hiera: enable maxmind isp lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) [13:43:31] jouncebot: nowandnext [13:43:31] For the next 0 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1300) [13:43:31] In 1 hour(s) and 46 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1530) [13:44:15] tgr: I'll ping you once we're finished with MichaelG_WMF patch. sorry for the delay folks - we were trying to debug some thresholds + UI things and decided to revert to be sure [13:44:22] (03Merged) 10jenkins-bot: SpecialHomepageLogger: Populate email state even with StartModule disabled [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1150619 (https://phabricator.wikimedia.org/T394017) (owner: 10Michael Große) [13:45:15] !log elukey@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [13:45:30] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1004.eqiad.wmnet with OS bookworm [13:45:52] !log elukey@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [13:46:14] !log elukey@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [13:46:34] !log elukey@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [13:47:04] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1004.eqiad.wmnet [13:47:05] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1004.eqiad.wmnet [13:47:07] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [13:47:41] got an error on spiderpig https://spiderpig.wikimedia.org/jobs/97 It says it found an unexpected diff which was caused by the revert [13:48:06] (03PS1) 10Ayounsi: Interfaces: also alert on frack routers and switches [alerts] - 10https://gerrit.wikimedia.org/r/1150692 (https://phabricator.wikimedia.org/T388641) [13:48:10] I mean my previous revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1150688 [13:49:13] any help? I guess we should either retry or revert MichaelG_WMF 's patch as well as it has already been merged [13:49:27] (03CR) 10Hashar: [C:03+1] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148268/12/modules/homer/templates/private-git/config.erb indeed does the magic which" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [13:49:42] you'll need to manually pull the revert to the deployment server [13:49:56] what's Gkyziridis's IRC nick btw? [13:50:12] georgekyz: [13:50:18] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: sync [13:50:25] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [13:52:17] (03CR) 10Cathal Mooney: "I have no idea tbh, and I agree. I guess we could just rename in netbox?" [puppet] - 10https://gerrit.wikimedia.org/r/1150642 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [13:52:59] taavi: I guess you mean doing a manual revert as described here ? https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Manual_revert [13:53:16] no [13:53:23] since you already did the change manually in gerrit [13:53:46] the procedure you're looking for is https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Fetching_patches [13:54:04] also file a scap bug about the error you got [13:55:34] ack and done. thank you [13:56:05] so now shall i retry through spiderpig? [13:57:05] sure [13:57:54] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [13:57:56] !log isaranto@deploy1003 Started scap sync-world: Backport for [[gerrit:1150619|SpecialHomepageLogger: Populate email state even with StartModule disabled (T394017)]] [13:58:02] T394017: '.event' should have required property 'start_email_state' - https://phabricator.wikimedia.org/T394017 [13:58:33] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:48] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 773, active_shards: 1826, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [13:58:48] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:58:49] you can just tell spiderpig to continue the deploy if the "unexpected diff" is actually something you expected [13:59:52] I ran a new deployment. in the previous one I clicked yes on the prompt to show the diff and then it failed [14:01:01] !log isaranto@deploy1003 migr, isaranto: Backport for [[gerrit:1150619|SpecialHomepageLogger: Populate email state even with StartModule disabled (T394017)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:01:15] * MichaelG_WMF looks [14:02:23] isaranto: I'm not seeing the error anymore. Thank you! [14:02:32] ok, proceeding! [14:02:36] !log isaranto@deploy1003 migr, isaranto: Continuing with sync [14:09:48] (03PS1) 10Bartosz Wójtowicz: ml-services: Update docker image tags for ML staging models. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) [14:10:09] I found a feature request related to rolling back with spiderpig https://phabricator.wikimedia.org/T394858 [14:12:29] !log isaranto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1150619|SpecialHomepageLogger: Populate email state even with StartModule disabled (T394017)]] (duration: 14m 33s) [14:12:34] T394017: '.event' should have required property 'start_email_state' - https://phabricator.wikimedia.org/T394017 [14:12:54] MichaelG_WMF: deployed. tgr you can go now [14:13:35] (03CR) 10Tiziano Fogli: [C:04-1] "The `RewriteEngine on` directive is already declared in modules/profile/templates/prometheus/httpd-public.conf.erb, which is used by the p" [puppet] - 10https://gerrit.wikimedia.org/r/1146973 (owner: 10Majavah) [14:13:50] or anzx . I'll leave it up2u folks to coordinate [14:13:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#10856882 (10elukey) p:05Triage→03Medium [14:13:59] taavi: thanks once again for the help! [14:15:31] isaranto: I don't have deployment access, someone else needed to deploy my patch [14:15:39] (03CR) 10Vgutierrez: [C:03+1] haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [14:15:49] (03CR) 10Vgutierrez: [C:03+1] hiera: enable maxmind isp lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [14:16:36] (03CR) 10Vgutierrez: [C:04-1] hiera: enable maxmind isp lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [14:16:46] (03PS1) 10Urbanecm: changeprop: Decrease reenqueue_delay for Getting Started notif job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150699 (https://phabricator.wikimedia.org/T394958) [14:17:07] (03CR) 10Bartosz Wójtowicz: "This patch updates staging image tags for models affected by this pre-commit change:https://gerrit.wikimedia.org/r/c/machinelearning/liftw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [14:17:17] (03CR) 10Vgutierrez: [C:04-1] haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [14:17:24] (03CR) 10Vgutierrez: [C:03+1] hiera: enable maxmind isp lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [14:17:57] (03CR) 10Bartosz Wójtowicz: "Making it unresolved" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [14:18:19] (03CR) 10CI reject: [V:04-1] changeprop: Decrease reenqueue_delay for Getting Started notif job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150699 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [14:19:48] anzx: I will have to go in ~10min tgr could you deploy it along with your patches? [14:21:45] (03PS1) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 180s [dns] - 10https://gerrit.wikimedia.org/r/1150701 (https://phabricator.wikimedia.org/T394312) [14:22:06] isaranto: tgr: i can schedule it for tomorrow afternoon if not possible to deploy today [14:22:19] isaranto: I said you should file a scap bug about the crash you got, not that you should file a spiderpig feature request for a feature that'd let you have avoid that manual revert in the first place :P [14:23:53] taavi: cool, got it! I'll file that scap bug report. I just mentioned that there is already a feature request that describes why the current process fails and will fix this [14:24:06] * would fix this [14:24:19] yeah, just couldn't tell from your message whether you thought that was a duplicate or not [14:31:42] I am in a meeting, I can do it a little later if you have time [14:31:56] sure [14:32:58] Am I ok to deploy my image change in the meantime? [14:36:37] I'll go ahead and do it, should be relatively quick [14:38:05] !log cgoubert@deploy1003 Started scap sync-world: mediawiki-cli image update - T395245 [14:38:10] T395245: Add a flag to the mwscript wrapper to set +e when required - https://phabricator.wikimedia.org/T395245 [14:40:48] (03PS8) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) [14:41:09] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150704 [14:41:38] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:41:42] (03PS1) 10Vgutierrez: hiera: Enable edge uniques in another host per DC and cluster [puppet] - 10https://gerrit.wikimedia.org/r/1150705 (https://phabricator.wikimedia.org/T391411) [14:41:47] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:42:03] (03CR) 10FNegri: wikireplicas scripts: setup pytest, add first test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri) [14:42:19] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150705 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:43:45] (03PS9) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) [14:46:55] (03CR) 10Ssingh: [C:03+1] hiera: Enable edge uniques in another host per DC and cluster [puppet] - 10https://gerrit.wikimedia.org/r/1150705 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:47:08] (03CR) 10CI reject: [V:04-1] wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri) [14:48:19] (03PS4) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) [14:48:47] !log cgoubert@deploy1003 Finished scap sync-world: mediawiki-cli image update - T395245 (duration: 10m 41s) [14:48:48] (03CR) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [14:48:53] T395245: Add a flag to the mwscript wrapper to set +e when required - https://phabricator.wikimedia.org/T395245 [14:49:10] (03PS2) 10Fabfur: hiera: enable maxmind isp lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) [14:49:28] (03PS10) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) [14:52:23] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge uniques in another host per DC and cluster [puppet] - 10https://gerrit.wikimedia.org/r/1150705 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:54:17] (03CR) 10Vgutierrez: [C:03+1] haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [14:55:00] (03CR) 10Fabfur: [C:03+2] haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [14:55:38] (03CR) 10Fabfur: [C:03+2] hiera: enable maxmind isp lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [14:56:14] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:56:45] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:58:33] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:56] (03PS10) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 [14:58:56] (03PS1) 10Tiziano Fogli: monitoring::service: add migration task as parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150709 [15:00:09] (03PS11) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 [15:00:09] (03PS2) 10Tiziano Fogli: monitoring::service: add migration task as parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150709 [15:00:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.376s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:00:26] (03CR) 10Hnowlan: [C:03+1] mw::maintenance::purge_securepoll: Ignore foreachwiki errors [puppet] - 10https://gerrit.wikimedia.org/r/1150637 (https://phabricator.wikimedia.org/T395245) (owner: 10Clément Goubert) [15:00:36] (03CR) 10Gkyziridis: [C:03+1] "I am not sure why all models are not having both values and values-ml-staging-codfw. I suggest to leave them as they are for now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [15:00:53] (03CR) 10Hnowlan: [C:03+2] changeprop(-jobqueue): don't log 404s at ERROR level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150645 (https://phabricator.wikimedia.org/T395132) (owner: 10Hnowlan) [15:01:03] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::purge_securepoll: Ignore foreachwiki errors [puppet] - 10https://gerrit.wikimedia.org/r/1150637 (https://phabricator.wikimedia.org/T395245) (owner: 10Clément Goubert) [15:01:56] (03CR) 10Bunnypranav: [C:03+1] ruwikisource: add Автор (Author) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150665 (https://phabricator.wikimedia.org/T395193) (owner: 10Anzx) [15:03:01] (03Merged) 10jenkins-bot: changeprop(-jobqueue): don't log 404s at ERROR level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150645 (https://phabricator.wikimedia.org/T395132) (owner: 10Hnowlan) [15:03:38] !log temporary depooling cp7001 to restart haproxy (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1150690) (T392219) [15:03:41] (03PS3) 10Tiziano Fogli: monitoring services: add migration task as parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150709 [15:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:42] T392219: Map ISPs in Maxmind db, used in turnilo/superset, to use in requestctl rule - https://phabricator.wikimedia.org/T392219 [15:04:21] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [15:05:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.376s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:06:20] (03PS1) 10Clément Goubert: mw::maintenance::growthexperiment: Ignore foreachwiki errors [puppet] - 10https://gerrit.wikimedia.org/r/1150711 (https://phabricator.wikimedia.org/T395247) [15:06:41] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:07:42] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:04] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [15:08:30] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:08:34] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: sync [15:08:43] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [15:08:55] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:08:58] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:09:37] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [15:09:38] great timing ^ :) [15:10:12] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:10:24] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [15:10:29] (03CR) 10Jelto: "I left some comments in-line." [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [15:10:36] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:10:37] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [15:10:42] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:10:51] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:11:13] (03CR) 10Andrea Denisse: "Hi Tiziano, I was wondering if there’s a corresponding Phabricator task for it. It would help me better understand the context and the goa" [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (owner: 10Tiziano Fogli) [15:11:50] (03CR) 10Jgiannelos: [C:03+2] pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150607 (https://phabricator.wikimedia.org/T394896) (owner: 10Jgiannelos) [15:12:06] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [15:12:32] (03PS2) 10Clément Goubert: mw::maintenance::growthexperiment: Ignore foreachwiki errors [puppet] - 10https://gerrit.wikimedia.org/r/1150711 (https://phabricator.wikimedia.org/T395247) [15:12:46] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [15:13:35] (03Merged) 10jenkins-bot: pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150607 (https://phabricator.wikimedia.org/T394896) (owner: 10Jgiannelos) [15:15:12] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [15:15:23] (03CR) 10Andrea Denisse: pdb_resource_exporter: add puppetdb resource exporter to puppedb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (owner: 10Tiziano Fogli) [15:15:33] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [15:17:42] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:22:12] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150711 (https://phabricator.wikimedia.org/T395247) (owner: 10Clément Goubert) [15:26:25] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure (sdb) on coludcephmon1004 - https://phabricator.wikimedia.org/T392458#10857144 (10andrea.denisse) 05Resolved→03Open p:05Medium→03Unbreak! Hi, this is doesn't seem to be resolved as we're still getting email notifications as of today: Degr... [15:27:17] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure (sdb) on cloudcephmon1004 - https://phabricator.wikimedia.org/T392458#10857149 (10andrea.denisse) [15:28:34] !log volans@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2111.codfw.wmnet [15:28:45] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:29:08] !log volans@cumin1003 START - Cookbook sre.hosts.reboot-single for host cirrussearch2111.codfw.wmnet [15:29:42] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [15:29:59] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [15:30:04] jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1530). Please do the needful. [15:31:39] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:32:18] (03PS1) 10Fabfur: Revert "hiera: enable maxmind isp lookup on cp7001" [puppet] - 10https://gerrit.wikimedia.org/r/1150714 [15:32:45] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:39:38] (03PS2) 10Urbanecm: changeprop: Decrease reenqueue_delay for Getting Started notif job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150699 (https://phabricator.wikimedia.org/T394958) [15:39:49] (03PS3) 10Urbanecm: changeprop: Decrease reenqueue_delay for Getting Started notif job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150699 (https://phabricator.wikimedia.org/T394958) [15:40:12] (03CR) 10Fabfur: [C:03+2] Revert "hiera: enable maxmind isp lookup on cp7001" [puppet] - 10https://gerrit.wikimedia.org/r/1150714 (owner: 10Fabfur) [15:41:51] (03PS2) 10FNegri: wikireplicas: remove dashes from script names [puppet] - 10https://gerrit.wikimedia.org/r/1148358 (https://phabricator.wikimedia.org/T351637) [15:41:51] (03PS11) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) [15:41:51] (03PS1) 10FNegri: wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 [15:42:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [15:43:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T395241)', diff saved to https://phabricator.wikimedia.org/P76427 and previous config saved to /var/cache/conftool/dbconfig/20250526-154304-fceratto.json [15:43:24] (03CR) 10CI reject: [V:04-1] wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (owner: 10FNegri) [15:44:47] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:45:09] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:45:43] !log volans@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cirrussearch2111.codfw.wmnet [15:46:21] !log volans@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cirrussearch2111.codfw.wmnet [15:46:25] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:46:58] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:47:07] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:47:12] (03PS1) 10Elukey: profile::maps: add default privileges for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1150718 (https://phabricator.wikimedia.org/T381565) [15:47:13] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:47:43] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:48:20] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:48:28] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:49:01] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:49:35] 06SRE, 06Traffic-Icebox, 07Wikimedia-Performance-recommendation: Investigate using RFC 7838 Alternate Services to better optimize edge connections - https://phabricator.wikimedia.org/T208242#10857203 (10ssingh) [15:49:37] 06SRE, 06Traffic-Icebox, 07HTTPS, 07Wikimedia-Performance-recommendation: Enable HTTP/3 (QUIC) support on Wikimedia servers - https://phabricator.wikimedia.org/T238034#10857202 (10ssingh) [15:49:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T395241)', diff saved to https://phabricator.wikimedia.org/P76428 and previous config saved to /var/cache/conftool/dbconfig/20250526-154939-fceratto.json [15:51:14] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2111 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:53:00] (03PS1) 10Clément Goubert: mw::maintenance::wikidata: Alert wikidata [puppet] - 10https://gerrit.wikimedia.org/r/1150721 (https://phabricator.wikimedia.org/T388543) [15:54:30] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150721 (https://phabricator.wikimedia.org/T388543) (owner: 10Clément Goubert) [15:57:07] (03PS1) 10Fabfur: hiera: re-enable maxmind lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150724 (https://phabricator.wikimedia.org/T392219) [15:57:42] (03CR) 10Vgutierrez: [C:03+1] hiera: re-enable maxmind lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150724 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [15:58:15] (03CR) 10CI reject: [V:04-1] hiera: re-enable maxmind lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150724 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [15:59:30] (03PS2) 10Fabfur: hiera: re-enable maxmind lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150724 (https://phabricator.wikimedia.org/T392219) [15:59:40] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::wikidata: Alert wikidata [puppet] - 10https://gerrit.wikimedia.org/r/1150721 (https://phabricator.wikimedia.org/T388543) (owner: 10Clément Goubert) [15:59:47] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::growthexperiment: Ignore foreachwiki errors [puppet] - 10https://gerrit.wikimedia.org/r/1150711 (https://phabricator.wikimedia.org/T395247) (owner: 10Clément Goubert) [15:59:48] (03PS6) 10Jgiannelos: pcs-rb-sunset: Disable changeprop rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670) [16:01:03] (03CR) 10Fabfur: [C:03+2] hiera: re-enable maxmind lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150724 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [16:01:14] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2111 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:03:08] (03PS3) 10Jgiannelos: pcs: Block RB traffic for all domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145828 [16:03:34] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P76429 and previous config saved to /var/cache/conftool/dbconfig/20250526-160447-fceratto.json [16:06:09] (03PS3) 10FNegri: wikireplicas: remove dashes from script names [puppet] - 10https://gerrit.wikimedia.org/r/1148358 (https://phabricator.wikimedia.org/T351637) [16:06:09] (03PS2) 10FNegri: wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 [16:06:10] (03PS12) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) [16:07:01] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [16:07:29] (03CR) 10CI reject: [V:04-1] wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (owner: 10FNegri) [16:07:46] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [16:08:10] (03CR) 10Jgiannelos: [C:03+2] pcs-rb-sunset: Disable changeprop rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670) (owner: 10Jgiannelos) [16:09:49] (03Merged) 10jenkins-bot: pcs-rb-sunset: Disable changeprop rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670) (owner: 10Jgiannelos) [16:09:51] (03PS1) 10Alexandros Kosiaris: Exlude linkrecommendation from KubernetesContainerReachingMemoryLimit [alerts] - 10https://gerrit.wikimedia.org/r/1150726 (https://phabricator.wikimedia.org/T357122) [16:10:02] (03PS1) 10Fabfur: cache: fixed maxmind lua fetcher script [puppet] - 10https://gerrit.wikimedia.org/r/1150727 (https://phabricator.wikimedia.org/T392219) [16:10:22] (03CR) 10Clément Goubert: [C:03+1] Exlude linkrecommendation from KubernetesContainerReachingMemoryLimit [alerts] - 10https://gerrit.wikimedia.org/r/1150726 (https://phabricator.wikimedia.org/T357122) (owner: 10Alexandros Kosiaris) [16:10:38] (03CR) 10Vgutierrez: [C:03+1] cache: fixed maxmind lua fetcher script [puppet] - 10https://gerrit.wikimedia.org/r/1150727 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [16:11:29] 06SRE, 06serviceops, 10WMF-General-or-Unknown: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#10857252 (10Clement_Goubert) 05Open→03Resolved All scripts now have alerting, and log to logstash. [16:11:31] (03CR) 10Alexandros Kosiaris: [C:03+2] Exlude linkrecommendation from KubernetesContainerReachingMemoryLimit [alerts] - 10https://gerrit.wikimedia.org/r/1150726 (https://phabricator.wikimedia.org/T357122) (owner: 10Alexandros Kosiaris) [16:11:41] RESOLVED: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:13:12] (03CR) 10Fabfur: [C:03+2] cache: fixed maxmind lua fetcher script [puppet] - 10https://gerrit.wikimedia.org/r/1150727 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [16:13:25] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [16:13:59] (03PS1) 10Volans: sre.hardware.upgrade-firmware: add support for SSD [cookbooks] - 10https://gerrit.wikimedia.org/r/1150728 (https://phabricator.wikimedia.org/T394543) [16:14:47] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [16:14:53] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [16:15:22] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:16:13] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10857271 (10Volans) Thank Brian, I've upgraded the firmware of `cirrussearch2111` with the above patch, it's all back to you. The only th... [16:16:29] (03CR) 10Volans: [C:04-1] "One thing still to fix, see https://phabricator.wikimedia.org/T394543#10857271" [cookbooks] - 10https://gerrit.wikimedia.org/r/1150728 (https://phabricator.wikimedia.org/T394543) (owner: 10Volans) [16:18:33] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:25] (03PS1) 10JMeybohm: sre.k8s.wipe-cluster: Verify that k8s service are up after puppet ran [cookbooks] - 10https://gerrit.wikimedia.org/r/1150729 (https://phabricator.wikimedia.org/T389086) [16:19:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P76430 and previous config saved to /var/cache/conftool/dbconfig/20250526-161955-fceratto.json [16:20:15] (03PS4) 10FNegri: wikireplicas: remove dashes from script names [puppet] - 10https://gerrit.wikimedia.org/r/1148358 (https://phabricator.wikimedia.org/T395266) [16:20:16] (03PS3) 10FNegri: wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) [16:20:18] (03PS13) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266) [16:21:34] (03CR) 10CI reject: [V:04-1] wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [16:23:38] (03PS5) 10FNegri: wikireplicas: remove dashes from script names [puppet] - 10https://gerrit.wikimedia.org/r/1148358 (https://phabricator.wikimedia.org/T395266) [16:23:38] (03PS4) 10FNegri: wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) [16:23:38] (03PS14) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266) [16:25:59] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [16:34:51] (03PS5) 10FNegri: wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) [16:34:51] (03PS15) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266) [16:35:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T395241)', diff saved to https://phabricator.wikimedia.org/P76431 and previous config saved to /var/cache/conftool/dbconfig/20250526-163502-fceratto.json [16:35:23] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [16:35:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T395241)', diff saved to https://phabricator.wikimedia.org/P76432 and previous config saved to /var/cache/conftool/dbconfig/20250526-163530-fceratto.json [16:35:48] (03CR) 10FNegri: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5681/co" [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [16:38:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149822 (https://phabricator.wikimedia.org/T392251) (owner: 10Gergő Tisza) [16:38:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149823 (https://phabricator.wikimedia.org/T392251) (owner: 10Gergő Tisza) [16:43:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T395241)', diff saved to https://phabricator.wikimedia.org/P76433 and previous config saved to /var/cache/conftool/dbconfig/20250526-164324-fceratto.json [16:47:16] (03PS1) 10Jgiannelos: pcs: Disable changeprop rule for summary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150731 (https://phabricator.wikimedia.org/T264670) [16:49:00] (03CR) 10Alexandros Kosiaris: [C:04-1] "Awesome, but please bump the chart version in Chart.yaml as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150731 (https://phabricator.wikimedia.org/T264670) (owner: 10Jgiannelos) [16:58:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P76434 and previous config saved to /var/cache/conftool/dbconfig/20250526-165831-fceratto.json [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1700) [17:00:05] ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1700). [17:10:42] PROBLEM - Disk space on restbase1031 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 68747 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1031&var-datasource=eqiad+prometheus/ops [17:13:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P76435 and previous config saved to /var/cache/conftool/dbconfig/20250526-171338-fceratto.json [17:23:27] (03CR) 10Majavah: [C:03+1] wikireplicas: remove dashes from script names [puppet] - 10https://gerrit.wikimedia.org/r/1148358 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [17:23:38] (03CR) 10Majavah: [C:04-1] wikireplicas: split db config from maintain-views (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [17:25:10] (03CR) 10Majavah: wikireplicas scripts: setup pytest, add first test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [17:25:10] (03PS3) 10AOkoth: doc: swap doc1003 with doc1004 [puppet] - 10https://gerrit.wikimedia.org/r/1149469 [17:27:56] (03CR) 10AOkoth: doc: swap doc1003 with doc1004 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1149469 (owner: 10AOkoth) [17:28:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T395241)', diff saved to https://phabricator.wikimedia.org/P76436 and previous config saved to /var/cache/conftool/dbconfig/20250526-172844-fceratto.json [17:29:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [17:29:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T395241)', diff saved to https://phabricator.wikimedia.org/P76437 and previous config saved to /var/cache/conftool/dbconfig/20250526-172912-fceratto.json [17:33:34] (03CR) 10AOkoth: [C:03+2] doc: swap doc1003 with doc1004 [puppet] - 10https://gerrit.wikimedia.org/r/1149469 (owner: 10AOkoth) [17:37:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T395241)', diff saved to https://phabricator.wikimedia.org/P76438 and previous config saved to /var/cache/conftool/dbconfig/20250526-173700-fceratto.json [17:44:58] (03CR) 10Michael Große: [C:03+1] "This change is a good idea to try out early, so that we can learn whether it impacts the stability of the overall system." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150699 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [17:52:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P76439 and previous config saved to /var/cache/conftool/dbconfig/20250526-175207-fceratto.json [18:07:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P76440 and previous config saved to /var/cache/conftool/dbconfig/20250526-180714-fceratto.json [18:22:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T395241)', diff saved to https://phabricator.wikimedia.org/P76441 and previous config saved to /var/cache/conftool/dbconfig/20250526-182221-fceratto.json [18:22:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance [18:22:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T395241)', diff saved to https://phabricator.wikimedia.org/P76442 and previous config saved to /var/cache/conftool/dbconfig/20250526-182247-fceratto.json [18:28:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T395241)', diff saved to https://phabricator.wikimedia.org/P76443 and previous config saved to /var/cache/conftool/dbconfig/20250526-182817-fceratto.json [18:31:22] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:35:10] anzx: sorry, I had to leave. Please reschedule the patch. [18:43:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P76444 and previous config saved to /var/cache/conftool/dbconfig/20250526-184325-fceratto.json [18:43:53] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm [18:55:35] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure (sdb) on cloudcephmon1004 - https://phabricator.wikimedia.org/T392458#10857617 (10andrea.denisse) 05Open→03Resolved Thanks to Taavi for adding /dev/sdb back to software raid. https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?or... [18:58:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P76445 and previous config saved to /var/cache/conftool/dbconfig/20250526-185832-fceratto.json [19:13:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T395241)', diff saved to https://phabricator.wikimedia.org/P76446 and previous config saved to /var/cache/conftool/dbconfig/20250526-191341-fceratto.json [19:14:01] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2201.codfw.wmnet with reason: Maintenance [19:16:55] (03PS1) 10Effie Mouzeli: validating-admission-policies: fix typo in Makefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150749 [19:16:59] !log Add Grafana v12.0.1 to reprepro for bookworm - T395098 [19:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:03] T395098: Upgrade to Grafana 12.0.1 - https://phabricator.wikimedia.org/T395098 [19:17:42] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:19:06] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: Maintenance [19:19:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T395241)', diff saved to https://phabricator.wikimedia.org/P76447 and previous config saved to /var/cache/conftool/dbconfig/20250526-191912-fceratto.json [19:19:28] !log Upgrading Grafana to v12.0.1 on grafana1002 - T395098 [19:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:06] (03PS1) 10Andrea Denisse: Revert "grafana: Disable dashboard sync to ugprade Grafana version" [puppet] - 10https://gerrit.wikimedia.org/r/1150750 [19:24:44] !log aokoth@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [19:26:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T395241)', diff saved to https://phabricator.wikimedia.org/P76448 and previous config saved to /var/cache/conftool/dbconfig/20250526-192600-fceratto.json [19:26:30] !log aokoth@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [19:26:32] (03PS6) 10Effie Mouzeli: admin_ng: add ValidatingAdmissionPolicy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [19:27:17] (03CR) 10Andrea Denisse: [C:03+2] Revert "grafana: Disable dashboard sync to ugprade Grafana version" [puppet] - 10https://gerrit.wikimedia.org/r/1150750 (owner: 10Andrea Denisse) [19:27:45] (03PS7) 10Effie Mouzeli: admin_ng: add ValidatingAdmissionPolicy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [19:28:33] !log aokoth@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [19:29:14] !log aokoth@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [19:29:49] (03PS2) 10Effie Mouzeli: kubernetes::deployment_server: add new mw-experimental release [puppet] - 10https://gerrit.wikimedia.org/r/1148300 (https://phabricator.wikimedia.org/T276994) [19:30:28] !log Re-enable sync between grafana hosts - T395098 [19:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:34] T395098: Upgrade to Grafana 12.0.1 - https://phabricator.wikimedia.org/T395098 [19:33:07] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on doc1004.eqiad.wmnet with reason: Bookworm [19:37:07] (03CR) 10Effie Mouzeli: kubernetes::deployment_server: add new mw-experimental release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148300 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [19:38:28] (03PS4) 10AOkoth: wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) [19:38:42] (03PS5) 10AOkoth: wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) [19:40:45] (03PS6) 10AOkoth: wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) [19:41:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P76449 and previous config saved to /var/cache/conftool/dbconfig/20250526-194107-fceratto.json [19:41:17] (03PS7) 10AOkoth: wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) [19:41:30] (03CR) 10AOkoth: "Done" [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [19:56:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P76450 and previous config saved to /var/cache/conftool/dbconfig/20250526-195614-fceratto.json [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T2000). [20:00:05] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] hi [20:00:38] anyone around who could deploy for me? [20:03:34] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:11:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T395241)', diff saved to https://phabricator.wikimedia.org/P76451 and previous config saved to /var/cache/conftool/dbconfig/20250526-201123-fceratto.json [20:11:43] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2223.codfw.wmnet with reason: Maintenance [20:11:47] i still need a deployer if anyone has a couple of minutes [20:11:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2223 (T395241)', diff saved to https://phabricator.wikimedia.org/P76452 and previous config saved to /var/cache/conftool/dbconfig/20250526-201150-fceratto.json [20:16:48] (03PS1) 10Effie Mouzeli: mw-experimental: initial commit (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150760 (https://phabricator.wikimedia.org/T276994) [20:18:33] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T395241)', diff saved to https://phabricator.wikimedia.org/P76453 and previous config saved to /var/cache/conftool/dbconfig/20250526-201840-fceratto.json [20:19:14] (03PS1) 10Effie Mouzeli: mw-experimental: create new service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) [20:22:26] (03PS2) 10Effie Mouzeli: profile::kubernetes::deployment_server add usernames for mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1147782 (https://phabricator.wikimedia.org/T276994) [20:25:04] (03PS3) 10Effie Mouzeli: profile::kubernetes::deployment_server::services: add usernames for mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1147782 (https://phabricator.wikimedia.org/T276994) [20:25:11] (03PS3) 10Effie Mouzeli: profile::kubernetes::deployment_server: add new mw-experimental release [puppet] - 10https://gerrit.wikimedia.org/r/1148300 (https://phabricator.wikimedia.org/T276994) [20:26:13] (03CR) 10CI reject: [V:04-1] profile::kubernetes::deployment_server::services: add usernames for mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1147782 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [20:28:36] (03PS2) 10Effie Mouzeli: mw-experimental: create new service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) [20:30:47] (03PS4) 10Effie Mouzeli: profile::kubernetes::deployment_server: add usernames for mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1147782 (https://phabricator.wikimedia.org/T276994) [20:33:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P76454 and previous config saved to /var/cache/conftool/dbconfig/20250526-203348-fceratto.json [20:48:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P76455 and previous config saved to /var/cache/conftool/dbconfig/20250526-204855-fceratto.json [20:54:41] (03PS1) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) [20:55:56] (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli) [20:59:00] (03PS2) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) [21:00:05] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T2100). [21:00:11] (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli) [21:04:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T395241)', diff saved to https://phabricator.wikimedia.org/P76456 and previous config saved to /var/cache/conftool/dbconfig/20250526-210402-fceratto.json [21:04:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2228.codfw.wmnet with reason: Maintenance [21:04:39] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [21:04:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2228 (T395241)', diff saved to https://phabricator.wikimedia.org/P76457 and previous config saved to /var/cache/conftool/dbconfig/20250526-210445-fceratto.json [21:07:08] (03PS3) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) [21:08:26] (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli) [21:08:40] (03PS4) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) [21:10:03] (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli) [21:10:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:10:36] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:11:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T395241)', diff saved to https://phabricator.wikimedia.org/P76458 and previous config saved to /var/cache/conftool/dbconfig/20250526-211127-fceratto.json [21:12:30] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.857 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:13:18] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53940 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:15:14] (03PS5) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) [21:16:30] (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli) [21:20:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149822 (https://phabricator.wikimedia.org/T392251) (owner: 10Gergő Tisza) [21:20:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149823 (https://phabricator.wikimedia.org/T392251) (owner: 10Gergő Tisza) [21:23:59] (03PS6) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) [21:25:11] (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli) [21:26:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P76459 and previous config saved to /var/cache/conftool/dbconfig/20250526-212634-fceratto.json [21:29:43] (03PS7) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) [21:30:50] (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli) [21:31:23] (03PS8) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) [21:41:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P76460 and previous config saved to /var/cache/conftool/dbconfig/20250526-214142-fceratto.json [21:52:12] (03PS2) 10Ahonc: Add user group extendedmover to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150778 (https://phabricator.wikimedia.org/T395285) [21:52:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150778 (https://phabricator.wikimedia.org/T395285) (owner: 10Ahonc) [21:56:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T395241)', diff saved to https://phabricator.wikimedia.org/P76461 and previous config saved to /var/cache/conftool/dbconfig/20250526-215649-fceratto.json [22:06:16] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 83557MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [22:20:34] (03PS3) 10Ahonc: Add user group extendedmover to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150778 (https://phabricator.wikimedia.org/T395285) [22:23:54] (03PS4) 10Ahonc: Add user group extendedmover to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150778 (https://phabricator.wikimedia.org/T395285) [22:45:36] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Mon 23 Jun 2025 10:10:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T2300) [23:17:42] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:30:09] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1150145 (owner: 10TrainBranchBot) [23:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1150797 [23:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1150797 (owner: 10TrainBranchBot) [23:54:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1150797 (owner: 10TrainBranchBot)