[00:03:33] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:08:16] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1150157
[00:08:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1150157 (owner: 10TrainBranchBot)
[00:09:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.192s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:10:33] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 623.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:18:33] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[00:35:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1150157 (owner: 10TrainBranchBot)
[00:47:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.395s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:52:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.166s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:56:15] <icinga-wm>	 RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[00:57:03] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:57:13] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:57:53] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:58:03] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:15:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.377s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:20:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.377s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:22:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.742s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:52:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.068s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:55:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.598s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:58:33] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:00:33] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:15:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.037s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:18:33] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[04:03:33] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:18:33] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[04:20:45] <icinga-wm>	 PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:24:39] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr1-esams and Hurricane Electric (2001:7f8:1::a500:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDo
[04:30:43] <icinga-wm>	 PROBLEM - Disk space on restbase1031 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 63967 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1031&var-datasource=eqiad+prometheus/ops
[04:34:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Hurricane Electric (2001:7f8:1::a500:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[04:39:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Hurricane Electric (2001:7f8:1::a500:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[05:02:53] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: PuppetFailure (instance an-worker1068:9100) - https://phabricator.wikimedia.org/T392554#10855640 (10Stevemunene) 05Open→03Resolved
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:13:09] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1177.eqiad.wmnet
[05:15:17] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1177.eqiad.wmnet
[05:16:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10855642 (10Stevemunene) Thanks @Jclark-ctr   For the host I  [x] verified the VDs [x] created the journal node [x] ran...
[05:17:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10855643 (10Stevemunene)
[05:18:35] <wikibugs>	 (03PS1) 10Stevemunene: Revert "hdfs: add an-worker1177 to in retup role" [puppet] - 10https://gerrit.wikimedia.org/r/1150223
[05:33:47] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: BAD PEM3 on cr2-codfw - https://phabricator.wikimedia.org/T394868#10855645 (10Papaul) ` UPDATE HAS BEEN ADDED:   Dear Juniper Networks Customer,  Your replacement part associated with RMA R200568010 Item # 100 has been successfu...
[05:58:33] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:06:24] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:06:55] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:15:26] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10netbox: Selena can't see objects in Netbox despite having wmf group membership - https://phabricator.wikimedia.org/T395172#10855647 (10Volans) The user created in Netbox has username `sdeckelmann` while the user in LDAP has UID `sdeckelmann-wmf`...
[06:28:58] <moritzm>	 !log installing intel-microcode security updates
[06:29:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:33:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[06:56:47] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149805 (https://phabricator.wikimedia.org/T394603) (owner: 10Bunnypranav)
[06:58:57] <moritzm>	 !log installing Linux 6.1.140 packages
[06:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T0700).
[07:00:05] <jouncebot>	 bunnypranav: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:26] <bunnypranav>	 o/
[07:01:44] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Default the Kerberos role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1149542 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[07:02:42] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1149736 (https://phabricator.wikimedia.org/T393579) (owner: 10Dzahn)
[07:03:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[07:04:01] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "yes please!" [puppet] - 10https://gerrit.wikimedia.org/r/1150223 (owner: 10Stevemunene)
[07:06:24] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:06:55] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:08:10] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Revert "hdfs: add an-worker1177 to in retup role" [puppet] - 10https://gerrit.wikimedia.org/r/1150223 (owner: 10Stevemunene)
[07:17:54] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1177.eqiad.wmnet
[07:18:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10855685 (10ops-monitoring-bot) Host an-worker1177.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting...
[07:18:33] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[07:20:34] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::worker: move ml-serve1003 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1150494 (https://phabricator.wikimedia.org/T387854)
[07:20:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Fix auto restart for alertmanager-irc-relay [puppet] - 10https://gerrit.wikimedia.org/r/1149544 (owner: 10Muehlenhoff)
[07:22:59] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1003.eqiad.wmnet
[07:23:00] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1003.eqiad.wmnet
[07:23:22] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1003.eqiad.wmnet
[07:23:51] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::ml_k8s::worker: move ml-serve1003 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1150494 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[07:25:42] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1177.eqiad.wmnet
[07:26:46] <wikibugs>	 (03CR) 10Elukey: [C:03+1] profile::prometheus::k8s: drop terminated pod targets [puppet] - 10https://gerrit.wikimedia.org/r/1149505 (https://phabricator.wikimedia.org/T395052) (owner: 10Scott French)
[07:28:26] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1003.eqiad.wmnet
[07:32:33] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1003.eqiad.wmnet with OS bookworm
[07:37:08] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:37:10] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:40:32] <wikibugs>	 (03PS16) 10Arnaudb: gerrit: lock, preflight checks, hieradata lookups, verbosity [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034)
[07:40:32] <wikibugs>	 (03CR) 10Arnaudb: "Following up April 30th Gerrit split brain, there are now:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[07:45:42] <wikibugs>	 (03PS3) 10Effie Mouzeli: admin_ng: add ValidatingAdmissionPolicy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225)
[07:46:54] <wikibugs>	 (03PS4) 10Effie Mouzeli: admin_ng: add ValidatingAdmissionPolicy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225)
[07:52:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10855739 (10Stevemunene) Host has successfully rejoined the cluster {F60548076}
[07:52:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10855740 (10Stevemunene)
[07:52:39] <wikibugs>	 (03PS5) 10Effie Mouzeli: admin_ng: add ValidatingAdmissionPolicy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225)
[07:52:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10855741 (10Stevemunene) 05Open→03Resolved
[07:57:05] <wikibugs>	 (03PS9) 10Fabfur: haproxy: use maxmind lua bindings to lookup client ISP [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219)
[07:57:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:58:51] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10netbox: Selena can't see objects in Netbox despite having wmf group membership - https://phabricator.wikimedia.org/T395172#10855743 (10SLyngshede-WMF) The account is correctly linked in the social_auth tabel, not sure how though:   ` >>> u = Use...
[07:59:57] <wikibugs>	 (03PS10) 10Volans: homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380)
[07:59:58] <wikibugs>	 (03PS5) 10Volans: git::clone: fix support for different remote name [puppet] - 10https://gerrit.wikimedia.org/r/1148267
[08:01:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] git::clone: fix support for different remote name [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans)
[08:03:33] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:05:20] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10netbox: Selena can't see objects in Netbox despite having wmf group membership - https://phabricator.wikimedia.org/T395172#10855775 (10SLyngshede-WMF) @SDeckelmann-WMF can you try something, not sure if that will work, but I'll like to get a con...
[08:05:27] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10netbox: Selena can't see objects in Netbox despite having wmf group membership - https://phabricator.wikimedia.org/T395172#10855776 (10SLyngshede-WMF) p:05Triage→03High a:03SLyngshede-WMF
[08:06:28] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] definitions: Add port for x3 wiki replica backend [homer/public] - 10https://gerrit.wikimedia.org/r/1149606 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah)
[08:07:23] <wikibugs>	 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10855779 (10tappof) Ok, thank you @RobH. I’ll add some Pint directives to silence alerts for missing metrics in the DCs that do...
[08:08:10] <wikibugs>	 (03CR) 10Majavah: [C:03+2] definitions: Add port for x3 wiki replica backend [homer/public] - 10https://gerrit.wikimedia.org/r/1149606 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah)
[08:08:41] <wikibugs>	 (03Merged) 10jenkins-bot: definitions: Add port for x3 wiki replica backend [homer/public] - 10https://gerrit.wikimedia.org/r/1149606 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah)
[08:10:57] <wikibugs>	 (03PS4) 10Tiziano Fogli: pdus: add pro4x breaker alerts [alerts] - 10https://gerrit.wikimedia.org/r/1149343 (https://phabricator.wikimedia.org/T387231)
[08:11:58] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync
[08:11:59] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync
[08:12:28] <wikibugs>	 (03CR) 10Volans: "Just passing by and left some spicerack-specific suggestions." [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[08:12:50] <wikibugs>	 (03PS6) 10Volans: git::clone: fix support for different remote name [puppet] - 10https://gerrit.wikimedia.org/r/1148267
[08:12:58] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[08:13:16] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxy: use maxmind lua bindings to lookup client ISP [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[08:13:58] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] pdus: add pro4x breaker alerts [alerts] - 10https://gerrit.wikimedia.org/r/1149343 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[08:14:56] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync
[08:15:02] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync
[08:15:57] <wikibugs>	 (03Merged) 10jenkins-bot: pdus: add pro4x breaker alerts [alerts] - 10https://gerrit.wikimedia.org/r/1149343 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[08:17:29] <wikibugs>	 (03CR) 10Brouberol: "I think you're right. Let's abandon it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149013 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol)
[08:17:51] <wikibugs>	 (03Abandoned) 10Brouberol: airflow: relax timeout after which DAGs are deleted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149013 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol)
[08:18:33] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[08:19:03] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync
[08:19:09] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync
[08:21:10] <wikibugs>	 (03PS1) 10Slyngshede: P:idp always use Wikimedia theme [puppet] - 10https://gerrit.wikimedia.org/r/1150581
[08:22:56] <Mvolz>	 jouncebot: nowandnext
[08:22:56] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 37 minute(s)
[08:22:56] <jouncebot>	 In 1 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1000)
[08:23:33] <jinxer-wm>	 RESOLVED: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[08:23:53] <Mvolz>	 anybody object if I use this window to do a citoid deploy? can't make this week's scheduled one. 
[08:24:43] <elukey>	 +1 from my side, it doesn't seem to be a problem. Anything risky to deploy? 
[08:29:16] <wikibugs>	 (03PS1) 10Brouberol: airflow: never start an instance with 0 DAG parsing processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150586 (https://phabricator.wikimedia.org/T393998)
[08:30:25] <Mvolz>	 Nope
[08:30:29] <Mvolz>	 Nothing risky
[08:31:08] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147106 (owner: 10PipelineBot)
[08:31:41] <Mvolz>	 (famous last words)
[08:32:50] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147106 (owner: 10PipelineBot)
[08:34:19] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1003.eqiad.wmnet with OS bookworm
[08:34:44] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Depool lvs1013 before switching to katran [puppet] - 10https://gerrit.wikimedia.org/r/1150587 (https://phabricator.wikimedia.org/T395228)
[08:34:45] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1003.eqiad.wmnet with OS bookworm
[08:34:49] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply
[08:35:13] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply
[08:35:48] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150587 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez)
[08:35:49] <wikibugs>	 (03CR) 10Volans: "I went for the simplification option, not passing remote_name to git::clone and overriding everything in the .git/config file." [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[08:41:08] <logmsgbot>	 !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply
[08:41:35] <logmsgbot>	 !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[08:42:43] <wikibugs>	 (03PS7) 10Volans: git::clone: remote remote_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/1148267
[08:43:50] <logmsgbot>	 !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply
[08:44:17] <logmsgbot>	 !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[08:45:13] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150586 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol)
[08:46:11] <wikibugs>	 (03PS8) 10Volans: git::clone: remove remote_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/1148267
[08:46:35] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: never start an instance with 0 DAG parsing processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150586 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol)
[08:46:36] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans)
[08:48:26] <logmsgbot>	 !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker1135.eqiad.wmnet with reason: Investigate MegaRAID failure
[08:48:29] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632#10855909 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b04912d5-a1f9-43e3-8127-02a5d51fd650) set by stevemunene@cumin10...
[08:48:38] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632#10855911 (10Stevemunene) Hi @jcrespo apologies for the delay, this has been downtimed
[08:48:52] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] systemd::timer: Allow setting FixedRandomDelay [puppet] - 10https://gerrit.wikimedia.org/r/1149647 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez)
[08:50:39] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] systemd::timer::job: Allow setting accuracy and fixed_random_delay [puppet] - 10https://gerrit.wikimedia.org/r/1149648 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez)
[08:53:12] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage
[08:56:02] <wikibugs>	 (03CR) 10Elukey: [C:03+1] homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[08:57:12] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: host reimage
[08:59:26] <wikibugs>	 (03PS1) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219)
[09:02:34] <wikibugs>	 (03CR) 10Fabfur: external_cloud_vendors: fix Azure prefix fetch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur)
[09:03:41] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "this opens the door for third-parties spoofing the value of X-Requestctl-ISP" [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[09:04:50] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632#10855999 (10jcrespo) Assuming this is a hw failure, remember to notify dc-ops ( https://phabricator.wikimedia.org/maniphest/task/edit/form/55...
[09:07:01] <wikibugs>	 (03PS2) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219)
[09:07:33] <wikibugs>	 (03CR) 10Fabfur: "yeah, nice catch, fixed thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[09:09:10] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:09:10] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:09:25] <wikibugs>	 (03CR) 10Vgutierrez: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[09:14:45] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1003.eqiad.wmnet with OS bookworm
[09:14:48] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] varnish: Deploy edge uniques experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1149651 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez)
[09:17:04] <wikibugs>	 (03CR) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[09:21:57] <wikibugs>	 (03CR) 10Volans: "PCC results: https://puppet-compiler.wmflabs.org/output/1148267/4002/" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans)
[09:25:05] <wikibugs>	 (03PS5) 10Fabfur: external_cloud_vendors: fix Azure prefix fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127)
[09:25:30] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: disable hardcoded networkpolicy in favor of the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149639 (https://phabricator.wikimedia.org/T392668) (owner: 10Brouberol)
[09:27:14] <wikibugs>	 (03CR) 10Fabfur: external_cloud_vendors: fix Azure prefix fetch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur)
[09:27:37] <wikibugs>	 (03PS1) 10Brouberol: airflow: emit lineage metadata to datahub via kafka instead of the GMS REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150595 (https://phabricator.wikimedia.org/T395106)
[09:27:39] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Fix wmfuniq_experiment_fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1150596 (https://phabricator.wikimedia.org/T395001)
[09:28:46] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::purge_securepoll: Only run on securepollglobal.dblist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149629 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert)
[09:29:41] <wikibugs>	 (03CR) 10Vgutierrez: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[09:31:23] <wikibugs>	 (03PS1) 10Clément Goubert: mw::maintenance::purge_securepoll: Fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1150598 (https://phabricator.wikimedia.org/T388542)
[09:33:18] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mw::maintenance::purge_securepoll: Fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1150598 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert)
[09:33:38] <icinga-wm>	 PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp3080 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:33:38] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1150596 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez)
[09:33:38] <icinga-wm>	 PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp3066 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:33:38] <icinga-wm>	 PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp6009 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:33:38] <icinga-wm>	 PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp7014 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:33:38] <icinga-wm>	 PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp7002 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:33:39] <wikibugs>	 (03PS2) 10Hnowlan: mw::periodic_job: clean up migration_title parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150594 (https://phabricator.wikimedia.org/T341555)
[09:33:39] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150594 (https://phabricator.wikimedia.org/T341555) (owner: 10Hnowlan)
[09:33:40] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150598 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert)
[09:34:47] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] varnish: Fix wmfuniq_experiment_fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1150596 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez)
[09:34:50] <wikibugs>	 (03PS2) 10Brouberol: deployment_server: deploy the mediawiki-dumps-legacy scap target [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786)
[09:34:58] <wikibugs>	 (03PS2) 10Clément Goubert: mw::maintenance::purge_securepoll: Fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1150598 (https://phabricator.wikimedia.org/T388542)
[09:35:00] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150598 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert)
[09:35:06] <wikibugs>	 (03CR) 10Brouberol: "Thanks for the review Scott!" [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol)
[09:37:54] <icinga-wm>	 PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp1114 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:37:54] <icinga-wm>	 PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp1105 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:37:54] <icinga-wm>	 PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp3077 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:37:56] <icinga-wm>	 PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp7003 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:37:56] <icinga-wm>	 PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp7015 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:37:56] <icinga-wm>	 PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp7009 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:40:41] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::purge_securepoll: Fix dblist [puppet] - 10https://gerrit.wikimedia.org/r/1150598 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert)
[09:42:12] <icinga-wm>	 PROBLEM - Check unit status of wmfuniq-experiment-fetcher on cp6012 is CRITICAL: CRITICAL: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:42:16] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::purge_securepoll: Only run on securepollglobal.dblist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1149629 (https://phabricator.wikimedia.org/T388542) (owner: 10Clément Goubert)
[09:42:37] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#10856107 (10akosiaris) 1 single bucket, at least at the beginning. Reading https://distribution.github.io/distribution/about/configuration/, I don't think the softw...
[09:43:06] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#10856108 (10akosiaris) > If you want to do some testing, I could set you up with a test account on apus.  That would be swell!
[09:43:38] <icinga-wm>	 RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp3080 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:43:38] <icinga-wm>	 RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp3066 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:43:38] <icinga-wm>	 RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp6009 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:43:39] <icinga-wm>	 RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp7002 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:43:39] <icinga-wm>	 RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp7014 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:45:29] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1003.eqiad.wmnet
[09:45:30] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1003.eqiad.wmnet
[09:47:41] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::worker: upgrade ml-serve1004 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1150604 (https://phabricator.wikimedia.org/T387854)
[09:47:47] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1004.eqiad.wmnet
[09:47:54] <icinga-wm>	 RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp1114 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:47:54] <icinga-wm>	 RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp1105 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:47:54] <icinga-wm>	 RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp3077 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:47:56] <icinga-wm>	 RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp7015 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:47:56] <icinga-wm>	 RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp7003 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:47:56] <icinga-wm>	 RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp7009 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:51:21] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::ml_k8s::worker: upgrade ml-serve1004 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1150604 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[09:52:12] <icinga-wm>	 RECOVERY - Check unit status of wmfuniq-experiment-fetcher on cp6012 is OK: OK: Status of the systemd unit wmfuniq-experiment-fetcher https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:52:52] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve1004.eqiad.wmnet
[09:53:55] <wikibugs>	 (03PS1) 10Jgiannelos: pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150607 (https://phabricator.wikimedia.org/T394896)
[09:54:38] <wikibugs>	 (03CR) 10Jgiannelos: "This is the patch from last week after some more testing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150607 (https://phabricator.wikimedia.org/T394896) (owner: 10Jgiannelos)
[09:55:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] external_cloud_vendors: fix Azure prefix fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur)
[09:58:33] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:58:53] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] external_cloud_vendors: fix Azure prefix fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1000)
[10:04:26] <logmsgbot>	 elukey@cumin1002 reimage (PID 2164230) is awaiting input
[10:08:30] <wikibugs>	 (03Abandoned) 10Hnowlan: mw::periodic_job: clean up migration_title parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150594 (https://phabricator.wikimedia.org/T341555) (owner: 10Hnowlan)
[10:08:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans)
[10:08:41] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1004.eqiad.wmnet with OS bookworm
[10:09:14] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Move Kartotherian/staging to the new Bookworm nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148881 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[10:13:12] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:13:12] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:13:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[10:22:13] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] external_cloud_vendors: fix Azure prefix fetch [puppet] - 10https://gerrit.wikimedia.org/r/1149693 (https://phabricator.wikimedia.org/T395127) (owner: 10Fabfur)
[10:22:36] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1004.eqiad.wmnet with OS bookworm
[10:23:02] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1004.eqiad.wmnet with OS bookworm
[10:32:53] <wikibugs>	 (03PS1) 10Michael Große: SpecialHomepageLogger: Populate email state even with StartModule disabled [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1150619 (https://phabricator.wikimedia.org/T394017)
[10:33:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1150619 (https://phabricator.wikimedia.org/T394017) (owner: 10Michael Große)
[10:34:03] <wikibugs>	 (03PS2) 10Brouberol: airflow: emit lineage metadata to datahub via kafka instead of the GMS REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150595 (https://phabricator.wikimedia.org/T395106)
[10:39:11] <wikibugs>	 (03PS3) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219)
[10:39:33] <wikibugs>	 (03CR) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[10:42:26] <wikibugs>	 (03CR) 10FNegri: [C:03+1] openstack: wmcs-bastionless: Fix condition [puppet] - 10https://gerrit.wikimedia.org/r/1149811 (https://phabricator.wikimedia.org/T379550) (owner: 10Majavah)
[10:43:49] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[10:44:11] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[10:44:33] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[10:46:40] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: wmcs-bastionless: Fix condition [puppet] - 10https://gerrit.wikimedia.org/r/1149811 (https://phabricator.wikimedia.org/T379550) (owner: 10Majavah)
[10:51:14] <wikibugs>	 (03CR) 10Majavah: [C:03+1] Remove unused option to enable host-based auth [puppet] - 10https://gerrit.wikimedia.org/r/1149371 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[10:56:00] <wikibugs>	 (03PS1) 10Hnowlan: alertmanager: adjust phab project to security-team rather than security tag [puppet] - 10https://gerrit.wikimedia.org/r/1150624 (https://phabricator.wikimedia.org/T388531)
[10:58:29] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] profile::prometheus::k8s: drop terminated pod targets [puppet] - 10https://gerrit.wikimedia.org/r/1149505 (https://phabricator.wikimedia.org/T395052) (owner: 10Scott French)
[10:58:55] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Depool lvs1013 before switching to katran [puppet] - 10https://gerrit.wikimedia.org/r/1150587 (https://phabricator.wikimedia.org/T395228)
[10:58:55] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Use katran in lvs1013 [puppet] - 10https://gerrit.wikimedia.org/r/1150626 (https://phabricator.wikimedia.org/T395228)
[10:59:17] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[10:59:39] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[11:04:59] <wikibugs>	 (03PS11) 10Volans: homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380)
[11:04:59] <wikibugs>	 (03PS9) 10Volans: git::clone: remove remote_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/1148267
[11:05:16] <wikibugs>	 (03CR) 10Volans: homer: make private repo support multiple peers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[11:05:45] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[11:08:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet
[11:09:04] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views
[11:09:16] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.codfw.wmnet
[11:11:57] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:12:11] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:14:48] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0)
[11:14:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet
[11:14:57] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:15:04] <logmsgbot>	 elukey@cumin1002 reimage (PID 2165345) is awaiting input
[11:15:11] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:15:25] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2005-dev.codfw.wmnet
[11:15:56] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views
[11:16:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
[11:16:14] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudbackup1001-dev.eqiad.wmnet
[11:18:38] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[11:20:10] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet
[11:20:13] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0)
[11:21:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
[11:21:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1150581 (owner: 10Slyngshede)
[11:22:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet
[11:22:57] <wikibugs>	 (03PS1) 10Samwilson: InitialiseSettings: wgTemplateDataEnableDiscovery on plwiki and arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150629 (https://phabricator.wikimedia.org/T377975)
[11:24:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150629 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson)
[11:25:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[11:25:45] <icinga-wm>	 PROBLEM - Host vrts2002 is DOWN: PING CRITICAL - Packet loss = 100%
[11:27:13] <icinga-wm>	 RECOVERY - Host vrts2002 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms
[11:27:20] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service vrts2002:25 has failed probes (tcp_vrts_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#vrts2002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:27:34] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idp always use Wikimedia theme [puppet] - 10https://gerrit.wikimedia.org/r/1150581 (owner: 10Slyngshede)
[11:28:35] <wikibugs>	 (03CR) 10Cparle: [C:03+1] InitialiseSettings: wgTemplateDataEnableDiscovery on plwiki and arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150629 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson)
[11:31:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet
[11:32:20] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service vrts2002:25 has failed probes (tcp_vrts_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#vrts2002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:33:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2006.codfw.wmnet
[11:34:57] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Update multiple ML models on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150630 (https://phabricator.wikimedia.org/T393865)
[11:37:10] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "LGTM! Thnx for working on this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150630 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[11:38:53] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: Update multiple ML models on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150630 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[11:39:47] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab1005.eqiad.wmnet with reason: update
[11:40:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2006.codfw.wmnet
[11:45:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2005.codfw.wmnet
[11:45:46] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on people1004.eqiad.wmnet with reason: update
[11:46:01] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[11:46:51] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on lists2001.wikimedia.org with reason: update
[11:47:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Default the Kerberos role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1149542 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[11:48:25] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gerrit2003.wikimedia.org with reason: update
[11:49:19] <wikibugs>	 (03PS1) 10Majavah: openstack: drain-hypervisor: Ignore instances being deleted [puppet] - 10https://gerrit.wikimedia.org/r/1150636 (https://phabricator.wikimedia.org/T395244)
[11:50:05] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] "As said on IRC: I don't really like the name being so generic. Maybe you can find something that makes it more clear that this is rule is " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli)
[11:50:34] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on etherpad2002.codfw.wmnet with reason: update
[11:52:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2005.codfw.wmnet
[11:52:31] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on etherpad1004.eqiad.wmnet with reason: update
[11:52:55] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on doc2003.codfw.wmnet with reason: update
[11:53:33] <wikibugs>	 (03PS1) 10Clément Goubert: mw::maintenance::purge_securepoll: Ignore foreachwiki errors [puppet] - 10https://gerrit.wikimedia.org/r/1150637 (https://phabricator.wikimedia.org/T395245)
[11:54:00] <wikibugs>	 (03PS2) 10Majavah: openstack: drain-hypervisor: Ignore instances being deleted [puppet] - 10https://gerrit.wikimedia.org/r/1150636 (https://phabricator.wikimedia.org/T395244)
[11:54:59] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[11:55:28] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on doc1004.eqiad.wmnet with reason: update
[11:56:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2004.codfw.wmnet
[11:57:30] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632#10856437 (10Stevemunene) Checking the battery details as per https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop/Administration#...
[11:57:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:57:57] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on aphlict2001.codfw.wmnet with reason: update
[11:58:02] <moritzm>	 !log installing postgresql-15 security updates
[11:58:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:48] <wikibugs>	 (03PS1) 10Majavah: openstack: drain-hypervisor: Catch and retry 409 Conflict errors [puppet] - 10https://gerrit.wikimedia.org/r/1150638 (https://phabricator.wikimedia.org/T395244)
[11:59:32] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[11:59:57] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[12:00:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack: drain-hypervisor: Catch and retry 409 Conflict errors [puppet] - 10https://gerrit.wikimedia.org/r/1150638 (https://phabricator.wikimedia.org/T395244) (owner: 10Majavah)
[12:00:18] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on aphlict1002.eqiad.wmnet with reason: update
[12:00:25] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10856445 (10KCVelaga_WMF) > The deployment group brings a lot of power with it, though. I'm not sure that all of our possible Airflow developers would...
[12:02:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2004.codfw.wmnet
[12:03:19] <wikibugs>	 (03PS1) 10Cathal Mooney: Add entry for cagefive2* hosts in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1150642 (https://phabricator.wikimedia.org/T394021)
[12:03:33] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:04:10] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for cagefive2001 test server - cmooney@cumin1002"
[12:04:16] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for cagefive2001 test server - cmooney@cumin1002"
[12:04:16] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:04:40] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.dhcp for host cagefive2001.codfw.wmnet
[12:05:12] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on releases2003.codfw.wmnet with reason: update
[12:05:39] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cagefive2001.codfw.wmnet
[12:06:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for gcc-12 [puppet] - 10https://gerrit.wikimedia.org/r/1150643
[12:07:26] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[12:07:49] <wikibugs>	 (03PS1) 10Hnowlan: changeprop(-jobqueue): don't log 404s at ERROR level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150645 (https://phabricator.wikimedia.org/T395132)
[12:07:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2003.codfw.wmnet
[12:08:20] <wikibugs>	 (03CR) 10Ayounsi: "Why not name them sretest like the others? Unless there is a good reason I'd prefer we keep our standards" [puppet] - 10https://gerrit.wikimedia.org/r/1150642 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[12:09:51] <logmsgbot>	 cmooney@cumin1002 reimage (PID 2179552) is awaiting input
[12:11:11] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for cagefive2001 test server - cmooney@cumin1002"
[12:11:17] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for cagefive2001 test server - cmooney@cumin1002"
[12:11:17] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:11:31] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1191 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:12:03] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache cagefive2001.mgmt.codfw.wmnet on all recursors
[12:12:06] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cagefive2001.mgmt.codfw.wmnet on all recursors
[12:12:14] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150607 (https://phabricator.wikimedia.org/T394896) (owner: 10Jgiannelos)
[12:13:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2003.codfw.wmnet
[12:15:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2002.codfw.wmnet
[12:16:03] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10856492 (10brouberol) >  We could just create a group called `airflow-deployers` and reference all members of the `analytics-privatedata-users` group...
[12:17:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hint for gcc-12 [puppet] - 10https://gerrit.wikimedia.org/r/1150643 (owner: 10Muehlenhoff)
[12:19:31] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1191 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:21:28] <wikibugs>	 (03PS2) 10Majavah: openstack: drain-hypervisor: Catch and retry 409 Conflict errors [puppet] - 10https://gerrit.wikimedia.org/r/1150638 (https://phabricator.wikimedia.org/T395244)
[12:21:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2002.codfw.wmnet
[12:22:24] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10856524 (10cmooney) @Jhancock.wm hey I'm having some problems reaching cagefive2001 over management.  The IP it is assigned is not respo...
[12:22:29] <wikibugs>	 (03PS1) 10Brouberol: admin/data: create an airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1150654 (https://phabricator.wikimedia.org/T395125)
[12:22:31] <wikibugs>	 (03PS1) 10Brouberol: airflow-dev: make kubeconfig group-owned by the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1150655 (https://phabricator.wikimedia.org/T395125)
[12:23:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2001.codfw.wmnet
[12:25:45] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10856529 (10brouberol) a:03brouberol
[12:25:49] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10856531 (10brouberol)
[12:25:49] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10856532 (10brouberol) 05Open→03In progress
[12:26:41] <wikibugs>	 (03Abandoned) 10Brouberol: admin/data: add kcvelaga to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1149666 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol)
[12:29:44] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet
[12:30:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2001.codfw.wmnet
[12:50:27] <wikibugs>	 (03CR) 10Volans: [C:03+2] homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[12:50:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150665 (https://phabricator.wikimedia.org/T395193) (owner: 10Anzx)
[12:52:19] <icinga-wm>	 PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy
[12:52:27] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1004.eqiad.wmnet with OS bookworm
[12:52:41] <logmsgbot>	 !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:52:45] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1004.eqiad.wmnet with OS bookworm
[12:52:55] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:53:33] <jinxer-wm>	 FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:53:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow7001.magru.wmnet
[12:53:42] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: drain-hypervisor: Ignore instances being deleted [puppet] - 10https://gerrit.wikimedia.org/r/1150636 (https://phabricator.wikimedia.org/T395244) (owner: 10Majavah)
[12:53:51] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: drain-hypervisor: Catch and retry 409 Conflict errors [puppet] - 10https://gerrit.wikimedia.org/r/1150638 (https://phabricator.wikimedia.org/T395244) (owner: 10Majavah)
[12:54:19] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:54:37] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:55:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Add maps-bookworm alias [puppet] - 10https://gerrit.wikimedia.org/r/1150670 (https://phabricator.wikimedia.org/T381565)
[12:55:46] <kart_>	 !log Update Recommendation-API to 2025-05-26-081343-production (T394441, T395026, T306508, T391230)
[12:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:53] <stashbot>	 T394441: Rec API tests failing intermittently - https://phabricator.wikimedia.org/T394441
[12:55:54] <stashbot>	 T395026: Rec APi not picking up new collection Wiki99/LGBT+ - https://phabricator.wikimedia.org/T395026
[12:55:54] <stashbot>	 T306508: ContentTranslation doesn't know that an article already exists in the Norwegian Bokmål Wikipedia - https://phabricator.wikimedia.org/T306508
[12:55:54] <stashbot>	 T391230: Unified Dashboard: Support country-level filtering under custom suggestions view - https://phabricator.wikimedia.org/T391230
[12:56:54] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1004.eqiad.wmnet with OS bookworm
[12:57:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow7001.magru.wmnet
[12:57:53] <wikibugs>	 (03PS4) 10JMeybohm: Revert^2 "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1135046 (owner: 10Kamila Součková)
[12:58:47] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:58:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet
[12:59:04] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135046 (owner: 10Kamila Součková)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1300).
[13:00:05] <jouncebot>	 isaranto, tgr, MichaelG_WMF, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:12] <isaranto>	 o/
[13:00:13] <icinga-wm>	 RECOVERY - Squid on install1004 is OK: TCP OK - 3.040 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy
[13:00:15] <MichaelG_WMF>	 o/
[13:01:20] <anzx>	 o/
[13:01:53] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:02:39] <isaranto>	 shall I proceed with the deployment?
[13:03:08] <MichaelG_WMF>	 My change cannot be tested in the UI (as far as I can tell), but a *lot of* errors in logstash should go away once it has been deployed. Is it possible to only see errors from "mwdebug" on logstash?
[13:03:19] <icinga-wm>	 PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy
[13:03:24] <MichaelG_WMF>	 @isaranto If you can deploy, that would be great!
[13:03:25] <tgr>	 I'll be back in the second half of the hour
[13:03:49] <tgr>	 MichaelG_WMF: yes, use the mwdebug dashboard
[13:04:06] <isaranto>	 I'm going to proceed with my backport first  -- I'm in a meeting with folks to QA it first.MichaelG_WMF then I can proceed with your deployment
[13:04:15] <isaranto>	 deploying!
[13:04:29] <tgr>	 or filter the normal dashboard by hostname, if you are using one of the non-k8s debug hosts
[13:04:31] <wikibugs>	 (03CR) 10Raymond Ndibe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1149533 (https://phabricator.wikimedia.org/T394276) (owner: 10Raymond Ndibe)
[13:04:39] <MichaelG_WMF>	 isaranto: thank you, that works for me!
[13:04:43] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:04:49] <wikibugs>	 (03PS10) 10Volans: git::clone: remove remote_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/1148267
[13:04:49] <wikibugs>	 (03PS1) 10Volans: homer: fix private repository config [puppet] - 10https://gerrit.wikimedia.org/r/1150675 (https://phabricator.wikimedia.org/T389380)
[13:04:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149407 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos)
[13:04:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet
[13:05:09] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:05:09] <icinga-wm>	 RECOVERY - Squid on install1004 is OK: TCP OK - 0.016 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy
[13:05:27] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:06:11] <icinga-wm>	 PROBLEM - Host cloudnet2005-dev is DOWN: PING CRITICAL - Packet loss = 100%
[13:06:26] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: WIP: adding mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 (owner: 10Effie Mouzeli)
[13:06:29] <wikibugs>	 (03CR) 10Jelto: "I forgot about the already existing PTR, so yes that makes sense and we don't need another one! But let's wait for a clear decision in T39" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn)
[13:06:57] <MichaelG_WMF>	 tgr: isaranto: thanks, I have confirmed that I can trigger the error my change fixes on the mwdebug dashboard. That way I should be able to test that it works.
[13:07:13] <icinga-wm>	 RECOVERY - Host cloudnet2005-dev is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms
[13:07:22] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150675 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[13:07:51] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Revert^2 "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1135046 (owner: 10Kamila Součková)
[13:07:59] <wikibugs>	 (03Merged) 10jenkins-bot: ores-extension: enable ores extention UI in idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149407 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos)
[13:08:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1150675 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[13:08:17] <logmsgbot>	 !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1149407|ores-extension: enable ores extention UI in idwiki (T382171)]]
[13:08:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet
[13:08:21] <stashbot>	 T382171: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171
[13:08:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add maps-bookworm alias [puppet] - 10https://gerrit.wikimedia.org/r/1150670 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:08:33] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:09:10] <wikibugs>	 (03CR) 10Volans: [C:03+2] homer: fix private repository config [puppet] - 10https://gerrit.wikimedia.org/r/1150675 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[13:09:30] <volans>	 jayme: if you got my commit too feel free to merge it ;)
[13:09:56] <moritzm>	 volans: I'll merge your patch along
[13:10:06] <volans>	 thanks a lot :D
[13:12:39] <icinga-wm>	 PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100%
[13:13:31] <icinga-wm>	 RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 30.29 ms
[13:13:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet
[13:14:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet
[13:14:59] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudnet2007-dev.codfw.wmnet
[13:18:04] <wikibugs>	 (03PS6) 10Effie Mouzeli: functions-orchestrator: add mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 (https://phabricator.wikimedia.org/T391986)
[13:18:16] <wikibugs>	 (03CR) 10Majavah: [C:04-1] "This needs to wait until https://phabricator.wikimedia.org/T394337. Right now there's a broken deployment in that namespace in tools and d" [puppet] - 10https://gerrit.wikimedia.org/r/1149533 (https://phabricator.wikimedia.org/T394276) (owner: 10Raymond Ndibe)
[13:19:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet
[13:20:48] <wikibugs>	 (03PS2) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 210s. [dns] - 10https://gerrit.wikimedia.org/r/1148904 (https://phabricator.wikimedia.org/T394312)
[13:21:16] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2007-dev.codfw.wmnet
[13:21:36] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudnet2008-dev.codfw.wmnet
[13:22:03] <wikibugs>	 (03PS2) 10Effie Mouzeli: wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150666
[13:22:15] <logmsgbot>	 !log gkyziridis@deploy1003 isaranto, gkyziridis: Backport for [[gerrit:1149407|ores-extension: enable ores extention UI in idwiki (T382171)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:22:19] <stashbot>	 T382171: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171
[13:23:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3003.esams.wmnet
[13:24:37] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage
[13:27:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3003.esams.wmnet
[13:28:05] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2008-dev.codfw.wmnet
[13:28:16] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: host reimage
[13:28:22] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudbackup1002-dev.eqiad.wmnet
[13:29:12] <wikibugs>	 (03PS3) 10Effie Mouzeli: wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150666
[13:29:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2003.codfw.wmnet
[13:30:33] <wikibugs>	 (03PS3) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 210s. [dns] - 10https://gerrit.wikimedia.org/r/1148904 (https://phabricator.wikimedia.org/T394312)
[13:31:38] <wikibugs>	 (03Abandoned) 10Slyngshede: Drop jackson-module-kotlin (experimental) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/809132 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff)
[13:32:05] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet
[13:32:25] <logmsgbot>	 !log gkyziridis@deploy1003 Sync cancelled.
[13:33:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2003.codfw.wmnet
[13:33:29] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] templates: lower TTLs for dyna.wm.org and upload.wm.org to 210s. [dns] - 10https://gerrit.wikimedia.org/r/1148904 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh)
[13:33:31] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet
[13:34:33] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] templates: lower TTLs for dyna.wm.org and upload.wm.org to 210s. [dns] - 10https://gerrit.wikimedia.org/r/1148904 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh)
[13:34:41] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[13:34:49] <isaranto>	 we didn't end up syncing my patch because there was an issue with QA. MichaelG_WMF shall I proceed with your patch?
[13:35:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet
[13:35:15] <MichaelG_WMF>	 yes, please!
[13:35:19] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[13:35:28] <wikibugs>	 (03PS1) 10Majavah: P:openstack: Migrate simple rules to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1150685
[13:35:28] <wikibugs>	 (03PS1) 10Majavah: P:openstack: pdns: Migrate mysql_root ferm service to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1150686
[13:35:29] <wikibugs>	 (03PS1) 10Majavah: P:openstack: codfw1dev: Migrate Cumin ferm term to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1150687
[13:35:37] <taavi>	 isaranto: you will need to revert your patch if you did not end up deploying it
[13:35:40] <tgr>	 let's do the rest at the same time, maybe?
[13:35:53] <isaranto>	 taavi: on it!
[13:35:54] <wikibugs>	 (03PS4) 10Effie Mouzeli: wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150666 (https://phabricator.wikimedia.org/T391986)
[13:35:55] <tgr>	 not much hope of fitting them in the hour, otherwise
[13:35:56] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312#10856773 (10ssingh)
[13:36:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:openstack: Migrate simple rules to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1150685 (owner: 10Majavah)
[13:36:38] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: Revert "ores-extension: enable ores extention UI in idwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150688
[13:36:52] <wikibugs>	 (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1150685 (owner: 10Majavah)
[13:36:57] <tgr>	 although I guess nothing important is happening afterwards
[13:37:01] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] Revert "ores-extension: enable ores extention UI in idwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150688 (owner: 10Ilias Sarantopoulos)
[13:37:53] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ores-extension: enable ores extention UI in idwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150688 (owner: 10Ilias Sarantopoulos)
[13:37:57] <claime>	 I have an image change to merge, but it's not urgent, it can wait until y'all are done with scap deployments
[13:38:02] <isaranto>	 ok. just waiting for the revert to be merged and then will deploy Michael's patch
[13:38:05] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1002.eqiad.wmnet
[13:38:23] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 4 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1150685 (owner: 10Majavah)
[13:38:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet
[13:38:58] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:40:12] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:40:16] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:40:20] <isaranto>	 I can deploy the next patch(es) shall I do them all together via Spiderpig?
[13:40:46] <hashar>	 tgr: feel free to extend the window
[13:40:51] <isaranto>	 my revert has already been merged
[13:40:59] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5679/co" [puppet] - 10https://gerrit.wikimedia.org/r/1150686 (owner: 10Majavah)
[13:41:17] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Move Kartotherian/staging to the new Bookworm nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148881 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:41:54] <isaranto>	 I started just with MichaelG_WMF 's patch
[13:42:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by isaranto@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1150619 (https://phabricator.wikimedia.org/T394017) (owner: 10Michael Große)
[13:42:08] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on lists1004.wikimedia.org with reason: update
[13:42:10] <MichaelG_WMF>	 thanks!
[13:42:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:42:33] <wikibugs>	 (03PS1) 10Fabfur: hiera: enable maxmind isp lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219)
[13:43:31] <isaranto>	 jouncebot: nowandnext
[13:43:31] <jouncebot>	 For the next 0 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1300)
[13:43:31] <jouncebot>	 In 1 hour(s) and 46 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1530)
[13:44:15] <isaranto>	 tgr: I'll ping you once we're finished with MichaelG_WMF patch. sorry for the delay folks - we were trying to debug some thresholds + UI things and decided to revert to be sure
[13:44:22] <wikibugs>	 (03Merged) 10jenkins-bot: SpecialHomepageLogger: Populate email state even with StartModule disabled [extensions/GrowthExperiments] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1150619 (https://phabricator.wikimedia.org/T394017) (owner: 10Michael Große)
[13:45:15] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[13:45:30] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1004.eqiad.wmnet with OS bookworm
[13:45:52] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[13:46:14] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[13:46:34] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[13:47:04] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1004.eqiad.wmnet
[13:47:05] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1004.eqiad.wmnet
[13:47:07] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[13:47:41] <isaranto>	 got an error on spiderpig https://spiderpig.wikimedia.org/jobs/97 It says it found an unexpected diff which was caused by the revert
[13:48:06] <wikibugs>	 (03PS1) 10Ayounsi: Interfaces: also alert on frack routers and switches [alerts] - 10https://gerrit.wikimedia.org/r/1150692 (https://phabricator.wikimedia.org/T388641)
[13:48:10] <isaranto>	 I mean my previous revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1150688
[13:49:13] <isaranto>	 any help? I guess we should either retry or revert MichaelG_WMF 's patch as well as it has already been merged
[13:49:27] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148268/12/modules/homer/templates/private-git/config.erb indeed does the magic which" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans)
[13:49:42] <taavi>	 you'll need to manually pull the revert to the deployment server
[13:49:56] <taavi>	 what's Gkyziridis's IRC nick btw?
[13:50:12] <isaranto>	 georgekyz: 
[13:50:18] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: sync
[13:50:25] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[13:52:17] <wikibugs>	 (03CR) 10Cathal Mooney: "I have no idea tbh, and I agree. I guess we could just rename in netbox?" [puppet] - 10https://gerrit.wikimedia.org/r/1150642 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[13:52:59] <isaranto>	 taavi: I guess you mean doing a manual revert as described here ? https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Manual_revert
[13:53:16] <taavi>	 no
[13:53:23] <taavi>	 since you already did the change manually in gerrit
[13:53:46] <taavi>	 the procedure you're looking for is https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Fetching_patches
[13:54:04] <taavi>	 also file a scap bug about the error you got
[13:55:34] <isaranto>	 ack and done. thank you
[13:56:05] <isaranto>	 so now shall i retry through spiderpig? 
[13:57:05] <taavi>	 sure
[13:57:54] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[13:57:56] <logmsgbot>	 !log isaranto@deploy1003 Started scap sync-world: Backport for [[gerrit:1150619|SpecialHomepageLogger: Populate email state even with StartModule disabled (T394017)]]
[13:58:02] <stashbot>	 T394017: '.event' should have required property 'start_email_state' - https://phabricator.wikimedia.org/T394017
[13:58:33] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:58:48] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 773, active_shards: 1826, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar
[13:58:48] <icinga-wm>	 umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[13:58:49] <tgr>	 you can just tell spiderpig to continue the deploy if the "unexpected diff" is actually something you expected
[13:59:52] <isaranto>	 I ran a new deployment. in the previous one I clicked yes on the prompt to show the diff and then it failed
[14:01:01] <logmsgbot>	 !log isaranto@deploy1003 migr, isaranto: Backport for [[gerrit:1150619|SpecialHomepageLogger: Populate email state even with StartModule disabled (T394017)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:01:15] * MichaelG_WMF looks
[14:02:23] <MichaelG_WMF>	 isaranto: I'm not seeing the error anymore. Thank you!
[14:02:32] <isaranto>	 ok, proceeding!
[14:02:36] <logmsgbot>	 !log isaranto@deploy1003 migr, isaranto: Continuing with sync
[14:09:48] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Update docker image tags for ML staging models. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865)
[14:10:09] <isaranto>	 I found a feature request related to rolling back with spiderpig https://phabricator.wikimedia.org/T394858
[14:12:29] <logmsgbot>	 !log isaranto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1150619|SpecialHomepageLogger: Populate email state even with StartModule disabled (T394017)]] (duration: 14m 33s)
[14:12:34] <stashbot>	 T394017: '.event' should have required property 'start_email_state' - https://phabricator.wikimedia.org/T394017
[14:12:54] <isaranto>	 MichaelG_WMF: deployed. tgr you can go now
[14:13:35] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:04-1] "The `RewriteEngine on` directive is already declared in modules/profile/templates/prometheus/httpd-public.conf.erb, which is used by the p" [puppet] - 10https://gerrit.wikimedia.org/r/1146973 (owner: 10Majavah)
[14:13:50] <isaranto>	 or anzx . I'll leave it up2u folks to coordinate
[14:13:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#10856882 (10elukey) p:05Triage→03Medium
[14:13:59] <isaranto>	 taavi: thanks once again for the help!
[14:15:31] <anzx>	 isaranto: I don't have deployment access, someone else needed to deploy my patch
[14:15:39] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[14:15:49] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: enable maxmind isp lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[14:16:36] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] hiera: enable maxmind isp lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[14:16:46] <wikibugs>	 (03PS1) 10Urbanecm: changeprop: Decrease reenqueue_delay for Getting Started notif job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150699 (https://phabricator.wikimedia.org/T394958)
[14:17:07] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: "This patch updates staging image tags for models affected by this pre-commit change:https://gerrit.wikimedia.org/r/c/machinelearning/liftw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[14:17:17] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[14:17:24] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: enable maxmind isp lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[14:17:57] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: "Making it unresolved" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[14:18:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] changeprop: Decrease reenqueue_delay for Getting Started notif job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150699 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm)
[14:19:48] <isaranto>	 anzx: I will have to go in ~10min tgr could you deploy it along with your patches?
[14:21:45] <wikibugs>	 (03PS1) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 180s [dns] - 10https://gerrit.wikimedia.org/r/1150701 (https://phabricator.wikimedia.org/T394312)
[14:22:06] <anzx>	 isaranto: tgr: i can schedule it for tomorrow afternoon if not possible to deploy today 
[14:22:19] <taavi>	 isaranto: I said you should file a scap bug about the crash you got, not that you should file a spiderpig feature request for a feature that'd let you have avoid that manual revert in the first place :P
[14:23:53] <isaranto>	 taavi: cool, got it! I'll file that scap bug report. I just mentioned that there is already a feature request that describes why the current process fails and will fix this
[14:24:06] <isaranto>	 * would fix this
[14:24:19] <taavi>	 yeah, just couldn't tell from your message whether you thought that was a duplicate or not
[14:31:42] <tgr>	 I am in a meeting, I can do it a little later if you have time
[14:31:56] <anzx>	 sure
[14:32:58] <claime>	 Am I ok to deploy my image change in the meantime?
[14:36:37] <claime>	 I'll go ahead and do it, should be relatively quick
[14:38:05] <logmsgbot>	 !log cgoubert@deploy1003 Started scap sync-world: mediawiki-cli image update - T395245
[14:38:10] <stashbot>	 T395245: Add a flag to the mwscript wrapper to set +e when required - https://phabricator.wikimedia.org/T395245
[14:40:48] <wikibugs>	 (03PS8) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637)
[14:41:09] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150704
[14:41:38] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: sync
[14:41:42] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable edge uniques in another host per DC and cluster [puppet] - 10https://gerrit.wikimedia.org/r/1150705 (https://phabricator.wikimedia.org/T391411)
[14:41:47] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[14:42:03] <wikibugs>	 (03CR) 10FNegri: wikireplicas scripts: setup pytest, add first test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri)
[14:42:19] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150705 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[14:43:45] <wikibugs>	 (03PS9) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637)
[14:46:55] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Enable edge uniques in another host per DC and cluster [puppet] - 10https://gerrit.wikimedia.org/r/1150705 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[14:47:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri)
[14:48:19] <wikibugs>	 (03PS4) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219)
[14:48:47] <logmsgbot>	 !log cgoubert@deploy1003 Finished scap sync-world: mediawiki-cli image update - T395245 (duration: 10m 41s)
[14:48:48] <wikibugs>	 (03CR) 10Fabfur: haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[14:48:53] <stashbot>	 T395245: Add a flag to the mwscript wrapper to set +e when required - https://phabricator.wikimedia.org/T395245
[14:49:10] <wikibugs>	 (03PS2) 10Fabfur: hiera: enable maxmind isp lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219)
[14:49:28] <wikibugs>	 (03PS10) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637)
[14:52:23] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge uniques in another host per DC and cluster [puppet] - 10https://gerrit.wikimedia.org/r/1150705 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[14:54:17] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[14:55:00] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxy: do not set X-Requestctl-ISP if maxmind doesn't return value [puppet] - 10https://gerrit.wikimedia.org/r/1150591 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[14:55:38] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: enable maxmind isp lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[14:56:14] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[14:56:45] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[14:58:33] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:58:56] <wikibugs>	 (03PS10) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600
[14:58:56] <wikibugs>	 (03PS1) 10Tiziano Fogli: monitoring::service: add migration task as parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150709
[15:00:09] <wikibugs>	 (03PS11) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600
[15:00:09] <wikibugs>	 (03PS2) 10Tiziano Fogli: monitoring::service: add migration task as parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150709
[15:00:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.376s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:00:26] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mw::maintenance::purge_securepoll: Ignore foreachwiki errors [puppet] - 10https://gerrit.wikimedia.org/r/1150637 (https://phabricator.wikimedia.org/T395245) (owner: 10Clément Goubert)
[15:00:36] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "I am not sure why all models are not having both values and values-ml-staging-codfw. I suggest to leave them as they are for now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz)
[15:00:53] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] changeprop(-jobqueue): don't log 404s at ERROR level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150645 (https://phabricator.wikimedia.org/T395132) (owner: 10Hnowlan)
[15:01:03] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::purge_securepoll: Ignore foreachwiki errors [puppet] - 10https://gerrit.wikimedia.org/r/1150637 (https://phabricator.wikimedia.org/T395245) (owner: 10Clément Goubert)
[15:01:56] <wikibugs>	 (03CR) 10Bunnypranav: [C:03+1] ruwikisource: add Автор (Author) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150665 (https://phabricator.wikimedia.org/T395193) (owner: 10Anzx)
[15:03:01] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop(-jobqueue): don't log 404s at ERROR level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150645 (https://phabricator.wikimedia.org/T395132) (owner: 10Hnowlan)
[15:03:38] <fabfur>	 !log temporary depooling cp7001 to restart haproxy (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1150690) (T392219) 
[15:03:41] <wikibugs>	 (03PS3) 10Tiziano Fogli: monitoring services: add migration task as parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150709
[15:03:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:42] <stashbot>	 T392219: Map ISPs in Maxmind db, used in turnilo/superset, to use in requestctl rule - https://phabricator.wikimedia.org/T392219
[15:04:21] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet
[15:05:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.376s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[15:06:20] <wikibugs>	 (03PS1) 10Clément Goubert: mw::maintenance::growthexperiment: Ignore foreachwiki errors [puppet] - 10https://gerrit.wikimedia.org/r/1150711 (https://phabricator.wikimedia.org/T395247)
[15:06:41] <jinxer-wm>	 FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[15:07:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:04] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[15:08:30] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[15:08:34] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: sync
[15:08:43] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[15:08:55] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:08:58] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:09:37] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply
[15:09:38] <nemo-yiannis>	 great timing ^ :)
[15:10:12] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:10:24] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[15:10:29] <wikibugs>	 (03CR) 10Jelto: "I left some comments in-line." [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[15:10:36] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:10:37] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet
[15:10:42] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:10:51] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:11:13] <wikibugs>	 (03CR) 10Andrea Denisse: "Hi Tiziano, I was wondering if there’s a corresponding Phabricator task for it. It would help me better understand the context and the goa" [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (owner: 10Tiziano Fogli)
[15:11:50] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150607 (https://phabricator.wikimedia.org/T394896) (owner: 10Jgiannelos)
[15:12:06] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[15:12:32] <wikibugs>	 (03PS2) 10Clément Goubert: mw::maintenance::growthexperiment: Ignore foreachwiki errors [puppet] - 10https://gerrit.wikimedia.org/r/1150711 (https://phabricator.wikimedia.org/T395247)
[15:12:46] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[15:13:35] <wikibugs>	 (03Merged) 10jenkins-bot: pcs: Default to use http client with service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150607 (https://phabricator.wikimedia.org/T394896) (owner: 10Jgiannelos)
[15:15:12] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[15:15:23] <wikibugs>	 (03CR) 10Andrea Denisse: pdb_resource_exporter: add puppetdb resource exporter to puppedb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (owner: 10Tiziano Fogli)
[15:15:33] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:17:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:22:12] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150711 (https://phabricator.wikimedia.org/T395247) (owner: 10Clément Goubert)
[15:26:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure (sdb) on coludcephmon1004 - https://phabricator.wikimedia.org/T392458#10857144 (10andrea.denisse) 05Resolved→03Open p:05Medium→03Unbreak! Hi, this is doesn't seem to be resolved as we're still getting email notifications as of today: Degr...
[15:27:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure (sdb) on cloudcephmon1004 - https://phabricator.wikimedia.org/T392458#10857149 (10andrea.denisse)
[15:28:34] <logmsgbot>	 !log volans@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2111.codfw.wmnet
[15:28:45] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[15:29:08] <logmsgbot>	 !log volans@cumin1003 START - Cookbook sre.hosts.reboot-single for host cirrussearch2111.codfw.wmnet
[15:29:42] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150690 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[15:29:59] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:30:04] <jouncebot>	 jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1530). Please do the needful.
[15:31:39] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[15:32:18] <wikibugs>	 (03PS1) 10Fabfur: Revert "hiera: enable maxmind isp lookup on cp7001" [puppet] - 10https://gerrit.wikimedia.org/r/1150714
[15:32:45] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:39:38] <wikibugs>	 (03PS2) 10Urbanecm: changeprop: Decrease reenqueue_delay for Getting Started notif job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150699 (https://phabricator.wikimedia.org/T394958)
[15:39:49] <wikibugs>	 (03PS3) 10Urbanecm: changeprop: Decrease reenqueue_delay for Getting Started notif job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150699 (https://phabricator.wikimedia.org/T394958)
[15:40:12] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Revert "hiera: enable maxmind isp lookup on cp7001" [puppet] - 10https://gerrit.wikimedia.org/r/1150714 (owner: 10Fabfur)
[15:41:51] <wikibugs>	 (03PS2) 10FNegri: wikireplicas: remove dashes from script names [puppet] - 10https://gerrit.wikimedia.org/r/1148358 (https://phabricator.wikimedia.org/T351637)
[15:41:51] <wikibugs>	 (03PS11) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637)
[15:41:51] <wikibugs>	 (03PS1) 10FNegri: wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715
[15:42:58] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance
[15:43:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T395241)', diff saved to https://phabricator.wikimedia.org/P76427 and previous config saved to /var/cache/conftool/dbconfig/20250526-154304-fceratto.json
[15:43:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (owner: 10FNegri)
[15:44:47] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:45:09] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:45:43] <logmsgbot>	 !log volans@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cirrussearch2111.codfw.wmnet
[15:46:21] <logmsgbot>	 !log volans@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cirrussearch2111.codfw.wmnet
[15:46:25] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:46:58] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:47:07] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:47:12] <wikibugs>	 (03PS1) 10Elukey: profile::maps: add default privileges for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1150718 (https://phabricator.wikimedia.org/T381565)
[15:47:13] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:47:43] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:48:20] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:48:28] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[15:49:01] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[15:49:35] <wikibugs>	 06SRE, 06Traffic-Icebox, 07Wikimedia-Performance-recommendation: Investigate using RFC 7838 Alternate Services to better optimize edge connections - https://phabricator.wikimedia.org/T208242#10857203 (10ssingh)
[15:49:37] <wikibugs>	 06SRE, 06Traffic-Icebox, 07HTTPS, 07Wikimedia-Performance-recommendation: Enable HTTP/3 (QUIC) support on Wikimedia servers - https://phabricator.wikimedia.org/T238034#10857202 (10ssingh)
[15:49:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T395241)', diff saved to https://phabricator.wikimedia.org/P76428 and previous config saved to /var/cache/conftool/dbconfig/20250526-154939-fceratto.json
[15:51:14] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2111 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:53:00] <wikibugs>	 (03PS1) 10Clément Goubert: mw::maintenance::wikidata: Alert wikidata [puppet] - 10https://gerrit.wikimedia.org/r/1150721 (https://phabricator.wikimedia.org/T388543)
[15:54:30] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150721 (https://phabricator.wikimedia.org/T388543) (owner: 10Clément Goubert)
[15:57:07] <wikibugs>	 (03PS1) 10Fabfur: hiera: re-enable maxmind lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150724 (https://phabricator.wikimedia.org/T392219)
[15:57:42] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: re-enable maxmind lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150724 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[15:58:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hiera: re-enable maxmind lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150724 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[15:59:30] <wikibugs>	 (03PS2) 10Fabfur: hiera: re-enable maxmind lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150724 (https://phabricator.wikimedia.org/T392219)
[15:59:40] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::wikidata: Alert wikidata [puppet] - 10https://gerrit.wikimedia.org/r/1150721 (https://phabricator.wikimedia.org/T388543) (owner: 10Clément Goubert)
[15:59:47] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::growthexperiment: Ignore foreachwiki errors [puppet] - 10https://gerrit.wikimedia.org/r/1150711 (https://phabricator.wikimedia.org/T395247) (owner: 10Clément Goubert)
[15:59:48] <wikibugs>	 (03PS6) 10Jgiannelos: pcs-rb-sunset: Disable changeprop rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670)
[16:01:03] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: re-enable maxmind lookup on cp7001 [puppet] - 10https://gerrit.wikimedia.org/r/1150724 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[16:01:14] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch2111 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:03:08] <wikibugs>	 (03PS3) 10Jgiannelos: pcs: Block RB traffic for all domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145828
[16:03:34] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:04:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P76429 and previous config saved to /var/cache/conftool/dbconfig/20250526-160447-fceratto.json
[16:06:09] <wikibugs>	 (03PS3) 10FNegri: wikireplicas: remove dashes from script names [puppet] - 10https://gerrit.wikimedia.org/r/1148358 (https://phabricator.wikimedia.org/T351637)
[16:06:09] <wikibugs>	 (03PS2) 10FNegri: wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715
[16:06:10] <wikibugs>	 (03PS12) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637)
[16:07:01] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[16:07:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (owner: 10FNegri)
[16:07:46] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[16:08:10] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] pcs-rb-sunset: Disable changeprop rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670) (owner: 10Jgiannelos)
[16:09:49] <wikibugs>	 (03Merged) 10jenkins-bot: pcs-rb-sunset: Disable changeprop rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670) (owner: 10Jgiannelos)
[16:09:51] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Exlude linkrecommendation from KubernetesContainerReachingMemoryLimit [alerts] - 10https://gerrit.wikimedia.org/r/1150726 (https://phabricator.wikimedia.org/T357122)
[16:10:02] <wikibugs>	 (03PS1) 10Fabfur: cache: fixed maxmind lua fetcher script [puppet] - 10https://gerrit.wikimedia.org/r/1150727 (https://phabricator.wikimedia.org/T392219)
[16:10:22] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Exlude linkrecommendation from KubernetesContainerReachingMemoryLimit [alerts] - 10https://gerrit.wikimedia.org/r/1150726 (https://phabricator.wikimedia.org/T357122) (owner: 10Alexandros Kosiaris)
[16:10:38] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache: fixed maxmind lua fetcher script [puppet] - 10https://gerrit.wikimedia.org/r/1150727 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[16:11:29] <wikibugs>	 06SRE, 06serviceops, 10WMF-General-or-Unknown: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#10857252 (10Clement_Goubert) 05Open→03Resolved All scripts now have alerting, and log to logstash.
[16:11:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Exlude linkrecommendation from KubernetesContainerReachingMemoryLimit [alerts] - 10https://gerrit.wikimedia.org/r/1150726 (https://phabricator.wikimedia.org/T357122) (owner: 10Alexandros Kosiaris)
[16:11:41] <jinxer-wm>	 RESOLVED: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:13:12] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache: fixed maxmind lua fetcher script [puppet] - 10https://gerrit.wikimedia.org/r/1150727 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[16:13:25] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[16:13:59] <wikibugs>	 (03PS1) 10Volans: sre.hardware.upgrade-firmware: add support for SSD [cookbooks] - 10https://gerrit.wikimedia.org/r/1150728 (https://phabricator.wikimedia.org/T394543)
[16:14:47] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[16:14:53] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[16:15:22] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[16:16:13] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10857271 (10Volans) Thank Brian, I've upgraded the firmware of `cirrussearch2111` with the above patch, it's all back to you. The only th...
[16:16:29] <wikibugs>	 (03CR) 10Volans: [C:04-1] "One thing still to fix, see https://phabricator.wikimedia.org/T394543#10857271" [cookbooks] - 10https://gerrit.wikimedia.org/r/1150728 (https://phabricator.wikimedia.org/T394543) (owner: 10Volans)
[16:18:33] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:19:25] <wikibugs>	 (03PS1) 10JMeybohm: sre.k8s.wipe-cluster: Verify that k8s service are up after puppet ran [cookbooks] - 10https://gerrit.wikimedia.org/r/1150729 (https://phabricator.wikimedia.org/T389086)
[16:19:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P76430 and previous config saved to /var/cache/conftool/dbconfig/20250526-161955-fceratto.json
[16:20:15] <wikibugs>	 (03PS4) 10FNegri: wikireplicas: remove dashes from script names [puppet] - 10https://gerrit.wikimedia.org/r/1148358 (https://phabricator.wikimedia.org/T395266)
[16:20:16] <wikibugs>	 (03PS3) 10FNegri: wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266)
[16:20:18] <wikibugs>	 (03PS13) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266)
[16:21:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[16:23:38] <wikibugs>	 (03PS5) 10FNegri: wikireplicas: remove dashes from script names [puppet] - 10https://gerrit.wikimedia.org/r/1148358 (https://phabricator.wikimedia.org/T395266)
[16:23:38] <wikibugs>	 (03PS4) 10FNegri: wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266)
[16:23:38] <wikibugs>	 (03PS14) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266)
[16:25:59] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[16:34:51] <wikibugs>	 (03PS5) 10FNegri: wikireplicas: split db config from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266)
[16:34:51] <wikibugs>	 (03PS15) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266)
[16:35:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T395241)', diff saved to https://phabricator.wikimedia.org/P76431 and previous config saved to /var/cache/conftool/dbconfig/20250526-163502-fceratto.json
[16:35:23] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[16:35:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T395241)', diff saved to https://phabricator.wikimedia.org/P76432 and previous config saved to /var/cache/conftool/dbconfig/20250526-163530-fceratto.json
[16:35:48] <wikibugs>	 (03CR) 10FNegri: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5681/co" [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[16:38:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149822 (https://phabricator.wikimedia.org/T392251) (owner: 10Gergő Tisza)
[16:38:39] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, May 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-2" [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149823 (https://phabricator.wikimedia.org/T392251) (owner: 10Gergő Tisza)
[16:43:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T395241)', diff saved to https://phabricator.wikimedia.org/P76433 and previous config saved to /var/cache/conftool/dbconfig/20250526-164324-fceratto.json
[16:47:16] <wikibugs>	 (03PS1) 10Jgiannelos: pcs: Disable changeprop rule for summary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150731 (https://phabricator.wikimedia.org/T264670)
[16:49:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] "Awesome, but please bump the chart version in Chart.yaml as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150731 (https://phabricator.wikimedia.org/T264670) (owner: 10Jgiannelos)
[16:58:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P76434 and previous config saved to /var/cache/conftool/dbconfig/20250526-165831-fceratto.json
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1700)
[17:00:05] <jouncebot>	 ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T1700).
[17:10:42] <icinga-wm>	 PROBLEM - Disk space on restbase1031 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 68747 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1031&var-datasource=eqiad+prometheus/ops
[17:13:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P76435 and previous config saved to /var/cache/conftool/dbconfig/20250526-171338-fceratto.json
[17:23:27] <wikibugs>	 (03CR) 10Majavah: [C:03+1] wikireplicas: remove dashes from script names [puppet] - 10https://gerrit.wikimedia.org/r/1148358 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[17:23:38] <wikibugs>	 (03CR) 10Majavah: [C:04-1] wikireplicas: split db config from maintain-views (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[17:25:10] <wikibugs>	 (03CR) 10Majavah: wikireplicas scripts: setup pytest, add first test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[17:25:10] <wikibugs>	 (03PS3) 10AOkoth: doc: swap doc1003 with doc1004 [puppet] - 10https://gerrit.wikimedia.org/r/1149469
[17:27:56] <wikibugs>	 (03CR) 10AOkoth: doc: swap doc1003 with doc1004 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1149469 (owner: 10AOkoth)
[17:28:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T395241)', diff saved to https://phabricator.wikimedia.org/P76436 and previous config saved to /var/cache/conftool/dbconfig/20250526-172844-fceratto.json
[17:29:05] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance
[17:29:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T395241)', diff saved to https://phabricator.wikimedia.org/P76437 and previous config saved to /var/cache/conftool/dbconfig/20250526-172912-fceratto.json
[17:33:34] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] doc: swap doc1003 with doc1004 [puppet] - 10https://gerrit.wikimedia.org/r/1149469 (owner: 10AOkoth)
[17:37:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T395241)', diff saved to https://phabricator.wikimedia.org/P76438 and previous config saved to /var/cache/conftool/dbconfig/20250526-173700-fceratto.json
[17:44:58] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] "This change is a good idea to try out early, so that we can learn whether it impacts the stability of the overall system." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150699 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm)
[17:52:08] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P76439 and previous config saved to /var/cache/conftool/dbconfig/20250526-175207-fceratto.json
[18:07:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P76440 and previous config saved to /var/cache/conftool/dbconfig/20250526-180714-fceratto.json
[18:22:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T395241)', diff saved to https://phabricator.wikimedia.org/P76441 and previous config saved to /var/cache/conftool/dbconfig/20250526-182221-fceratto.json
[18:22:41] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance
[18:22:47] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T395241)', diff saved to https://phabricator.wikimedia.org/P76442 and previous config saved to /var/cache/conftool/dbconfig/20250526-182247-fceratto.json
[18:28:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T395241)', diff saved to https://phabricator.wikimedia.org/P76443 and previous config saved to /var/cache/conftool/dbconfig/20250526-182817-fceratto.json
[18:31:22] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:35:10] <tgr>	 anzx: sorry, I had to leave. Please reschedule the patch.
[18:43:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P76444 and previous config saved to /var/cache/conftool/dbconfig/20250526-184325-fceratto.json
[18:43:53] <logmsgbot>	 !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm
[18:55:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure (sdb) on cloudcephmon1004 - https://phabricator.wikimedia.org/T392458#10857617 (10andrea.denisse) 05Open→03Resolved Thanks to Taavi for adding /dev/sdb back to software raid. https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?or...
[18:58:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P76445 and previous config saved to /var/cache/conftool/dbconfig/20250526-185832-fceratto.json
[19:13:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T395241)', diff saved to https://phabricator.wikimedia.org/P76446 and previous config saved to /var/cache/conftool/dbconfig/20250526-191341-fceratto.json
[19:14:01] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2201.codfw.wmnet with reason: Maintenance
[19:16:55] <wikibugs>	 (03PS1) 10Effie Mouzeli: validating-admission-policies: fix typo in Makefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150749
[19:16:59] <denisse>	 !log Add Grafana v12.0.1 to reprepro for bookworm - T395098
[19:17:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:03] <stashbot>	 T395098: Upgrade to Grafana 12.0.1 - https://phabricator.wikimedia.org/T395098
[19:17:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:19:06] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: Maintenance
[19:19:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T395241)', diff saved to https://phabricator.wikimedia.org/P76447 and previous config saved to /var/cache/conftool/dbconfig/20250526-191912-fceratto.json
[19:19:28] <denisse>	 !log Upgrading Grafana to v12.0.1 on grafana1002 - T395098
[19:19:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:06] <wikibugs>	 (03PS1) 10Andrea Denisse: Revert "grafana: Disable dashboard sync to ugprade Grafana version" [puppet] - 10https://gerrit.wikimedia.org/r/1150750
[19:24:44] <logmsgbot>	 !log aokoth@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
[19:26:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T395241)', diff saved to https://phabricator.wikimedia.org/P76448 and previous config saved to /var/cache/conftool/dbconfig/20250526-192600-fceratto.json
[19:26:30] <logmsgbot>	 !log aokoth@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'.
[19:26:32] <wikibugs>	 (03PS6) 10Effie Mouzeli: admin_ng: add ValidatingAdmissionPolicy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225)
[19:27:17] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] Revert "grafana: Disable dashboard sync to ugprade Grafana version" [puppet] - 10https://gerrit.wikimedia.org/r/1150750 (owner: 10Andrea Denisse)
[19:27:45] <wikibugs>	 (03PS7) 10Effie Mouzeli: admin_ng: add ValidatingAdmissionPolicy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225)
[19:28:33] <logmsgbot>	 !log aokoth@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[19:29:14] <logmsgbot>	 !log aokoth@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[19:29:49] <wikibugs>	 (03PS2) 10Effie Mouzeli: kubernetes::deployment_server: add new mw-experimental release [puppet] - 10https://gerrit.wikimedia.org/r/1148300 (https://phabricator.wikimedia.org/T276994)
[19:30:28] <denisse>	 !log Re-enable sync between grafana hosts - T395098
[19:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:34] <stashbot>	 T395098: Upgrade to Grafana 12.0.1 - https://phabricator.wikimedia.org/T395098
[19:33:07] <logmsgbot>	 !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on doc1004.eqiad.wmnet with reason: Bookworm
[19:37:07] <wikibugs>	 (03CR) 10Effie Mouzeli: kubernetes::deployment_server: add new mw-experimental release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148300 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[19:38:28] <wikibugs>	 (03PS4) 10AOkoth: wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794)
[19:38:42] <wikibugs>	 (03PS5) 10AOkoth: wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794)
[19:40:45] <wikibugs>	 (03PS6) 10AOkoth: wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794)
[19:41:08] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P76449 and previous config saved to /var/cache/conftool/dbconfig/20250526-194107-fceratto.json
[19:41:17] <wikibugs>	 (03PS7) 10AOkoth: wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794)
[19:41:30] <wikibugs>	 (03CR) 10AOkoth: "Done" [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[19:56:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P76450 and previous config saved to /var/cache/conftool/dbconfig/20250526-195614-fceratto.json
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T2000).
[20:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:17] <MatmaRex>	 hi
[20:00:38] <MatmaRex>	 anyone around who could deploy for me?
[20:03:34] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:11:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T395241)', diff saved to https://phabricator.wikimedia.org/P76451 and previous config saved to /var/cache/conftool/dbconfig/20250526-201123-fceratto.json
[20:11:43] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2223.codfw.wmnet with reason: Maintenance
[20:11:47] <MatmaRex>	 i still need a deployer if anyone has a couple of minutes
[20:11:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2223 (T395241)', diff saved to https://phabricator.wikimedia.org/P76452 and previous config saved to /var/cache/conftool/dbconfig/20250526-201150-fceratto.json
[20:16:48] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-experimental: initial commit (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150760 (https://phabricator.wikimedia.org/T276994)
[20:18:33] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:18:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T395241)', diff saved to https://phabricator.wikimedia.org/P76453 and previous config saved to /var/cache/conftool/dbconfig/20250526-201840-fceratto.json
[20:19:14] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-experimental: create new service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994)
[20:22:26] <wikibugs>	 (03PS2) 10Effie Mouzeli: profile::kubernetes::deployment_server add usernames for mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1147782 (https://phabricator.wikimedia.org/T276994)
[20:25:04] <wikibugs>	 (03PS3) 10Effie Mouzeli: profile::kubernetes::deployment_server::services: add usernames for mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1147782 (https://phabricator.wikimedia.org/T276994)
[20:25:11] <wikibugs>	 (03PS3) 10Effie Mouzeli: profile::kubernetes::deployment_server: add new mw-experimental release [puppet] - 10https://gerrit.wikimedia.org/r/1148300 (https://phabricator.wikimedia.org/T276994)
[20:26:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile::kubernetes::deployment_server::services: add usernames for mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1147782 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[20:28:36] <wikibugs>	 (03PS2) 10Effie Mouzeli: mw-experimental: create new service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994)
[20:30:47] <wikibugs>	 (03PS4) 10Effie Mouzeli: profile::kubernetes::deployment_server: add usernames for mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1147782 (https://phabricator.wikimedia.org/T276994)
[20:33:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P76454 and previous config saved to /var/cache/conftool/dbconfig/20250526-203348-fceratto.json
[20:48:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P76455 and previous config saved to /var/cache/conftool/dbconfig/20250526-204855-fceratto.json
[20:54:41] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284)
[20:55:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli)
[20:59:00] <wikibugs>	 (03PS2) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284)
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T2100).
[21:00:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli)
[21:04:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T395241)', diff saved to https://phabricator.wikimedia.org/P76456 and previous config saved to /var/cache/conftool/dbconfig/20250526-210402-fceratto.json
[21:04:24] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2228.codfw.wmnet with reason: Maintenance
[21:04:39] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[21:04:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2228 (T395241)', diff saved to https://phabricator.wikimedia.org/P76457 and previous config saved to /var/cache/conftool/dbconfig/20250526-210445-fceratto.json
[21:07:08] <wikibugs>	 (03PS3) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284)
[21:08:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli)
[21:08:40] <wikibugs>	 (03PS4) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284)
[21:10:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli)
[21:10:28] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:10:36] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:11:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T395241)', diff saved to https://phabricator.wikimedia.org/P76458 and previous config saved to /var/cache/conftool/dbconfig/20250526-211127-fceratto.json
[21:12:30] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.857 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:13:18] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53940 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:15:14] <wikibugs>	 (03PS5) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284)
[21:16:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli)
[21:20:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149822 (https://phabricator.wikimedia.org/T392251) (owner: 10Gergő Tisza)
[21:20:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1149823 (https://phabricator.wikimedia.org/T392251) (owner: 10Gergő Tisza)
[21:23:59] <wikibugs>	 (03PS6) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284)
[21:25:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli)
[21:26:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P76459 and previous config saved to /var/cache/conftool/dbconfig/20250526-212634-fceratto.json
[21:29:43] <wikibugs>	 (03PS7) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284)
[21:30:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli)
[21:31:23] <wikibugs>	 (03PS8) 10Effie Mouzeli: mediawiki: mount mediawiki via hostPath feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284)
[21:41:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P76460 and previous config saved to /var/cache/conftool/dbconfig/20250526-214142-fceratto.json
[21:52:12] <wikibugs>	 (03PS2) 10Ahonc: Add user group extendedmover to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150778 (https://phabricator.wikimedia.org/T395285)
[21:52:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150778 (https://phabricator.wikimedia.org/T395285) (owner: 10Ahonc)
[21:56:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T395241)', diff saved to https://phabricator.wikimedia.org/P76461 and previous config saved to /var/cache/conftool/dbconfig/20250526-215649-fceratto.json
[22:06:16] <icinga-wm>	 PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 83557MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[22:20:34] <wikibugs>	 (03PS3) 10Ahonc: Add user group extendedmover to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150778 (https://phabricator.wikimedia.org/T395285)
[22:23:54] <wikibugs>	 (03PS4) 10Ahonc: Add user group extendedmover to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1150778 (https://phabricator.wikimedia.org/T395285)
[22:45:36] <icinga-wm>	 RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Mon 23 Jun 2025 10:10:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[23:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250526T2300)
[23:17:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:30:09] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1150145 (owner: 10TrainBranchBot)
[23:38:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1150797
[23:38:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1150797 (owner: 10TrainBranchBot)
[23:54:41] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1150797 (owner: 10TrainBranchBot)