[00:38:32] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1107977
[00:38:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1107977 (owner: 10TrainBranchBot)
[00:39:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10428545 (10phaultfinder)
[00:58:12] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1107977 (owner: 10TrainBranchBot)
[01:00:33] <wikibugs>	 (03PS1) 10RLazarus: Update file paths for wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1107978
[01:01:43] <wikibugs>	 (03CR) 10RLazarus: Update file paths for wmf-laptop-sre -> wmf-laptop rename (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1107978 (owner: 10RLazarus)
[01:02:30] <wikibugs>	 (03PS2) 10RLazarus: Update file paths for wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1107978
[01:04:25] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[01:06:25] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[01:08:16] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1107979
[01:08:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1107979 (owner: 10TrainBranchBot)
[01:11:44] <wikibugs>	 (03PS1) 10Tim Starling: beta: Use update.php --doshared [puppet] - 10https://gerrit.wikimedia.org/r/1107980 (https://phabricator.wikimedia.org/T382389)
[01:17:43] <jinxer-wm>	 RESOLVED: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[01:17:44] <jinxer-wm>	 RESOLVED: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[01:26:35] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1107979 (owner: 10TrainBranchBot)
[01:32:43] <jinxer-wm>	 FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[01:32:43] <jinxer-wm>	 FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[01:34:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10428595 (10phaultfinder)
[01:37:43] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[01:37:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[01:49:29] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/9eae1c6b0620bf2717712e6e09f6cb7c9e8e498e8a653dc81587b095933cac48/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:09:29] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:24:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10428609 (10phaultfinder)
[02:26:56] <wikibugs>	 (03PS3) 10Andrew Bogott: pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129)
[02:26:56] <wikibugs>	 (03PS4) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129)
[02:27:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott)
[02:27:36] <wikibugs>	 (03CR) 10Andrew Bogott: pdns recursor: support injecting extra hostnames into recursor config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott)
[02:28:31] <wikibugs>	 (03PS4) 10Andrew Bogott: pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129)
[02:28:31] <wikibugs>	 (03PS5) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129)
[02:29:18] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott)
[02:33:56] <wikibugs>	 (03PS6) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129)
[02:35:03] <wikibugs>	 (03PS5) 10Andrew Bogott: pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129)
[02:35:03] <wikibugs>	 (03PS7) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129)
[02:35:17] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott)
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:37:57] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott)
[03:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:25:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:28:27] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:37:43] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[05:37:43] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[06:09:59] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106940 (https://phabricator.wikimedia.org/T382784) (owner: 10Hubaishan)
[06:23:38] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2116 [puppet] - 10https://gerrit.wikimedia.org/r/1107982 (https://phabricator.wikimedia.org/T362950)
[06:23:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2116.codfw.wmnet
[06:24:32] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db2116 [puppet] - 10https://gerrit.wikimedia.org/r/1107982 (https://phabricator.wikimedia.org/T362950) (owner: 10Marostegui)
[06:28:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[06:32:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2116.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[06:32:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2116.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[06:32:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:32:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2116.codfw.wmnet
[06:33:08] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2116.codfw.wmnet - https://phabricator.wikimedia.org/T362950#10428671 (10Marostegui) a:05Marostegui→03None
[06:33:21] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2116.codfw.wmnet - https://phabricator.wikimedia.org/T362950#10428676 (10Marostegui) This is ready for #dc-ops
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250103T0700)
[07:21:14] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2115 [puppet] - 10https://gerrit.wikimedia.org/r/1107984 (https://phabricator.wikimedia.org/T362949)
[07:21:53] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2115 [puppet] - 10https://gerrit.wikimedia.org/r/1107984 (https://phabricator.wikimedia.org/T362949) (owner: 10Marostegui)
[07:23:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2115 from dbctl T362949', diff saved to https://phabricator.wikimedia.org/P71770 and previous config saved to /var/cache/conftool/dbconfig/20250103-072349-marostegui.json
[07:23:53] <stashbot>	 T362949: decommission db2115.codfw.wmnet - https://phabricator.wikimedia.org/T362949
[07:25:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:26:05] <wikibugs>	 (03PS1) 10Marostegui: db2115: No candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1107985 (https://phabricator.wikimedia.org/T362949)
[07:26:35] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2115: No candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1107985 (https://phabricator.wikimedia.org/T362949) (owner: 10Marostegui)
[07:28:27] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:36:08] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db1193.eqiad.wmnet with reason: maintenance
[07:36:11] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db1193.eqiad.wmnet with reason: maintenance
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250103T0800)
[08:16:57] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Offboarding mhernandez [puppet] - 10https://gerrit.wikimedia.org/r/1108034
[08:24:05] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove db2115 [puppet] - 10https://gerrit.wikimedia.org/r/1108035 (https://phabricator.wikimedia.org/T362949)
[08:24:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2115.codfw.wmnet
[08:29:25] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[08:33:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2115.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[08:35:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2115.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[08:35:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:35:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2115.codfw.wmnet
[08:46:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Also add the cluster SSH key in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724)
[08:46:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Also add the cluster SSH key in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff)
[08:47:49] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove db2115 [puppet] - 10https://gerrit.wikimedia.org/r/1108035 (https://phabricator.wikimedia.org/T362949) (owner: 10Marostegui)
[08:49:44] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2115.codfw.wmnet - https://phabricator.wikimedia.org/T362949#10428731 (10Marostegui)
[08:49:47] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2115.codfw.wmnet - https://phabricator.wikimedia.org/T362949#10428735 (10Marostegui) Ready for #dc-ops
[08:50:00] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2115.codfw.wmnet - https://phabricator.wikimedia.org/T362949#10428736 (10Marostegui) a:05Marostegui→03None
[08:51:11] <wikibugs>	 (03CR) 10Tacsipacsi: "No problem." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106739 (owner: 10Tacsipacsi)
[09:00:37] <wikibugs>	 (03PS2) 10Muehlenhoff: Also add the cluster SSH key in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724)
[09:02:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2236 to upgrade to 10.11.10 T378940', diff saved to https://phabricator.wikimedia.org/P71771 and previous config saved to /var/cache/conftool/dbconfig/20250103-090215-marostegui.json
[09:02:18] <stashbot>	 T378940: Compile and package MariaDB 10.11.10 and 10.6.20 - https://phabricator.wikimedia.org/T378940
[09:02:36] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2236.codfw.wmnet with reason: upgrade
[09:02:50] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2236.codfw.wmnet with reason: upgrade
[09:03:05] <marostegui>	 !log Upgrade db2236 to 10.11.10 s4 codfw dbmaint T378940
[09:03:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71772 and previous config saved to /var/cache/conftool/dbconfig/20250103-090541-root.json
[09:08:38] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff)
[09:16:28] <wikibugs>	 (03PS3) 10Muehlenhoff: Also add the cluster SSH key in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724)
[09:20:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71773 and previous config saved to /var/cache/conftool/dbconfig/20250103-092046-root.json
[09:23:14] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff)
[09:30:40] <wikibugs>	 (03PS4) 10Muehlenhoff: Also add the cluster SSH key to /var/lib/ganeti/known_hosts in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724)
[09:31:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Also add the cluster SSH key to /var/lib/ganeti/known_hosts in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff)
[09:32:49] <wikibugs>	 (03PS5) 10Muehlenhoff: Also add the cluster SSH key to /var/lib/ganeti/known_hosts in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724)
[09:33:45] <wikibugs>	 (03PS1) 10Marostegui: production-m3.sql.erb: Add missing grants [puppet] - 10https://gerrit.wikimedia.org/r/1108037 (https://phabricator.wikimedia.org/T377643)
[09:34:50] <wikibugs>	 (03CR) 10Marostegui: "This is a NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1108037 (https://phabricator.wikimedia.org/T377643) (owner: 10Marostegui)
[09:35:26] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff)
[09:35:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71774 and previous config saved to /var/cache/conftool/dbconfig/20250103-093552-root.json
[09:37:43] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[09:37:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[09:40:33] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] production-m3.sql.erb: Add missing grants [puppet] - 10https://gerrit.wikimedia.org/r/1108037 (https://phabricator.wikimedia.org/T377643) (owner: 10Marostegui)
[09:50:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71776 and previous config saved to /var/cache/conftool/dbconfig/20250103-095057-root.json
[10:06:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71777 and previous config saved to /var/cache/conftool/dbconfig/20250103-100603-root.json
[10:18:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] "Thanks, merging" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1107978 (owner: 10RLazarus)
[10:18:48] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Update file paths for wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1107978 (owner: 10RLazarus)
[10:30:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] "This should work. Let me know if you need help building the image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1107952 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[10:35:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2021 T381848', diff saved to https://phabricator.wikimedia.org/P71778 and previous config saved to /var/cache/conftool/dbconfig/20250103-103513-marostegui.json
[10:35:16] <stashbot>	 T381848: Decommission es202[0-5] - https://phabricator.wikimedia.org/T381848
[10:36:04] <wikibugs>	 (03PS1) 10Marostegui: es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108039 (https://phabricator.wikimedia.org/T381848)
[10:37:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108039 (https://phabricator.wikimedia.org/T381848) (owner: 10Marostegui)
[11:04:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Switchover es4 codfw master', diff saved to https://phabricator.wikimedia.org/P71779 and previous config saved to /var/cache/conftool/dbconfig/20250103-110440-marostegui.json
[11:05:01] <marostegui>	 !log Switchover es4 codfw master to es2043 dbmaint T381848
[11:05:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:07] <stashbot>	 T381848: Decommission es202[0-5] - https://phabricator.wikimedia.org/T381848
[11:09:48] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update es4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1108040 (https://phabricator.wikimedia.org/T381848)
[11:11:48] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update es4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1108040 (https://phabricator.wikimedia.org/T381848) (owner: 10Marostegui)
[11:12:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2020 T382945', diff saved to https://phabricator.wikimedia.org/P71780 and previous config saved to /var/cache/conftool/dbconfig/20250103-111255-marostegui.json
[11:12:59] <stashbot>	 T382945: decommission es2020.codfw.wmnet - https://phabricator.wikimedia.org/T382945
[11:17:39] <wikibugs>	 (03PS2) 10Tim Starling: beta: Use update.php --doshared [puppet] - 10https://gerrit.wikimedia.org/r/1107980 (https://phabricator.wikimedia.org/T382389)
[11:17:42] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] beta: Use update.php --doshared [puppet] - 10https://gerrit.wikimedia.org/r/1107980 (https://phabricator.wikimedia.org/T382389) (owner: 10Tim Starling)
[11:25:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:28:27] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:31:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Add an option to pass the Presto firewall settings compatible with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1108041
[11:39:43] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108041 (owner: 10Muehlenhoff)
[11:42:49] <wikibugs>	 (03PS2) 10Muehlenhoff: Add an option to pass the Presto firewall settings compatible with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1108041
[11:44:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10428944 (10phaultfinder)
[11:48:04] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108041 (owner: 10Muehlenhoff)
[11:54:07] <wikibugs>	 (03PS3) 10Muehlenhoff: Add an option to pass the Presto firewall settings compatible with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1108041
[12:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250103T0800)
[12:00:05] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250103T1200).
[12:00:20] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108041 (owner: 10Muehlenhoff)
[12:01:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2022 T381848', diff saved to https://phabricator.wikimedia.org/P71781 and previous config saved to /var/cache/conftool/dbconfig/20250103-120132-marostegui.json
[12:01:36] <stashbot>	 T381848: Decommission es202[0-5] - https://phabricator.wikimedia.org/T381848
[12:02:56] <wikibugs>	 (03PS1) 10Marostegui: es2022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108054 (https://phabricator.wikimedia.org/T382946)
[12:03:23] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108054 (https://phabricator.wikimedia.org/T382946) (owner: 10Marostegui)
[12:10:25] <moritzm>	 !log renewed internal Ganeti certs in eqsin (would have expired in two days) T382873
[12:10:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:28] <stashbot>	 T382873: Ganeti expired certificate errors in ulsfo - https://phabricator.wikimedia.org/T382873
[12:30:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10428993 (10MoritzMuehlenhoff)
[12:35:08] <wikibugs>	 (03PS1) 10Btullis: Add conftool-data for dbstore hosts to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108071 (https://phabricator.wikimedia.org/T382947)
[12:36:18] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "You'd also need to fill them up in dbctl once they are up. I'd suggest not to merge this on a Friday afternoon." [puppet] - 10https://gerrit.wikimedia.org/r/1108071 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[12:39:03] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] "There is still quite a chance to accidentally flow traffic to dbstore (and vice versa, the dumps reading from other prod replicas). I sugg" [puppet] - 10https://gerrit.wikimedia.org/r/1108071 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[12:39:34] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Let's discuss that on the task." [puppet] - 10https://gerrit.wikimedia.org/r/1108071 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[12:46:19] <wikibugs>	 (03CR) 10Btullis: [C:04-1] "Setting to -1 for now, whie we discuss. We will also need to configure the firewalls on the dbstore100[7-9] servers, accordingly." [puppet] - 10https://gerrit.wikimedia.org/r/1108071 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[13:37:43] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[13:37:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[13:39:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429075 (10phaultfinder)
[14:35:24] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10429165 (10MoritzMuehlenhoff)
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:44:45] <wikibugs>	 (03PS1) 10CDanis: otelcol: drop service-runner healthchecks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108086 (https://phabricator.wikimedia.org/T366750)
[14:45:38] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott)
[14:47:05] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "(dns1004 NOOP looks good that is and therefore the prod DNS hosts)" [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott)
[14:53:39] <wikibugs>	 (03PS1) 10Muehlenhoff: crm: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1108088
[15:02:30] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:03:06] <wikibugs>	 (03PS1) 10Btullis: Add caps to allow ceph-csi-cephfs to work with the dumps filesystem [puppet] - 10https://gerrit.wikimedia.org/r/1108089 (https://phabricator.wikimedia.org/T382490)
[15:03:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott)
[15:03:58] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4730/co" [puppet] - 10https://gerrit.wikimedia.org/r/1108089 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis)
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:59] <wikibugs>	 (03PS6) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378)
[15:07:30] <jinxer-wm>	 RESOLVED: ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:08:17] <wikibugs>	 (03CR) 10FNegri: prometheus-node-kernel-panic: use prom labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri)
[15:09:46] <wikibugs>	 (03PS7) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378)
[15:10:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri)
[15:12:34] <wikibugs>	 (03CR) 10Ssingh: "Nice work, looking good! I think we are on the right path but I have some questions from our last discussion (see in-line):" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins)
[15:13:14] <wikibugs>	 (03PS1) 10Btullis: Add a storageclass for the dumps file system [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108090 (https://phabricator.wikimedia.org/T382490)
[15:18:28] <wikibugs>	 (03PS8) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378)
[15:18:29] <wikibugs>	 (03PS1) 10FNegri: promtheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091
[15:25:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:27:54] <wikibugs>	 (03PS9) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378)
[15:27:55] <wikibugs>	 (03PS2) 10FNegri: promtheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091
[15:28:27] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:28:39] <wikibugs>	 (03PS10) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378)
[15:28:39] <wikibugs>	 (03PS3) 10FNegri: promtheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091
[15:29:16] <wikibugs>	 (03PS4) 10FNegri: promtheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T379378)
[15:30:24] <wikibugs>	 (03PS5) 10FNegri: prometheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T379378)
[15:38:45] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108088 (owner: 10Muehlenhoff)
[15:40:36] <wikibugs>	 (03CR) 10Andrew Bogott: "I am not 100% sure that this change is meaningful, but here's what I'm thinking. The status quo is:" [puppet] - 10https://gerrit.wikimedia.org/r/1105944 (owner: 10Andrew Bogott)
[15:43:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix permissions for /var/lib/ganeti/known_hosts in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108092 (https://phabricator.wikimedia.org/T382870)
[15:43:33] <wikibugs>	 (03PS2) 10Muehlenhoff: Fix permissions for /var/lib/ganeti/known_hosts in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108092 (https://phabricator.wikimedia.org/T382870)
[15:47:05] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108092 (https://phabricator.wikimedia.org/T382870) (owner: 10Muehlenhoff)
[16:02:20] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "That's fair! I guess we could try to replicate the behaviour at least on the pdns-rec side to see what works out the best for us but you a" [puppet] - 10https://gerrit.wikimedia.org/r/1105944 (owner: 10Andrew Bogott)
[16:04:35] <wikibugs>	 (03PS1) 10Thcipriani: Revert "Reinstate the banner for the developer survey" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1108097
[16:08:44] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[16:09:44] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[16:17:29] <wikibugs>	 (03CR) 10Pppery: [C:03+1] "Yes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106739 (owner: 10Tacsipacsi)
[16:25:38] <wikibugs>	 10SRE-swift-storage, 07Epic: [Epic] Determine a strategy to store files between 5 and 100 GB - https://phabricator.wikimedia.org/T191802#10429423 (10Koavf) Note that there is now a local list at Commons for very large files: https://commons.wikimedia.org/wiki/Commons:Very_large_files_to_upload. This can be use...
[16:38:13] <wikibugs>	 (03CR) 10Thcipriani: "check experimental" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1108097 (owner: 10Thcipriani)
[16:39:54] <wikibugs>	 (03CR) 10Thcipriani: [C:03+2] Revert "Reinstate the banner for the developer survey" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1108097 (owner: 10Thcipriani)
[16:40:30] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Reinstate the banner for the developer survey" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1108097 (owner: 10Thcipriani)
[16:55:20] <logmsgbot>	 !log thcipriani@deploy2002 Started deploy [gerrit/gerrit@44854d4]: remove developer satisfaction survey banner (gerrit2003.wikimedia.org only)
[16:55:28] <logmsgbot>	 !log thcipriani@deploy2002 Finished deploy [gerrit/gerrit@44854d4]: remove developer satisfaction survey banner (gerrit2003.wikimedia.org only) (duration: 00m 08s)
[16:56:08] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: ml-services: revamp llm model server with aya-8B [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101000 (https://phabricator.wikimedia.org/T379052)
[16:56:20] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: revamp llm model server with aya-8B [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101000 (https://phabricator.wikimedia.org/T379052) (owner: 10Ilias Sarantopoulos)
[16:57:50] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: revamp llm model server with aya-8B [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101000 (https://phabricator.wikimedia.org/T379052) (owner: 10Ilias Sarantopoulos)
[17:00:15] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[17:04:37] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874#10429460 (10VRiley-WMF) Hey @MatthewVernon I have opened a ticket with supermicro and will work with them on getting a new drive ASAP (supermicro ticket is #ANT-782-48642)
[17:06:38] <logmsgbot>	 !log thcipriani@deploy2002 Started deploy [gerrit/gerrit@44854d4]: remove developer satisfaction survey banner (gerrit2002.wikimedia.org only)
[17:06:46] <logmsgbot>	 !log thcipriani@deploy2002 Finished deploy [gerrit/gerrit@44854d4]: remove developer satisfaction survey banner (gerrit2002.wikimedia.org only) (duration: 00m 08s)
[17:14:11] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] kubectl: image with kubectl installed [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1107952 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[17:14:16] <wikibugs>	 (03CR) 10AOkoth: [V:03+2 C:03+2] kubectl: image with kubectl installed [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1107952 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[17:14:36] <wikibugs>	 (03PS1) 10CDanis: draft: scrub echostore userids [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108106 (https://phabricator.wikimedia.org/T366750)
[17:15:19] <logmsgbot>	 !log thcipriani@deploy2002 Started deploy [gerrit/gerrit@44854d4]: remove developer satisfaction survey banner (gerrit1003.wikimedia.org only)
[17:15:29] <logmsgbot>	 !log thcipriani@deploy2002 Finished deploy [gerrit/gerrit@44854d4]: remove developer satisfaction survey banner (gerrit1003.wikimedia.org only) (duration: 00m 10s)
[17:25:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429488 (10phaultfinder)
[17:27:17] <wikibugs>	 (03PS1) 10AOkoth: kubectl: add changelog for kubectl image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1108107
[17:29:02] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] kubectl: add changelog for kubectl image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1108107 (owner: 10AOkoth)
[17:29:12] <wikibugs>	 (03CR) 10AOkoth: [V:03+2 C:03+2] kubectl: add changelog for kubectl image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1108107 (owner: 10AOkoth)
[17:37:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[17:37:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[17:55:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429561 (10phaultfinder)
[18:13:03] <wikibugs>	 (03CR) 10FNegri: prometheus-node-kernel-panic: use prom labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri)
[18:16:55] <wikibugs>	 (03PS11) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961)
[18:16:57] <wikibugs>	 (03PS6) 10FNegri: prometheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961)
[18:32:38] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "if everything just changes from existing "parking" then not much can go wrong here. lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1106344 (https://phabricator.wikimedia.org/T380667) (owner: 10BCornwall)
[18:38:20] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] devtools: fix hiera after host renaming [puppet] - 10https://gerrit.wikimedia.org/r/1104957 (https://phabricator.wikimedia.org/T363415) (owner: 10Hashar)
[18:43:48] <icinga-wm>	 PROBLEM - Host doc2002 is DOWN: PING CRITICAL - Packet loss = 100%
[18:44:39] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "I assume the puppet run _would_ be fixed now but there is another unrelated issue which I just filed for other instances as https://phabri" [puppet] - 10https://gerrit.wikimedia.org/r/1104957 (https://phabricator.wikimedia.org/T363415) (owner: 10Hashar)
[18:45:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:46:50] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on doc2002.codfw.wmnet with reason: Disk Change
[18:47:05] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doc2002.codfw.wmnet with reason: Disk Change
[18:54:04] <icinga-wm>	 PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:07:19] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update - https://phabricator.wikimedia.org/T309724#10429718 (10Arnoldokoth) o/ I ran into this issue trying to access the console for doc2002 and I think it's...
[19:09:19] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on doc2002.codfw.wmnet with reason: Disk Change
[19:09:22] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on doc2002.codfw.wmnet with reason: Disk Change
[19:17:39] <wikibugs>	 (03PS1) 10AOkoth: docs: alert only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1108112 (https://phabricator.wikimedia.org/T382964)
[19:19:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429741 (10phaultfinder)
[19:25:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:28:27] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:29:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429744 (10phaultfinder)
[19:37:37] <wikibugs>	 (03CR) 10Dwisehaupt: [C:03+1] "This looks fine to me. The CDN should be the only entry point of access currently." [puppet] - 10https://gerrit.wikimedia.org/r/1108088 (owner: 10Muehlenhoff)
[19:49:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T381635#10429776 (10VRiley-WMF) Hi, is there a specific time that would be preferred for us to take a look at this and swap the module if needed?
[20:16:56] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s7 on db1171 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cawiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:19:24] <marostegui>	 I'll fix that 
[20:21:56] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s7 on db1171 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429842 (10phaultfinder)
[20:40:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429907 (10phaultfinder)
[21:27:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429991 (10phaultfinder)
[21:32:30] <wikibugs>	 (03CR) 10Catrope: [C:03+1] draft: scrub echostore userids [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108106 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis)
[21:37:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[21:37:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[21:37:49] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] Point various parking domains to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1106344 (https://phabricator.wikimedia.org/T380667) (owner: 10BCornwall)
[21:38:08] <wikibugs>	 (03PS2) 10BCornwall: Point various parking domains to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1106344 (https://phabricator.wikimedia.org/T380667)
[21:41:12] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] Point various parking domains to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1106344 (https://phabricator.wikimedia.org/T380667) (owner: 10BCornwall)
[21:54:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10430019 (10phaultfinder)
[22:10:00] <wikibugs>	 (03PS5) 10Krinkle: Profiler: centralize metrics send to a function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite)
[22:10:03] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] Profiler: centralize metrics send to a function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite)
[22:11:56] <wikibugs>	 (03PS1) 10Jdlrobson: Move logic for type infering to server [extensions/Chart] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1108130 (https://phabricator.wikimedia.org/T382042)
[22:13:20] <wikibugs>	 (03PS3) 10Cwhite: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385)
[22:14:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10430065 (10phaultfinder)
[22:38:11] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Remove `wgVectorStickyHeader` from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108135 (https://phabricator.wikimedia.org/T332728)
[22:45:17] <jinxer-wm>	 FIRING: ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc1003.eqiad.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:01:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:11:42] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:25:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:28:27] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed