[00:38:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1107977 [00:38:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1107977 (owner: 10TrainBranchBot) [00:39:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10428545 (10phaultfinder) [00:58:12] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1107977 (owner: 10TrainBranchBot) [01:00:33] (03PS1) 10RLazarus: Update file paths for wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1107978 [01:01:43] (03CR) 10RLazarus: Update file paths for wmf-laptop-sre -> wmf-laptop rename (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1107978 (owner: 10RLazarus) [01:02:30] (03PS2) 10RLazarus: Update file paths for wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1107978 [01:04:25] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [01:06:25] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [01:08:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1107979 [01:08:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1107979 (owner: 10TrainBranchBot) [01:11:44] (03PS1) 10Tim Starling: beta: Use update.php --doshared [puppet] - 10https://gerrit.wikimedia.org/r/1107980 (https://phabricator.wikimedia.org/T382389) [01:17:43] RESOLVED: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [01:17:44] RESOLVED: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [01:26:35] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1107979 (owner: 10TrainBranchBot) [01:32:43] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [01:32:43] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [01:34:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10428595 (10phaultfinder) [01:37:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [01:37:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [01:49:29] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/9eae1c6b0620bf2717712e6e09f6cb7c9e8e498e8a653dc81587b095933cac48/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:09:29] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10428609 (10phaultfinder) [02:26:56] (03PS3) 10Andrew Bogott: pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) [02:26:56] (03PS4) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) [02:27:16] (03CR) 10CI reject: [V:04-1] pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [02:27:36] (03CR) 10Andrew Bogott: pdns recursor: support injecting extra hostnames into recursor config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [02:28:31] (03PS4) 10Andrew Bogott: pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) [02:28:31] (03PS5) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) [02:29:18] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [02:33:56] (03PS6) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) [02:35:03] (03PS5) 10Andrew Bogott: pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) [02:35:03] (03PS7) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) [02:35:17] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:57] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:25:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:28:27] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:37:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [05:37:43] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [06:09:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106940 (https://phabricator.wikimedia.org/T382784) (owner: 10Hubaishan) [06:23:38] (03PS1) 10Marostegui: mariadb: Decommission db2116 [puppet] - 10https://gerrit.wikimedia.org/r/1107982 (https://phabricator.wikimedia.org/T362950) [06:23:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2116.codfw.wmnet [06:24:32] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db2116 [puppet] - 10https://gerrit.wikimedia.org/r/1107982 (https://phabricator.wikimedia.org/T362950) (owner: 10Marostegui) [06:28:40] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [06:32:12] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2116.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:32:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2116.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:32:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:32:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2116.codfw.wmnet [06:33:08] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2116.codfw.wmnet - https://phabricator.wikimedia.org/T362950#10428671 (10Marostegui) a:05Marostegui→03None [06:33:21] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2116.codfw.wmnet - https://phabricator.wikimedia.org/T362950#10428676 (10Marostegui) This is ready for #dc-ops [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250103T0700) [07:21:14] (03PS1) 10Marostegui: instances.yaml: Remove db2115 [puppet] - 10https://gerrit.wikimedia.org/r/1107984 (https://phabricator.wikimedia.org/T362949) [07:21:53] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2115 [puppet] - 10https://gerrit.wikimedia.org/r/1107984 (https://phabricator.wikimedia.org/T362949) (owner: 10Marostegui) [07:23:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2115 from dbctl T362949', diff saved to https://phabricator.wikimedia.org/P71770 and previous config saved to /var/cache/conftool/dbconfig/20250103-072349-marostegui.json [07:23:53] T362949: decommission db2115.codfw.wmnet - https://phabricator.wikimedia.org/T362949 [07:25:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:26:05] (03PS1) 10Marostegui: db2115: No candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1107985 (https://phabricator.wikimedia.org/T362949) [07:26:35] (03CR) 10Marostegui: [C:03+2] db2115: No candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1107985 (https://phabricator.wikimedia.org/T362949) (owner: 10Marostegui) [07:28:27] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:36:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db1193.eqiad.wmnet with reason: maintenance [07:36:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db1193.eqiad.wmnet with reason: maintenance [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250103T0800) [08:16:57] (03PS1) 10Slyngshede: data.yaml: Offboarding mhernandez [puppet] - 10https://gerrit.wikimedia.org/r/1108034 [08:24:05] (03PS1) 10Marostegui: mariadb: Remove db2115 [puppet] - 10https://gerrit.wikimedia.org/r/1108035 (https://phabricator.wikimedia.org/T362949) [08:24:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2115.codfw.wmnet [08:29:25] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [08:33:23] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2115.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [08:35:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2115.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [08:35:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:35:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2115.codfw.wmnet [08:46:15] (03PS1) 10Muehlenhoff: Also add the cluster SSH key in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) [08:46:36] (03CR) 10CI reject: [V:04-1] Also add the cluster SSH key in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [08:47:49] (03CR) 10Marostegui: [C:03+2] mariadb: Remove db2115 [puppet] - 10https://gerrit.wikimedia.org/r/1108035 (https://phabricator.wikimedia.org/T362949) (owner: 10Marostegui) [08:49:44] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2115.codfw.wmnet - https://phabricator.wikimedia.org/T362949#10428731 (10Marostegui) [08:49:47] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2115.codfw.wmnet - https://phabricator.wikimedia.org/T362949#10428735 (10Marostegui) Ready for #dc-ops [08:50:00] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2115.codfw.wmnet - https://phabricator.wikimedia.org/T362949#10428736 (10Marostegui) a:05Marostegui→03None [08:51:11] (03CR) 10Tacsipacsi: "No problem." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106739 (owner: 10Tacsipacsi) [09:00:37] (03PS2) 10Muehlenhoff: Also add the cluster SSH key in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) [09:02:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2236 to upgrade to 10.11.10 T378940', diff saved to https://phabricator.wikimedia.org/P71771 and previous config saved to /var/cache/conftool/dbconfig/20250103-090215-marostegui.json [09:02:18] T378940: Compile and package MariaDB 10.11.10 and 10.6.20 - https://phabricator.wikimedia.org/T378940 [09:02:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2236.codfw.wmnet with reason: upgrade [09:02:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2236.codfw.wmnet with reason: upgrade [09:03:05] !log Upgrade db2236 to 10.11.10 s4 codfw dbmaint T378940 [09:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71772 and previous config saved to /var/cache/conftool/dbconfig/20250103-090541-root.json [09:08:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [09:16:28] (03PS3) 10Muehlenhoff: Also add the cluster SSH key in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) [09:20:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71773 and previous config saved to /var/cache/conftool/dbconfig/20250103-092046-root.json [09:23:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [09:30:40] (03PS4) 10Muehlenhoff: Also add the cluster SSH key to /var/lib/ganeti/known_hosts in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) [09:31:00] (03CR) 10CI reject: [V:04-1] Also add the cluster SSH key to /var/lib/ganeti/known_hosts in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [09:32:49] (03PS5) 10Muehlenhoff: Also add the cluster SSH key to /var/lib/ganeti/known_hosts in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) [09:33:45] (03PS1) 10Marostegui: production-m3.sql.erb: Add missing grants [puppet] - 10https://gerrit.wikimedia.org/r/1108037 (https://phabricator.wikimedia.org/T377643) [09:34:50] (03CR) 10Marostegui: "This is a NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1108037 (https://phabricator.wikimedia.org/T377643) (owner: 10Marostegui) [09:35:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [09:35:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71774 and previous config saved to /var/cache/conftool/dbconfig/20250103-093552-root.json [09:37:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [09:37:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [09:40:33] (03CR) 10Marostegui: [C:03+2] production-m3.sql.erb: Add missing grants [puppet] - 10https://gerrit.wikimedia.org/r/1108037 (https://phabricator.wikimedia.org/T377643) (owner: 10Marostegui) [09:50:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71776 and previous config saved to /var/cache/conftool/dbconfig/20250103-095057-root.json [10:06:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P71777 and previous config saved to /var/cache/conftool/dbconfig/20250103-100603-root.json [10:18:41] (03CR) 10Muehlenhoff: [C:03+2] "Thanks, merging" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1107978 (owner: 10RLazarus) [10:18:48] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Update file paths for wmf-laptop-sre -> wmf-laptop rename [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1107978 (owner: 10RLazarus) [10:30:36] (03CR) 10Alexandros Kosiaris: [C:03+1] "This should work. Let me know if you need help building the image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1107952 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [10:35:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2021 T381848', diff saved to https://phabricator.wikimedia.org/P71778 and previous config saved to /var/cache/conftool/dbconfig/20250103-103513-marostegui.json [10:35:16] T381848: Decommission es202[0-5] - https://phabricator.wikimedia.org/T381848 [10:36:04] (03PS1) 10Marostegui: es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108039 (https://phabricator.wikimedia.org/T381848) [10:37:17] (03CR) 10Marostegui: [C:03+2] es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108039 (https://phabricator.wikimedia.org/T381848) (owner: 10Marostegui) [11:04:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Switchover es4 codfw master', diff saved to https://phabricator.wikimedia.org/P71779 and previous config saved to /var/cache/conftool/dbconfig/20250103-110440-marostegui.json [11:05:01] !log Switchover es4 codfw master to es2043 dbmaint T381848 [11:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:07] T381848: Decommission es202[0-5] - https://phabricator.wikimedia.org/T381848 [11:09:48] (03PS1) 10Marostegui: wmnet: Update es4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1108040 (https://phabricator.wikimedia.org/T381848) [11:11:48] (03CR) 10Marostegui: [C:03+2] wmnet: Update es4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1108040 (https://phabricator.wikimedia.org/T381848) (owner: 10Marostegui) [11:12:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2020 T382945', diff saved to https://phabricator.wikimedia.org/P71780 and previous config saved to /var/cache/conftool/dbconfig/20250103-111255-marostegui.json [11:12:59] T382945: decommission es2020.codfw.wmnet - https://phabricator.wikimedia.org/T382945 [11:17:39] (03PS2) 10Tim Starling: beta: Use update.php --doshared [puppet] - 10https://gerrit.wikimedia.org/r/1107980 (https://phabricator.wikimedia.org/T382389) [11:17:42] (03CR) 10Ladsgroup: [V:03+2 C:03+2] beta: Use update.php --doshared [puppet] - 10https://gerrit.wikimedia.org/r/1107980 (https://phabricator.wikimedia.org/T382389) (owner: 10Tim Starling) [11:25:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:28:27] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:31:32] (03PS1) 10Muehlenhoff: Add an option to pass the Presto firewall settings compatible with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1108041 [11:39:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108041 (owner: 10Muehlenhoff) [11:42:49] (03PS2) 10Muehlenhoff: Add an option to pass the Presto firewall settings compatible with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1108041 [11:44:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10428944 (10phaultfinder) [11:48:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108041 (owner: 10Muehlenhoff) [11:54:07] (03PS3) 10Muehlenhoff: Add an option to pass the Presto firewall settings compatible with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1108041 [12:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250103T0800) [12:00:05] eoghan, jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250103T1200). [12:00:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108041 (owner: 10Muehlenhoff) [12:01:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2022 T381848', diff saved to https://phabricator.wikimedia.org/P71781 and previous config saved to /var/cache/conftool/dbconfig/20250103-120132-marostegui.json [12:01:36] T381848: Decommission es202[0-5] - https://phabricator.wikimedia.org/T381848 [12:02:56] (03PS1) 10Marostegui: es2022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108054 (https://phabricator.wikimedia.org/T382946) [12:03:23] (03CR) 10Marostegui: [C:03+2] es2022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108054 (https://phabricator.wikimedia.org/T382946) (owner: 10Marostegui) [12:10:25] !log renewed internal Ganeti certs in eqsin (would have expired in two days) T382873 [12:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:28] T382873: Ganeti expired certificate errors in ulsfo - https://phabricator.wikimedia.org/T382873 [12:30:02] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10428993 (10MoritzMuehlenhoff) [12:35:08] (03PS1) 10Btullis: Add conftool-data for dbstore hosts to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108071 (https://phabricator.wikimedia.org/T382947) [12:36:18] (03CR) 10Marostegui: [C:03+1] "You'd also need to fill them up in dbctl once they are up. I'd suggest not to merge this on a Friday afternoon." [puppet] - 10https://gerrit.wikimedia.org/r/1108071 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [12:39:03] (03CR) 10Ladsgroup: [C:04-1] "There is still quite a chance to accidentally flow traffic to dbstore (and vice versa, the dumps reading from other prod replicas). I sugg" [puppet] - 10https://gerrit.wikimedia.org/r/1108071 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [12:39:34] (03CR) 10Marostegui: [C:03+1] "Let's discuss that on the task." [puppet] - 10https://gerrit.wikimedia.org/r/1108071 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [12:46:19] (03CR) 10Btullis: [C:04-1] "Setting to -1 for now, whie we discuss. We will also need to configure the firewalls on the dbstore100[7-9] servers, accordingly." [puppet] - 10https://gerrit.wikimedia.org/r/1108071 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [13:37:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [13:37:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [13:39:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429075 (10phaultfinder) [14:35:24] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10429165 (10MoritzMuehlenhoff) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:44:45] (03PS1) 10CDanis: otelcol: drop service-runner healthchecks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108086 (https://phabricator.wikimedia.org/T366750) [14:45:38] (03CR) 10Ssingh: [C:03+1] "Looks good thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [14:47:05] (03CR) 10Ssingh: [C:03+1] "(dns1004 NOOP looks good that is and therefore the prod DNS hosts)" [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [14:53:39] (03PS1) 10Muehlenhoff: crm: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1108088 [15:02:30] FIRING: ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:03:06] (03PS1) 10Btullis: Add caps to allow ceph-csi-cephfs to work with the dumps filesystem [puppet] - 10https://gerrit.wikimedia.org/r/1108089 (https://phabricator.wikimedia.org/T382490) [15:03:32] (03CR) 10Andrew Bogott: [C:03+2] pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [15:03:58] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4730/co" [puppet] - 10https://gerrit.wikimedia.org/r/1108089 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:59] (03PS6) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) [15:07:30] RESOLVED: ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:08:17] (03CR) 10FNegri: prometheus-node-kernel-panic: use prom labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri) [15:09:46] (03PS7) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) [15:10:09] (03CR) 10CI reject: [V:04-1] prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri) [15:12:34] (03CR) 10Ssingh: "Nice work, looking good! I think we are on the right path but I have some questions from our last discussion (see in-line):" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [15:13:14] (03PS1) 10Btullis: Add a storageclass for the dumps file system [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108090 (https://phabricator.wikimedia.org/T382490) [15:18:28] (03PS8) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) [15:18:29] (03PS1) 10FNegri: promtheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 [15:25:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:27:54] (03PS9) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) [15:27:55] (03PS2) 10FNegri: promtheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 [15:28:27] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:28:39] (03PS10) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) [15:28:39] (03PS3) 10FNegri: promtheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 [15:29:16] (03PS4) 10FNegri: promtheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T379378) [15:30:24] (03PS5) 10FNegri: prometheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T379378) [15:38:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108088 (owner: 10Muehlenhoff) [15:40:36] (03CR) 10Andrew Bogott: "I am not 100% sure that this change is meaningful, but here's what I'm thinking. The status quo is:" [puppet] - 10https://gerrit.wikimedia.org/r/1105944 (owner: 10Andrew Bogott) [15:43:19] (03PS1) 10Muehlenhoff: Fix permissions for /var/lib/ganeti/known_hosts in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108092 (https://phabricator.wikimedia.org/T382870) [15:43:33] (03PS2) 10Muehlenhoff: Fix permissions for /var/lib/ganeti/known_hosts in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108092 (https://phabricator.wikimedia.org/T382870) [15:47:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108092 (https://phabricator.wikimedia.org/T382870) (owner: 10Muehlenhoff) [16:02:20] (03CR) 10Ssingh: [C:03+1] "That's fair! I guess we could try to replicate the behaviour at least on the pdns-rec side to see what works out the best for us but you a" [puppet] - 10https://gerrit.wikimedia.org/r/1105944 (owner: 10Andrew Bogott) [16:04:35] (03PS1) 10Thcipriani: Revert "Reinstate the banner for the developer survey" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1108097 [16:08:44] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:09:44] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [16:17:29] (03CR) 10Pppery: [C:03+1] "Yes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106739 (owner: 10Tacsipacsi) [16:25:38] 10SRE-swift-storage, 07Epic: [Epic] Determine a strategy to store files between 5 and 100 GB - https://phabricator.wikimedia.org/T191802#10429423 (10Koavf) Note that there is now a local list at Commons for very large files: https://commons.wikimedia.org/wiki/Commons:Very_large_files_to_upload. This can be use... [16:38:13] (03CR) 10Thcipriani: "check experimental" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1108097 (owner: 10Thcipriani) [16:39:54] (03CR) 10Thcipriani: [C:03+2] Revert "Reinstate the banner for the developer survey" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1108097 (owner: 10Thcipriani) [16:40:30] (03Merged) 10jenkins-bot: Revert "Reinstate the banner for the developer survey" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1108097 (owner: 10Thcipriani) [16:55:20] !log thcipriani@deploy2002 Started deploy [gerrit/gerrit@44854d4]: remove developer satisfaction survey banner (gerrit2003.wikimedia.org only) [16:55:28] !log thcipriani@deploy2002 Finished deploy [gerrit/gerrit@44854d4]: remove developer satisfaction survey banner (gerrit2003.wikimedia.org only) (duration: 00m 08s) [16:56:08] (03PS4) 10Ilias Sarantopoulos: ml-services: revamp llm model server with aya-8B [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101000 (https://phabricator.wikimedia.org/T379052) [16:56:20] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: revamp llm model server with aya-8B [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101000 (https://phabricator.wikimedia.org/T379052) (owner: 10Ilias Sarantopoulos) [16:57:50] (03Merged) 10jenkins-bot: ml-services: revamp llm model server with aya-8B [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101000 (https://phabricator.wikimedia.org/T379052) (owner: 10Ilias Sarantopoulos) [17:00:15] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [17:04:37] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874#10429460 (10VRiley-WMF) Hey @MatthewVernon I have opened a ticket with supermicro and will work with them on getting a new drive ASAP (supermicro ticket is #ANT-782-48642) [17:06:38] !log thcipriani@deploy2002 Started deploy [gerrit/gerrit@44854d4]: remove developer satisfaction survey banner (gerrit2002.wikimedia.org only) [17:06:46] !log thcipriani@deploy2002 Finished deploy [gerrit/gerrit@44854d4]: remove developer satisfaction survey banner (gerrit2002.wikimedia.org only) (duration: 00m 08s) [17:14:11] (03CR) 10AOkoth: [C:03+2] kubectl: image with kubectl installed [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1107952 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [17:14:16] (03CR) 10AOkoth: [V:03+2 C:03+2] kubectl: image with kubectl installed [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1107952 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [17:14:36] (03PS1) 10CDanis: draft: scrub echostore userids [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108106 (https://phabricator.wikimedia.org/T366750) [17:15:19] !log thcipriani@deploy2002 Started deploy [gerrit/gerrit@44854d4]: remove developer satisfaction survey banner (gerrit1003.wikimedia.org only) [17:15:29] !log thcipriani@deploy2002 Finished deploy [gerrit/gerrit@44854d4]: remove developer satisfaction survey banner (gerrit1003.wikimedia.org only) (duration: 00m 10s) [17:25:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429488 (10phaultfinder) [17:27:17] (03PS1) 10AOkoth: kubectl: add changelog for kubectl image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1108107 [17:29:02] (03CR) 10AOkoth: [C:03+2] kubectl: add changelog for kubectl image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1108107 (owner: 10AOkoth) [17:29:12] (03CR) 10AOkoth: [V:03+2 C:03+2] kubectl: add changelog for kubectl image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1108107 (owner: 10AOkoth) [17:37:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [17:37:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [17:55:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429561 (10phaultfinder) [18:13:03] (03CR) 10FNegri: prometheus-node-kernel-panic: use prom labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri) [18:16:55] (03PS11) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) [18:16:57] (03PS6) 10FNegri: prometheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961) [18:32:38] (03CR) 10Dzahn: [C:03+1] "if everything just changes from existing "parking" then not much can go wrong here. lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1106344 (https://phabricator.wikimedia.org/T380667) (owner: 10BCornwall) [18:38:20] (03CR) 10Dzahn: [C:03+2] devtools: fix hiera after host renaming [puppet] - 10https://gerrit.wikimedia.org/r/1104957 (https://phabricator.wikimedia.org/T363415) (owner: 10Hashar) [18:43:48] PROBLEM - Host doc2002 is DOWN: PING CRITICAL - Packet loss = 100% [18:44:39] (03CR) 10Dzahn: [C:03+2] "I assume the puppet run _would_ be fixed now but there is another unrelated issue which I just filed for other instances as https://phabri" [puppet] - 10https://gerrit.wikimedia.org/r/1104957 (https://phabricator.wikimedia.org/T363415) (owner: 10Hashar) [18:45:17] FIRING: [2x] ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:46:50] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on doc2002.codfw.wmnet with reason: Disk Change [18:47:05] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doc2002.codfw.wmnet with reason: Disk Change [18:54:04] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:07:19] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update - https://phabricator.wikimedia.org/T309724#10429718 (10Arnoldokoth) o/ I ran into this issue trying to access the console for doc2002 and I think it's... [19:09:19] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on doc2002.codfw.wmnet with reason: Disk Change [19:09:22] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on doc2002.codfw.wmnet with reason: Disk Change [19:17:39] (03PS1) 10AOkoth: docs: alert only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1108112 (https://phabricator.wikimedia.org/T382964) [19:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429741 (10phaultfinder) [19:25:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:27] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429744 (10phaultfinder) [19:37:37] (03CR) 10Dwisehaupt: [C:03+1] "This looks fine to me. The CDN should be the only entry point of access currently." [puppet] - 10https://gerrit.wikimedia.org/r/1108088 (owner: 10Muehlenhoff) [19:49:31] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T381635#10429776 (10VRiley-WMF) Hi, is there a specific time that would be preferred for us to take a look at this and swap the module if needed? [20:16:56] PROBLEM - MariaDB Replica SQL: s7 on db1171 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cawiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:19:24] I'll fix that [20:21:56] RECOVERY - MariaDB Replica SQL: s7 on db1171 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429842 (10phaultfinder) [20:40:48] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429907 (10phaultfinder) [21:27:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10429991 (10phaultfinder) [21:32:30] (03CR) 10Catrope: [C:03+1] draft: scrub echostore userids [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108106 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis) [21:37:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [21:37:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [21:37:49] (03CR) 10BCornwall: [C:03+2] Point various parking domains to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1106344 (https://phabricator.wikimedia.org/T380667) (owner: 10BCornwall) [21:38:08] (03PS2) 10BCornwall: Point various parking domains to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1106344 (https://phabricator.wikimedia.org/T380667) [21:41:12] (03CR) 10BCornwall: [V:03+2 C:03+2] Point various parking domains to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1106344 (https://phabricator.wikimedia.org/T380667) (owner: 10BCornwall) [21:54:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10430019 (10phaultfinder) [22:10:00] (03PS5) 10Krinkle: Profiler: centralize metrics send to a function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite) [22:10:03] (03CR) 10Krinkle: [C:03+1] Profiler: centralize metrics send to a function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 (owner: 10Cwhite) [22:11:56] (03PS1) 10Jdlrobson: Move logic for type infering to server [extensions/Chart] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1108130 (https://phabricator.wikimedia.org/T382042) [22:13:20] (03PS3) 10Cwhite: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) [22:14:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10430065 (10phaultfinder) [22:38:11] (03PS1) 10Kimberly Sarabia: Remove `wgVectorStickyHeader` from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108135 (https://phabricator.wikimedia.org/T332728) [22:45:17] FIRING: ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc1003.eqiad.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:01:42] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:11:42] RESOLVED: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:25:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:28:27] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed