[00:01:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:04:32] FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:29] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1161692 [00:08:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1161692 (owner: 10TrainBranchBot) [00:29:27] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1161692 (owner: 10TrainBranchBot) [00:36:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:01:05] RESOLVED: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:09:25] PROBLEM - mysqld processes on es2045 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [01:09:43] PROBLEM - MariaDB read only es5 on es2045 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [02:04:42] (03PS1) 10Tim Starling: Suppress mobile redirect for Googlebot Smartphone on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1161727 (https://phabricator.wikimedia.org/T397267) [02:28:35] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:43:35] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:48:03] (03PS2) 10Tim Starling: Suppress mobile redirect for Googlebot Smartphone on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1161727 (https://phabricator.wikimedia.org/T397267) [02:57:03] (03CR) 10Tim Starling: "PS1" [puppet] - 10https://gerrit.wikimedia.org/r/1161727 (https://phabricator.wikimedia.org/T397267) (owner: 10Tim Starling) [03:23:49] (03PS1) 10Krinkle: Set wgCentralBannerRecorder to /beacon/… instead of //example.org/beacon/… [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161757 [03:24:03] (03PS2) 10Krinkle: Set wgCentralBannerRecorder to /beacon/… instead of //example.org/beacon/… [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161757 (https://phabricator.wikimedia.org/T390929) [03:45:39] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:58:21] (03CR) 10Krinkle: [C:03+1] Suppress mobile redirect for Googlebot Smartphone on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1161727 (https://phabricator.wikimedia.org/T397267) (owner: 10Tim Starling) [04:05:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:08:29] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:39:05] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:44:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:03] PROBLEM - Host pc2012 #page is DOWN: PING CRITICAL - Packet loss = 100% [05:09:47] I guess this is because of the transition to 10G which wasn't finished yesterday and the downtime expired [05:09:49] <_joe_> I am afk at the moment [05:09:59] <_joe_> ah ok [05:10:18] <_joe_> should we ack it? [05:10:22] <_joe_> !incidents [05:10:22] 6377 (UNACKED) Host pc2012 (paged) [05:10:22] 6375 (RESOLVED) Host es2045 (paged) [05:10:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on pc1012.eqiad.wmnet with reason: Maintenance [05:10:48] <_joe_> !ack 6377 [05:10:48] 6377 (ACKED) Host pc2012 (paged) [05:14:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:18:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1236.eqiad.wmnet with reason: Maintenance [05:21:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Upgrade [05:22:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2147.codfw.wmnet with reason: Maintenance [05:23:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T396130)', diff saved to https://phabricator.wikimedia.org/P78499 and previous config saved to /var/cache/conftool/dbconfig/20250620-052300-marostegui.json [05:23:05] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:31:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T396130)', diff saved to https://phabricator.wikimedia.org/P78500 and previous config saved to /var/cache/conftool/dbconfig/20250620-053137-marostegui.json [05:31:43] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:39:05] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:46:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P78501 and previous config saved to /var/cache/conftool/dbconfig/20250620-054644-marostegui.json [05:51:02] (03CR) 10Effie Mouzeli: [C:03+1] memcached::instance: Remove support for Ferm-only syntax [puppet] - 10https://gerrit.wikimedia.org/r/1161511 (owner: 10Muehlenhoff) [05:51:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr4-ulsfo:xe-0/1/4 (Peering: ... [05:51:51] Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [05:55:36] (03PS2) 10Effie Mouzeli: site.pp: make wikikube-worker-exp2001 a k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1160238 (https://phabricator.wikimedia.org/T276994) [05:56:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr4-ulsfo:xe-0/1/4 (Peering: ... [05:56:51] Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250620T0600) [06:01:50] 10SRE-SLO, 10Abstract Wikipedia team (25Q4 (Apr–Jun)), 07OKR-Work, 07Workstreams: Establish an SLO for the Wikifunctions integration into Wikimedia projects' wikitext pages, to assure reader experience quality is maintained during roll-out - https://phabricator.wikimedia.org/T390548#10933510 (10DSantamaria) [06:01:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P78502 and previous config saved to /var/cache/conftool/dbconfig/20250620-060151-marostegui.json [06:02:08] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-staging-etcd2002.codfw.wmnet to plain [06:02:09] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10933511 (10jijiki) >>! In T396584#10928125, @elukey wrote: >>>! In T396584#10927316, @MatthewVernon wrote: >> Silly qu... [06:02:50] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-staging-etcd2002.codfw.wmnet to plain [06:05:51] FIRING: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr4-ulsfo:xe-0/1/4 (Peering: ... [06:05:51] Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [06:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:11:11] !incidents [06:11:12] 6377 (ACKED) Host pc2012 (paged) [06:11:12] 6379 (UNACKED) TransitPeeringOutboundSaturation network sre (cr4-ulsfo:9804 Peering: Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749} xe-0/1/4 gnmi ulsfo) [06:11:12] 6378 (RESOLVED) TransitPeeringOutboundSaturation network sre (cr4-ulsfo:9804 Peering: Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749} xe-0/1/4 gnmi ulsfo) [06:11:12] 6375 (RESOLVED) Host es2045 (paged) [06:11:21] !ack 6379 [06:11:21] 6379 (ACKED) TransitPeeringOutboundSaturation network sre (cr4-ulsfo:9804 Peering: Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749} xe-0/1/4 gnmi ulsfo) [06:14:23] <_joe_> sigh something is wrong still in the bwlimits [06:16:25] (03PS1) 10Marostegui: control-mariadb-10.11-bookworm: New version [software] - 10https://gerrit.wikimedia.org/r/1161802 (https://phabricator.wikimedia.org/T397425) [06:17:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T396130)', diff saved to https://phabricator.wikimedia.org/P78503 and previous config saved to /var/cache/conftool/dbconfig/20250620-061659-marostegui.json [06:17:05] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:17:10] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-bookworm: New version [software] - 10https://gerrit.wikimedia.org/r/1161802 (https://phabricator.wikimedia.org/T397425) (owner: 10Marostegui) [06:17:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2155.codfw.wmnet with reason: Maintenance [06:19:08] (03Merged) 10jenkins-bot: control-mariadb-10.11-bookworm: New version [software] - 10https://gerrit.wikimedia.org/r/1161802 (https://phabricator.wikimedia.org/T397425) (owner: 10Marostegui) [06:19:41] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:20:51] RESOLVED: TransitPeeringOutboundSaturation: Transit or peering outbound traffic above 90% capacity - cr4-ulsfo:xe-0/1/4 (Peering: ... [06:20:51] Equinix (111916-SV1-IX-01 MAC filter) {#DLRMXC791749}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringOutboundSaturation [06:23:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2172.codfw.wmnet with reason: Maintenance [06:23:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T396130)', diff saved to https://phabricator.wikimedia.org/P78504 and previous config saved to /var/cache/conftool/dbconfig/20250620-062307-marostegui.json [06:23:12] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:24:24] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10933529 (10Jgiannelos) Historically there were many cases where maps had issues which led to stale caches and or needi... [06:24:41] FIRING: [8x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:31:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T396130)', diff saved to https://phabricator.wikimedia.org/P78506 and previous config saved to /var/cache/conftool/dbconfig/20250620-063145-marostegui.json [06:31:52] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:39:41] RESOLVED: [8x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:40:03] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts ganeti2023.codfw.wmnet [06:44:35] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [06:46:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P78507 and previous config saved to /var/cache/conftool/dbconfig/20250620-064652-marostegui.json [06:48:54] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2023.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [06:50:32] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2023.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [06:50:32] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:50:33] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2023.codfw.wmnet [06:51:50] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts ganeti2024.codfw.wmnet [06:56:39] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250620T0700) [07:02:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P78508 and previous config saved to /var/cache/conftool/dbconfig/20250620-070200-marostegui.json [07:02:13] jmm@cumin1003 decommission (PID 2370173) is awaiting input [07:11:19] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2024.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [07:12:16] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2024.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [07:12:16] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:12:17] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2024.codfw.wmnet [07:13:49] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission ganeti2023 / ganeti2024 - https://phabricator.wikimedia.org/T397311#10933566 (10MoritzMuehlenhoff) [07:15:56] (03CR) 10Stevemunene: [C:03+2] blunderbuss: replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161573 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [07:17:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T396130)', diff saved to https://phabricator.wikimedia.org/P78509 and previous config saved to /var/cache/conftool/dbconfig/20250620-071707-marostegui.json [07:17:13] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:17:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2179.codfw.wmnet with reason: Maintenance [07:17:28] (03Merged) 10jenkins-bot: blunderbuss: replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161573 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [07:17:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T396130)', diff saved to https://phabricator.wikimedia.org/P78510 and previous config saved to /var/cache/conftool/dbconfig/20250620-071730-marostegui.json [07:20:49] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on es2045.codfw.wmnet with reason: Firmware downgrade pending [07:26:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T396130)', diff saved to https://phabricator.wikimedia.org/P78511 and previous config saved to /var/cache/conftool/dbconfig/20250620-072605-marostegui.json [07:26:10] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:29:35] (03PS1) 10Muehlenhoff: Remove ganeti role from ganeti2021 [puppet] - 10https://gerrit.wikimedia.org/r/1161822 (https://phabricator.wikimedia.org/T396590) [07:40:25] RECOVERY - Host pc2012 #page is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [07:40:38] PROBLEM - mysqld processes on pc2012 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:40:38] PROBLEM - MariaDB Replica IO: pc2 on pc2012 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:40:38] PROBLEM - MariaDB Replica Lag: pc2 on pc2012 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:40:38] PROBLEM - MariaDB Event Scheduler pc2 on pc2012 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [07:40:54] PROBLEM - MariaDB Replica SQL: pc2 on pc2012 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:41:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P78512 and previous config saved to /var/cache/conftool/dbconfig/20250620-074112-marostegui.json [07:41:32] Do I extend the downtime, marostegui ? [07:41:38] RECOVERY - mysqld processes on pc2012 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:41:38] RECOVERY - MariaDB Replica Lag: pc2 on pc2012 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:41:38] RECOVERY - MariaDB Replica IO: pc2 on pc2012 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:41:40] RECOVERY - MariaDB Event Scheduler pc2 on pc2012 is OK: Version 10.11.13-MariaDB-log, Uptime 43s, read_only: False, event_scheduler: True, 22.86 QPS, connection latency: 0.024669s, query latency: 0.000585s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [07:41:48] oh, you seem to be working on it [07:41:54] RECOVERY - MariaDB Replica SQL: pc2 on pc2012 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:42:17] jynus: earlier the morning he wrote: "I guess this is because of the transition to 10G which wasn't finished yesterday and the downtime expired" [07:42:26] yes, I read it [07:42:30] ok [07:42:57] but mariadb doesn't start on its own, so he must be working right now on it [07:45:02] I am working on it [07:45:08] replication was badly broken [07:45:11] I am fixing it [07:45:21] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti role from ganeti2021 [puppet] - 10https://gerrit.wikimedia.org/r/1161822 (https://phabricator.wikimedia.org/T396590) (owner: 10Muehlenhoff) [07:45:39] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:52:35] <_joe_> marostegui: I'm not a dba, but I wouldn't put the server back in rotation unless replication is fixed [07:53:32] _joe_: I will in a bit, I need to remove downtimes and check a few more things [07:53:42] <_joe_> sure do your thing [07:56:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P78513 and previous config saved to /var/cache/conftool/dbconfig/20250620-075619-marostegui.json [07:58:46] (03PS1) 10Muehlenhoff: Fix firewall config for idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1161827 [07:58:46] (03PS1) 10Muehlenhoff: Update server entry for idp-test in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1161828 [07:58:46] (03PS1) 10Muehlenhoff: acmechief: Remove idp-test2004 [puppet] - 10https://gerrit.wikimedia.org/r/1161829 [07:58:47] (03PS1) 10Muehlenhoff: site.pp: Remove idp-test2004 [puppet] - 10https://gerrit.wikimedia.org/r/1161830 [07:59:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc2 T378715', diff saved to https://phabricator.wikimedia.org/P78514 and previous config saved to /var/cache/conftool/dbconfig/20250620-075944-root.json [07:59:49] T378715: Possibility to transition some codfw data persistence hosts to 10G - https://phabricator.wikimedia.org/T378715 [08:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:29] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T396130)', diff saved to https://phabricator.wikimedia.org/P78515 and previous config saved to /var/cache/conftool/dbconfig/20250620-081127-marostegui.json [08:11:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2199.codfw.wmnet with reason: Maintenance [08:11:32] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:12:01] (03PS9) 10Brouberol: Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:12:27] (03CR) 10Brouberol: [C:03+1] Airflow analytics-test: Optimization for LocalExecutors [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161047 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:16:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2206.codfw.wmnet with reason: Maintenance [08:16:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T396130)', diff saved to https://phabricator.wikimedia.org/P78516 and previous config saved to /var/cache/conftool/dbconfig/20250620-081638-marostegui.json [08:16:44] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:16:49] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#10933677 (10Gehel) [08:20:10] (03CR) 10Vgutierrez: [C:03+1] acmechief: Remove idp-test2004 [puppet] - 10https://gerrit.wikimedia.org/r/1161829 (owner: 10Muehlenhoff) [08:20:13] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10933693 (10Gehel) [08:22:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10933703 (10Gehel) [08:22:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q4:rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T393029#10933704 (10Gehel) [08:24:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T396130)', diff saved to https://phabricator.wikimedia.org/P78517 and previous config saved to /var/cache/conftool/dbconfig/20250620-082420-marostegui.json [08:24:22] 10SRE-SLO, 10observability, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): WDQS Update Lag SLO looks wrong - https://phabricator.wikimedia.org/T395987#10933717 (10Gehel) 05In progress→03Resolved a:03Gehel [08:24:26] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:26:09] (03PS1) 10Giuseppe Lavagetto: fetch_external_cloud: stop depending on requestctl libraries [puppet] - 10https://gerrit.wikimedia.org/r/1161869 [08:26:16] (03CR) 10Slyngshede: [C:03+1] Fix firewall config for idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1161827 (owner: 10Muehlenhoff) [08:26:40] (03CR) 10Slyngshede: [C:03+1] Update server entry for idp-test in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1161828 (owner: 10Muehlenhoff) [08:28:24] (03CR) 10CI reject: [V:04-1] fetch_external_cloud: stop depending on requestctl libraries [puppet] - 10https://gerrit.wikimedia.org/r/1161869 (owner: 10Giuseppe Lavagetto) [08:29:28] (03CR) 10Slyngshede: [C:03+1] acmechief: Remove idp-test2004 [puppet] - 10https://gerrit.wikimedia.org/r/1161829 (owner: 10Muehlenhoff) [08:34:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/WikiLambda] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161622 (https://phabricator.wikimedia.org/T396978) (owner: 10Jforrester) [08:35:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154121 (owner: 10Jforrester) [08:35:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156351 (owner: 10Jforrester) [08:39:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P78518 and previous config saved to /var/cache/conftool/dbconfig/20250620-083928-marostegui.json [08:40:01] (03PS1) 10Hashar: wikitech: remove logging configuration for hooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161871 (https://phabricator.wikimedia.org/T371592) [08:48:45] (03CR) 10JMeybohm: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161533 (https://phabricator.wikimedia.org/T397341) (owner: 10Alexandros Kosiaris) [08:51:46] (03CR) 10JMeybohm: [C:03+1] calico default-deny: Switch other clusters to follow wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161535 (owner: 10Alexandros Kosiaris) [08:52:09] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1161830 (owner: 10Muehlenhoff) [08:52:23] (03CR) 10Ladsgroup: tables-catalog: add PageAssessments (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1161578 (https://phabricator.wikimedia.org/T393792) (owner: 10MusikAnimal) [08:52:26] (03PS1) 10Federico Ceratto: CAS: Add wmf group for Zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/1161873 (https://phabricator.wikimedia.org/T395304) [08:52:26] (03CR) 10Federico Ceratto: "1-line change to add "wmf" access to Zarcillo" [puppet] - 10https://gerrit.wikimedia.org/r/1161873 (https://phabricator.wikimedia.org/T395304) (owner: 10Federico Ceratto) [08:53:57] (03CR) 10JMeybohm: [C:03+2] kind.sh can bootstrap a wikikube like cluster with kind [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154293 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [08:54:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P78519 and previous config saved to /var/cache/conftool/dbconfig/20250620-085435-marostegui.json [08:55:09] (03Merged) 10jenkins-bot: kind.sh can bootstrap a wikikube like cluster with kind [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154293 (https://phabricator.wikimedia.org/T396107) (owner: 10JMeybohm) [08:56:33] (03CR) 10Majavah: "ooc, is there a specific reason to include `wmf` but not `nda`?" [puppet] - 10https://gerrit.wikimedia.org/r/1161873 (https://phabricator.wikimedia.org/T395304) (owner: 10Federico Ceratto) [08:57:10] (03CR) 10JMeybohm: [C:03+1] make kubectl-completion alternative entry dependent on kubectl [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1161513 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [08:57:56] (03CR) 10JMeybohm: [C:03+1] make kubectl-completion alternative entry dependent on kubectl (v1.31) [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1161526 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [09:09:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T396130)', diff saved to https://phabricator.wikimedia.org/P78520 and previous config saved to /var/cache/conftool/dbconfig/20250620-090943-marostegui.json [09:09:48] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:09:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2210.codfw.wmnet with reason: Maintenance [09:10:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T396130)', diff saved to https://phabricator.wikimedia.org/P78521 and previous config saved to /var/cache/conftool/dbconfig/20250620-091005-marostegui.json [09:10:17] (03CR) 10Muehlenhoff: CAS: Add wmf group for Zarcillo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1161873 (https://phabricator.wikimedia.org/T395304) (owner: 10Federico Ceratto) [09:11:35] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:12:42] (03PS2) 10Federico Ceratto: CAS: Add wmf group for Zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/1161873 (https://phabricator.wikimedia.org/T395304) [09:18:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:18:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T396130)', diff saved to https://phabricator.wikimedia.org/P78522 and previous config saved to /var/cache/conftool/dbconfig/20250620-091847-marostegui.json [09:18:52] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:18:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:19:26] effie: I looked at mw-experimental (exciting!) and already have a few questions :) [09:19:37] does /srv/mediawiki update every 30 minutes or every hour? https://wikitech.wikimedia.org/wiki/Mw-experimental says both :P [09:20:08] what does the first step (helmfile apply) do? right now I can SSH into experimental.eqiad without having run it (I guess it’s already running) [09:20:23] does it get automatically undeployed at some point? am I supposed to do any “cleanup” after I’m done with it? [09:21:35] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:23:42] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:23:56] (03CR) 10Jelto: [C:03+2] make kubectl-completion alternative entry dependent on kubectl (v1.31) [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1161526 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [09:24:54] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2009.codfw.wmnet with reason: Maintenance and reboot [09:24:55] (03PS1) 10Btullis: airflow-test-k8s: bump the max_connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161879 (https://phabricator.wikimedia.org/T391564) [09:25:41] (03PS3) 10Cparle: Add UploadWizard tables [puppet] - 10https://gerrit.wikimedia.org/r/1161562 (https://phabricator.wikimedia.org/T393793) [09:31:35] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:32:30] (03PS2) 10Btullis: airflow-test-k8s: bump the max_connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161879 (https://phabricator.wikimedia.org/T391564) [09:33:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P78523 and previous config saved to /var/cache/conftool/dbconfig/20250620-093354-marostegui.json [09:36:35] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:38:48] (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161883 (https://phabricator.wikimedia.org/T392420) [09:41:08] (03CR) 10Dima koushha: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161883 (https://phabricator.wikimedia.org/T392420) (owner: 10Jakob) [09:41:17] (03CR) 10Jakob: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161883 (https://phabricator.wikimedia.org/T392420) (owner: 10Jakob) [09:41:35] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:43:05] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161883 (https://phabricator.wikimedia.org/T392420) (owner: 10Jakob) [09:44:19] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [09:44:31] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [09:47:27] (03PS1) 10Jelto: fix newline in postinst script [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1161884 (https://phabricator.wikimedia.org/T387548) [09:49:01] (03PS1) 10Jakob: Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161885 [09:49:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P78524 and previous config saved to /var/cache/conftool/dbconfig/20250620-094901-marostegui.json [09:49:17] (03CR) 10JMeybohm: [C:03+1] fix newline in postinst script [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1161884 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [09:51:27] (03CR) 10Dima koushha: [C:03+1] Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161885 (owner: 10Jakob) [09:51:35] (03CR) 10Jakob: [C:03+2] Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161885 (owner: 10Jakob) [09:52:01] (03CR) 10Jelto: [C:03+2] fix newline in postinst script [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1161884 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [09:55:09] (03Merged) 10jenkins-bot: Revert "wikidata-query-gui: Bump query-gui image version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161885 (owner: 10Jakob) [09:55:32] !log root@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup2009.codfw.wmnet: Renew puppet certificate - root@cumin1002 [09:56:14] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [09:56:24] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [09:57:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:58:11] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:00:05] (03PS1) 10Jelto: remove priority from slave [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1161888 (https://phabricator.wikimedia.org/T387548) [10:01:23] (03CR) 10JMeybohm: [C:03+1] remove priority from slave [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1161888 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [10:01:35] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:03:52] (03PS1) 10Giuseppe Lavagetto: Bugfixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1161889 [10:04:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T396130)', diff saved to https://phabricator.wikimedia.org/P78525 and previous config saved to /var/cache/conftool/dbconfig/20250620-100409-marostegui.json [10:04:12] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Bugfixes [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1161889 (owner: 10Giuseppe Lavagetto) [10:04:15] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:04:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2219.codfw.wmnet with reason: Maintenance [10:04:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T396130)', diff saved to https://phabricator.wikimedia.org/P78526 and previous config saved to /var/cache/conftool/dbconfig/20250620-100431-marostegui.json [10:04:51] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1003" [10:04:52] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1003 [10:05:22] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfixes - oblivian@cumin1003 [10:05:24] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfixes - oblivian@cumin1003" [10:09:40] (03PS1) 10Jcrespo: mariadb: Upgrade db2184 (backup1-codfw replica) to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1161890 (https://phabricator.wikimedia.org/T394487) [10:11:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T396130)', diff saved to https://phabricator.wikimedia.org/P78527 and previous config saved to /var/cache/conftool/dbconfig/20250620-101116-marostegui.json [10:11:19] !log jynus@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2184.codfw.wmnet with reason: mariadb upgrade [10:11:21] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:12:41] (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db2184 (backup1-codfw replica) to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1161890 (https://phabricator.wikimedia.org/T394487) (owner: 10Jcrespo) [10:13:07] (03PS2) 10Giuseppe Lavagetto: fetch_external_cloud: stop depending on requestctl libraries [puppet] - 10https://gerrit.wikimedia.org/r/1161869 [10:15:29] (03CR) 10CI reject: [V:04-1] fetch_external_cloud: stop depending on requestctl libraries [puppet] - 10https://gerrit.wikimedia.org/r/1161869 (owner: 10Giuseppe Lavagetto) [10:17:23] (03PS3) 10Giuseppe Lavagetto: fetch_external_cloud: stop depending on requestctl libraries [puppet] - 10https://gerrit.wikimedia.org/r/1161869 [10:18:41] (03CR) 10JMeybohm: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [10:19:43] (03CR) 10CI reject: [V:04-1] fetch_external_cloud: stop depending on requestctl libraries [puppet] - 10https://gerrit.wikimedia.org/r/1161869 (owner: 10Giuseppe Lavagetto) [10:22:32] (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161892 (https://phabricator.wikimedia.org/T397452) [10:23:33] (03CR) 10Dima koushha: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161892 (https://phabricator.wikimedia.org/T397452) (owner: 10Jakob) [10:24:03] (03CR) 10Jakob: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161892 (https://phabricator.wikimedia.org/T397452) (owner: 10Jakob) [10:25:43] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161892 (https://phabricator.wikimedia.org/T397452) (owner: 10Jakob) [10:26:16] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [10:26:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P78528 and previous config saved to /var/cache/conftool/dbconfig/20250620-102623-marostegui.json [10:26:27] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [10:27:50] !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [10:27:59] (03PS1) 10Hnowlan: changeprop: emit abandoned events metric [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161893 (https://phabricator.wikimedia.org/T397072) [10:28:07] !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [10:29:07] (03CR) 10Jelto: [C:03+2] remove priority from slave [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1161888 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [10:29:10] !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [10:29:24] !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [10:33:56] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, and 2 others: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10934155 (10MoritzMuehlenhoff) >>! In T396584#10933529, @Jgiannelos wrote: > Historically there were many cases where m... [10:36:14] (03CR) 10Clément Goubert: sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1160817 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [10:41:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P78529 and previous config saved to /var/cache/conftool/dbconfig/20250620-104131-marostegui.json [10:41:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:42:26] (03PS1) 10Jelto: remove priority from update-alternatives --remove [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1161894 (https://phabricator.wikimedia.org/T387548) [10:46:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:48:04] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:53:38] (03PS1) 10Muehlenhoff: Add Joanna to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/1161896 [10:55:53] 07sre-alert-triage, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10934195 (10Stevemunene) 05Open→03Resolved Checked the status of these hosts and they all seem to be ok an-pres... [10:56:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T396130)', diff saved to https://phabricator.wikimedia.org/P78530 and previous config saved to /var/cache/conftool/dbconfig/20250620-105638-marostegui.json [10:56:44] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:56:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2236.codfw.wmnet with reason: Maintenance [10:57:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2236 (T396130)', diff saved to https://phabricator.wikimedia.org/P78531 and previous config saved to /var/cache/conftool/dbconfig/20250620-105701-marostegui.json [10:58:36] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1161896 (owner: 10Muehlenhoff) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250620T0700) [11:00:05] jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250620T1100). [11:03:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T396130)', diff saved to https://phabricator.wikimedia.org/P78532 and previous config saved to /var/cache/conftool/dbconfig/20250620-110354-marostegui.json [11:04:00] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:07:39] (03PS1) 10Slyngshede: P:idm update manager config for Joannas account [puppet] - 10https://gerrit.wikimedia.org/r/1161917 [11:19:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P78533 and previous config saved to /var/cache/conftool/dbconfig/20250620-111901-marostegui.json [11:20:39] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1161917 (owner: 10Slyngshede) [11:23:13] (03CR) 10Slyngshede: [C:03+2] P:idm update manager config for Joannas account [puppet] - 10https://gerrit.wikimedia.org/r/1161917 (owner: 10Slyngshede) [11:25:28] (03PS1) 10Kamila Součková: Update codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1161929 (https://phabricator.wikimedia.org/T397148) [11:25:30] (03PS1) 10Kamila Součková: Update codfw eqiad pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1161930 (https://phabricator.wikimedia.org/T375845) [11:26:08] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161930 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [11:26:11] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161929 (https://phabricator.wikimedia.org/T397148) (owner: 10Kamila Součková) [11:28:05] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:28:56] (03PS1) 10Btullis: Drop af_hit_count from abuse_filter views [puppet] - 10https://gerrit.wikimedia.org/r/1161931 (https://phabricator.wikimedia.org/T397508) [11:30:59] (03CR) 10JMeybohm: [C:03+1] remove priority from update-alternatives --remove [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1161894 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [11:32:21] (03CR) 10Dreamy Jazz: [C:03+1] "No rights to merge, but change looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1161931 (https://phabricator.wikimedia.org/T397508) (owner: 10Btullis) [11:32:28] (03CR) 10Jelto: [C:03+2] remove priority from update-alternatives --remove [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1161894 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [11:34:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P78534 and previous config saved to /var/cache/conftool/dbconfig/20250620-113410-marostegui.json [11:40:36] (03PS2) 10Jelto: make kubectl-completion alternative entry dependent on kubectl (v1.23) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1161513 (https://phabricator.wikimedia.org/T387548) [11:45:39] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:46:59] (03CR) 10JMeybohm: [C:03+1] make kubectl-completion alternative entry dependent on kubectl (v1.23) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1161513 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [11:48:04] (03PS1) 10Esanders: Deploy EditCheck's multi-check mode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) [11:49:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T396130)', diff saved to https://phabricator.wikimedia.org/P78535 and previous config saved to /var/cache/conftool/dbconfig/20250620-114917-marostegui.json [11:49:23] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:49:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2237.codfw.wmnet with reason: Maintenance [11:49:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2237 (T396130)', diff saved to https://phabricator.wikimedia.org/P78536 and previous config saved to /var/cache/conftool/dbconfig/20250620-114941-marostegui.json [11:51:52] (03CR) 10Jelto: [C:03+2] make kubectl-completion alternative entry dependent on kubectl (v1.23) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1161513 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [11:56:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T396130)', diff saved to https://phabricator.wikimedia.org/P78537 and previous config saved to /var/cache/conftool/dbconfig/20250620-115629-marostegui.json [11:56:36] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:59:04] !log import kubernetes 1.23.14-6 and 1.31.4-5 to apt host - T387548 [11:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:09] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [12:04:40] (03PS2) 10Urbanecm: [Growth] Prepare for the Get Started notification experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159465 (https://phabricator.wikimedia.org/T394958) [12:05:18] (03CR) 10Urbanecm: [Growth] Prepare for the Get Started notification experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159465 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [12:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:29] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:31] (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161895 (https://phabricator.wikimedia.org/T392420) [12:11:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P78538 and previous config saved to /var/cache/conftool/dbconfig/20250620-121136-marostegui.json [12:18:53] !log slyngshede@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [12:19:14] !log slyngshede@cumin1002 END (FAIL) - Cookbook sre.netbox.update-extras (exit_code=1) rolling restart_daemons on A:netbox-canary [12:20:08] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, only nit is description should not have 'eqiad' in it as it just changes the codfw range." [puppet] - 10https://gerrit.wikimedia.org/r/1161930 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [12:23:43] (03PS2) 10Kamila Součková: Update codfw pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1161930 (https://phabricator.wikimedia.org/T375845) [12:24:35] (03CR) 10Kamila Součková: "Oops, copypasta consequences, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1161930 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [12:26:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P78539 and previous config saved to /var/cache/conftool/dbconfig/20250620-122644-marostegui.json [12:27:03] (03CR) 10Urbanecm: "set of pilot wikis is confirmed, until the variant is assigned to some users, this doesn't do anything" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159465 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [12:27:04] (03CR) 10Dima koushha: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161895 (https://phabricator.wikimedia.org/T392420) (owner: 10Jakob) [12:27:45] (03CR) 10Jakob: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161895 (https://phabricator.wikimedia.org/T392420) (owner: 10Jakob) [12:28:05] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:28:23] (03PS3) 10Urbanecm: [Growth] Prepare for the Get Started notification experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159465 (https://phabricator.wikimedia.org/T394958) [12:29:28] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161895 (https://phabricator.wikimedia.org/T392420) (owner: 10Jakob) [12:31:02] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [12:31:13] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [12:31:45] !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [12:32:02] !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [12:32:21] !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [12:32:35] !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [12:35:06] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T395518#10934376 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:35:44] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397386#10934378 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:38:10] 10ops-eqiad, 06SRE, 06DC-Ops: Rack and cable a single mgmt switch in one of the future machine learning racks - https://phabricator.wikimedia.org/T395941#10934384 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Racked and cabled newer msw netgear in rack e11 waiting on power will setup at later da... [12:38:31] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10934387 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:41:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T396130)', diff saved to https://phabricator.wikimedia.org/P78540 and previous config saved to /var/cache/conftool/dbconfig/20250620-124151-marostegui.json [12:41:57] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:42:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2239.codfw.wmnet with reason: Maintenance [12:45:05] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161930 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [12:53:49] (03CR) 10Joal: [C:03+1] "Change looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1161931 (https://phabricator.wikimedia.org/T397508) (owner: 10Btullis) [12:54:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2240.codfw.wmnet with reason: Maintenance [12:58:03] !log upload liberica 0.21 to apt.wm.o (bookworm-wikimedia) [12:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:02] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs1013.eqiad.wmnet} and A:liberica [12:59:34] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs1013.eqiad.wmnet} and A:liberica [13:03:59] (03CR) 10Bking: [C:03+2] services: mw-page-content-change : raise JobManager memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161554 (https://phabricator.wikimedia.org/T397336) (owner: 10Gmodena) [13:04:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Pool in API for db1252 - see T385141', diff saved to https://phabricator.wikimedia.org/P78541 and previous config saved to /var/cache/conftool/dbconfig/20250620-130423-fceratto.json [13:04:30] T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141 [13:07:14] (03PS1) 10Kamila Součková: Update codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161945 (https://phabricator.wikimedia.org/T397148) [13:07:18] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs7002.magru.wmnet} and A:liberica [13:08:06] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs7002.magru.wmnet} and A:liberica [13:15:20] jouncebot: nowandnext [13:15:21] For the next 17 hour(s) and 44 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250620T0700) [13:15:21] In 17 hour(s) and 44 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250621T0700) [13:15:29] Want to deploy a security fix. [13:16:09] Should be low risk to cause broken functionality but is more risky if we leave unfixed. [13:22:05] (03PS1) 10Kamila Součková: admin_ng: Change codfw pod ip range to 10.194.128.0/17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161948 (https://phabricator.wikimedia.org/T375845) [13:23:18] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs7001.magru.wmnet} and A:liberica [13:23:49] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:23:55] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs7001.magru.wmnet} and A:liberica [13:24:01] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161115 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [13:24:52] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs[4008-4009].ulsfo.wmnet} and A:liberica [13:26:20] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs[4008-4009].ulsfo.wmnet} and A:liberica [13:27:16] 10ops-esams, 06SRE, 06DC-Ops: Inbound errors on interface cr1-esams:xe-0/0/8 (Transit: Arelion (IC-381309) {#30386}) - https://phabricator.wikimedia.org/T393213#10934479 (10cmooney) @robh these are ongoing and while not increasing badly we probably should do something before the situation gets worse. https:... [13:27:48] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host aux-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:27:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host aux-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:31:46] (03CR) 10Kosta Harlan: [C:04-1] "Per https://phabricator.wikimedia.org/T364705#10904025, I think we are not ready to use RRML without having recalculated the thresholds fo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152770 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [13:32:11] (03PS1) 10Jhancock.wm: Adding sretest2009 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1161949 (https://phabricator.wikimedia.org/T396365) [13:32:37] (03PS1) 10Gergő Tisza: Fix password handling for non-existent users [extensions/CentralAuth] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161950 (https://phabricator.wikimedia.org/T395372) [13:33:13] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs[5004-5005].eqsin.wmnet} and A:liberica [13:35:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:36:16] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs[5004-5005].eqsin.wmnet} and A:liberica [13:38:01] (03PS7) 10Muehlenhoff: New structure for sshd_config starting with trixie [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) [13:38:02] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1006.eqiad.wmnet with OS bookworm [13:38:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10934502 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host aux-k8s-worker1006.eqiad.wmnet with OS bookworm [13:39:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:39:15] (03PS1) 10Hnowlan: changeprop: implement batch_size parameter for pcs job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161957 (https://phabricator.wikimedia.org/T397072) [13:39:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:39:24] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1007.eqiad.wmnet with OS bookworm [13:39:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10934508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host aux-k8s-worker1007.eqiad.wmnet with OS bookworm [13:40:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161950 (https://phabricator.wikimedia.org/T395372) (owner: 10Gergő Tisza) [13:40:53] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1009.eqiad.wmnet with OS bookworm [13:41:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10934509 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host aux-k8s-worker1009.eqiad.wmnet with OS bookworm [13:47:27] (03PS1) 10Bking: cirrussearch: stop monitoring snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1161960 (https://phabricator.wikimedia.org/T357146) [13:47:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161960 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [13:47:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:49:48] (03CR) 10CI reject: [V:04-1] cirrussearch: stop monitoring snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1161960 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [13:49:50] (03CR) 10Xcollazo: [C:03+1] Cleanup htmldumps role ready for decommisioning htmldumper1001 [puppet] - 10https://gerrit.wikimedia.org/r/1161480 (https://phabricator.wikimedia.org/T397434) (owner: 10Btullis) [13:50:21] (03PS2) 10Bking: cirrussearch: stop monitoring snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1161960 (https://phabricator.wikimedia.org/T357146) [13:51:44] (03PS3) 10Bking: cirrussearch: stop monitoring snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1161960 (https://phabricator.wikimedia.org/T357146) [13:52:20] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161960 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [13:53:05] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:53:16] (03CR) 10Michael Große: [C:03+1] [Growth] Prepare for the Get Started notification experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159465 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [13:56:08] (03CR) 10Btullis: [C:03+2] Drop af_hit_count from abuse_filter views [puppet] - 10https://gerrit.wikimedia.org/r/1161931 (https://phabricator.wikimedia.org/T397508) (owner: 10Btullis) [14:00:20] (03CR) 10Jclark-ctr: [C:03+2] Adding sretest2009 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1161949 (https://phabricator.wikimedia.org/T396365) (owner: 10Jhancock.wm) [14:04:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:05:16] jouncebot: nowandnext [14:05:17] For the next 16 hour(s) and 54 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250620T0700) [14:05:17] In 16 hour(s) and 54 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250621T0700) [14:05:26] Will proceed with deploying the security patch [14:05:43] (03PS4) 10Bking: cirrussearch: stop monitoring snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1161960 (https://phabricator.wikimedia.org/T357146) [14:06:02] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161960 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [14:09:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:14:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:14:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:15:02] (03CR) 10Btullis: [C:03+1] cirrussearch: stop monitoring snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1161960 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [14:16:15] !log dreamyjazz Deployed security patch for T397221 [14:16:44] (03PS1) 10BBlack: Format interface-rps with black [puppet] - 10https://gerrit.wikimedia.org/r/1161976 [14:16:44] (03PS1) 10BBlack: interface-rps: allow using ht siblings [puppet] - 10https://gerrit.wikimedia.org/r/1161977 [14:17:28] (03CR) 10CI reject: [V:04-1] interface-rps: allow using ht siblings [puppet] - 10https://gerrit.wikimedia.org/r/1161977 (owner: 10BBlack) [14:21:43] 06SRE, 06cloud-services-team, 10Cloud-VPS: [cloudsw] enable 25G network - https://phabricator.wikimedia.org/T393676#10934635 (10cmooney) 05Open→03Resolved a:03cmooney Closing this one @dcaro I believe all understand the current situation, and we can connect at 25G where it is possible. Please re-o... [14:21:49] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host aux-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:21:51] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:21:54] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host aux-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:22:19] (03CR) 10Bking: [C:03+2] cirrussearch: stop monitoring snapshot repository [puppet] - 10https://gerrit.wikimedia.org/r/1161960 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [14:23:16] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aux-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:23:39] (03PS2) 10BBlack: interface-rps: allow using ht siblings [puppet] - 10https://gerrit.wikimedia.org/r/1161977 [14:25:20] !log bking@cumin2002:~$ sudo cumin prometheus1007* 'run-puppet-agent' T357146 [14:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:25] T357146: Monitor Elastic S3 repository status - https://phabricator.wikimedia.org/T357146 [14:25:52] !log dreamyjazz Deployed security patch for T397221 [14:26:29] (03CR) 10Cathal Mooney: [C:03+1] Adding sretest2009 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1161949 (https://phabricator.wikimedia.org/T396365) (owner: 10Jhancock.wm) [14:26:50] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host aux-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:28:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:29:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:29:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:33:05] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:33:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:33:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:36:10] 06SRE, 06Infrastructure-Foundations, 06Traffic: Avoid using codfw expansion cage for non-IPIP LVS-fronted services - https://phabricator.wikimedia.org/T394286#10934684 (10cmooney) 05Open→03Resolved a:03cmooney Gonna close this one, I trust you guys have the heads up on what can't go there just yet. [14:36:20] 10ops-codfw, 06SRE, 06DC-Ops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10934688 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm These have been shipped back to nokia. including the 10 transceivers and the extra rails they sent. [14:38:58] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:41:34] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1006.eqiad.wmnet with OS bookworm [14:41:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10934705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host aux-k8s-worker1006.eqiad.wmnet with OS bookworm [14:41:41] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1009.eqiad.wmnet with OS bookworm [14:41:43] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1007.eqiad.wmnet with OS bookworm [14:41:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10934706 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host aux-k8s-worker1009.eqiad.wmnet with OS bookworm [14:41:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10934707 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host aux-k8s-worker1007.eqiad.wmnet with OS bookworm [14:43:58] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:48:58] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:53:28] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1009.eqiad.wmnet with reason: host reimage [14:53:39] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1006.eqiad.wmnet with reason: host reimage [14:53:43] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1007.eqiad.wmnet with reason: host reimage [14:53:58] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:57:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1009.eqiad.wmnet with reason: host reimage [14:58:58] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:01:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1006.eqiad.wmnet with reason: host reimage [15:04:20] 06SRE: [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526 (10Jdrewniak) 03NEW [15:05:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1007.eqiad.wmnet with reason: host reimage [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:48] (03PS4) 10Giuseppe Lavagetto: fetch_external_cloud: stop depending on requestctl libraries [puppet] - 10https://gerrit.wikimedia.org/r/1161869 [15:08:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:09:07] (03CR) 10CI reject: [V:04-1] fetch_external_cloud: stop depending on requestctl libraries [puppet] - 10https://gerrit.wikimedia.org/r/1161869 (owner: 10Giuseppe Lavagetto) [15:13:11] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:13:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:13:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1009.eqiad.wmnet with OS bookworm [15:13:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10934826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host aux-k8s-worker1009.eqiad.wmnet with OS bookworm complete... [15:15:53] (03PS1) 10Giuseppe Lavagetto: HIDDENPARMA: Add root stub api token [labs/private] - 10https://gerrit.wikimedia.org/r/1162016 [15:16:11] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] HIDDENPARMA: Add root stub api token [labs/private] - 10https://gerrit.wikimedia.org/r/1162016 (owner: 10Giuseppe Lavagetto) [15:16:20] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:16:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:16:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1006.eqiad.wmnet with OS bookworm [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:45] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1161869 (owner: 10Giuseppe Lavagetto) [15:16:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10934844 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host aux-k8s-worker1006.eqiad.wmnet with OS bookworm complete... [15:17:52] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1161869 (owner: 10Giuseppe Lavagetto) [15:20:13] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:21:27] RECOVERY - mysqld processes on es2045 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:21:35] (03CR) 10Bking: [C:04-1] "These changes target elasticsearch, which we no longer use. We'll need to target the new role (cirrus::opensearch)" [puppet] - 10https://gerrit.wikimedia.org/r/1123652 (https://phabricator.wikimedia.org/T387309) (owner: 10Vgutierrez) [15:21:43] RECOVERY - MariaDB read only es5 on es2045 is OK: Version 10.11.11-MariaDB-log, Uptime 43s, read_only: True, event_scheduler: True, 2.29 QPS, connection latency: 0.031214s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:22:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:22:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1007.eqiad.wmnet with OS bookworm [15:22:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10934858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host aux-k8s-worker1007.eqiad.wmnet with OS bookworm complete... [15:23:01] !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for es2045.codfw.wmnet [15:23:01] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2045.codfw.wmnet [15:24:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10934863 (10Jclark-ctr) [15:24:42] (03CR) 10Jforrester: [C:03+1] "Thanks!" [extensions/CentralAuth] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1161950 (https://phabricator.wikimedia.org/T395372) (owner: 10Gergő Tisza) [15:27:17] (03CR) 10BryanDavis: "Cause of T397424." [puppet] - 10https://gerrit.wikimedia.org/r/1161397 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [15:28:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:28:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:31:07] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2045* slowly with 10 steps - Pooling in slowly [15:32:38] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162022 [15:36:52] (03CR) 10DLynch: [C:03+1] Deploy EditCheck's multi-check mode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) (owner: 10Esanders) [15:38:49] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162023 [15:40:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:43:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:45:40] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:47:44] !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [15:48:20] !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:48:54] !log bking@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [15:49:03] !log bking@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:49:57] !log dancy@deploy1003 Installing scap version "4.181.0" for 2 host(s) [15:51:46] !log dancy@deploy1003 Installation of scap version "4.181.0" completed for 2 hosts [15:53:05] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:54:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:55:13] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:56:54] (03PS1) 10Bking: cirrussearch: set host to correct lb pool [puppet] - 10https://gerrit.wikimedia.org/r/1162029 (https://phabricator.wikimedia.org/T388610) [15:58:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162029 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:59:02] (03PS3) 10Kamila Součková: Update codfw pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1161930 (https://phabricator.wikimedia.org/T375845) [15:59:15] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1161930 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [16:01:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:02:10] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1008.eqiad.wmnet with OS bookworm [16:02:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#10934977 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host aux-k8s-worker1008.eqiad.wmnet with OS bookworm [16:03:18] (03PS2) 10Bking: cirrussearch: set host to correct lb pool [puppet] - 10https://gerrit.wikimedia.org/r/1162029 (https://phabricator.wikimedia.org/T388610) [16:05:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162029 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [16:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:39] (03PS3) 10Bking: cirrussearch: set host to correct lb pool [puppet] - 10https://gerrit.wikimedia.org/r/1162029 (https://phabricator.wikimedia.org/T388610) [16:08:02] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162029 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [16:08:29] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:54] (03PS1) 10Bking: WIP: Rip out unused elasticsearch code [puppet] - 10https://gerrit.wikimedia.org/r/1162035 (https://phabricator.wikimedia.org/T388607) [16:16:44] (03PS2) 10Bking: WIP: Rip out unused elasticsearch code [puppet] - 10https://gerrit.wikimedia.org/r/1162035 (https://phabricator.wikimedia.org/T388607) [16:17:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162035 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [16:17:09] (03CR) 10CI reject: [V:04-1] WIP: Rip out unused elasticsearch code [puppet] - 10https://gerrit.wikimedia.org/r/1162035 (https://phabricator.wikimedia.org/T388607) (owner: 10Bking) [16:25:20] 06SRE, 10WE 3.3.4 Reading Lists on web: [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#10935084 (10Jdrewniak) [16:25:23] (03CR) 10Andrew Bogott: [C:03+2] cinder: use 'cinder' service user rather than 'novaadmin' [puppet] - 10https://gerrit.wikimedia.org/r/1161115 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [16:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:31:19] (03PS1) 10Andrew Bogott: cinder.pp: fix copy/paste error for service user password [puppet] - 10https://gerrit.wikimedia.org/r/1162038 [16:31:19] (03PS1) 10Andrew Bogott: rabbitmq.pp: fix copy/paste error for hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/1162039 [16:37:34] (03CR) 10Andrew Bogott: [C:03+2] cinder.pp: fix copy/paste error for service user password [puppet] - 10https://gerrit.wikimedia.org/r/1162038 (owner: 10Andrew Bogott) [16:37:40] (03PS2) 10Andrew Bogott: cinder.pp: fix copy/paste error for service user password [puppet] - 10https://gerrit.wikimedia.org/r/1162038 [16:38:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.433s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:40:36] (03CR) 10Andrew Bogott: [C:03+2] cinder.pp: fix copy/paste error for service user password [puppet] - 10https://gerrit.wikimedia.org/r/1162038 (owner: 10Andrew Bogott) [16:41:11] (03PS2) 10Andrew Bogott: rabbitmq.pp: fix copy/paste error for hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/1162039 [16:41:15] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162039 (owner: 10Andrew Bogott) [16:43:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.606s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:46:48] (03CR) 10Andrew Bogott: [C:03+2] rabbitmq.pp: fix copy/paste error for hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/1162039 (owner: 10Andrew Bogott) [16:52:40] (03PS3) 10Federico Ceratto: CAS: Add wmf group for Zarcillo, remove ops [puppet] - 10https://gerrit.wikimedia.org/r/1161873 (https://phabricator.wikimedia.org/T395304) [16:58:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:01:00] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:03:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:05:11] jclark@cumin1002 provision (PID 2162360) is awaiting input [17:05:25] FIRING: [2x] SystemdUnitFailed: user@499.service on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:06:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:08:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:18:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:19:22] (03CR) 10AOkoth: [C:03+2] miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [17:21:12] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [17:21:23] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [17:23:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:25:37] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [17:30:25] FIRING: [2x] SystemdUnitFailed: user@499.service on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:33:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:35:47] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [17:38:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:39:45] (03PS1) 10AOkoth: miscweb: fix os-reports sidecar entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162052 (https://phabricator.wikimedia.org/T350794) [17:44:25] (03CR) 10AOkoth: "https://integration.wikimedia.org/ci/job/helm-lint/25844/console" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162052 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [17:44:34] (03CR) 10AOkoth: [C:03+2] miscweb: fix os-reports sidecar entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162052 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [17:46:32] (03Merged) 10jenkins-bot: miscweb: fix os-reports sidecar entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162052 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [17:47:08] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2045* slowly with 10 steps - Pooling in slowly [17:47:36] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [17:49:41] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [17:49:44] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:52:06] (03PS1) 10AOkoth: miscweb: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162053 [17:53:30] !log gmodena@deploy1003 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [17:53:35] !log gmodena@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:55:49] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [17:55:52] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:58:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:00:24] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [18:00:37] PROBLEM - puppetboard.wikimedia.org requires authentication on puppetboard1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:01:07] (03CR) 10AOkoth: [C:03+2] miscweb: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162053 (owner: 10AOkoth) [18:01:27] RECOVERY - puppetboard.wikimedia.org requires authentication on puppetboard1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:03:03] (03Merged) 10jenkins-bot: miscweb: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162053 (owner: 10AOkoth) [18:05:10] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [18:05:25] FIRING: [2x] SystemdUnitFailed: user@499.service on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:05:45] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [18:08:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:08:22] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [18:18:13] (03CR) 10Ssingh: "Thanks for letting us know. We will factor this in for the discussion under T358887." [puppet] - 10https://gerrit.wikimedia.org/r/1161397 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [18:19:53] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [18:21:56] (03PS1) 10AOkoth: miscweb: update os-reports args format [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162058 [18:26:40] (03CR) 10Btullis: [C:03+1] cirrussearch: set host to correct lb pool [puppet] - 10https://gerrit.wikimedia.org/r/1162029 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:27:34] (03CR) 10AOkoth: [C:03+2] miscweb: update os-reports args format [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162058 (owner: 10AOkoth) [18:29:01] (03PS3) 10Bking: WIP: Rip out unused elasticsearch code [puppet] - 10https://gerrit.wikimedia.org/r/1162035 (https://phabricator.wikimedia.org/T388607) [18:29:33] (03Merged) 10jenkins-bot: miscweb: update os-reports args format [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162058 (owner: 10AOkoth) [18:30:25] FIRING: [2x] SystemdUnitFailed: user@499.service on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:30:57] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [18:32:23] !log aokoth@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [18:33:05] RESOLVED: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:33:49] !log aokoth@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [18:34:21] !log aokoth@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [18:36:14] !log bking@cumin2002 conftool action : set/weight=10:pooled=no; selector: name=cirrussearch2113\.codfw\.wmnet [18:36:19] (03CR) 10Bking: [C:03+2] cirrussearch: set host to correct lb pool [puppet] - 10https://gerrit.wikimedia.org/r/1162029 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:51:30] (03PS1) 10Andrew Bogott: Added stand-in passwords for nova service user [labs/private] - 10https://gerrit.wikimedia.org/r/1162060 (https://phabricator.wikimedia.org/T330759) [19:01:01] (03PS1) 10Andrew Bogott: profile::openstack::base::cinder: fix ldap password lookup [puppet] - 10https://gerrit.wikimedia.org/r/1162062 [19:01:01] (03PS1) 10Andrew Bogott: Openstack Nova: use 'novaservice' service user rather than novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1162063 (https://phabricator.wikimedia.org/T330759) [19:02:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162062 (owner: 10Andrew Bogott) [19:02:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162063 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [19:06:48] (03CR) 10Andrew Bogott: [C:03+2] profile::openstack::base::cinder: fix ldap password lookup [puppet] - 10https://gerrit.wikimedia.org/r/1162062 (owner: 10Andrew Bogott) [19:08:05] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Added stand-in passwords for nova service user [labs/private] - 10https://gerrit.wikimedia.org/r/1162060 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [19:08:57] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162063 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [19:10:03] !log sudo cumin 'A:cp' "disable-puppet 'merging CR 1160381'": T390924 [19:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:09] T390924: Add ismobile attribute to X-Analytics header - https://phabricator.wikimedia.org/T390924 [19:12:21] (03CR) 10Andrew Bogott: [C:03+2] Openstack Nova: use 'novaservice' service user rather than novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1162063 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [19:12:39] (03CR) 10Ssingh: [C:03+2] varnish: Set X-Analytics `ismobile=1` for mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/1160381 (https://phabricator.wikimedia.org/T390924) (owner: 10Krinkle) [19:16:08] !log enabling puppet on cp4037 to merge CR 1160381: add `ismobile=1' for mobile requests: T390924 [19:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:14] T390924: Add ismobile attribute to X-Analytics header - https://phabricator.wikimedia.org/T390924 [19:19:05] !log sudo cumin -b11 'A:cp' "run-puppet-agent 'merging CR 1160381'": T390924 [19:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:22] !log sudo cumin -b11 'A:cp' "run-puppet-agent --enable 'merging CR 1160381'": T390924 [19:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:05] PROBLEM - nova-compute proc minimum on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:25:05] RECOVERY - nova-compute proc minimum on cloudvirt1073 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:26:05] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:26:31] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:26:43] (03PS1) 10Andrew Bogott: nova.conf: use novaservice username in more places [puppet] - 10https://gerrit.wikimedia.org/r/1162069 [19:27:05] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:27:29] PROBLEM - nova-compute proc minimum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:27:31] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:28:03] (03CR) 10Andrew Bogott: [C:03+2] nova.conf: use novaservice username in more places [puppet] - 10https://gerrit.wikimedia.org/r/1162069 (owner: 10Andrew Bogott) [19:28:09] PROBLEM - nova-compute proc minimum on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:28:29] RECOVERY - nova-compute proc minimum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:28:31] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:28:57] PROBLEM - nova-compute proc minimum on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:29:09] RECOVERY - nova-compute proc minimum on cloudvirt1064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:29:31] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:29:57] RECOVERY - nova-compute proc minimum on cloudvirt1062 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:30:51] PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:31:05] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:31:27] PROBLEM - nova-compute proc minimum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:31:51] RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:32:05] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:32:27] RECOVERY - nova-compute proc minimum on cloudvirt1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:35:27] PROBLEM - nova-compute proc minimum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:36:09] PROBLEM - nova-compute proc minimum on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:36:27] RECOVERY - nova-compute proc minimum on cloudvirt1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:36:32] (03PS1) 10Andrew Bogott: Revert "Openstack Nova: use 'novaservice' service user rather than novaadmin" [puppet] - 10https://gerrit.wikimedia.org/r/1162071 [19:37:09] RECOVERY - nova-compute proc minimum on cloudvirt1064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:39:43] (03CR) 10Andrew Bogott: [C:03+2] Revert "Openstack Nova: use 'novaservice' service user rather than novaadmin" [puppet] - 10https://gerrit.wikimedia.org/r/1162071 (owner: 10Andrew Bogott) [19:40:09] PROBLEM - nova-compute proc minimum on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:41:09] RECOVERY - nova-compute proc minimum on cloudvirt1064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:41:51] PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:42:51] RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:45:40] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:04:52] (03PS1) 10Andrew Bogott: Openstack Nova: use 'novaservice' service user rather than novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1162077 (https://phabricator.wikimedia.org/T330759) [20:07:05] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:07:11] (03PS2) 10Andrew Bogott: Openstack Nova: use 'novaservice' service user rather than novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1162077 (https://phabricator.wikimedia.org/T330759) [20:08:29] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:39] (03PS1) 10CDobbins: geo-maps: update default for South America [dns] - 10https://gerrit.wikimedia.org/r/1162078 [20:09:43] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162077 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [20:10:19] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 603.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:15:23] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162077 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [20:17:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:17:17] (03CR) 10Andrew Bogott: [C:03+2] Openstack Nova: use 'novaservice' service user rather than novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/1162077 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [20:22:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:25:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:26:47] anyone around? [20:32:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:33:29] FIRING: [11x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:37:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:47:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:48:50] (03PS19) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [20:48:50] (03CR) 10Andrea Denisse: "Hi folks, I tested this with Pontoon o phi-syslog-01, logs are being logged like this:" [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [20:49:39] (03CR) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [20:52:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:57:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:58:29] FIRING: [11x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:02:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:04:41] (03PS1) 10Clare Ming: xLab: Deploy v0.7.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162089 (https://phabricator.wikimedia.org/T396045) [21:06:20] (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.7.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162089 (https://phabricator.wikimedia.org/T396045) (owner: 10Clare Ming) [21:07:55] (03Merged) 10jenkins-bot: xLab: Deploy v0.7.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162089 (https://phabricator.wikimedia.org/T396045) (owner: 10Clare Ming) [21:12:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:14:16] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [21:14:42] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [21:17:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:22:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:22:35] (03PS1) 10Andrew Bogott: Cinder: reduce rpc_response_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1162094 (https://phabricator.wikimedia.org/T397517) [21:24:45] (03CR) 10Andrew Bogott: [C:03+2] Cinder: reduce rpc_response_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1162094 (https://phabricator.wikimedia.org/T397517) (owner: 10Andrew Bogott) [21:27:04] (03PS5) 10Scott French: P:etcd::tlsproxy: add support for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) [21:27:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:30:01] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [21:30:13] dduvall: that'll do it! [21:30:55] Oops wrong channel [21:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:47:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:47:47] PROBLEM - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor [21:48:37] RECOVERY - debmonitor.discovery.wmnet:443 internal on debmonitor1003 is OK: HTTP OK: Status line output matched HTTP/1.1 200 - 679 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [21:52:05] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:00:19] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:30:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:33:48] FIRING: PuppetFailure: Puppet has failed on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:22:05] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:26:35] FIRING: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:31:35] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:38:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1162114 [23:38:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1162114 (owner: 10TrainBranchBot) [23:43:35] FIRING: [2x] ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:43:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:45:40] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:51:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1162114 (owner: 10TrainBranchBot) [23:58:35] RESOLVED: ErrorBudgetBurn: citoid-requests codfw - https://slo.wikimedia.org/?search=citoid-requests - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn