[00:00:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P75368 and previous config saved to /var/cache/conftool/dbconfig/20250424-000028-fceratto.json [00:00:42] (03PS2) 10RLazarus: admin_ng: Read RoleBinding usernames from services hiera [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138494 (https://phabricator.wikimedia.org/T378429) [00:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:06:58] (03PS3) 10RLazarus: admin_ng: Read RoleBinding usernames from services hiera [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138494 (https://phabricator.wikimedia.org/T378429) [00:10:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1138501 [00:10:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1138501 (owner: 10TrainBranchBot) [00:10:46] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:12:37] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:13:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:15:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P75369 and previous config saved to /var/cache/conftool/dbconfig/20250424-001535-fceratto.json [00:16:46] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:17:04] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for lsw1-e1-codfw - pt1979@cumin2002" [00:17:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for lsw1-e1-codfw - pt1979@cumin2002" [00:17:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:17:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-e1-codfw.mgmt.codfw.wmnet [00:17:32] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-e1-codfw.mgmt.codfw.wmnet [00:17:34] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:17:42] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:21:42] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-e1-codfw - pt1979@cumin2002" [00:21:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-e1-codfw - pt1979@cumin2002" [00:21:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:29:03] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1138501 (owner: 10TrainBranchBot) [00:30:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T391056)', diff saved to https://phabricator.wikimedia.org/P75370 and previous config saved to /var/cache/conftool/dbconfig/20250424-003043-fceratto.json [00:30:47] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [00:31:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1239.eqiad.wmnet with reason: Maintenance [00:32:31] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10763369 (10Dzahn) [00:32:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-e1-codfw.mgmt.codfw.wmnet [00:34:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:36:11] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10763372 (10Dzahn) [00:36:44] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10763374 (10Dzahn) [00:37:04] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10763376 (10Dzahn) [00:39:46] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#10763379 (10Dzahn) [00:43:45] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device ssw1-e1-codfw [00:43:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-e1-codfw [00:44:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1240.eqiad.wmnet with reason: Maintenance [00:44:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:44:48] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:45:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:45:10] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:45:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:45:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:45:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:46:08] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:47:42] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:52:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:52:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-e1-codfw.mgmt.codfw.wmnet [00:53:27] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device lsw1-e1-codfw [00:53:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e1-codfw [00:54:28] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-e3-codfw.mgmt.codfw.wmnet [00:54:31] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:54:31] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [00:54:45] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:55:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:55:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:55:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:55:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:56:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:56:40] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/25d49b0394643683cc29be506bf489ef18ed93ed14947155d239ad5d049934f3/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:59:51] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-e3-codfw - pt1979@cumin2002" [01:00:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2045.codfw.wmnet with OS bookworm [01:01:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10763390 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm [01:01:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2046.codfw.wmnet with OS bookworm [01:01:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10763391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2046.codfw.wmnet with OS bookworm [01:01:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm [01:01:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10763392 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm [01:01:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2048.codfw.wmnet with OS bookworm [01:02:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10763393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm [01:02:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2049.codfw.wmnet with OS bookworm [01:02:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1251.eqiad.wmnet with reason: Maintenance [01:02:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-e3-codfw - pt1979@cumin2002" [01:02:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:02:16] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10763394 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2049.codfw.wmnet with OS bookworm [01:02:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1251 (T391056)', diff saved to https://phabricator.wikimedia.org/P75371 and previous config saved to /var/cache/conftool/dbconfig/20250424-010217-fceratto.json [01:02:24] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [01:07:08] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:08:06] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:08:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [01:13:53] jhancock@cumin2002 reimage (PID 2662713) is awaiting input [01:16:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2049.codfw.wmnet with reason: host reimage [01:18:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T391056)', diff saved to https://phabricator.wikimedia.org/P75372 and previous config saved to /var/cache/conftool/dbconfig/20250424-011807-fceratto.json [01:18:11] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [01:19:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2049.codfw.wmnet with reason: host reimage [01:28:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2045.codfw.wmnet with OS bookworm [01:28:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10763412 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with err... [01:33:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P75373 and previous config saved to /var/cache/conftool/dbconfig/20250424-013313-fceratto.json [01:33:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-e3-codfw.mgmt.codfw.wmnet [01:36:40] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:37:05] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:39:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:39:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2049.codfw.wmnet with OS bookworm [01:39:36] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10763420 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2049.codfw.wmnet with OS bookworm completed: - gane... [01:40:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10763423 (10Jhancock.wm) [01:40:25] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device lsw1-e3-codfw [01:40:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e3-codfw [01:40:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:48:06] jhancock@cumin2002 reimage (PID 2663581) is awaiting input [01:48:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P75374 and previous config saved to /var/cache/conftool/dbconfig/20250424-014821-fceratto.json [01:48:22] jhancock@cumin2002 reimage (PID 2663836) is awaiting input [01:50:34] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10763427 (10Papaul) [01:51:27] (03PS1) 10Pppery: Check for shared domain in missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) [01:52:15] (03CR) 10CI reject: [V:04-1] Check for shared domain in missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery) [01:52:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:00:25] (03PS2) 10Pppery: Check for shared domain in missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) [02:03:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T391056)', diff saved to https://phabricator.wikimedia.org/P75375 and previous config saved to /var/cache/conftool/dbconfig/20250424-020328-fceratto.json [02:03:32] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10763437 (10Papaul) [02:03:33] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [02:03:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [02:05:26] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:11:48] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:17:20] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2141.codfw.wmnet with reason: Maintenance [02:24:32] jhancock@cumin2002 reimage (PID 2663395) is awaiting input [02:32:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2145.codfw.wmnet with reason: Maintenance [02:32:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T391056)', diff saved to https://phabricator.wikimedia.org/P75376 and previous config saved to /var/cache/conftool/dbconfig/20250424-023220-fceratto.json [02:32:26] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [02:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:51:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T391056)', diff saved to https://phabricator.wikimedia.org/P75377 and previous config saved to /var/cache/conftool/dbconfig/20250424-025140-fceratto.json [02:51:45] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [03:06:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P75378 and previous config saved to /var/cache/conftool/dbconfig/20250424-030647-fceratto.json [03:21:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P75379 and previous config saved to /var/cache/conftool/dbconfig/20250424-032154-fceratto.json [03:37:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T391056)', diff saved to https://phabricator.wikimedia.org/P75380 and previous config saved to /var/cache/conftool/dbconfig/20250424-033701-fceratto.json [03:37:06] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [03:37:18] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2146.codfw.wmnet with reason: Maintenance [03:37:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T391056)', diff saved to https://phabricator.wikimedia.org/P75381 and previous config saved to /var/cache/conftool/dbconfig/20250424-033724-fceratto.json [03:53:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:49] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:56:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T391056)', diff saved to https://phabricator.wikimedia.org/P75382 and previous config saved to /var/cache/conftool/dbconfig/20250424-035609-fceratto.json [03:56:15] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [03:58:41] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:11:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P75383 and previous config saved to /var/cache/conftool/dbconfig/20250424-041116-fceratto.json [04:26:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P75384 and previous config saved to /var/cache/conftool/dbconfig/20250424-042623-fceratto.json [04:34:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:35:48] PROBLEM - Disk space on analytics1072 is CRITICAL: DISK CRITICAL - free space: / 2115 MB (3% inode=95%): /tmp 2115 MB (3% inode=95%): /var/tmp 2115 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [04:41:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T391056)', diff saved to https://phabricator.wikimedia.org/P75385 and previous config saved to /var/cache/conftool/dbconfig/20250424-044130-fceratto.json [04:41:35] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [04:41:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2153.codfw.wmnet with reason: Maintenance [04:41:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T391056)', diff saved to https://phabricator.wikimedia.org/P75386 and previous config saved to /var/cache/conftool/dbconfig/20250424-044153-fceratto.json [05:02:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T391056)', diff saved to https://phabricator.wikimedia.org/P75387 and previous config saved to /var/cache/conftool/dbconfig/20250424-050247-fceratto.json [05:02:53] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [05:03:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:08:46] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [05:17:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P75388 and previous config saved to /var/cache/conftool/dbconfig/20250424-051753-fceratto.json [05:18:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:23:02] (03PS1) 10Marostegui: installserver: Allow reimage of pc1018 [puppet] - 10https://gerrit.wikimedia.org/r/1138520 (https://phabricator.wikimedia.org/T392492) [05:25:36] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:25:37] (03CR) 10Marostegui: [C:03+2] installserver: Allow reimage of pc1018 [puppet] - 10https://gerrit.wikimedia.org/r/1138520 (https://phabricator.wikimedia.org/T392492) (owner: 10Marostegui) [05:25:52] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:27:26] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:27:42] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:28:44] (03PS1) 10Marostegui: mariadb: Add pc1018 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1138521 (https://phabricator.wikimedia.org/T392492) [05:31:23] (03CR) 10Marostegui: [C:03+2] mariadb: Add pc1018 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1138521 (https://phabricator.wikimedia.org/T392492) (owner: 10Marostegui) [05:32:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10763639 (10Marostegui) [05:32:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10763640 (10Marostegui) Patches are merged. [05:33:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P75389 and previous config saved to /var/cache/conftool/dbconfig/20250424-053301-fceratto.json [05:40:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:46:27] (03CR) 10Arnaudb: gerrit: switchover to gerrit1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [05:48:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T391056)', diff saved to https://phabricator.wikimedia.org/P75390 and previous config saved to /var/cache/conftool/dbconfig/20250424-054808-fceratto.json [05:48:13] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [05:48:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2170.codfw.wmnet with reason: Maintenance [05:48:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T391056)', diff saved to https://phabricator.wikimedia.org/P75391 and previous config saved to /var/cache/conftool/dbconfig/20250424-054831-fceratto.json [06:00:06] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250424T0600) [06:00:06] marostegui, Amir1, and federico3: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250424T0600). [06:05:26] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:06:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T391056)', diff saved to https://phabricator.wikimedia.org/P75392 and previous config saved to /var/cache/conftool/dbconfig/20250424-060628-fceratto.json [06:06:33] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [06:21:12] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:21:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P75393 and previous config saved to /var/cache/conftool/dbconfig/20250424-062135-fceratto.json [06:23:10] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:28:41] FIRING: [7x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:33:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1027 T391921', diff saved to https://phabricator.wikimedia.org/P75394 and previous config saved to /var/cache/conftool/dbconfig/20250424-063345-marostegui.json [06:33:49] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [06:34:13] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1027.eqiad.wmnet with reason: Maintenance [06:34:40] (03PS1) 10Marostegui: es1027: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1138617 (https://phabricator.wikimedia.org/T391921) [06:35:20] (03CR) 10Marostegui: [C:03+2] es1027: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1138617 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [06:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P75395 and previous config saved to /var/cache/conftool/dbconfig/20250424-063643-fceratto.json [06:39:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75396 and previous config saved to /var/cache/conftool/dbconfig/20250424-063922-root.json [06:41:12] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:42:08] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:42:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1255.eqiad.wmnet with reason: Maintenance [06:42:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1256.eqiad.wmnet with reason: Maintenance [06:45:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1257.eqiad.wmnet with reason: Maintenance [06:51:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T391056)', diff saved to https://phabricator.wikimedia.org/P75397 and previous config saved to /var/cache/conftool/dbconfig/20250424-065149-fceratto.json [06:51:55] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [06:52:06] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2173.codfw.wmnet with reason: Maintenance [06:52:21] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2186.codfw.wmnet with reason: Maintenance [06:52:27] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [06:52:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T391056)', diff saved to https://phabricator.wikimedia.org/P75398 and previous config saved to /var/cache/conftool/dbconfig/20250424-065227-fceratto.json [06:52:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [06:53:04] (03CR) 10Brouberol: [C:03+2] data-platform: monitor namespace resource quota usage over time [alerts] - 10https://gerrit.wikimedia.org/r/1138346 (https://phabricator.wikimedia.org/T389777) (owner: 10Brouberol) [06:54:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75399 and previous config saved to /var/cache/conftool/dbconfig/20250424-065428-root.json [07:00:05] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250424T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:09] (03CR) 10Elukey: [C:03+1] Add a Cumin alias to select UEFI-enabled servers [puppet] - 10https://gerrit.wikimedia.org/r/1138397 (https://phabricator.wikimedia.org/T389217) (owner: 10Muehlenhoff) [07:09:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75400 and previous config saved to /var/cache/conftool/dbconfig/20250424-070933-root.json [07:10:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T391056)', diff saved to https://phabricator.wikimedia.org/P75401 and previous config saved to /var/cache/conftool/dbconfig/20250424-071001-fceratto.json [07:10:06] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [07:24:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75402 and previous config saved to /var/cache/conftool/dbconfig/20250424-072439-root.json [07:25:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P75403 and previous config saved to /var/cache/conftool/dbconfig/20250424-072508-fceratto.json [07:25:26] RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:27:43] (03PS1) 10Elukey: fetch_external_cloud_vendors_nets: fix corner case [puppet] - 10https://gerrit.wikimedia.org/r/1138634 [07:33:04] (03CR) 10Filippo Giunchedi: "Thank you for the review! I'll be merging this early next week" [puppet] - 10https://gerrit.wikimedia.org/r/1138329 (https://phabricator.wikimedia.org/T391333) (owner: 10Filippo Giunchedi) [07:39:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75404 and previous config saved to /var/cache/conftool/dbconfig/20250424-073943-root.json [07:40:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P75405 and previous config saved to /var/cache/conftool/dbconfig/20250424-074016-fceratto.json [07:42:16] (03PS1) 10Marostegui: db1255: Make note about its future role [puppet] - 10https://gerrit.wikimedia.org/r/1138673 (https://phabricator.wikimedia.org/T390530) [07:42:31] (03CR) 10Vgutierrez: varnish: Add basic edge uniques handling (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [07:42:53] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1138673 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [07:43:21] (03CR) 10Marostegui: [C:03+2] db1255: Make note about its future role [puppet] - 10https://gerrit.wikimedia.org/r/1138673 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [07:45:42] (03PS4) 10Elukey: profile::prometheus::k8s: drop istio gateway labels for ML [puppet] - 10https://gerrit.wikimedia.org/r/1138313 (https://phabricator.wikimedia.org/T387350) [07:45:42] (03PS1) 10Elukey: profile::pyrra: avoid Istio recording rules for SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1138674 (https://phabricator.wikimedia.org/T387350) [07:46:36] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:47:26] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.205 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:49:59] (03CR) 10MVernon: "Daft question - is that "None" coming from a NoneType elsewhere being coerced into a String (and should maybe have been discarded there ra" [puppet] - 10https://gerrit.wikimedia.org/r/1138634 (owner: 10Elukey) [07:53:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:53:49] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:54:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75407 and previous config saved to /var/cache/conftool/dbconfig/20250424-075448-root.json [07:55:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T391056)', diff saved to https://phabricator.wikimedia.org/P75408 and previous config saved to /var/cache/conftool/dbconfig/20250424-075524-fceratto.json [07:55:28] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [07:55:40] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2174.codfw.wmnet with reason: Maintenance [07:55:43] (03CR) 10Elukey: "If you check in https://geoip.linode.com/, at some point there is this line:" [puppet] - 10https://gerrit.wikimedia.org/r/1138634 (owner: 10Elukey) [07:55:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T391056)', diff saved to https://phabricator.wikimedia.org/P75409 and previous config saved to /var/cache/conftool/dbconfig/20250424-075547-fceratto.json [08:04:15] (03PS1) 10Marostegui: db2241: Make note about its future role [puppet] - 10https://gerrit.wikimedia.org/r/1138676 (https://phabricator.wikimedia.org/T390530) [08:04:47] (03CR) 10Marostegui: [C:03+2] db2241: Make note about its future role [puppet] - 10https://gerrit.wikimedia.org/r/1138676 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [08:04:59] (03CR) 10Marostegui: [C:03+2] "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1138676 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [08:09:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75410 and previous config saved to /var/cache/conftool/dbconfig/20250424-080953-root.json [08:10:21] (03CR) 10Muehlenhoff: [C:03+2] Add a Cumin alias to select UEFI-enabled servers [puppet] - 10https://gerrit.wikimedia.org/r/1138397 (https://phabricator.wikimedia.org/T389217) (owner: 10Muehlenhoff) [08:13:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T391056)', diff saved to https://phabricator.wikimedia.org/P75411 and previous config saved to /var/cache/conftool/dbconfig/20250424-081350-fceratto.json [08:13:55] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [08:18:32] (03CR) 10MVernon: [C:03+1] "Ah, yes. Sadness." [puppet] - 10https://gerrit.wikimedia.org/r/1138634 (owner: 10Elukey) [08:19:25] (03CR) 10MVernon: [C:03+1] restbase: add/remove new/old hosts to/from conftool [puppet] - 10https://gerrit.wikimedia.org/r/1138480 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [08:21:34] (03CR) 10Jaime Nuche: "Another idea (potentially even less strict?):" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [08:22:58] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:23:54] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:24:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75412 and previous config saved to /var/cache/conftool/dbconfig/20250424-082458-root.json [08:28:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P75413 and previous config saved to /var/cache/conftool/dbconfig/20250424-082857-fceratto.json [08:32:47] (03CR) 10Muehlenhoff: "PCC looks fine as well: https://puppet-compiler.wmflabs.org/output/1138377/5338/" [puppet] - 10https://gerrit.wikimedia.org/r/1138377 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [08:32:49] (03CR) 10Muehlenhoff: [C:03+2] Make krb1002 a KDC [puppet] - 10https://gerrit.wikimedia.org/r/1138377 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [08:34:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:37:13] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10763904 (10MoritzMuehlenhoff) [08:38:50] (03PS1) 10Majavah: P:toolforge: legacy_redirector: Handle *.www.toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/1138681 [08:39:22] !log installing reprepro bugfix updates from Bookworm point release [08:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1027 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75414 and previous config saved to /var/cache/conftool/dbconfig/20250424-084004-root.json [08:41:46] hi urbanecm, can you revisit and review https://gerrit.wikimedia.org/r/1100228? blockers of this task got removed [08:44:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P75415 and previous config saved to /var/cache/conftool/dbconfig/20250424-084404-fceratto.json [08:44:25] FIRING: SystemdUnitFailed: krb5-kdc.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:01] (03CR) 10Elukey: [C:03+2] fetch_external_cloud_vendors_nets: fix corner case [puppet] - 10https://gerrit.wikimedia.org/r/1138634 (owner: 10Elukey) [08:49:25] FIRING: [2x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:45] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:54:25] FIRING: [2x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:04] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: PuppetFailure (instance an-worker1068:9100) - https://phabricator.wikimedia.org/T392554 (10LSobanski) 03NEW [08:56:20] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: DiskSpace (instance analytics1071:9100) - https://phabricator.wikimedia.org/T392555 (10LSobanski) 03NEW [08:59:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T391056)', diff saved to https://phabricator.wikimedia.org/P75416 and previous config saved to /var/cache/conftool/dbconfig/20250424-085911-fceratto.json [08:59:15] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [08:59:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2176.codfw.wmnet with reason: Maintenance [08:59:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T391056)', diff saved to https://phabricator.wikimedia.org/P75417 and previous config saved to /var/cache/conftool/dbconfig/20250424-085933-fceratto.json [08:59:45] (03CR) 10DCausse: [C:03+2] search: Update envoy alerts for discovery dns names [alerts] - 10https://gerrit.wikimedia.org/r/1136422 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [09:01:33] (03PS1) 10Muehlenhoff: Add krb1002 to kerberos_kdc_servers [puppet] - 10https://gerrit.wikimedia.org/r/1138684 (https://phabricator.wikimedia.org/T390863) [09:01:37] (03Merged) 10jenkins-bot: search: Update envoy alerts for discovery dns names [alerts] - 10https://gerrit.wikimedia.org/r/1136422 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [09:01:39] (03CR) 10Majavah: [C:03+2] P:toolforge: legacy_redirector: Handle *.www.toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/1138681 (owner: 10Majavah) [09:06:24] (03PS2) 10Muehlenhoff: Add krb1002 to kerberos_kdc_servers [puppet] - 10https://gerrit.wikimedia.org/r/1138684 (https://phabricator.wikimedia.org/T390863) [09:08:46] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [09:08:48] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5339/co" [puppet] - 10https://gerrit.wikimedia.org/r/1138468 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [09:10:00] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:10:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138684 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [09:10:56] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:13:11] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1138468 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [09:16:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T391056)', diff saved to https://phabricator.wikimedia.org/P75418 and previous config saved to /var/cache/conftool/dbconfig/20250424-091622-fceratto.json [09:16:26] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [09:22:04] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit/nftables_throttling: add tracking_duration parameter [puppet] - 10https://gerrit.wikimedia.org/r/1138308 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [09:23:36] (03PS1) 10Vgutierrez: conftool: Remove mentions to elastic2064 [puppet] - 10https://gerrit.wikimedia.org/r/1138687 (https://phabricator.wikimedia.org/T388610) [09:25:07] (03PS1) 10Federico Ceratto: zarcillo: values.yaml: Update container path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138688 (https://phabricator.wikimedia.org/T384212) [09:25:27] (03CR) 10Elukey: [C:03+1] conftool: Remove mentions to elastic2064 [puppet] - 10https://gerrit.wikimedia.org/r/1138687 (https://phabricator.wikimedia.org/T388610) (owner: 10Vgutierrez) [09:25:47] (03CR) 10Alexandros Kosiaris: [C:03+1] zarcillo: values.yaml: Update container path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138688 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [09:25:53] (03CR) 10Vgutierrez: [C:03+2] conftool: Remove mentions to elastic2064 [puppet] - 10https://gerrit.wikimedia.org/r/1138687 (https://phabricator.wikimedia.org/T388610) (owner: 10Vgutierrez) [09:26:59] (03PS1) 10Hnowlan: mediawiki: migrate startupregistrystats-mediawikiwiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138689 (https://phabricator.wikimedia.org/T388540) [09:27:18] !log depool thanos-fe200[1-3] pending decommissioning T391352 [09:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:23] (03CR) 10CI reject: [V:04-1] mediawiki: migrate startupregistrystats-mediawikiwiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138689 (https://phabricator.wikimedia.org/T388540) (owner: 10Hnowlan) [09:27:53] (03PS1) 10Jelto: Revert "gerrit/nftables_throttling: add tracking_duration parameter" [puppet] - 10https://gerrit.wikimedia.org/r/1138690 (https://phabricator.wikimedia.org/T392467) [09:28:48] (03PS2) 10Hnowlan: mediawiki: migrate startupregistrystats-mediawikiwiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138689 (https://phabricator.wikimedia.org/T388540) [09:30:16] (03CR) 10Jelto: [C:03+2] Revert "gerrit/nftables_throttling: add tracking_duration parameter" [puppet] - 10https://gerrit.wikimedia.org/r/1138690 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [09:30:46] (03PS1) 10Muehlenhoff: Record LDAP acess for owresch [puppet] - 10https://gerrit.wikimedia.org/r/1138691 [09:31:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P75419 and previous config saved to /var/cache/conftool/dbconfig/20250424-093128-fceratto.json [09:32:10] (03CR) 10Federico Ceratto: [C:03+1] zarcillo: values.yaml: Update container path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138688 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [09:32:12] (03CR) 10Federico Ceratto: [C:03+2] zarcillo: values.yaml: Update container path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138688 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [09:33:13] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:33:25] (03PS1) 10Hnowlan: trafficserver: route all PCS routes via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724) [09:33:39] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138689 (https://phabricator.wikimedia.org/T388540) (owner: 10Hnowlan) [09:34:44] (03PS1) 10Vgutierrez: conftool: Remove mentions to elastic2094 [puppet] - 10https://gerrit.wikimedia.org/r/1138693 (https://phabricator.wikimedia.org/T388610) [09:35:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission mw13[49-51], mw1407 - https://phabricator.wikimedia.org/T383226#10764078 (10akosiaris) [09:35:58] (03CR) 10Elukey: [C:03+1] conftool: Remove mentions to elastic2094 [puppet] - 10https://gerrit.wikimedia.org/r/1138693 (https://phabricator.wikimedia.org/T388610) (owner: 10Vgutierrez) [09:36:24] (03CR) 10Vgutierrez: [C:03+2] conftool: Remove mentions to elastic2094 [puppet] - 10https://gerrit.wikimedia.org/r/1138693 (https://phabricator.wikimedia.org/T388610) (owner: 10Vgutierrez) [09:37:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission mw13[49-51], mw1407 - https://phabricator.wikimedia.org/T383226#10764083 (10akosiaris) @VRiley-WMF , thanks for following up on this. Yes, they are ready for dc ops to take over and finish decom. I 've updated the tas... [09:40:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:40:59] (03PS1) 10Muehlenhoff: Add d-i config for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1138694 [09:41:29] (03CR) 10CI reject: [V:04-1] Add d-i config for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1138694 (owner: 10Muehlenhoff) [09:41:29] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:42:34] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10764092 (10MoritzMuehlenhoff) [09:44:57] (03PS2) 10Muehlenhoff: Add d-i config for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1138694 (https://phabricator.wikimedia.org/T391083) [09:46:30] (03PS1) 10Vgutierrez: conftool: Remove no longer existent elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1138695 (https://phabricator.wikimedia.org/T388610) [09:46:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P75420 and previous config saved to /var/cache/conftool/dbconfig/20250424-094635-fceratto.json [09:48:11] FIRING: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:51:14] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: enable multiprocessing for ptwiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127851 (owner: 10Ilias Sarantopoulos) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:40] (03Merged) 10jenkins-bot: ml-services: enable multiprocessing for ptwiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127851 (owner: 10Ilias Sarantopoulos) [09:53:11] RESOLVED: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:55:53] (03CR) 10Elukey: [C:03+1] conftool: Remove no longer existent elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1138695 (https://phabricator.wikimedia.org/T388610) (owner: 10Vgutierrez) [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:47] (03CR) 10Vgutierrez: [C:03+2] conftool: Remove no longer existent elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1138695 (https://phabricator.wikimedia.org/T388610) (owner: 10Vgutierrez) [09:58:15] (03PS1) 10Muehlenhoff: Add pxelinux config for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1138697 (https://phabricator.wikimedia.org/T391083) [09:59:34] (03PS1) 10Muehlenhoff: Add trixie to the list of supported OSes [cookbooks] - 10https://gerrit.wikimedia.org/r/1138698 (https://phabricator.wikimedia.org/T391083) [09:59:48] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250424T1000) [10:01:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T391056)', diff saved to https://phabricator.wikimedia.org/P75421 and previous config saved to /var/cache/conftool/dbconfig/20250424-100143-fceratto.json [10:01:48] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [10:01:59] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2188.codfw.wmnet with reason: Maintenance [10:02:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T391056)', diff saved to https://phabricator.wikimedia.org/P75422 and previous config saved to /var/cache/conftool/dbconfig/20250424-100206-fceratto.json [10:02:56] (03PS1) 10Vgutierrez: conftool: Remove mentions to elastic2095 [puppet] - 10https://gerrit.wikimedia.org/r/1138704 (https://phabricator.wikimedia.org/T388610) [10:04:51] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:06:54] (03CR) 10Vgutierrez: [C:03+2] conftool: Remove mentions to elastic2095 [puppet] - 10https://gerrit.wikimedia.org/r/1138704 (https://phabricator.wikimedia.org/T388610) (owner: 10Vgutierrez) [10:09:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10764187 (10phaultfinder) [10:10:41] RESOLVED: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:13:11] FIRING: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:14:36] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP acess for owresch [puppet] - 10https://gerrit.wikimedia.org/r/1138691 (owner: 10Muehlenhoff) [10:16:24] (03CR) 10FNegri: P:toolforge::mailrelay: Pull WMCS IP space from network module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1138411 (owner: 10Majavah) [10:17:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T391056)', diff saved to https://phabricator.wikimedia.org/P75423 and previous config saved to /var/cache/conftool/dbconfig/20250424-101710-fceratto.json [10:17:15] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [10:18:11] RESOLVED: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:21:28] (03CR) 10Majavah: [V:03+1] P:toolforge::mailrelay: Pull WMCS IP space from network module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1138411 (owner: 10Majavah) [10:21:48] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:22:16] !log aborrero@cumin1002 START - Cookbook sre.dns.netbox [10:26:36] !log aborrero@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: neutron updates - aborrero@cumin1002" [10:26:57] !log aborrero@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: neutron updates - aborrero@cumin1002" [10:26:57] !log aborrero@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:27:11] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:28:41] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:17] !log aborrero@cumin1002 START - Cookbook sre.dns.wipe-cache cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org on all recursors [10:29:20] !log aborrero@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org on all recursors [10:30:04] (03PS1) 10Marostegui: sections.yaml: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1138711 (https://phabricator.wikimedia.org/T390530) [10:32:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P75424 and previous config saved to /var/cache/conftool/dbconfig/20250424-103217-fceratto.json [10:33:24] (03CR) 10FNegri: [C:03+1] P:toolforge::mailrelay: Pull WMCS IP space from network module [puppet] - 10https://gerrit.wikimedia.org/r/1138411 (owner: 10Majavah) [10:33:54] !log aborrero@cumin1002 START - Cookbook sre.dns.netbox [10:34:40] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::mailrelay: Pull WMCS IP space from network module [puppet] - 10https://gerrit.wikimedia.org/r/1138411 (owner: 10Majavah) [10:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:39:25] FIRING: [3x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:39:34] aborrero@cumin1002 netbox (PID 2239214) is awaiting input [10:40:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:44:54] (03CR) 10Jgiannelos: "Can we also skip `zhwiki`? It is known to cause complications so maybe its prone to errors if we roll it out with all the rest of the host" [puppet] - 10https://gerrit.wikimedia.org/r/1138692 (https://phabricator.wikimedia.org/T390724) (owner: 10Hnowlan) [10:45:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:47:11] !log aborrero@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw updates - aborrero@cumin1002" [10:47:17] !log aborrero@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw updates - aborrero@cumin1002" [10:47:17] !log aborrero@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:47:23] !log aborrero@cumin1002 START - Cookbook sre.dns.wipe-cache virt.cloudgw.eqiad1.wikimediacloud.org on all recursors [10:47:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P75425 and previous config saved to /var/cache/conftool/dbconfig/20250424-104723-fceratto.json [10:47:27] !log aborrero@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) virt.cloudgw.eqiad1.wikimediacloud.org on all recursors [10:56:11] (03PS1) 10Muehlenhoff: Remove obsolete videoscaler cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1138713 (https://phabricator.wikimedia.org/T360636) [10:56:45] (03PS2) 10Muehlenhoff: Remove obsolete videoscaler cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1138713 (https://phabricator.wikimedia.org/T360636) [10:57:10] (03CR) 10Ladsgroup: "let me first add it to mediawiki config to ignore it so it doesn't explode" [puppet] - 10https://gerrit.wikimedia.org/r/1138711 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [10:59:20] (03PS1) 10Ladsgroup: Add support for x3 db cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138714 (https://phabricator.wikimedia.org/T351820) [10:59:41] (03CR) 10Ladsgroup: "https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1138714" [puppet] - 10https://gerrit.wikimedia.org/r/1138711 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [11:02:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T391056)', diff saved to https://phabricator.wikimedia.org/P75426 and previous config saved to /var/cache/conftool/dbconfig/20250424-110230-fceratto.json [11:02:35] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:02:36] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:02:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2202.codfw.wmnet with reason: Maintenance [11:02:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:03:26] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:03:44] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.251 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:04:02] (03PS2) 10Majavah: Add WMCS ranges to wgAutoblockExemptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137732 (https://phabricator.wikimedia.org/T386689) [11:09:59] (03CR) 10Elukey: [C:03+1] conftool: Remove mentions to elastic2095 [puppet] - 10https://gerrit.wikimedia.org/r/1138704 (https://phabricator.wikimedia.org/T388610) (owner: 10Vgutierrez) [11:11:17] !log installing python-urllib3 security updates [11:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:18] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2212.codfw.wmnet with reason: Maintenance [11:16:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T391056)', diff saved to https://phabricator.wikimedia.org/P75427 and previous config saved to /var/cache/conftool/dbconfig/20250424-111625-fceratto.json [11:16:30] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:18:06] (03PS1) 10Muehlenhoff: Record LDAP access for jmerino-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1138715 [11:23:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:25:29] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for jmerino-wmf [puppet] - 10https://gerrit.wikimedia.org/r/1138715 (owner: 10Muehlenhoff) [11:29:11] 10SRE-swift-storage, 06Commons, 10Thumbor: Error: 429, Too Many Requests - https://phabricator.wikimedia.org/T392348#10764418 (10AntiCompositeNumber) Hm.... Trying to load the original https://upload.wikimedia.org/wikipedia/commons/c/c5/Himalaya%2C_Indian_Atlas%2C_sheet_66_%2815219000%29.jpg in Firefox ends... [11:29:41] 10SRE-swift-storage, 06Commons, 10Thumbor: Error: 429, Too Many Requests - https://phabricator.wikimedia.org/T392348#10764420 (10AntiCompositeNumber) [11:31:45] (03CR) 10Lucas Werkmeister (WMDE): "Announced two weeks ago, to be deployed today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134692 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [11:32:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T391056)', diff saved to https://phabricator.wikimedia.org/P75428 and previous config saved to /var/cache/conftool/dbconfig/20250424-113234-fceratto.json [11:32:40] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:34:04] (03PS2) 10Lucas Werkmeister (WMDE): Fix EntitySchema propertyType on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134692 (https://phabricator.wikimedia.org/T371196) [11:34:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134692 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [11:34:31] FIRING: ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:39:20] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: cleanup and simplify [puppet] - 10https://gerrit.wikimedia.org/r/1138718 (https://phabricator.wikimedia.org/T380174) [11:39:31] RESOLVED: ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:14] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:44:42] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:44:42] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:44:52] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:44:52] PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:45:44] RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:45:44] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 3.429 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:46:40] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Thu 22 May 2025 06:12:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:46:40] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Thu 22 May 2025 06:12:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:47:28] (03PS1) 10Jelto: gerrit: add tencent IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1138720 (https://phabricator.wikimedia.org/T392467) [11:47:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P75429 and previous config saved to /var/cache/conftool/dbconfig/20250424-114742-fceratto.json [11:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:48:52] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:49:42] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:49:42] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:51:03] (03CR) 10Jelto: [C:03+2] gerrit: add tencent IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1138720 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [11:51:08] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: cleanup and simplify [puppet] - 10https://gerrit.wikimedia.org/r/1138718 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [11:51:35] (03CR) 10Arnaudb: [C:03+1] gerrit: add tencent IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1138720 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [11:53:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:53] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:59] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:54:48] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 7.187 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:55:04] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:55:32] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Thu 22 May 2025 06:12:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:55:32] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Thu 22 May 2025 06:12:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250424T1200) [12:00:10] (03CR) 10Marostegui: [C:03+1] Add support for x3 db cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138714 (https://phabricator.wikimedia.org/T351820) (owner: 10Ladsgroup) [12:02:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P75430 and previous config saved to /var/cache/conftool/dbconfig/20250424-120249-fceratto.json [12:05:00] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:05:58] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:11:51] (03Abandoned) 10Muehlenhoff: site: Add drmrs ganeti instances [puppet] - 10https://gerrit.wikimedia.org/r/736820 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:11:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2032 T391921', diff saved to https://phabricator.wikimedia.org/P75431 and previous config saved to /var/cache/conftool/dbconfig/20250424-121152-marostegui.json [12:11:57] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [12:12:58] (03CR) 10Muehlenhoff: [C:03+2] Add krb1002 to kerberos_kdc_servers [puppet] - 10https://gerrit.wikimedia.org/r/1138684 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [12:13:13] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2032.codfw.wmnet with reason: Maintenance [12:14:00] (03PS2) 10Marostegui: sections.yaml: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1138711 (https://phabricator.wikimedia.org/T390530) [12:14:00] (03PS1) 10Marostegui: es2032: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1138733 (https://phabricator.wikimedia.org/T391921) [12:15:20] (03CR) 10Marostegui: [C:03+2] es2032: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1138733 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [12:15:55] (03CR) 10Majavah: [C:03+1] invisible-unicorn: Return 404 if caller tries to access a nonexistent proxy [puppet] - 10https://gerrit.wikimedia.org/r/1137803 (owner: 10Andrew Bogott) [12:16:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2032.codfw.wmnet with reason: Maintenance [12:16:26] jouncebot: nowandnext [12:16:26] For the next 0 hour(s) and 43 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250424T1200) [12:16:26] In 0 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250424T1300) [12:17:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2048.codfw.wmnet with OS bookworm [12:17:13] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10764515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm executed with err... [12:17:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2047.codfw.wmnet with OS bookworm [12:17:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10764516 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with err... [12:17:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2046.codfw.wmnet with OS bookworm [12:17:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10764517 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2046.codfw.wmnet with OS bookworm executed with err... [12:17:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T391056)', diff saved to https://phabricator.wikimedia.org/P75432 and previous config saved to /var/cache/conftool/dbconfig/20250424-121756-fceratto.json [12:18:01] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:18:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2216.codfw.wmnet with reason: Maintenance [12:18:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T391056)', diff saved to https://phabricator.wikimedia.org/P75433 and previous config saved to /var/cache/conftool/dbconfig/20250424-121819-fceratto.json [12:18:33] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1138697 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [12:18:45] (03CR) 10Filippo Giunchedi: [C:03+1] "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/1138694 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [12:20:13] (03CR) 10Ladsgroup: [C:03+2] Add support for x3 db cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138714 (https://phabricator.wikimedia.org/T351820) (owner: 10Ladsgroup) [12:20:49] (03CR) 10Muehlenhoff: [C:03+2] Add d-i config for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1138694 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [12:20:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138714 (https://phabricator.wikimedia.org/T351820) (owner: 10Ladsgroup) [12:21:00] (03Merged) 10jenkins-bot: Add support for x3 db cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138714 (https://phabricator.wikimedia.org/T351820) (owner: 10Ladsgroup) [12:21:36] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1138714|Add support for x3 db cluster (T351820)]] [12:21:40] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [12:26:34] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1138714|Add support for x3 db cluster (T351820)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:28:55] (03PS1) 10Jgreen: Remove deprecated host civi2001 from nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/1138738 (https://phabricator.wikimedia.org/T375038) [12:29:10] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [12:33:00] (03PS1) 10Brouberol: airflow-test-k8s: allow 32 pods to be created in a single executor batch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138739 (https://phabricator.wikimedia.org/T391669) [12:33:16] (03Abandoned) 10Majavah: [DON'T MERGE] Allow Cloud VPS NAT address for $wmgAllowLabsAnonEdits wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657067 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez) [12:33:29] (03Abandoned) 10Majavah: Add WMCS to the exception of ratelimit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658890 (https://phabricator.wikimedia.org/T209011) (owner: 10Ladsgroup) [12:34:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T391056)', diff saved to https://phabricator.wikimedia.org/P75434 and previous config saved to /var/cache/conftool/dbconfig/20250424-123407-fceratto.json [12:34:13] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:34:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:03] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138714|Add support for x3 db cluster (T351820)]] (duration: 14m 28s) [12:36:08] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [12:36:20] (03CR) 10Marostegui: [C:03+2] sections.yaml: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1138711 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:36:29] (03PS1) 10Jelto: miscweb: change os-reports runtime owner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [12:36:29] (03CR) 10Jelto: [C:04-1] "see my comment in T350794#10764564" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [12:39:59] (03Abandoned) 10Majavah: [DONT MERGE] cloud: NAT egress connections to WMF wikis [puppet] - 10https://gerrit.wikimedia.org/r/656883 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez) [12:41:57] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138739 (https://phabricator.wikimedia.org/T391669) (owner: 10Brouberol) [12:42:09] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: allow 32 pods to be created in a single executor batch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138739 (https://phabricator.wikimedia.org/T391669) (owner: 10Brouberol) [12:44:25] FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:46:02] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10764666 (10MatthewVernon) @VRiley-WMF I had a quick look at the console of ms-fe1015 and it looks like there's some problem with its network setup? {F59377760} [12:48:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75435 and previous config saved to /var/cache/conftool/dbconfig/20250424-124838-root.json [12:49:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe1016.eqiad.wmnet with OS bullseye [12:49:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P75436 and previous config saved to /var/cache/conftool/dbconfig/20250424-124914-fceratto.json [12:49:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10764673 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe1016.eqiad.wmnet with OS bullseye [12:53:37] Deploying cxserver. [12:53:40] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10764698 (10MatthewVernon) I see the same failure mode on ms-fe1016: ` Booting from BRCM MBA Slot 0400 v21.6.4 Broadcom UNDI PXE-2.1 v21.6.4 Copyright (C) 2000-2024 Broadcom Li... [12:53:50] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-fe1016.eqiad.wmnet with OS bullseye [12:53:55] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10764700 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe1016.eqiad.wmnet with OS bullseye executed with errors: - ms-fe10... [12:54:57] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:55:34] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:56:34] (03CR) 10Muehlenhoff: [C:03+2] Add pxelinux config for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1138697 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [12:58:14] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:58:50] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250424T1300). [13:00:05] Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:37] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: cleanup pre-IPv6 settings [puppet] - 10https://gerrit.wikimedia.org/r/1138743 [13:00:51] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: cleanup pre-IPv6 settings [puppet] - 10https://gerrit.wikimedia.org/r/1138743 (https://phabricator.wikimedia.org/T380174) [13:00:57] I’m in a meeting, hopefully I’ll be able to deploy near the end of the window [13:01:01] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138743 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [13:01:50] 10ops-eqiad, 06DC-Ops: eno1 on gitlab-runner1003:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T392585 (10phaultfinder) 03NEW [13:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:03:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75439 and previous config saved to /var/cache/conftool/dbconfig/20250424-130344-root.json [13:03:57] !log Updated cxserver to 2025-04-15-070132-production (T391289) [13:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:01] T391289: Consistent error message when trying to resume an old translation - https://phabricator.wikimedia.org/T391289 [13:04:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P75440 and previous config saved to /var/cache/conftool/dbconfig/20250424-130421-fceratto.json [13:05:30] PROBLEM - Check unit status of replicate-krb-database on krb1001 is CRITICAL: CRITICAL: Status of the systemd unit replicate-krb-database https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:08:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [13:09:25] FIRING: [5x] SystemdUnitFailed: replicate-krb-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:34] (03PS3) 10Vgutierrez: wmflib: Add list_secrets() function [puppet] - 10https://gerrit.wikimedia.org/r/1137055 (https://phabricator.wikimedia.org/T391411) [13:12:54] (03CR) 10Vgutierrez: "comments addressed, thanks for the review Jesse!" [puppet] - 10https://gerrit.wikimedia.org/r/1137055 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:18:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75441 and previous config saved to /var/cache/conftool/dbconfig/20250424-131850-root.json [13:19:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T391056)', diff saved to https://phabricator.wikimedia.org/P75442 and previous config saved to /var/cache/conftool/dbconfig/20250424-131928-fceratto.json [13:19:33] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [13:23:30] (03CR) 10Ssingh: [C:03+1] "Looks good, verified the ranges in Netbox. You should update geo-maps as well in the DNS repository so that the changes are uniform." [puppet] - 10https://gerrit.wikimedia.org/r/1138342 (https://phabricator.wikimedia.org/T380174) (owner: 10Majavah) [13:25:04] (03PS1) 10Filippo Giunchedi: logstash: bump shards for logstash-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138754 (https://phabricator.wikimedia.org/T391687) [13:27:20] (03CR) 10Ssingh: "I am assuming Luca's +1 is for hiddenparma. Because it's really not under Traffic's namespace I think and I don't want to step on anyone's" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans) [13:28:01] (03PS1) 10Jelto: gerrit: add openai-gptbot ranges [puppet] - 10https://gerrit.wikimedia.org/r/1138755 (https://phabricator.wikimedia.org/T392467) [13:28:12] (03PS2) 10Filippo Giunchedi: logstash: bump shards for logstash-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138754 (https://phabricator.wikimedia.org/T391687) [13:28:12] (03PS1) 10Filippo Giunchedi: role: remove logstash role files [puppet] - 10https://gerrit.wikimedia.org/r/1138756 [13:29:43] (03CR) 10Arnaudb: [C:03+1] gerrit: add openai-gptbot ranges [puppet] - 10https://gerrit.wikimedia.org/r/1138755 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [13:30:09] (03CR) 10Jelto: [C:03+2] gerrit: add openai-gptbot ranges [puppet] - 10https://gerrit.wikimedia.org/r/1138755 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [13:31:10] (03PS1) 10Majavah: Update GeoIP maps for new WMCS ranges [dns] - 10https://gerrit.wikimedia.org/r/1138758 (https://phabricator.wikimedia.org/T37947) [13:33:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75443 and previous config saved to /var/cache/conftool/dbconfig/20250424-133355-root.json [13:35:53] (03CR) 10Majavah: [C:03+2] P:dns: Update discovery-map for new WMCS addresses [puppet] - 10https://gerrit.wikimedia.org/r/1138342 (https://phabricator.wikimedia.org/T380174) (owner: 10Majavah) [13:36:50] (03CR) 10Ssingh: [C:03+1] "[nit]: probably a good idea to refer to I7584a33f4e9a67b599a597d2bc2e83bd198de4f0" [dns] - 10https://gerrit.wikimedia.org/r/1138758 (https://phabricator.wikimedia.org/T37947) (owner: 10Majavah) [13:37:44] (03PS2) 10Majavah: Update GeoIP maps for new WMCS ranges [dns] - 10https://gerrit.wikimedia.org/r/1138758 (https://phabricator.wikimedia.org/T37947) [13:38:10] (03PS1) 10Marostegui: check_eventlogging_lag.sh: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1138760 [13:38:46] (03CR) 10Ssingh: Update GeoIP maps for new WMCS ranges [dns] - 10https://gerrit.wikimedia.org/r/1138758 (https://phabricator.wikimedia.org/T37947) (owner: 10Majavah) [13:40:22] (03CR) 10Marostegui: [C:03+2] check_eventlogging_lag.sh: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1138760 (owner: 10Marostegui) [13:40:30] (03CR) 10Majavah: [C:03+2] Update GeoIP maps for new WMCS ranges [dns] - 10https://gerrit.wikimedia.org/r/1138758 (https://phabricator.wikimedia.org/T37947) (owner: 10Majavah) [13:40:39] !log taavi@dns3004 START - running authdns-update [13:43:40] !log taavi@dns3004 END - running authdns-update [13:46:35] Lucas_WMDE: Shall I take over the deploy= [13:46:52] hoo: I wouldn’t mind trying out spiderpig ^^ [13:46:56] I should be free in 15 minutes [13:47:03] and the deployment calendar looks free then [13:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:49:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75444 and previous config saved to /var/cache/conftool/dbconfig/20250424-134900-root.json [13:49:19] Ok, good to see it taken care of :) [13:49:36] (03CR) 10JHathaway: [C:03+1] wmflib: Add list_secrets() function [puppet] - 10https://gerrit.wikimedia.org/r/1137055 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:50:05] (03CR) 10Vgutierrez: [C:03+2] wmflib: Add list_secrets() function [puppet] - 10https://gerrit.wikimedia.org/r/1137055 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:51:43] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging2005'] [13:52:23] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging2005'] [13:53:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:54:56] PROBLEM - Hadoop NodeManager on an-worker1171 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:55:42] PROBLEM - Hadoop NodeManager on an-worker1205 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:59:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging2005'] [13:59:50] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging2005'] [14:03:18] (03PS1) 10Marostegui: site: Reorganize x3 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1138774 (https://phabricator.wikimedia.org/T390530) [14:03:40] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:04:01] (03PS2) 10Marostegui: site: Reorganize x3 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1138774 (https://phabricator.wikimedia.org/T390530) [14:04:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75445 and previous config saved to /var/cache/conftool/dbconfig/20250424-140406-root.json [14:04:26] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1138774 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [14:05:42] RECOVERY - Hadoop NodeManager on an-worker1205 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:05:51] (03CR) 10Ladsgroup: [C:03+1] site: Reorganize x3 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1138774 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [14:05:58] (03CR) 10Marostegui: [C:03+2] site: Reorganize x3 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1138774 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [14:07:31] jouncebot: now [14:07:31] No deployments scheduled for the next 1 hour(s) and 52 minute(s) [14:07:39] I’ll go ahead and deploy my config change now (cc hoo) [14:10:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging2005'] [14:11:19] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging2005'] [14:11:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['kafka-logging2005'] [14:11:41] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging2005'] [14:11:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134692 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [14:12:55] (03Merged) 10jenkins-bot: Fix EntitySchema propertyType on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134692 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [14:12:56] RECOVERY - Hadoop NodeManager on an-worker1171 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:13:06] (03PS1) 10Majavah: dynamicproxy: Add description to managed DNS records [puppet] - 10https://gerrit.wikimedia.org/r/1138799 [14:13:07] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1134692|Fix EntitySchema propertyType on Wikidata (T371196)]] [14:13:11] T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196 [14:13:56] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1138799 (owner: 10Majavah) [14:14:25] FIRING: [5x] SystemdUnitFailed: replicate-krb-database.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:38] (03CR) 10Majavah: [C:03+2] dynamicproxy: Add description to managed DNS records [puppet] - 10https://gerrit.wikimedia.org/r/1138799 (owner: 10Majavah) [14:14:40] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:15:30] RECOVERY - Check unit status of replicate-krb-database on krb1001 is OK: OK: Status of the systemd unit replicate-krb-database https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:17:31] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1134692|Fix EntitySchema propertyType on Wikidata (T371196)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:17:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging2005'] [14:17:59] (03PS3) 10Majavah: dynamicproxy: Return 404 if caller tries to access a nonexistent proxy [puppet] - 10https://gerrit.wikimedia.org/r/1137803 (owner: 10Andrew Bogott) [14:17:59] (03PS2) 10Majavah: dynamicproxy: Delete dns entries before removing proxy records [puppet] - 10https://gerrit.wikimedia.org/r/1137483 (https://phabricator.wikimedia.org/T391718) (owner: 10Andrew Bogott) [14:18:31] lgtm [14:18:37] (tested at https://www.wikidata.org/wiki/Special:EntityData/P12861.ttl) [14:18:40] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [14:19:12] (03CR) 10Majavah: [C:03+2] dynamicproxy: Return 404 if caller tries to access a nonexistent proxy [puppet] - 10https://gerrit.wikimedia.org/r/1137803 (owner: 10Andrew Bogott) [14:19:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75446 and previous config saved to /var/cache/conftool/dbconfig/20250424-141911-root.json [14:19:16] (03CR) 10Majavah: [C:03+2] dynamicproxy: Delete dns entries before removing proxy records [puppet] - 10https://gerrit.wikimedia.org/r/1137483 (https://phabricator.wikimedia.org/T391718) (owner: 10Andrew Bogott) [14:20:45] (03PS1) 10Bking: cirrussearch: remove remaining elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1138804 (https://phabricator.wikimedia.org/T388610) [14:21:48] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:24:02] (03PS2) 10Majavah: bird: Only specify interface for link-local peerings [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) [14:25:15] I wonder if it’s worth teaching spiderpig not to hyphenate my name like [14:25:16] lu- [14:25:17] (03PS12) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [14:25:17] caswerk- [14:25:19] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134692|Fix EntitySchema propertyType on Wikidata (T371196)]] (duration: 12m 11s) [14:25:19] meister-wmde [14:25:23] ./hj [14:25:23] T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196 [14:25:40] :D [14:25:40] just set my unix user name to {{DISPLAYTITLE:lucas­werkmeister-wmde}} [14:26:18] (03CR) 10Elukey: [C:03+1] "If you want to me more on the cautious side, you could set those as inactive first and monitor the cluster, then merge etc.. Otherwise go " [puppet] - 10https://gerrit.wikimedia.org/r/1138804 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:27:23] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5341/co" [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:28:41] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:39] 10ops-codfw, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q4): kafka-logging2005 is down since six days - https://phabricator.wikimedia.org/T392488#10765010 (10Jhancock.wm) @Papaul i might need your help with this one. Reseated the network card. no pings. Replaced the network card. no pings. Updated... [14:30:07] (03PS1) 10Brouberol: deployment_server: assign group onwership of airflow-wmde configs to airflow-wmde-admins [puppet] - 10https://gerrit.wikimedia.org/r/1138809 [14:30:27] (03CR) 10Bking: [C:03+2] cirrussearch: remove remaining elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1138804 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:30:31] (03CR) 10CI reject: [V:04-1] deployment_server: assign group onwership of airflow-wmde configs to airflow-wmde-admins [puppet] - 10https://gerrit.wikimedia.org/r/1138809 (owner: 10Brouberol) [14:31:15] (03PS2) 10Brouberol: deployment_server: assign airflow-wmde group onwership configs to airflow-wmde-admins [puppet] - 10https://gerrit.wikimedia.org/r/1138809 [14:31:40] (03CR) 10CI reject: [V:04-1] deployment_server: assign airflow-wmde group onwership configs to airflow-wmde-admins [puppet] - 10https://gerrit.wikimedia.org/r/1138809 (owner: 10Brouberol) [14:32:13] (03PS3) 10Brouberol: Assign airflow-wmde group onwership configs to airflow-wmde-admins [puppet] - 10https://gerrit.wikimedia.org/r/1138809 [14:32:30] (03PS1) 10Hnowlan: mediawiki: migrate unsubscribeinactiveusers-metawiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) [14:33:37] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 5 NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:33:59] (03PS3) 10Majavah: bird: Only specify interface for link-local peerings [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) [14:34:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75447 and previous config saved to /var/cache/conftool/dbconfig/20250424-143417-root.json [14:34:48] (03CR) 10Andrew McAllister (WMDE): [C:03+1] "Thanks for the help with this, @brouberol@wikimedia.org!" [puppet] - 10https://gerrit.wikimedia.org/r/1138809 (owner: 10Brouberol) [14:34:55] (03CR) 10Brouberol: [C:03+2] Assign airflow-wmde group onwership configs to airflow-wmde-admins [puppet] - 10https://gerrit.wikimedia.org/r/1138809 (owner: 10Brouberol) [14:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:35] (03CR) 10Hnowlan: "I am _fairly_ sure that this crontab correctly mimics the inscrutable runes that are systemd timers, but I can't be sure. Explainer here h" [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [14:35:44] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [14:37:50] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#10765082 (10HCoplin-WMF) p:05High→03Low Thanks for t... [14:38:10] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 5 CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:38:27] (03PS3) 10Brouberol: deployment_server: provision separate kubeconfig files for the airflow PG DBs [puppet] - 10https://gerrit.wikimedia.org/r/1138748 (https://phabricator.wikimedia.org/T391348) [14:39:41] (03CR) 10Kamila Součková: [C:03+1] mediawiki: migrate startupregistrystats-mediawikiwiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138689 (https://phabricator.wikimedia.org/T388540) (owner: 10Hnowlan) [14:41:42] (03CR) 10Majavah: [V:03+1] "The latest PCC is against everything that runs Bird over IPv6 (I did a Cumin query for `P:bird::anycast%do_ipv6=true` and the filtered it " [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:43:44] (03PS1) 10Hnowlan: mediawiki: migrate all unsubscribeinactiveusers jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) [14:45:52] (03CR) 10CI reject: [V:04-1] mediawiki: migrate all unsubscribeinactiveusers jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138815 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [14:47:55] (03CR) 10Ssingh: [C:03+1] "Yeah that makes sense re: PCC." [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:49:09] (03PS1) 10Hnowlan: mw:periodic_job:kubernetes: use correct function to error out [puppet] - 10https://gerrit.wikimedia.org/r/1138818 [14:49:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2032 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75449 and previous config saved to /var/cache/conftool/dbconfig/20250424-144923-root.json [14:51:54] (03PS2) 10Ssingh: wikimedia-dns.org: add TYPE65 records for check.wikimedia-dns.org [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378) [14:51:57] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:52:08] (03CR) 10Hnowlan: [C:03+2] mediawiki: migrate startupregistrystats-mediawikiwiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138689 (https://phabricator.wikimedia.org/T388540) (owner: 10Hnowlan) [14:53:15] (03PS1) 10Lucas Werkmeister (WMDE): wikidata-query-gui: bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138821 (https://phabricator.wikimedia.org/T391532) [14:59:16] (03PS1) 10Ssingh: Revert^4 "P:durum: add conditional to enable ECH (durum3003)" [puppet] - 10https://gerrit.wikimedia.org/r/1138823 [14:59:30] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:59:35] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:59:40] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [14:59:43] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [15:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:05:03] can anyone give a +1 to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1138821 for me? should be pretty harmless… [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:02] (03CR) 10Kamila Součková: [C:03+1] admin_ng: Read RoleBinding usernames from services hiera [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138494 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [15:08:27] (03PS4) 10Brouberol: deployment_server: provision separate kubeconfig files for the airflow PG DBs [puppet] - 10https://gerrit.wikimedia.org/r/1138748 (https://phabricator.wikimedia.org/T391348) [15:09:35] FIRING: NetworkDeviceAlarmActive: Alarm active on lsw1-e3-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lsw1-e3-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:10:20] (03CR) 10Stevemunene: [C:03+1] deployment_server: provision separate kubeconfig files for the airflow PG DBs [puppet] - 10https://gerrit.wikimedia.org/r/1138748 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol) [15:11:08] (03CR) 10Brouberol: [C:03+2] deployment_server: provision separate kubeconfig files for the airflow PG DBs [puppet] - 10https://gerrit.wikimedia.org/r/1138748 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol) [15:14:35] RESOLVED: NetworkDeviceAlarmActive: Alarm active on lsw1-e3-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=lsw1-e3-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:16:16] (03CR) 10WMDE-leszek: [C:03+1] wikidata-query-gui: bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138821 (https://phabricator.wikimedia.org/T391532) (owner: 10Lucas Werkmeister (WMDE)) [15:17:34] (03PS5) 10Vgutierrez: varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) [15:21:52] (03CR) 10RLazarus: [C:03+2] "Thanks Raine!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138494 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [15:26:16] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "I’m not really sure yet how this can be tested, but let’s try it anyway. Worst case I ’ll just revert." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138821 (https://phabricator.wikimedia.org/T391532) (owner: 10Lucas Werkmeister (WMDE)) [15:26:35] ^ deploying that in a moment [15:26:43] jouncebot: now [15:26:43] No deployments scheduled for the next 0 hour(s) and 33 minute(s) [15:28:05] does anyone happen to know how I could test a k8s-ized web service in a browser during deployment? [15:28:19] so far I have this, which should give me the staging version of the wikidata-query-gui service: [15:28:19] curl -I --resolve query.wikidata.org:30443:$(dig +short k8s-ingress-staging.discovery.wmnet) https://query.wikidata.org:30443/ [15:28:21] (03Merged) 10jenkins-bot: admin_ng: Read RoleBinding usernames from services hiera [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138494 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [15:28:36] but that doesn’t look like I could plug it into my browser to test the new version of the JavaScript code :/ [15:29:36] I’m half tempted to wget --mirror the staging version to localhost, but wget doesn’t seem to have a --resolve option like curl does [15:30:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10765458 (10Jgreen) 05Invalid→03Resolved [15:30:51] (03CR) 10Hnowlan: [C:03+1] Remove obsolete videoscaler cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1138713 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [15:34:11] (03Merged) 10jenkins-bot: wikidata-query-gui: bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138821 (https://phabricator.wikimedia.org/T391532) (owner: 10Lucas Werkmeister (WMDE)) [15:36:12] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [15:36:19] !log remove varnish libvmod-netmapper libvmod-querysort libvmod-re2 varnish-modules libvarnishapi2 varnishkafka from buster-wikimedia [15:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:05] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [15:37:05] (03CR) 10Majavah: [V:03+1 C:03+2] bird: Only specify interface for link-local peerings [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [15:37:54] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [15:38:13] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [15:38:17] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [15:38:35] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [15:39:31] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device ssw1-f1-codfw.mgmt.codfw.wmnet [15:39:33] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:40:17] !log remove libvarnishapi2 from bullseye-wikimedia main [15:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:39] (03PS2) 10Ssingh: P:durum: add conditional to enable ECH (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1138823 (https://phabricator.wikimedia.org/T205378) [15:41:46] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5347/co" [puppet] - 10https://gerrit.wikimedia.org/r/1138823 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:42:59] (03Abandoned) 10Ssingh: [DO NOT MERGE] set MX records for dyna [dns] - 10https://gerrit.wikimedia.org/r/1133974 (owner: 10Ssingh) [15:44:03] (03PS6) 10Vgutierrez: varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) [15:44:06] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-f1-codfw - pt1979@cumin2002" [15:44:36] (03CR) 10Ssingh: [V:03+1] "The commit message specifies changes since last reviewed:" [puppet] - 10https://gerrit.wikimedia.org/r/1138823 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:45:00] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:45:06] (03CR) 10Vgutierrez: varnish: Add basic edge uniques handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:45:51] !log installing twitter-bootstrap3 security updates [15:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:13] (03PS1) 10Brouberol: airflow-analytics-test: split the airflow and postgresql deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138827 (https://phabricator.wikimedia.org/T391348) [15:46:19] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#10765520 (10MatthewVernon) [15:47:12] pt1979@cumin2002 provision (PID 3559262) is awaiting input [15:47:17] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#10765527 (10MatthewVernon) [15:47:28] (03PS3) 10Ssingh: P:durum: add conditional to enable ECH (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1138823 (https://phabricator.wikimedia.org/T205378) [15:47:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:48:28] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5349/co" [puppet] - 10https://gerrit.wikimedia.org/r/1138823 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:48:31] (03PS1) 10Krinkle: admin: Remove unused 'platform-engineering' group [puppet] - 10https://gerrit.wikimedia.org/r/1138828 [15:48:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-f1-codfw - pt1979@cumin2002" [15:48:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:49:02] (03CR) 10CI reject: [V:04-1] admin: Remove unused 'platform-engineering' group [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle) [15:49:31] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:50:32] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:50:56] (03PS9) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [15:51:29] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [15:51:45] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [15:52:10] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [15:52:28] (03CR) 10Vgutierrez: "varnish tests are happy. Text tests:" [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:52:37] (03PS1) 10MVernon: Swift: drain ms-be2080 (prep for VLAN move) [puppet] - 10https://gerrit.wikimedia.org/r/1138830 (https://phabricator.wikimedia.org/T354872) [15:52:39] (03PS1) 10MVernon: swift: remove ms-be2080 entirely from rings prior to reimage [puppet] - 10https://gerrit.wikimedia.org/r/1138831 (https://phabricator.wikimedia.org/T354872) [15:52:40] (03PS1) 10MVernon: swift: restore ms-be2080 to the rings post-reimage [puppet] - 10https://gerrit.wikimedia.org/r/1138832 (https://phabricator.wikimedia.org/T354872) [15:53:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:53:20] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10765553 (10Jhancock.wm) [15:53:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:53] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:58] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:54:20] (03CR) 10Eevans: [C:03+2] restbase: add/remove new/old hosts to/from conftool [puppet] - 10https://gerrit.wikimedia.org/r/1138480 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [15:56:42] (03PS2) 10Krinkle: admin: Remove platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/1138828 [15:57:58] (03CR) 10BCornwall: [C:03+1] varnish: Add basic edge uniques handling (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:58:36] !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1043.eqiad.wmnet [15:58:47] !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1044.eqiad.wmnet [15:58:55] !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1045.eqiad.wmnet [15:59:18] (03PS1) 10Vgutierrez: varnish: Fix docker tests [puppet] - 10https://gerrit.wikimedia.org/r/1138834 [16:00:04] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250424T1600) [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:40] (03PS7) 10Vgutierrez: varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) [16:01:31] (03CR) 10Ssingh: varnish: Fix docker tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1138834 (owner: 10Vgutierrez) [16:02:15] (03CR) 10BCornwall: [C:03+1] varnish: Fix docker tests [puppet] - 10https://gerrit.wikimedia.org/r/1138834 (owner: 10Vgutierrez) [16:02:51] (03PS2) 10Vgutierrez: varnish: Fix docker tests [puppet] - 10https://gerrit.wikimedia.org/r/1138834 (https://phabricator.wikimedia.org/T378737) [16:02:52] (03PS8) 10Vgutierrez: varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) [16:03:08] (03CR) 10Vgutierrez: varnish: Fix docker tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1138834 (https://phabricator.wikimedia.org/T378737) (owner: 10Vgutierrez) [16:03:17] (03CR) 10Ssingh: [C:03+1] "🚢" [puppet] - 10https://gerrit.wikimedia.org/r/1138834 (https://phabricator.wikimedia.org/T378737) (owner: 10Vgutierrez) [16:03:31] (03CR) 10Vgutierrez: [C:03+2] varnish: Fix docker tests [puppet] - 10https://gerrit.wikimedia.org/r/1138834 (https://phabricator.wikimedia.org/T378737) (owner: 10Vgutierrez) [16:03:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:04:35] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10765593 (10dcaro) Did some tests, and we are on the clear, the new hard drives are performant enough (at low level) to handle the current load we have in... [16:05:28] (03PS1) 10CDanis: NetworkProbeLimit: use SameSite=None [puppet] - 10https://gerrit.wikimedia.org/r/1138836 (https://phabricator.wikimedia.org/T342624) [16:08:11] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase1043.eqiad.wmnet [16:08:12] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1043.eqiad.wmnet [16:08:17] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase1044.eqiad.wmnet [16:08:18] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1044.eqiad.wmnet [16:08:22] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase1045.eqiad.wmnet [16:08:22] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1045.eqiad.wmnet [16:13:11] (03PS1) 10Majavah: P:wmcs::instance: Permit DHCPv6 response traffic on host firewall [puppet] - 10https://gerrit.wikimedia.org/r/1138837 (https://phabricator.wikimedia.org/T392611) [16:14:21] (03PS1) 10Eevans: restbase: configure restbase104[3-5] for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1138838 (https://phabricator.wikimedia.org/T389423) [16:15:19] (03PS1) 10Kamila Součková: Rakefile: remove semver-cli requirement [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138839 [16:15:39] (03PS2) 10Majavah: P:wmcs::instance: Permit DHCPv6 response traffic on host firewall [puppet] - 10https://gerrit.wikimedia.org/r/1138837 (https://phabricator.wikimedia.org/T392611) [16:16:51] !log Delete source packages for varnish in bullseye-wikimedia [16:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:55] (03PS3) 10Majavah: P:wmcs::instance: Permit DHCPv6 response traffic on host firewall [puppet] - 10https://gerrit.wikimedia.org/r/1138837 (https://phabricator.wikimedia.org/T392611) [16:18:50] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1138837 (https://phabricator.wikimedia.org/T392611) (owner: 10Majavah) [16:19:08] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, I don't think it's much deal to only filter on the dst port." [puppet] - 10https://gerrit.wikimedia.org/r/1138837 (https://phabricator.wikimedia.org/T392611) (owner: 10Majavah) [16:20:10] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::instance: Permit DHCPv6 response traffic on host firewall [puppet] - 10https://gerrit.wikimedia.org/r/1138837 (https://phabricator.wikimedia.org/T392611) (owner: 10Majavah) [16:21:00] (03CR) 10BCornwall: [C:03+1] varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:21:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-f1-codfw.mgmt.codfw.wmnet [16:21:38] (03CR) 10Muehlenhoff: admin: Remove platform-engineering group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle) [16:23:43] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10765661 (10dcaro) [16:24:26] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device ssw1-f1-codfw [16:24:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-f1-codfw [16:26:29] (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [16:27:58] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:28:13] (03PS1) 10Vgutierrez: puppetserver: Deploy wmfuniq-keygen package [puppet] - 10https://gerrit.wikimedia.org/r/1138845 (https://phabricator.wikimedia.org/T391411) [16:28:58] (03PS2) 10Vgutierrez: puppetserver: Deploy wmfuniq-keygen package [puppet] - 10https://gerrit.wikimedia.org/r/1138845 (https://phabricator.wikimedia.org/T391411) [16:29:02] PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:29:06] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138845 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:29:54] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:30:02] wg 3 [16:31:02] RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:31:16] (03PS3) 10Vgutierrez: puppetserver: Deploy wmfuniq-keygen package [puppet] - 10https://gerrit.wikimedia.org/r/1138845 (https://phabricator.wikimedia.org/T391411) [16:32:21] (03CR) 10CI reject: [V:04-1] Rakefile: remove semver-cli requirement [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138839 (owner: 10Kamila Součková) [16:33:11] (03PS1) 10Cathal Mooney: WMCS: Add policy for IPv6 ranges assigned for server BGP announcement [homer/public] - 10https://gerrit.wikimedia.org/r/1138850 (https://phabricator.wikimedia.org/T379282) [16:34:25] FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:26] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10765696 (10Jhancock.wm) [16:36:44] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10765717 (10Jhancock.wm) [16:37:08] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-f3-codfw.mgmt.codfw.wmnet [16:37:11] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:39:04] (03CR) 10Ssingh: [C:03+1] puppetserver: Deploy wmfuniq-keygen package [puppet] - 10https://gerrit.wikimedia.org/r/1138845 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:39:19] (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138839 (owner: 10Kamila Součková) [16:41:54] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-f3-codfw - pt1979@cumin2002" [16:42:15] (03CR) 10Ssingh: [C:03+1] "Looks good! If you rebase this against master, we can run the Docker tests again as we fixed a bug in I2f2948501f7a04cd9e79f78d32919185180" [puppet] - 10https://gerrit.wikimedia.org/r/1138836 (https://phabricator.wikimedia.org/T342624) (owner: 10CDanis) [16:42:42] (03PS2) 10CDanis: NetworkProbeLimit: use SameSite=None [puppet] - 10https://gerrit.wikimedia.org/r/1138836 (https://phabricator.wikimedia.org/T342624) [16:44:00] (03CR) 10BCornwall: [C:03+1] puppetserver: Deploy wmfuniq-keygen package [puppet] - 10https://gerrit.wikimedia.org/r/1138845 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:45:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-f3-codfw - pt1979@cumin2002" [16:45:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:53:36] (03CR) 10Kamila Součková: [C:03+1] mw:periodic_job:kubernetes: use correct function to error out [puppet] - 10https://gerrit.wikimedia.org/r/1138818 (owner: 10Hnowlan) [16:55:02] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:55:05] (03CR) 10Majavah: [C:03+1] WMCS: Add policy for IPv6 ranges assigned for server BGP announcement [homer/public] - 10https://gerrit.wikimedia.org/r/1138850 (https://phabricator.wikimedia.org/T379282) (owner: 10Cathal Mooney) [16:55:16] (03CR) 10Cathal Mooney: [C:03+2] WMCS: Add policy for IPv6 ranges assigned for server BGP announcement [homer/public] - 10https://gerrit.wikimedia.org/r/1138850 (https://phabricator.wikimedia.org/T379282) (owner: 10Cathal Mooney) [16:55:40] (03CR) 10Bking: [C:03+2] cirrus: add to-be-renamed masters [puppet] - 10https://gerrit.wikimedia.org/r/1138489 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [16:55:51] (03Merged) 10jenkins-bot: WMCS: Add policy for IPv6 ranges assigned for server BGP announcement [homer/public] - 10https://gerrit.wikimedia.org/r/1138850 (https://phabricator.wikimedia.org/T379282) (owner: 10Cathal Mooney) [16:55:58] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:56:42] (03CR) 10Hnowlan: [C:03+2] mw:periodic_job:kubernetes: use correct function to error out [puppet] - 10https://gerrit.wikimedia.org/r/1138818 (owner: 10Hnowlan) [16:56:54] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:56:58] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:58:35] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2073 to cirrussearch2073 [16:58:58] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:59:40] (03PS2) 10Hnowlan: mediawiki: migrate unsubscribeinactiveusers-metawiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) [17:00:04] bd808: Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250424T1700). Please do the needful. [17:00:05] rzl: Time to snap out of that daydream and deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250424T1700). [17:00:36] (03PS1) 10Eevans: adjust hosts lists to reflect changes in restbase cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138854 (https://phabricator.wikimedia.org/T389423) [17:01:11] (03PS1) 10Kamila Součková: CampaignEvents: Shorten aggregateparticipantanswers name [puppet] - 10https://gerrit.wikimedia.org/r/1138855 (https://phabricator.wikimedia.org/T385867) [17:02:37] (03CR) 10Kamila Součková: "Please correct me if I'm wrong :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138839 (owner: 10Kamila Součková) [17:03:15] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2073 to cirrussearch2073 - bking@cumin2002" [17:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:04:08] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2073 to cirrussearch2073 - bking@cumin2002" [17:04:09] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:04:10] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2073 [17:04:24] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138810 (https://phabricator.wikimedia.org/T388539) (owner: 10Hnowlan) [17:04:48] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2073 [17:05:29] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2073 to cirrussearch2073 [17:08:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [17:08:44] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2073.codfw.wmnet with OS bullseye [17:08:56] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2073 [17:09:08] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:09:16] !log rzl@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:11:32] applying some latent admin_ng diffs for cassandra-restbase-a-eqiad, cassandra-restbase-b-eqiad, cassandra-restbase-c-eqiad, hadoop-worker-analytics, kerberos-kdc (all Endpoint IP address changes), kube-state-metrics ClusterRole and command-line args, and mw-wikifunctions Certificate dnsNames [17:11:47] no idea how long they've been sitting there [17:13:04] for posterity, https://www.irccloud.com/pastebin/WZpEyhx5/ [17:13:29] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2073 - bking@cumin2002" [17:13:35] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2073 - bking@cumin2002" [17:13:35] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:13:35] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2073.codfw.wmnet 28.0.192.10.in-addr.arpa 8.2.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:13:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2073.codfw.wmnet 28.0.192.10.in-addr.arpa 8.2.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:13:40] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2073 [17:14:12] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2073 [17:14:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2073 [17:14:27] !log rzl@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:16:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-f3-codfw.mgmt.codfw.wmnet [17:17:45] !log rzl@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:18:41] !log rzl@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:25:22] (03CR) 10BCornwall: [V:03+1] "0 tests failed, 0 tests skipped, 39 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1138836 (https://phabricator.wikimedia.org/T342624) (owner: 10CDanis) [17:26:48] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10765911 (10dcaro) [17:28:46] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2073.codfw.wmnet with reason: host reimage [17:32:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2073.codfw.wmnet with reason: host reimage [17:33:49] (03PS1) 10Bernard Wang: Remove Search AB test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138859 (https://phabricator.wikimedia.org/T388719) [17:35:28] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [17:36:52] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:38:34] (03PS10) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [17:38:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:39:10] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device lsw1-f3-codfw [17:39:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f3-codfw [17:39:58] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:41:04] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:43:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:45:24] (03CR) 10Jdlrobson: Remove Search AB test config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138859 (https://phabricator.wikimedia.org/T388719) (owner: 10Bernard Wang) [17:51:34] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2073.codfw.wmnet with OS bullseye [17:53:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:56:26] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:59:34] (03CR) 10JHathaway: Rakefile: remove semver-cli requirement (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138839 (owner: 10Kamila Součková) [18:01:26] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:10:56] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1138754 (https://phabricator.wikimedia.org/T391687) (owner: 10Filippo Giunchedi) [18:11:58] (03CR) 10Herron: [C:03+1] logstash: bump shards for logstash-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1138754 (https://phabricator.wikimedia.org/T391687) (owner: 10Filippo Giunchedi) [18:12:08] (03CR) 10Herron: [C:03+1] profile::prometheus::k8s: drop istio gateway labels for ML [puppet] - 10https://gerrit.wikimedia.org/r/1138313 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [18:12:28] (03CR) 10Herron: [C:03+1] profile::pyrra: avoid Istio recording rules for SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1138674 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [18:14:08] (03CR) 10Herron: [C:03+1] role: remove logstash role files [puppet] - 10https://gerrit.wikimedia.org/r/1138756 (owner: 10Filippo Giunchedi) [18:16:48] (03CR) 10Dzahn: [C:03+2] gerrit: add nftables rule to allow Istanbul Hackathon hotel network [puppet] - 10https://gerrit.wikimedia.org/r/1138468 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [18:19:04] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10766070 (10BCornwall) It looks like iDRAC only supports outputting historical data on inlet temperatures, not CPUs.... Currently this the best info we've got from iDRAC itself. ` racadm>>getsensorinfo S... [18:21:36] (03CR) 10Dzahn: [C:03+2] "tested on gerrit2003/2002 before gerrit1003. looks good in 'nft list ruleset' and restarting nftables had no issues" [puppet] - 10https://gerrit.wikimedia.org/r/1138468 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [18:21:48] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:28:38] (03CR) 10Dzahn: gerrit: switchover to gerrit1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [18:28:41] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:35:49] PROBLEM - Disk space on analytics1072 is CRITICAL: DISK CRITICAL - free space: / 2122 MB (3% inode=95%): /tmp 2122 MB (3% inode=95%): /var/tmp 2122 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [19:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:19:31] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2076 to cirrussearch2076 [19:19:54] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:23:12] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627 (10jhathaway) 03NEW [19:23:57] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10766249 (10jhathaway) p:05Triage→03Medium a:03jhathaway [19:24:43] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10766252 (10jhathaway) Only occurs on puppetserver1001.eqiad.wmnet, cert was revoked on April 14th: ` puppetserver-2025-04-14.0.log.gz:2025-04-14T07:26:35.169Z INFO [qtp1905171892-12616218] [p.p.certificate-authority] Rev... [19:25:09] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:25:44] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2076 to cirrussearch2076 - bking@cumin2002" [19:26:06] 07Puppet: sync-puppet-ca timer broken - https://phabricator.wikimedia.org/T392628 (10jhathaway) 03NEW [19:26:18] 07Puppet: sync-puppet-ca timer broken - https://phabricator.wikimedia.org/T392628#10766268 (10jhathaway) p:05Triage→03Medium [19:27:34] (03PS1) 10JHathaway: puppetserver: fix sync-puppet-ca timer [puppet] - 10https://gerrit.wikimedia.org/r/1138904 (https://phabricator.wikimedia.org/T392628) [19:28:00] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2076 to cirrussearch2076 - bking@cumin2002" [19:28:00] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:28:01] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2076 [19:28:19] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2076 [19:29:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2076 to cirrussearch2076 [19:29:26] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating pdus in codfw - jhancock@cumin2002" [19:29:29] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2076.codfw.wmnet with OS bullseye [19:29:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating pdus in codfw - jhancock@cumin2002" [19:29:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:29:40] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2076 [19:30:09] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:30:19] 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629 (10jhathaway) 03NEW [19:30:27] 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629#10766295 (10jhathaway) p:05Triage→03Medium [19:31:12] (03CR) 10Dwisehaupt: [C:03+2] "Approved!" [puppet] - 10https://gerrit.wikimedia.org/r/1138738 (https://phabricator.wikimedia.org/T375038) (owner: 10Jgreen) [19:33:06] (03PS1) 10JHathaway: systemd: validate units [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) [19:33:25] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138904 (https://phabricator.wikimedia.org/T392628) (owner: 10JHathaway) [19:35:13] (03CR) 10CI reject: [V:04-1] systemd: validate units [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) (owner: 10JHathaway) [19:35:15] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2076 - bking@cumin2002" [19:35:21] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2076 - bking@cumin2002" [19:35:21] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:35:22] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2076.codfw.wmnet 206.0.192.10.in-addr.arpa 6.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:35:25] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2076.codfw.wmnet 206.0.192.10.in-addr.arpa 6.0.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:35:26] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2076 [19:36:13] (03PS2) 10JHathaway: systemd: validate units [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) [19:36:25] (03CR) 10JHathaway: [C:03+2] puppetserver: fix sync-puppet-ca timer [puppet] - 10https://gerrit.wikimedia.org/r/1138904 (https://phabricator.wikimedia.org/T392628) (owner: 10JHathaway) [19:38:13] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2076 [19:38:14] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2076 [19:38:48] (03CR) 10CI reject: [V:04-1] systemd: validate units [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) (owner: 10JHathaway) [19:41:03] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:41:03] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:41:59] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:41:59] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:42:02] (03PS3) 10Gergő Tisza: Check for shared domain in missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery) [19:42:10] (03CR) 10Gergő Tisza: [C:03+1] Check for shared domain in missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery) [19:48:12] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [19:48:16] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [19:48:16] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2078 [19:48:18] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10766350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cirrussearch2078.codfw.w... [19:50:01] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10766354 (10Sreejithk2000) I am having trouble restoring https://commons.wikimedia.org/w/index.php?title=File:Hawkmoth_(Meganoton_nyctiphanes)_(86... [19:52:35] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2076.codfw.wmnet with reason: host reimage [19:52:58] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2078.codfw.wmnet with OS bullseye [19:53:04] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10766359 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cirrussearch2078.codfw.wmnet... [19:53:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:45] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:52] (03PS3) 10JHathaway: systemd: validate units [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) [19:53:54] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:55:41] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [19:55:45] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [19:55:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2078 [19:55:46] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10766374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cirrussearch2078.codfw.w... [19:55:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2076.codfw.wmnet with reason: host reimage [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250424T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:01:52] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) (owner: 10JHathaway) [20:13:25] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2078.codfw.wmnet with OS bullseye [20:13:33] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10766412 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cirrussearch2078.codfw.wmnet... [20:13:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:14:54] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10766416 (10Pppery) That's a different issue, as the file was deleted/undeleted in 2021. Presumably some SRE needs to manually delete the lingerin... [20:15:17] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: "Error undeleting file: A non-identical file already exists at "mwstore://local-swift-eqiad/local-public/..." while restoring a file on Commons - https://phabricator.wikimedia.org/T258938#10766429 (10Pppery) →14Duplicate dup:03T387340 [20:15:23] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10766431 (10Pppery) [20:16:20] 06SRE, 10SRE-swift-storage: Incorrect "non-identical file already exists" error when undeleting file on Commons - https://phabricator.wikimedia.org/T45952#10766436 (10Pppery) →14Duplicate dup:03T387340 [20:16:21] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10766438 (10Pppery) [20:17:07] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Rapid delete/move/undelete operations on files can result in the MediaWiki DB getting out of sync with Swift, resulting in "A non-identical file already exists at
errors" on undelete - https://phabricator.wikimedia.org/T387340#10766440 (10... [20:20:47] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2076.codfw.wmnet with OS bullseye [20:22:14] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [20:22:18] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [20:22:18] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2078 [20:22:26] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10766459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cirrussearch2078.codfw.w... [20:22:27] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Rapid delete/move/undelete operations on files can result in the MediaWiki DB getting out of sync with Swift, resulting in "A non-identical file already exists at
errors" on undelete - https://phabricator.wikimedia.org/T387340#10766460 (10... [20:29:12] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2078.codfw.wmnet with OS bullseye [20:29:24] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10766472 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cirrussearch2078.codfw.wmnet... [20:29:44] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2078'] [20:32:19] (03PS1) 10Krinkle: missing.php: Simplify code to reduce abstraction and duplication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138921 (https://phabricator.wikimedia.org/T113114) [20:32:20] (03PS1) 10Krinkle: missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) [20:34:25] FIRING: [3x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:38:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cirrussearch2078'] [20:43:54] (03CR) 10Krinkle: "Screenshots at https://phabricator.wikimedia.org/T113114#10766520" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [20:44:25] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:46:21] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:49:25] FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:25] (03PS4) 10Krinkle: missing: Check for auth.wikimedia.org domain on missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery) [20:49:33] (03PS5) 10Krinkle: missing.php: Check for auth.wikimedia.org domain on missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery) [20:50:51] (03PS6) 10Krinkle: missing.php: Check for auth.wikimedia.org domain on missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery) [20:51:21] (03CR) 10Krinkle: [C:03+1] missing.php: Check for auth.wikimedia.org domain on missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery) [20:51:38] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [20:51:42] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [20:51:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2078 [20:51:43] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10766553 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cirrussearch2078.codfw.w... [20:52:17] (03CR) 10VolkerE: [C:04-1] missing.php: Redesign to match current error pages (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [20:54:25] FIRING: [4x] SystemdUnitFailed: krb5-admin-server.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:57:00] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10766560 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cirrussearch2078.codfw.wmnet... [20:57:11] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2078.codfw.wmnet with OS bullseye [20:58:15] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [20:58:19] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [20:58:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2078 [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250424T2100) [21:00:14] (03CR) 10Zabe: [C:03+1] Apache config for arbcom_plwiki [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [21:00:57] (03CR) 10Krinkle: missing.php: Redesign to match current error pages (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [21:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:07:26] (03CR) 10Cwhite: [C:03+1] "These don't look used to me. Thanks for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/1138756 (owner: 10Filippo Giunchedi) [21:08:46] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [21:11:41] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2078.codfw.wmnet with OS bullseye [21:13:32] !log restarting puppetserver1002 to test crl [21:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:56] (03PS3) 10Ryan Kemper: sre.wdqs.data-transfer: improve graph type checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1097552 (https://phabricator.wikimedia.org/T376150) [21:17:03] (03PS3) 10Ryan Kemper: fix inconsequential typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137356 [21:18:29] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2080 to cirrussearch2080 [21:18:51] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:20:49] 07Puppet: non-ca puppetservers do not check the ca certificate revocation list or CRL - https://phabricator.wikimedia.org/T392637 (10jhathaway) 03NEW [21:21:23] 07Puppet: non-ca puppetservers do not check the ca certificate revocation list or CRL - https://phabricator.wikimedia.org/T392637#10766678 (10jhathaway) p:05Triage→03Medium [21:23:13] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2080 to cirrussearch2080 - bking@cumin2002" [21:23:29] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2080 to cirrussearch2080 - bking@cumin2002" [21:23:29] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:23:30] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2080 [21:24:32] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-f1-codfw.mgmt.codfw.wmnet [21:24:34] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:26:18] !log jhathaway@cumin1002 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for db1178.eqiad.wmnet: Renew puppet certificate - jhathaway@cumin1002 [21:26:33] bking@cumin2002 rename (PID 3906272) is awaiting input [21:28:46] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-f1-codfw - pt1979@cumin2002" [21:28:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-f1-codfw - pt1979@cumin2002" [21:28:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:29:56] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for db1178.eqiad.wmnet: Renew puppet certificate - jhathaway@cumin1002 [21:33:30] FIRING: Emergency syslog message: Alert for device lsw1-d8-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [21:33:33] (03CR) 10CI reject: [V:04-1] sre.wdqs.data-transfer: improve graph type checks [cookbooks] - 10https://gerrit.wikimedia.org/r/1097552 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [21:34:33] (03PS1) 10Bking: cirrussearch: allow any cirrussearch host to join cluster [puppet] - 10https://gerrit.wikimedia.org/r/1138925 (https://phabricator.wikimedia.org/T388610) [21:35:03] (03PS1) 10Zabe: Prepare nupwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138926 (https://phabricator.wikimedia.org/T390384) [21:35:12] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2080 [21:35:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2080 to cirrussearch2080 [21:36:21] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:38:18] (03PS2) 10Bking: cirrussearch: allow any cirrussearch host to join cluster [puppet] - 10https://gerrit.wikimedia.org/r/1138925 (https://phabricator.wikimedia.org/T388610) [21:38:30] RESOLVED: Emergency syslog message: Device lsw1-d8-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [21:39:04] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2080.codfw.wmnet with OS bullseye [21:39:05] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:39:25] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:39:54] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2080.codfw.wmnet with OS bullseye [21:40:35] (03CR) 10Zabe: [C:03+2] Prepare nupwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138926 (https://phabricator.wikimedia.org/T390384) (owner: 10Zabe) [21:41:01] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:42:28] 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10766729 (10jhathaway) 05Open→03Resolved I opened subtasks for the issues discovered when looking at this issue, the server certificate itself has been regenerated, however why the cert was revoked in the first plac... [21:42:54] 07Puppet: Non-ca puppetservers do not check the CA certificate revocation list or CRL - https://phabricator.wikimedia.org/T392637#10766732 (10jhathaway) [21:44:59] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2080.codfw.wmnet with OS bullseye [21:45:10] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2080 [21:45:16] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:45:48] (03Merged) 10jenkins-bot: Prepare nupwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138926 (https://phabricator.wikimedia.org/T390384) (owner: 10Zabe) [21:46:29] (03CR) 10BryanDavis: [C:03+1] "We had the Puppet log ingestion setup in Beta Cluster in the past, but those instances are long gone at this point." [puppet] - 10https://gerrit.wikimedia.org/r/1138756 (owner: 10Filippo Giunchedi) [21:46:31] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138925 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:47:21] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1138926|Prepare nupwiki (T390384)]] [21:47:26] T390384: Create Wikipedia Nupe - https://phabricator.wikimedia.org/T390384 [21:48:47] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: allow any cirrussearch host to join cluster [puppet] - 10https://gerrit.wikimedia.org/r/1138925 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:49:01] (03CR) 10Bking: [C:03+2] cirrussearch: allow any cirrussearch host to join cluster [puppet] - 10https://gerrit.wikimedia.org/r/1138925 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:49:30] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2080 - bking@cumin2002" [21:49:36] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2080 - bking@cumin2002" [21:49:36] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:49:37] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2080.codfw.wmnet 127.16.192.10.in-addr.arpa 7.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:49:40] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2080.codfw.wmnet 127.16.192.10.in-addr.arpa 7.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:49:41] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2080 [21:49:51] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2080 [21:49:51] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2080 [21:52:10] !log zabe@deploy1003 zabe: Backport for [[gerrit:1138926|Prepare nupwiki (T390384)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:53:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:53:54] !log zabe@deploy1003 zabe: Continuing with sync [21:54:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-f1-codfw.mgmt.codfw.wmnet [21:56:13] PROBLEM - OpenSearch health check for shards on 9600 on cirrussearch2076 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:56:13] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch2076 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:58:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2076-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:58:39] (03CR) 10Cwhite: [C:04-2] "I think it's worth detailing the rationale and method before making such a large change." [puppet] - 10https://gerrit.wikimedia.org/r/1138754 (https://phabricator.wikimedia.org/T391687) (owner: 10Filippo Giunchedi) [21:59:18] (03CR) 10Cwhite: [C:04-2] "_unresolve_" [puppet] - 10https://gerrit.wikimedia.org/r/1138754 (https://phabricator.wikimedia.org/T391687) (owner: 10Filippo Giunchedi) [21:59:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2076:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [21:59:59] (03PS1) 10Zabe: Activate nupwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138929 (https://phabricator.wikimedia.org/T390384) [22:00:37] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138926|Prepare nupwiki (T390384)]] (duration: 13m 15s) [22:00:41] T390384: Create Wikipedia Nupe - https://phabricator.wikimedia.org/T390384 [22:01:44] (03CR) 10Zabe: [C:03+2] Activate nupwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138929 (https://phabricator.wikimedia.org/T390384) (owner: 10Zabe) [22:02:33] (03Merged) 10jenkins-bot: Activate nupwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138929 (https://phabricator.wikimedia.org/T390384) (owner: 10Zabe) [22:02:42] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [22:02:46] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [22:02:47] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2078 [22:03:27] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1138929|Activate nupwiki (T390384)]] [22:04:12] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2080.codfw.wmnet with reason: host reimage [22:07:23] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2080.codfw.wmnet with reason: host reimage [22:07:57] !log zabe@deploy1003 zabe: Backport for [[gerrit:1138929|Activate nupwiki (T390384)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:08:01] T390384: Create Wikipedia Nupe - https://phabricator.wikimedia.org/T390384 [22:08:40] !log zabe@deploy1003 zabe: Continuing with sync [22:09:19] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2078.codfw.wmnet with OS bullseye [22:11:48] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye [22:11:52] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2078 [22:11:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2078 [22:15:21] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138929|Activate nupwiki (T390384)]] (duration: 11m 54s) [22:15:26] T390384: Create Wikipedia Nupe - https://phabricator.wikimedia.org/T390384 [22:21:48] FIRING: PuppetFailure: Puppet has failed on krb1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:22:12] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138933 [22:22:12] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138933 (owner: 10Zabe) [22:23:05] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138933 (owner: 10Zabe) [22:23:26] !log zabe@deploy1003 Started scap sync-world: T390384 [22:23:30] T390384: Create Wikipedia Nupe - https://phabricator.wikimedia.org/T390384 [22:27:36] (03PS1) 10Ryan Kemper: [wip] wdqs: point query.wikidata.org to main graph [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138935 (https://phabricator.wikimedia.org/T388134) [22:28:41] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:34:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2080.codfw.wmnet with OS bullseye [22:34:34] !log zabe@deploy1003 Finished scap sync-world: T390384 (duration: 11m 08s) [22:34:38] T390384: Create Wikipedia Nupe - https://phabricator.wikimedia.org/T390384 [22:34:54] (03PS1) 10RLazarus: admin_ng: Read RoleBinding usernames from services hiera, correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138938 (https://phabricator.wikimedia.org/T378429) [22:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:36:27] (03PS2) 10RLazarus: admin_ng: Read RoleBinding usernames from services hiera, correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138938 (https://phabricator.wikimedia.org/T378429) [22:43:00] (03CR) 10RLazarus: admin_ng: Read RoleBinding usernames from services hiera, correctly (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138938 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [22:52:58] (03CR) 10Kamila Součková: [C:03+1] "Oops, I didn't catch that on the previous change '^^" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138938 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [22:55:23] (03PS3) 10RLazarus: admin_ng: Read RoleBinding usernames from services hiera, correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138938 (https://phabricator.wikimedia.org/T378429) [22:59:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:04:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:05:34] (03CR) 10RLazarus: [C:03+2] admin_ng: Read RoleBinding usernames from services hiera, correctly (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138938 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [23:11:20] (03Merged) 10jenkins-bot: admin_ng: Read RoleBinding usernames from services hiera, correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138938 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [23:21:31] jouncebot: nowandnext [23:21:31] No deployments scheduled for the next 6 hour(s) and 38 minute(s) [23:21:31] In 6 hour(s) and 38 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250425T0600) [23:21:45] sneaking out an infrastructure patch a mere 6 hours 38 minutes before the window starts [23:22:18] !log rzl@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [23:24:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:25:43] !log rzl@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [23:27:21] !log rzl@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [23:28:10] !log rzl@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [23:28:21] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [23:29:41] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [23:29:45] RESOLVED: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:30:08] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [23:31:18] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [23:32:11] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2078.codfw.wmnet with OS bullseye [23:33:40] (done) [23:40:00] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:40:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1138941 [23:40:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1138941 (owner: 10TrainBranchBot) [23:45:00] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:47:46] !log pt1979@cumin2002 START - Cookbook sre.network.tls for network device lsw1-f1-codfw [23:47:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw [23:50:00] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:53:31] (03CR) 10Dzahn: "Alright, this sounds mostly good to me, just want to add a little detail right now. With the current setup a gerrit server has 2 IPs and d" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [23:53:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:46] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:54] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:54:25] (03CR) 10Dzahn: "Thank you, Jaime. I will get back to this shortly. appreciate the response and ideas." [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [23:55:52] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1138941 (owner: 10TrainBranchBot)