[00:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1103017 [00:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1103017 (owner: 10TrainBranchBot) [00:38:40] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10402034 (10Eevans) >>! In T381742#10399231, @Eevans wrote: > Status: Rebuilding... ...Done (New )status: Observing... [00:39:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10402035 (10phaultfinder) [01:01:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1103017 (owner: 10TrainBranchBot) [01:08:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1103027 [01:08:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1103027 (owner: 10TrainBranchBot) [01:13:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:23:33] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:30:50] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1103027 (owner: 10TrainBranchBot) [02:03:20] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101867 (https://phabricator.wikimedia.org/T381421) (owner: 10أنون) [02:09:30] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:22:08] (03CR) 10Scott French: [C:03+1] "Thanks, Tim! I think this should be good to proceed from the shellbox side of things:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101239 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [02:23:48] 10ops-codfw, 06SRE, 06DC-Ops: Remove defunct lvs cross-dc links in Netbox (lvs2011 & lvs2013) - https://phabricator.wikimedia.org/T381533#10402096 (10Papaul) 05Open→03Resolved @cmooney complete [02:28:08] 06SRE, 06Commons, 10MediaWiki-File-management, 06serviceops, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155#10402111 (10AntiCompositeNumber) [03:09:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10402114 (10phaultfinder) [03:20:14] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [03:24:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10402128 (10phaultfinder) [03:25:14] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [04:23:37] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 17 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [04:26:37] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [05:13:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:38:09] PROBLEM - Docker registry HTTPS interface on registry1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [05:39:03] RECOVERY - Docker registry HTTPS interface on registry1005 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 4.092 second response time https://wikitech.wikimedia.org/wiki/Docker [06:07:17] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:07:33] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:30] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:45:37] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241213T0700) [07:11:05] (03PS1) 10Elukey: charts: add /etc/ config volume for Kartotherian (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103198 [07:11:56] (03PS2) 10Elukey: charts: add /etc/ config volume for Kartotherian (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103198 [07:13:57] (03PS3) 10Elukey: charts: add /etc/ config volume for Kartotherian (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103198 [07:18:29] (03PS4) 10Elukey: charts: add /etc/ config volume for Kartotherian (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103198 [07:20:14] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [07:21:58] (03PS5) 10Elukey: charts: add /etc/ config volume for Kartotherian (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103198 [07:24:17] (03CR) 10Elukey: [C:03+2] charts: add /etc/ config volume for Kartotherian (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103198 (owner: 10Elukey) [07:25:14] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [07:25:47] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:26:37] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Swift [07:30:57] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [07:33:21] (03CR) 10Slyngshede: [C:03+2] IDM update to Bitu 0.1.4 [dns] - 10https://gerrit.wikimedia.org/r/1102762 (owner: 10Slyngshede) [07:33:36] (03PS1) 10Elukey: charts: fix kartotherian's config map [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103203 [07:34:59] (03CR) 10Elukey: [C:03+2] charts: fix kartotherian's config map [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103203 (owner: 10Elukey) [07:41:01] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [07:49:06] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [07:57:41] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2126.codfw.wmnet [07:58:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2126.codfw.wmnet [07:58:22] (03CR) 10Muehlenhoff: [C:03+2] Configure new maps nodes with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1101864 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:58:37] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2125.codfw.wmnet [07:59:10] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [07:59:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2125.codfw.wmnet [08:00:08] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241213T0800) [08:00:24] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[2001-2004].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [08:01:56] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2126.codfw.wmnet with OS bookworm [08:02:00] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2125.codfw.wmnet with OS bookworm [08:02:08] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2001.codfw.wmnet with OS bookworm [08:02:15] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2126 [08:02:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2126 [08:02:16] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2001 [08:02:16] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2001 [08:02:36] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2125 [08:02:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2125 [08:05:53] PROBLEM - BGP status on lsw1-b5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:06:07] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:08:58] (03CR) 10JMeybohm: [WIP, DNM] create sre.k8s.roll-reimage-nodes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [08:09:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2001.codfw.wmnet [08:10:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2002.codfw.wmnet [08:13:46] (03CR) 10JMeybohm: [WIP, DNM] create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [08:19:54] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2001.codfw.wmnet with reason: host reimage [08:21:23] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2125.codfw.wmnet with reason: host reimage [08:21:35] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host maps-test2001.codfw.wmnet [08:21:38] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2126.codfw.wmnet with reason: host reimage [08:22:16] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host maps-test2002.codfw.wmnet [08:23:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2003.codfw.wmnet [08:23:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2004.codfw.wmnet [08:23:42] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2001.codfw.wmnet with reason: host reimage [08:26:20] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:26:42] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:26:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2125.codfw.wmnet with reason: host reimage [08:29:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2003.codfw.wmnet [08:30:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2004.codfw.wmnet [08:30:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2126.codfw.wmnet with reason: host reimage [08:31:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2005.codfw.wmnet [08:31:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host maps-test2006.codfw.wmnet [08:33:43] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10402277 (10MoritzMuehlenhoff) [08:36:46] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2017.codfw.wmnet [08:36:54] RECOVERY - BGP status on lsw1-b5-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:39:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2005.codfw.wmnet [08:39:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps-test2006.codfw.wmnet [08:39:54] PROBLEM - BGP status on lsw1-b5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:40:54] RECOVERY - BGP status on lsw1-b5-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:41:58] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:43:34] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2001.codfw.wmnet with OS bookworm [08:44:08] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:44:29] (03PS1) 10Muehlenhoff: Remove obsolete acmechief entries [puppet] - 10https://gerrit.wikimedia.org/r/1103253 [08:45:19] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2002.codfw.wmnet with OS bookworm [08:45:20] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 70534104 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:45:27] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2002 [08:45:27] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2002 [08:45:33] (03PS1) 10Muehlenhoff: Remove ganeti2018 from cluster list [puppet] - 10https://gerrit.wikimedia.org/r/1103255 (https://phabricator.wikimedia.org/T376594) [08:46:20] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 98424 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:46:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2125.codfw.wmnet with OS bookworm [08:46:35] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2017.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:46:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2017.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:46:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:46:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2017.codfw.wmnet [08:47:07] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti2018 from cluster list [puppet] - 10https://gerrit.wikimedia.org/r/1103255 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [08:47:08] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:48:08] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:48:56] PROBLEM - BGP status on lsw1-b5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:50:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2126.codfw.wmnet with OS bookworm [08:51:35] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2125.codfw.wmnet [08:51:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2125.codfw.wmnet [08:51:46] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2126.codfw.wmnet [08:51:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2126.codfw.wmnet [08:58:33] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2123.codfw.wmnet [08:59:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2123.codfw.wmnet [08:59:41] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2122.codfw.wmnet [09:01:16] !log aokoth@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Update [09:02:12] !log T382078 Ran mwscript-k8s --comment="T382078" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=wikidatawiki --logwiki=metawiki 'Norberto Luis Amoroso Jacquet' 'Renamed user fe0fd27068061604303a2a5ab7390149' [09:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:16] T382078: Unblock stuck global rename of Norberto Luis Amoroso Jacquet and Roggenwolf - https://phabricator.wikimedia.org/T382078 [09:02:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2122.codfw.wmnet [09:05:10] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2002.codfw.wmnet with reason: host reimage [09:05:15] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2122.codfw.wmnet with OS bookworm [09:05:16] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2123.codfw.wmnet with OS bookworm [09:05:33] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2122 [09:05:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2122 [09:05:35] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2123 [09:05:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2123 [09:08:39] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2002.codfw.wmnet with reason: host reimage [09:08:54] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:09:25] !log T382078 Ran mwscript-k8s --comment="T382078" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=trwikiquote --logwiki=metawiki 'Roggenwolf' 'ChopinAficionado' [09:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:28] T382078: Unblock stuck global rename of Norberto Luis Amoroso Jacquet and Roggenwolf - https://phabricator.wikimedia.org/T382078 [09:13:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:22:57] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2122.codfw.wmnet with reason: host reimage [09:23:39] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2123.codfw.wmnet with reason: host reimage [09:26:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2122.codfw.wmnet with reason: host reimage [09:27:02] RECOVERY - BGP status on lsw1-b5-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:27:32] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2018.codfw.wmnet [09:27:47] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2002.codfw.wmnet with OS bookworm [09:29:32] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2003.codfw.wmnet with OS bookworm [09:29:40] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2003 [09:29:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2003 [09:29:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2123.codfw.wmnet with reason: host reimage [09:32:01] (03PS1) 10Lucas Werkmeister (WMDE): Remove lucaswerkmeister-wmde SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1103281 [09:32:48] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:36:25] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2018.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:36:33] (03CR) 10Fabfur: [C:03+2] Remove lucaswerkmeister-wmde SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1103281 (owner: 10Lucas Werkmeister (WMDE)) [09:36:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti2018.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:36:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:36:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2018.codfw.wmnet [09:37:31] (03PS1) 10Muehlenhoff: Remove decommed Ganeti nodes from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1103283 (https://phabricator.wikimedia.org/T382114) [09:39:34] (03CR) 10Muehlenhoff: [C:03+2] Remove decommed Ganeti nodes from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1103283 (https://phabricator.wikimedia.org/T382114) (owner: 10Muehlenhoff) [09:41:45] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission ganeti2017 / ganeti2018 - https://phabricator.wikimedia.org/T382114#10402378 (10MoritzMuehlenhoff) [09:42:03] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1103253 (owner: 10Muehlenhoff) [09:42:50] !log depool/restart swift/repool ms-fe1014 [09:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10402385 (10MoritzMuehlenhoff) 05Open→03Resolved All new servers added, all old server decommissioned and clusters rebalanced. [09:44:06] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete acmechief entries [puppet] - 10https://gerrit.wikimedia.org/r/1103253 (owner: 10Muehlenhoff) [09:44:07] 06SRE, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ won't load - https://phabricator.wikimedia.org/T381980#10402388 (10Aklapper) Works for me [09:45:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2122.codfw.wmnet with OS bookworm [09:47:12] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2003.codfw.wmnet with reason: host reimage [09:50:13] 06SRE, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ won't load - https://phabricator.wikimedia.org/T381980#10402401 (10Lucas_Werkmeister_WMDE) Still not working for me when logged in (but works in a private window). @Aklapper are you logged into... [09:50:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2123.codfw.wmnet with OS bookworm [09:50:55] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2003.codfw.wmnet with reason: host reimage [09:55:46] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Hm, no, I would’ve expected that to serve the not-ready-for-deployment latest version of the query builder from kubernetes (as opposed to " [puppet] - 10https://gerrit.wikimedia.org/r/1102320 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:56:06] (03CR) 10JMeybohm: "Another thing I've noticed is that, like with the reimage-stacked-control-plane cookbook, roll-reimage-nodes will repeat itself over and o" [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [09:56:56] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2123.codfw.wmnet [09:56:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2123.codfw.wmnet [09:57:14] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2122.codfw.wmnet [09:57:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2122.codfw.wmnet [09:57:26] 06SRE, 06Infrastructure-Foundations, 06Traffic: NetworkProbeLimit cookie rejected due to missing SameSite attribute - https://phabricator.wikimedia.org/T342624#10402439 (10Krinkle) [09:58:42] 10SRE-swift-storage, 10MW-on-K8s, 06serviceops, 10Shellbox, and 3 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#10402444 (10tstarling) For a baseline performance benchmark, I reset the WebM 360P transcode of [[https://commons.wikimedia.org/wiki/File:View_of_the_Earth_f... [09:59:45] 06SRE, 06Infrastructure-Foundations, 06Traffic: NetworkProbeLimit cookie rejected due to missing SameSite attribute - https://phabricator.wikimedia.org/T342624#10402445 (10Krinkle) [10:07:48] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2121.codfw.wmnet [10:08:22] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [10:08:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2121.codfw.wmnet [10:09:30] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:04] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2120.codfw.wmnet [10:10:12] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Docker [10:10:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2120.codfw.wmnet [10:11:10] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:11:13] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2003.codfw.wmnet with OS bookworm [10:12:29] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2121.codfw.wmnet with OS bookworm [10:12:31] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2120.codfw.wmnet with OS bookworm [10:12:49] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2121 [10:12:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2121 [10:12:50] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2120 [10:12:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2120 [10:13:09] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2004.codfw.wmnet with OS bookworm [10:13:18] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2004 [10:13:18] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2004 [10:14:38] (03PS1) 10Marostegui: installserver: Do not reimage es2044 [puppet] - 10https://gerrit.wikimedia.org/r/1103286 [10:16:10] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:16:20] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:17:05] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage es2044 [puppet] - 10https://gerrit.wikimedia.org/r/1103286 (owner: 10Marostegui) [10:17:57] (03Abandoned) 10Fabfur: haproxy: initial work to support easy-ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) (owner: 10Fabfur) [10:24:13] 06SRE, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ won't load when logged out - https://phabricator.wikimedia.org/T381980#10402510 (10Aklapper) [10:24:22] (03PS1) 10Fabfur: benthos: enable Benthos on whole ulsfo DC [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) [10:25:45] 06SRE, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ won't load when logged out - https://phabricator.wikimedia.org/T381980#10402512 (10Aklapper) Oh true, thanks. https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ works f... [10:26:16] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:26:52] 06SRE, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ won't load when logged in - https://phabricator.wikimedia.org/T381980#10402515 (10Aklapper) [10:30:14] (03CR) 10JMeybohm: [WIP, DNM] create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [10:30:41] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2121.codfw.wmnet with reason: host reimage [10:30:42] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2120.codfw.wmnet with reason: host reimage [10:30:59] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2004.codfw.wmnet with reason: host reimage [10:34:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2121.codfw.wmnet with reason: host reimage [10:37:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2120.codfw.wmnet with reason: host reimage [10:39:33] (03PS2) 10Fabfur: benthos: enable Benthos on whole ulsfo DC [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) [10:40:10] (03CR) 10CI reject: [V:04-1] benthos: enable Benthos on whole ulsfo DC [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:40:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2004.codfw.wmnet with reason: host reimage [10:47:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:49:28] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:50:11] (03PS3) 10Fabfur: benthos: enable Benthos on whole ulsfo DC [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) [10:51:11] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:53:28] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:53:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2121.codfw.wmnet with OS bookworm [10:54:22] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:54:28] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:57:22] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:57:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2120.codfw.wmnet with OS bookworm [10:59:38] (03PS4) 10Fabfur: benthos: enable Benthos on whole ulsfo DC [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) [11:00:17] (03CR) 10CI reject: [V:04-1] benthos: enable Benthos on whole ulsfo DC [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [11:00:33] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2004.codfw.wmnet with OS bookworm [11:00:35] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[2001-2004].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [11:04:13] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[2007-2010].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [11:05:58] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2007.codfw.wmnet with OS bookworm [11:06:06] (03PS5) 10Fabfur: benthos: enable Benthos on whole ulsfo DC [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) [11:06:06] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2007 [11:06:07] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2007 [11:07:42] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [11:10:22] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:20:14] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [11:23:51] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2007.codfw.wmnet with reason: host reimage [11:25:14] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [11:26:26] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2007.codfw.wmnet with reason: host reimage [11:30:29] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2119.codfw.wmnet [11:31:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2119.codfw.wmnet [11:31:42] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2120.codfw.wmnet [11:31:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2120.codfw.wmnet [11:31:55] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2121.codfw.wmnet [11:31:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2121.codfw.wmnet [11:32:22] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2118.codfw.wmnet [11:32:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2118.codfw.wmnet [11:35:10] (03PS1) 10Muehlenhoff: Finetune request dialogue [software/bitu] - 10https://gerrit.wikimedia.org/r/1103300 [11:36:15] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2119.codfw.wmnet with OS bookworm [11:36:16] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2118.codfw.wmnet with OS bookworm [11:36:35] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2119 [11:36:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2119 [11:36:35] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2118 [11:36:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2118 [11:39:52] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 273951136 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:40:30] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:40:52] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 57504 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:41:47] (03PS1) 10Muehlenhoff: Allow the comment to be left empty in permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1103303 [11:44:32] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [11:45:22] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Docker [11:46:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2007.codfw.wmnet with OS bookworm [11:46:22] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:47:59] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2008.codfw.wmnet with OS bookworm [11:48:09] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2008 [11:48:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2008 [11:52:22] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:54:29] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2119.codfw.wmnet with reason: host reimage [11:54:44] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2118.codfw.wmnet with reason: host reimage [11:55:29] aokoth@cumin1002 aokoth: The backup on gitlab2002 is complete, ready to proceed with upgrade. [11:57:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2119.codfw.wmnet with reason: host reimage [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241213T0800) [12:00:05] eoghan, jelto, arnoldokoth, and mutante: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241213T1200). [12:01:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2118.codfw.wmnet with reason: host reimage [12:05:34] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2008.codfw.wmnet with reason: host reimage [12:06:18] !log aokoth@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Update [12:06:56] FIRING: ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:08:31] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2008.codfw.wmnet with reason: host reimage [12:09:44] !log bump build2002 to 400G T379343 [12:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:48] T379343: Create bookworm-based build host - https://phabricator.wikimedia.org/T379343 [12:09:59] RECOVERY - BGP status on cr2-magru is OK: BGP OK - up: 77, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:11:56] RESOLVED: ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:17:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2119.codfw.wmnet with OS bookworm [12:18:27] (03PS1) 10Muehlenhoff: matomo: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1103313 [12:20:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1103313 (owner: 10Muehlenhoff) [12:22:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2118.codfw.wmnet with OS bookworm [12:22:31] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:23:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:24:19] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2118.codfw.wmnet [12:24:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2118.codfw.wmnet [12:24:30] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:32] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2119.codfw.wmnet [12:24:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2119.codfw.wmnet [12:27:03] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:27:09] (03CR) 10Slyngshede: [C:03+1] "Seems reasonable." [software/bitu] - 10https://gerrit.wikimedia.org/r/1103303 (owner: 10Muehlenhoff) [12:27:11] (03CR) 10Slyngshede: [C:03+2] Allow the comment to be left empty in permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1103303 (owner: 10Muehlenhoff) [12:27:49] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2117.codfw.wmnet [12:28:27] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:28:55] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2008.codfw.wmnet with OS bookworm [12:30:41] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2009.codfw.wmnet with OS bookworm [12:30:50] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2009 [12:30:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2009 [12:31:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2117.codfw.wmnet [12:32:09] (03PS1) 10Muehlenhoff: webperf: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1103318 [12:33:47] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2116.codfw.wmnet [12:34:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2116.codfw.wmnet [12:34:29] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:36:01] RECOVERY - BGP status on cr2-magru is OK: BGP OK - up: 81, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:36:27] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:37:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1103318 (owner: 10Muehlenhoff) [12:43:03] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2116.codfw.wmnet with OS bookworm [12:43:04] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2117.codfw.wmnet with OS bookworm [12:43:07] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:43:23] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2117 [12:43:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2117 [12:43:24] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2116 [12:43:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2116 [12:46:35] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:00:59] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2117.codfw.wmnet with reason: host reimage [13:01:22] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2116.codfw.wmnet with reason: host reimage [13:03:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2117.codfw.wmnet with reason: host reimage [13:06:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2116.codfw.wmnet with reason: host reimage [13:10:11] RECOVERY - BGP status on cr2-magru is OK: BGP OK - up: 81, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:13:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:19:30] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2117.codfw.wmnet with OS bookworm [13:23:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:25:45] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:26:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2116.codfw.wmnet with OS bookworm [13:28:10] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [13:35:37] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:38:14] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [13:44:12] (03Abandoned) 10JMeybohm: Revert: Ratelimit a hotlink saturation case [puppet] - 10https://gerrit.wikimedia.org/r/924550 (owner: 10JMeybohm) [13:45:51] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2009.codfw.wmnet with reason: host reimage [13:45:57] (03CR) 10Muehlenhoff: "Debian also provides a second Rust build in parallel to the default 1.63 one. It used to build Chromium and Firefox, which are updated in " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102983 (https://phabricator.wikimedia.org/T380807) (owner: 10Jforrester) [13:48:45] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [13:48:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM build2002.codfw.wmnet [13:49:42] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2009.codfw.wmnet with reason: host reimage [13:52:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM build2002.codfw.wmnet [13:56:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host build2002.codfw.wmnet with OS bookworm [13:56:23] 06SRE, 06Infrastructure-Foundations: Create bookworm-based build host - https://phabricator.wikimedia.org/T379343#10403077 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host build2002.codfw.wmnet with OS bookworm [13:58:04] (03CR) 10Filippo Giunchedi: [C:03+1] webperf: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1103318 (owner: 10Muehlenhoff) [13:58:49] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:03:29] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:09:16] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 477729920 and 25 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:10:16] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 131464 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:12:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2009.codfw.wmnet with OS bookworm [14:12:38] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:27] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on build2002.codfw.wmnet with reason: host reimage [14:13:33] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:13:53] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2010.codfw.wmnet with OS bookworm [14:14:02] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2010 [14:14:02] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2010 [14:14:16] (03PS2) 10Jforrester: Provide a base image for Rust 1.78, based on Bookworm using 'rustc-web' [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102983 (https://phabricator.wikimedia.org/T380807) [14:14:17] (03CR) 10Jforrester: Provide a base image for Rust 1.78, based on Bookworm using 'rustc-web' (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102983 (https://phabricator.wikimedia.org/T380807) (owner: 10Jforrester) [14:15:09] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:15:41] (03CR) 10Muehlenhoff: Provide a base image for Rust 1.78, based on Bookworm using 'rustc-web' (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102983 (https://phabricator.wikimedia.org/T380807) (owner: 10Jforrester) [14:16:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on build2002.codfw.wmnet with reason: host reimage [14:17:38] PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:21:09] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2117.codfw.wmnet [14:21:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2117.codfw.wmnet [14:21:24] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2116.codfw.wmnet [14:21:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2116.codfw.wmnet [14:24:15] (03CR) 10Mforns: [C:03+1] analytics/html: update readme for MW history dump [puppet] - 10https://gerrit.wikimedia.org/r/1102848 (https://phabricator.wikimedia.org/T381390) (owner: 10Milimetric) [14:24:33] (03PS3) 10Jforrester: Provide a base image for Rust, based on Bookworm using 'rustc-web' now at 1.78 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102983 (https://phabricator.wikimedia.org/T380807) [14:25:13] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:25:40] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2115.codfw.wmnet [14:26:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2115.codfw.wmnet [14:26:21] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:26:53] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2114.codfw.wmnet [14:27:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2114.codfw.wmnet [14:28:25] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:29:26] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [14:29:35] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2114.codfw.wmnet with OS bookworm [14:29:42] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2115.codfw.wmnet with OS bookworm [14:29:54] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2114 [14:29:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2114 [14:30:00] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2115 [14:30:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2115 [14:31:30] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [14:31:31] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2010.codfw.wmnet with reason: host reimage [14:31:51] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102983 (https://phabricator.wikimedia.org/T380807) (owner: 10Jforrester) [14:33:47] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:34:21] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 383507440 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:34:45] (03CR) 10Jforrester: Provide a base image for Rust, based on Bookworm using 'rustc-web' now at 1.78 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102983 (https://phabricator.wikimedia.org/T380807) (owner: 10Jforrester) [14:35:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2010.codfw.wmnet with reason: host reimage [14:35:21] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 95640 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:35:21] (03PS2) 10Fabfur: haproxy:benthos: produce msg compatible with our schema guidelines [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) [14:36:28] (03CR) 10Jforrester: Provide a base image for Rust, based on Bookworm using 'rustc-web' now at 1.78 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1102983 (https://phabricator.wikimedia.org/T380807) (owner: 10Jforrester) [14:48:57] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2114.codfw.wmnet with reason: host reimage [14:49:24] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2115.codfw.wmnet with reason: host reimage [14:52:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2114.codfw.wmnet with reason: host reimage [14:54:40] RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:54:48] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2010.codfw.wmnet with OS bookworm [14:54:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[2007-2010].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [14:55:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host build2002.codfw.wmnet with OS bookworm [14:55:10] 06SRE, 06Infrastructure-Foundations: Create bookworm-based build host - https://phabricator.wikimedia.org/T379343#10403233 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host build2002.codfw.wmnet with OS bookworm completed: - build2002 (**PASS**) - Downtimed on Icin... [14:56:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2115.codfw.wmnet with reason: host reimage [14:58:22] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 195437328 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:58:46] (03CR) 10Clément Goubert: [C:03+1] "https://phabricator.wikimedia.org/T379788 is Resolved and hosts have been completely decommissioned, I think you can go ahead." [puppet] - 10https://gerrit.wikimedia.org/r/1102710 (https://phabricator.wikimedia.org/T379788) (owner: 10Jelto) [14:59:22] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 91056 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:08:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:12:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2114.codfw.wmnet with OS bookworm [15:15:12] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [15:15:56] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:16:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2115.codfw.wmnet with OS bookworm [15:17:09] FIRING: [2x] ProbeDown: Service wdqs1025:443 has failed probes (http_wdqs_internal_main_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1025:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:17:43] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2115.codfw.wmnet [15:17:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2115.codfw.wmnet [15:18:33] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2114.codfw.wmnet [15:18:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2114.codfw.wmnet [15:18:49] (03PS1) 10Herron: thanos: remove max_item_size cache setting [puppet] - 10https://gerrit.wikimedia.org/r/1103352 [15:19:02] (03PS3) 10Fabfur: haproxy:benthos: produce msg compatible with our schema guidelines [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) [15:19:18] (03PS2) 10Herron: thanos: query-frontend: remove max_item_size cache setting [puppet] - 10https://gerrit.wikimedia.org/r/1103352 [15:20:14] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [15:20:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1025:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:21:13] (03PS1) 10Ladsgroup: mariadb: Add a link to wikitech doc in check_private_data_report [puppet] - 10https://gerrit.wikimedia.org/r/1103353 [15:21:29] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [15:24:04] (03PS2) 10Ladsgroup: mariadb: Add a link to wikitech doc in check_private_data_report [puppet] - 10https://gerrit.wikimedia.org/r/1103353 [15:25:14] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [15:26:00] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2113.codfw.wmnet [15:26:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2113.codfw.wmnet [15:26:53] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2112.codfw.wmnet [15:27:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2112.codfw.wmnet [15:28:19] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2112.codfw.wmnet with OS bookworm [15:28:20] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2113.codfw.wmnet with OS bookworm [15:28:39] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2112 [15:28:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2112 [15:28:39] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2113 [15:28:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2113 [15:31:56] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:40:12] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [15:41:08] (03CR) 10CDanis: [C:03+1] haproxy:benthos: produce msg compatible with our schema guidelines (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [15:45:49] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2113.codfw.wmnet with reason: host reimage [15:45:50] (03PS1) 10Herron: thanos: query-frontent: set labels.response-cache-config in systemd [puppet] - 10https://gerrit.wikimedia.org/r/1103364 [15:45:54] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2112.codfw.wmnet with reason: host reimage [15:46:03] (03CR) 10Fabfur: haproxy:benthos: produce msg compatible with our schema guidelines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [15:46:04] (03PS2) 10Herron: thanos: query-frontend: set labels.response-cache-config in systemd [puppet] - 10https://gerrit.wikimedia.org/r/1103364 [15:48:16] (03PS1) 10Herron: thanos: query-frontend: enable query-range.align-range-with-step [puppet] - 10https://gerrit.wikimedia.org/r/1103365 [15:49:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2113.codfw.wmnet with reason: host reimage [15:50:15] (03PS1) 10Ottomata: httpbb - add mediawiki.org/beacon/event test for legacy EventLogging beacon [puppet] - 10https://gerrit.wikimedia.org/r/1103366 (https://phabricator.wikimedia.org/T353817) [15:52:15] (03CR) 10Ottomata: "Please advise as to the correct file to put this test in. I put it in appserver/test_main, but perhaps it should be its own standalone fi" [puppet] - 10https://gerrit.wikimedia.org/r/1103366 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:52:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2112.codfw.wmnet with reason: host reimage [15:54:35] (03CR) 10Herron: "according to https://github.com/thanos-io/thanos/blob/main/docs/components/query-frontend.md "if only max_size is set, then max_size_items" [puppet] - 10https://gerrit.wikimedia.org/r/1103352 (owner: 10Herron) [15:55:19] (03CR) 10Marostegui: [C:03+1] mariadb: Add a link to wikitech doc in check_private_data_report [puppet] - 10https://gerrit.wikimedia.org/r/1103353 (owner: 10Ladsgroup) [15:56:00] (03PS1) 10Bking: WIP: Add partman recipe for raid 0 with EFI [puppet] - 10https://gerrit.wikimedia.org/r/1103367 (https://phabricator.wikimedia.org/T378368) [15:57:01] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [15:59:05] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:03:21] (03PS3) 10Ladsgroup: mariadb: Add a link to wikitech doc in check_private_data_report [puppet] - 10https://gerrit.wikimedia.org/r/1103353 [16:03:44] (03CR) 10Ladsgroup: "I had to add -e to echo to make the backslash work" [puppet] - 10https://gerrit.wikimedia.org/r/1103353 (owner: 10Ladsgroup) [16:03:45] (03Abandoned) 10Dzahn: admin: add a yubikey SSH key to user dzahn [puppet] - 10https://gerrit.wikimedia.org/r/1087509 (owner: 10Dzahn) [16:07:35] (03CR) 10Marostegui: "so the test works?" [puppet] - 10https://gerrit.wikimedia.org/r/1103353 (owner: 10Ladsgroup) [16:08:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2113.codfw.wmnet with OS bookworm [16:09:04] (03CR) 10Ladsgroup: "define test :D I tried the echo locally and noticed the bug and see that -e fixes it. I'm not sure I can test it better unless I create a " [puppet] - 10https://gerrit.wikimedia.org/r/1103353 (owner: 10Ladsgroup) [16:11:34] (03PS1) 10Elukey: charts: improve kartotherian's config.yaml configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103379 [16:12:02] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti2017 / ganeti2018 - https://phabricator.wikimedia.org/T382114#10403394 (10Jhancock.wm) 05Open→03Resolved [16:12:12] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti2017 / ganeti2018 - https://phabricator.wikimedia.org/T382114#10403396 (10Jhancock.wm) a:03Jhancock.wm [16:13:58] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:14:19] (03PS2) 10Elukey: charts: improve kartotherian's config.yaml configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103379 [16:14:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2112.codfw.wmnet with OS bookworm [16:15:15] 06SRE, 10SRE-swift-storage, 06Commons: Interieur - 's-Gravenhage - 20089866 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T381893#10403401 (10Ladsgroup) I suggest asking a community member, ideally the original uploader to do this. @Multichill hiii, your bot uploaded these file... [16:15:33] (03PS3) 10Elukey: charts: improve kartotherian's config.yaml configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103379 [16:17:02] (03CR) 10Elukey: [C:03+2] charts: improve kartotherian's config.yaml configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103379 (owner: 10Elukey) [16:17:13] (03PS1) 10Dreamy Jazz: Exclude autopromotion of temp IP viewer for users with specific global groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103380 (https://phabricator.wikimedia.org/T377929) [16:18:05] (03PS1) 10Bking: cloudelastic10[12]: add hosts to new efi-based partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1103381 (https://phabricator.wikimedia.org/T378368) [16:19:26] (03CR) 10CI reject: [V:04-1] cloudelastic10[12]: add hosts to new efi-based partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1103381 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [16:19:36] (03PS3) 10Bking: Add partman recipe for raid 0 with EFI [puppet] - 10https://gerrit.wikimedia.org/r/1103367 (https://phabricator.wikimedia.org/T378368) [16:19:51] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:25:41] (03PS1) 10CDanis: benthos: webrequest_live: fix unittest failure [puppet] - 10https://gerrit.wikimedia.org/r/1103382 (https://phabricator.wikimedia.org/T382156) [16:25:42] (03PS1) 10CDanis: WIP: Benthos tests in CI [puppet] - 10https://gerrit.wikimedia.org/r/1103383 (https://phabricator.wikimedia.org/T382156) [16:29:55] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:30:19] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:30:59] (03PS2) 10CDanis: WIP: Benthos tests in CI [puppet] - 10https://gerrit.wikimedia.org/r/1103383 (https://phabricator.wikimedia.org/T382156) [16:31:23] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:31:40] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:32:44] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:33:09] FIRING: HelmReleaseBadStatus: Helm release kartotherian/main on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kartotherian - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:33:44] (03CR) 10CI reject: [V:04-1] WIP: Benthos tests in CI [puppet] - 10https://gerrit.wikimedia.org/r/1103383 (https://phabricator.wikimedia.org/T382156) (owner: 10CDanis) [16:33:53] (03CR) 10Fabfur: haproxy:benthos: produce msg compatible with our schema guidelines (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [16:34:02] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:34:08] (03PS4) 10Fabfur: haproxy:benthos: produce msg compatible with our schema guidelines [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) [16:35:06] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:35:30] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:36:34] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:38:09] RESOLVED: HelmReleaseBadStatus: Helm release kartotherian/main on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kartotherian - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:38:12] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 29 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [16:39:53] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:40:08] (03CR) 10Ilias Sarantopoulos: [C:03+1] sre/ores: remove obsolete ORES cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102886 (https://phabricator.wikimedia.org/T379259) (owner: 10Klausman) [16:40:20] (03PS3) 10CDanis: WIP: Benthos tests in CI [puppet] - 10https://gerrit.wikimedia.org/r/1103383 (https://phabricator.wikimedia.org/T382156) [16:41:32] (03CR) 10Jforrester: "recheck; new version of puppet image in CI" [puppet] - 10https://gerrit.wikimedia.org/r/1103286 (owner: 10Marostegui) [16:41:57] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:42:29] (03CR) 10JHathaway: [C:03+1] Add partman recipe for raid 0 with EFI [puppet] - 10https://gerrit.wikimedia.org/r/1103367 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [16:42:39] (03CR) 10JHathaway: [C:03+1] cloudelastic10[12]: add hosts to new efi-based partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1103381 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [16:43:03] (03CR) 10Klausman: [C:03+2] sre/ores: remove obsolete ORES cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102886 (https://phabricator.wikimedia.org/T379259) (owner: 10Klausman) [16:43:04] (03CR) 10CI reject: [V:04-1] WIP: Benthos tests in CI [puppet] - 10https://gerrit.wikimedia.org/r/1103383 (https://phabricator.wikimedia.org/T382156) (owner: 10CDanis) [16:45:16] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:45:35] (03CR) 10JJMC89: "missing `u4c-member`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103380 (https://phabricator.wikimedia.org/T377929) (owner: 10Dreamy Jazz) [16:46:06] (03CR) 10Jforrester: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1103383 (https://phabricator.wikimedia.org/T382156) (owner: 10CDanis) [16:47:20] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:48:15] (03PS5) 10Fabfur: haproxy:benthos: produce msg compatible with our schema guidelines [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) [16:48:40] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:49:16] (03CR) 10Scott French: "Thanks for adding this!" [puppet] - 10https://gerrit.wikimedia.org/r/1103366 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [16:49:34] (03Merged) 10jenkins-bot: sre/ores: remove obsolete ORES cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102886 (https://phabricator.wikimedia.org/T379259) (owner: 10Klausman) [16:50:44] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:51:50] (03CR) 10Bking: [C:03+2] Add partman recipe for raid 0 with EFI [puppet] - 10https://gerrit.wikimedia.org/r/1103367 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [16:54:55] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:56:24] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2111.codfw.wmnet [16:56:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2111.codfw.wmnet [16:56:59] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:57:35] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2110.codfw.wmnet [16:58:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2110.codfw.wmnet [16:58:40] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:59:16] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2111.codfw.wmnet with OS bookworm [16:59:17] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2110.codfw.wmnet with OS bookworm [16:59:36] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2111 [16:59:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2111 [16:59:37] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2110 [16:59:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2110 [17:00:44] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [17:02:07] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [17:02:58] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:04:11] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [17:05:46] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [17:09:18] (03PS2) 10Bking: cloudelastic10[12]: add hosts to new efi-based partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1103381 (https://phabricator.wikimedia.org/T378368) [17:10:50] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [17:11:20] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [17:11:40] (03CR) 10Bking: [C:03+2] cloudelastic10[12]: add hosts to new efi-based partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1103381 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [17:12:54] (03CR) 10Dreamy Jazz: "I'm leaving this one out for now because it's not technically within the policy for this group to have the rights globally." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103380 (https://phabricator.wikimedia.org/T377929) (owner: 10Dreamy Jazz) [17:15:09] (03CR) 10JJMC89: "I've had a ticket open with privacy@ about the same for two months." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1103380 (https://phabricator.wikimedia.org/T377929) (owner: 10Dreamy Jazz) [17:16:25] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [17:16:51] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2111.codfw.wmnet with reason: host reimage [17:17:12] (03PS1) 10Elukey: charts: fix and improve Kartotherian's configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103391 [17:17:16] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2110.codfw.wmnet with reason: host reimage [17:18:37] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1011.eqiad.wmnet with OS bullseye [17:18:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 2 others: Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10403619 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1011.eqiad.wmnet with OS bullseye [17:18:52] (03CR) 10Elukey: [C:03+2] charts: fix and improve Kartotherian's configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1103391 (owner: 10Elukey) [17:19:30] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:20:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2111.codfw.wmnet with reason: host reimage [17:21:09] (03PS4) 10CDanis: WIP: Benthos tests in CI [puppet] - 10https://gerrit.wikimedia.org/r/1103383 (https://phabricator.wikimedia.org/T382156) [17:21:33] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [17:24:17] (03CR) 10Jforrester: "check experimental" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/1100798 (owner: 10L10n-bot) [17:24:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2110.codfw.wmnet with reason: host reimage [17:26:37] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [17:30:08] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 8013MiB (2% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [17:36:40] (03CR) 10Jforrester: "check experimental" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100163 (owner: 10Hashar) [17:41:00] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:41:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2111.codfw.wmnet with OS bookworm [17:43:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2110.codfw.wmnet with OS bookworm [17:49:02] (03PS5) 10CDanis: WIP: Benthos tests in CI [puppet] - 10https://gerrit.wikimedia.org/r/1103383 (https://phabricator.wikimedia.org/T382156) [17:49:03] (03PS2) 10CDanis: benthos: webrequest_live: fix unittest failure [puppet] - 10https://gerrit.wikimedia.org/r/1103382 (https://phabricator.wikimedia.org/T382156) [17:49:58] (03PS6) 10CDanis: Run profile::benthos::instance unittests in CI [puppet] - 10https://gerrit.wikimedia.org/r/1103383 (https://phabricator.wikimedia.org/T382156) [17:49:58] (03PS3) 10CDanis: benthos: webrequest_live: fix unittest failure [puppet] - 10https://gerrit.wikimedia.org/r/1103382 (https://phabricator.wikimedia.org/T382156) [17:50:20] (03PS2) 10Ottomata: httpbb - add mediawiki.org/beacon/event test for legacy EventLogging beacon [puppet] - 10https://gerrit.wikimedia.org/r/1103366 (https://phabricator.wikimedia.org/T353817) [17:50:28] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 179581096 and 19 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:50:28] (03CR) 10Ottomata: httpbb - add mediawiki.org/beacon/event test for legacy EventLogging beacon (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1103366 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [17:51:28] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 10264 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:52:05] (03PS3) 10Ottomata: httpbb - add mediawiki.org/beacon/event test for legacy EventLogging beacon [puppet] - 10https://gerrit.wikimedia.org/r/1103366 (https://phabricator.wikimedia.org/T353817) [17:53:10] (03CR) 10Ottomata: httpbb - add mediawiki.org/beacon/event test for legacy EventLogging beacon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1103366 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [17:53:42] (03CR) 10CDanis: "12:50:19 Test 'modules/profile/files/benthos/instances/webrequest_live.yaml' succeeded" [puppet] - 10https://gerrit.wikimedia.org/r/1103382 (https://phabricator.wikimedia.org/T382156) (owner: 10CDanis) [17:54:19] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2110.codfw.wmnet [17:54:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2110.codfw.wmnet [17:54:29] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2111.codfw.wmnet [17:54:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2111.codfw.wmnet [17:56:07] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2112.codfw.wmnet [17:56:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2112.codfw.wmnet [17:56:22] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2113.codfw.wmnet [17:56:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2113.codfw.wmnet [18:01:03] (03PS4) 10CDanis: benthos: webrequest_live: fix unittest failure [puppet] - 10https://gerrit.wikimedia.org/r/1103382 (https://phabricator.wikimedia.org/T382156) [18:01:04] (03PS1) 10CDanis: run_ci_locally.sh: add user override of oci_runtime [puppet] - 10https://gerrit.wikimedia.org/r/1103406 [18:04:20] (03CR) 10Giuseppe Lavagetto: [C:03+1] Run profile::benthos::instance unittests in CI [puppet] - 10https://gerrit.wikimedia.org/r/1103383 (https://phabricator.wikimedia.org/T382156) (owner: 10CDanis) [18:04:29] (03CR) 10Giuseppe Lavagetto: [C:03+1] run_ci_locally.sh: add user override of oci_runtime [puppet] - 10https://gerrit.wikimedia.org/r/1103406 (owner: 10CDanis) [18:07:39] (03CR) 10CDanis: [C:03+2] Run profile::benthos::instance unittests in CI [puppet] - 10https://gerrit.wikimedia.org/r/1103383 (https://phabricator.wikimedia.org/T382156) (owner: 10CDanis) [18:07:41] (03CR) 10CDanis: [C:03+2] run_ci_locally.sh: add user override of oci_runtime [puppet] - 10https://gerrit.wikimedia.org/r/1103406 (owner: 10CDanis) [18:08:44] (03CR) 10Scott French: [C:03+1] httpbb - add mediawiki.org/beacon/event test for legacy EventLogging beacon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1103366 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [18:08:47] (03PS6) 10Fabfur: haproxy:benthos: produce msg compatible with our schema guidelines [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) [18:09:05] (03CR) 10CI reject: [V:04-1] haproxy:benthos: produce msg compatible with our schema guidelines [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [18:15:28] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 888881264 and 39 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:16:28] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 106784 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:18:55] (03PS7) 10Fabfur: haproxy:benthos: produce msg compatible with our schema guidelines [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) [18:25:35] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1011.eqiad.wmnet with OS bullseye [18:25:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10403717 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1011.... [18:26:10] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1011.eqiad.wmnet with OS bullseye [18:26:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10403719 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1... [18:55:43] (03PS8) 10CDanis: haproxy:benthos: produce msg compatible with our schema guidelines [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [18:56:37] (03PS9) 10CDanis: haproxy:benthos: produce msg compatible with our schema guidelines [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [19:02:13] (03CR) 10SBassett: [C:03+2] "Verified via https://docker-registry.wikimedia.org/repos/sre/miscweb/security-landing-page/tags/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102989 (https://phabricator.wikimedia.org/T381430) (owner: 10Mstyles) [19:03:32] (03Merged) 10jenkins-bot: security-landing-page: deploying updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102989 (https://phabricator.wikimedia.org/T381430) (owner: 10Mstyles) [19:07:57] !log mstyles@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [19:08:24] !log mstyles@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:08:36] !log mstyles@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:08:59] !log mstyles@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:09:10] !log mstyles@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:09:32] !log mstyles@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:09:44] !log mstyles@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:09:46] !log mstyles@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:09:58] !log mstyles@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:10:01] !log mstyles@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:17:09] FIRING: [2x] ProbeDown: Service wdqs1025:443 has failed probes (http_wdqs_internal_main_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1025:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:20:14] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [19:20:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1025:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [19:21:12] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1025.eqiad.wmnet with reason: T376150 [19:21:17] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [19:21:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1025.eqiad.wmnet with reason: T376150 [19:25:14] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [19:27:24] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1025.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [19:27:28] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [19:27:49] (03PS6) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [19:28:31] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1011.eqiad.wmnet with OS bullseye [19:28:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10403829 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1011.... [19:28:41] (03PS7) 10CDanis: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [19:28:43] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [19:29:58] (03PS8) 10CDanis: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [19:30:00] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [19:30:28] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 379186256 and 28 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:31:28] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 21512 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:34:54] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10403867 (10VRiley-WMF) Closing ticket for now. Feel free to re-open if it comes back [19:35:06] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10403869 (10VRiley-WMF) 05Open→03Resolved [19:36:33] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10403873 (10VRiley-WMF) 05Resolved→03Open [19:37:41] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10403893 (10VRiley-WMF) [19:37:44] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T382033#10403896 (10VRiley-WMF) →14Duplicate dup:03T381742 [19:41:06] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T381742#10403897 (10VRiley-WMF) Re-opend. Just merging these tickets. Would like to see if this issue is still present [20:19:11] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1025.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [20:19:15] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [20:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10403954 (10phaultfinder) [20:22:09] RESOLVED: [2x] ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:39:31] 06SRE, 06Infrastructure-Foundations: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10403966 (10NBaca-WMF) Hi @Scott_French - thanks for looking into this and the summary! > Alright, it seems like there are two different issues intertwined here: Yes! thanks for separati... [21:19:30] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:27:35] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10404001 (10VRiley-WMF) 05Open→03Resolved [22:02:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10404029 (10VRiley-WMF) 05Open→03Resolved [22:46:12] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [22:54:48] (03PS1) 10Dzahn: wikistats: add a second timer to pull extended info [puppet] - 10https://gerrit.wikimedia.org/r/1103532 (https://phabricator.wikimedia.org/T381623) [22:55:38] (03CR) 10Dzahn: [C:03+2] wikistats: add a second timer to pull extended info [puppet] - 10https://gerrit.wikimedia.org/r/1103532 (https://phabricator.wikimedia.org/T381623) (owner: 10Dzahn) [23:20:14] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [23:25:14] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [23:56:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown