[06:45:15] It got rolled to esams as it was the DC getting the most ICMP traffic, and eqiad/codfw as they were the core sites (and esams fallback), then drmrs probably just forgotten. "not needed for now" was for the esams migration, to not have too many moving parts during the deployment as it's not critical infra. And indeed it would be nice to have a "is it still needed?" chat, knowing that it will probably go away with the future l4lb [07:02:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) Please open a new task for that. There is already a [[ https://github.com/wikimedia/operations-cookbooks/blob/ma... [07:46:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10ayounsi) Thanks, we had a quick chat on IRC about that and indeed that's the current conclusion. The extra details your provided (and fix suggestions... [08:05:09] 10Traffic, 10SRE: reprovision ping VM in esams - https://phabricator.wikimedia.org/T345743 (10fgiunchedi) Followup from IRC: it isn't clear whether ping offload should be fully rolled out everywhere (some PoPs are missing) or retired entirely, cc @cmooney @ayounsi [08:21:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10elukey) @cmooney @ayounsi thanks a lot! On the host side, I'd try two things (not sure if they could help or not): 1) Do a simple reboot. The hosts... [08:27:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10cmooney) p:05Triage→03Low [08:40:06] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) Hi # TL;DR cadvisor is to blame. Adding @fgiunchedi for his information and a thumbs up on disabling cadvisor on conf2* until we can bum... [08:42:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) This isn't present in conf1* hosts, despite also running cadvisor and the same exact version, presumably because of a different kernel ver... [08:45:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) There's a few more actionables here: 1. Re-evaluate our SLO target for conf hosts etcd service. Despite having exhausted the error budget... [08:46:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10cmooney) [08:53:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10fgiunchedi) >>! In T345738#9148513, @akosiaris wrote: > Hi > > # TL;DR > > cadvisor is to blame. Adding @fgiunchedi for his information and a thumb... [08:59:32] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10JMeybohm) >>! In T345738#9148572, @fgiunchedi wrote: >>>! In T345738#9148513, @akosiaris wrote: >> Hi >> >> # TL;DR >> >> cad... [08:59:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) >>! In T345738#9148572, @fgiunchedi wrote: >>>! In T345738#9148513, @akosiaris wrote: >> Hi >> >> # TL;DR >> >> ca... [09:10:56] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10cmooney) p:05Triage→03Low [10:27:43] Hello! I just learned about this from Hacker News. [10:30:04] hello! [11:18:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops, 10Patch-For-Review: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) p:05Triage→03Medium Given this isn't urgent and we have multiple ways of dealing with this, I 've re-enabled pup... [12:04:53] 10Traffic, 10SRE: reprovision ping VM in esams - https://phabricator.wikimedia.org/T345743 (10fgiunchedi) Related: {T345743} [12:11:33] txtsd: welcome! out of curiosity from which post ? [12:28:24] I got curious as well and ran into https://news.ycombinator.com/item?id=37412915 a few minutes ago :-) [12:29:38] cheers moritzm [16:00:45] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10Papaul) We will be using to test the new codfw spine/leaf new design contint2001 and thumbor2004. contint2001 will be rename to sretest... [16:41:50] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10cmooney) Thanks @Papaul ! [17:49:54] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh) [17:51:39] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [17:52:05] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [18:01:37] 10Traffic: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10BCornwall) [18:04:52] 10Traffic: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10ssingh) As per @BCornwall's comment above, it does seem like we have more issue with 9.2.1 that we should look into: ` sukhe@cp4052:~$ /usr/bin/traffic_server --version Traffic Server 9.2.1 Jun 14 2023 18:20:20 localhost ` `...