[02:14:09] 10Traffic, 10SRE, 10Patch-For-Review: Package libvmod-re2 for Debian 12/Bookworm - https://phabricator.wikimedia.org/T345663 (10ssingh) The above patch fixes the issue with CI passing. Once reviewed, we can merge and close this task. [07:20:28] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) [08:49:50] 10netops, 10Infrastructure-Foundations, 10SRE: Automate Netbox additions for new spine/leaf L3 networks. - https://phabricator.wikimedia.org/T333441 (10cmooney) 05Open→03Resolved I'm gonna close this for now. I used the following tooling to create the necessary in eqiad/codfw for recent expansion. An i... [08:49:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [08:52:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney) Indeed blocked on the optics arriving, but to clarify the cable runs have been done we just need the optics to slot in and connect. @Jclark-ctr correct... [09:41:13] 10netops, 10Infrastructure-Foundations, 10SRE: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10cmooney) p:05Triage→03Low [09:45:05] 10netops, 10Infrastructure-Foundations, 10SRE: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10Volans) SGTM, I would even consider a shorter time span :) [10:40:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [12:37:08] 10netops, 10Infrastructure-Foundations, 10SRE: New IP and Vlan allocations for esams knams move - https://phabricator.wikimedia.org/T343214 (10cmooney) 05Open→03Resolved [13:27:26] 10netops, 10Infrastructure-Foundations, 10SRE: Set idle-timeout for Juniper logins - https://phabricator.wikimedia.org/T345710 (10ayounsi) I thought that was not possible but it got introduced recently (in 16.1). +1 [13:37:20] 10netops, 10Infrastructure-Foundations, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10elukey) [13:39:03] 10netops, 10Infrastructure-Foundations, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10elukey) Surely not related but I noticed that the conf2xxx nodes hold a ton (8/9k) sockets in TIME_WAIT, most of them related to nginx -> etcd local traffic.... [14:23:37] hi, I noticed the ping hosts in esams haven't been (re)provisioned yet, is there a plan/task I could follow ? I'm triaging alert linting problems https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=name%3DPingOffloadMissingIP [14:28:29] probably just overlooked [14:29:54] fair neough, I'll file a task [14:31:02] 10Traffic: reprovision ping VM in esams - https://phabricator.wikimedia.org/T345743 (10fgiunchedi) [14:31:09] {{done}} [14:31:45] thanks! [14:32:08] 10Traffic: reprovision ping VM in esams - https://phabricator.wikimedia.org/T345743 (10fgiunchedi) [14:32:21] sure np [14:33:59] yeah thanks for filing, we had some discussion about this not sure why it got list [14:34:03] *lost [14:40:50] I think it was intentional, the tracking task we used to recreate the VMs states "Not needed for now" https://phabricator.wikimedia.org/T344355 [14:41:30] aah so this was it [14:41:38] I had some recollection of where I read it but wasn't sure [14:42:38] was "not needed for now" just sort of an "eh the site works fine without it, we don't need it during the initial rush of work?" [14:42:46] or like, we don't think we want to keep doing ping offload at all? [15:28:17] I'm not sure, XioNoX or topranks are best to comment on that [15:29:44] bblack: I'll let XioNoX comment as he's a better sense of it than me [15:29:48] it's something of a judgement call [15:30:00] the other POPs, drmrs, ulsfo, eqsin, don't have ping VMs [15:30:18] so in from a "keep things standard" point of view not having it in esams makes sense [15:30:36] But there is the wider question as to whether we should have ping offload at POPs [15:31:01] well, we should probably have it either everywhere or nowhere, IMHO [15:31:02] we've not had any issues to my knowledge at the other ones without it [15:31:11] that seems reasonable to me [15:31:52] IIRC, we did it because so many people were using "ping wikipedia.org" in various automations and scripts to check "is the internet working", that it could overwhelm reasonable ICMP ratelimiting on our edge hosts, and thus interfere with delivery of more-useful ICMPs. [15:31:53] I wasn't around when we first introduced them, not sure if circumstances have changed in terms of traffic patterns to make us reconsider now [15:32:13] bblack: yeah that's as I understood it [15:32:16] so this was to move all that icmp-echo traffic elsewhere and let the edge hosts focus on dealing with cases like PTB [15:32:30] but for instance we don't do it for v6 anywhere, so a bunch of "ping wikipedia.org" is gonna not get offloaded already [15:32:37] yeah [15:33:16] the determinant should probably be based on the ratelimiter stuff (since way back when we first saw this - does the kernel icmp ratelimiting stuff even still work the same? are we seeing enough ping volume that it will cause a problem there at any sites?) [15:33:54] but yeah with geodns and site failovers and such, I would just make one global decision on whether we should keep doing it at all. [15:35:24] That makes sense to me [15:35:54] Looking at the original task (T190090) it was missing at the POPs as they didn't have ganeti clusters originally [15:35:54] T190090: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 [15:36:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10akosiaris) So, `ethtool -G eno1 rx 1000` apparently did the [trick](https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=con... [15:36:13] but only ever got rolled out to esams afterwards [16:24:22] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops: etcd in codfw burned all latency SLO error budget - https://phabricator.wikimedia.org/T345738 (10cmooney) We removed switch asw-b1-codfw as it no longer had any servers connected (they were moved to cloudsw1-b1-codfw). The correlation between th... [17:06:57] 10Traffic, 10SRE: Package libvmod-re2 for Debian 12/Bookworm - https://phabricator.wikimedia.org/T345663 (10BCornwall) 05In progress→03Resolved Thanks, sukhe! [17:07:06] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [17:09:10] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [17:10:59] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [22:37:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Papaul) Thank you for putting the summary together. Another scenario I was thinking about while reading the document is up...