[04:47:12] (LVSHighRX) firing: Excessive RX traffic on lvs5004:9100 (enp94s0f0np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5004 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [04:52:12] (LVSHighRX) resolved: Excessive RX traffic on lvs5004:9100 (enp94s0f0np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5004 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [06:23:21] 10Domains, 10Traffic, 10DNS, 10SRE: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10Mschon) >>! In T337586#8929595, @BCornwall wrote: > Yeah, the document is pretty barren. It sounds like there needs to be a little bit more planning! it looks like the planning... [06:53:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) [07:27:53] XioNoX: ^^ why is it a hard requirement? [07:28:04] vgutierrez: ? [07:28:15] upgrading our keys to ed25519 [07:28:53] I currently store my prod key on a yubikey, a rsa 4096 one, if I'm forced to move to ed25519 that will increase my attack surface [07:29:44] at first it was because I thought we only tolerated RSA keys because of network devices, then we agreed it was better to only support modern keys [07:30:01] vgutierrez: I didn't know yubikeys didn't support ED25519 [07:30:18] on a limited fashion [07:30:25] my yubikey 4 doesn't support it at all [07:30:38] yubikey 5 has limited support, sign only IIRC [07:33:37] https://docs.yubico.com/hardware/yubikey/yk-5/tech-manual/yk5-apps.html#elliptic-curve-cryptographic-ecc-algorithms [07:34:25] ed25519 (sign / auth only), time to upgrade? :) [07:34:58] XioNoX: it's more complicated than that [07:35:18] there's the native ed25519-sk key type which uses that, but it's only supported in openssh versions in bullseye and newer [07:36:10] so until those are more widely supported, people have been using keys backed by their gpg keys in the yubikeys. they look like normal ssh-rsa keys to the servers so no newer openssh versions required, but iirc they don't support non-rsa keys [07:36:22] interesting! [07:36:23] taavi: hmmm that's part of the opengpg application, no need to update openssh AFAIK [07:36:33] *OpenPGP [07:37:07] vgutierrez: the current mechanism of using PGP for the keys are a part of the OpenPGP application, yes. but the new and fancy -sk key types are not [07:37:20] taavi: right, I'm talking about the regular ones [07:37:45] but I'm not sure if you can get a non-rsa ssh key from the PGP application [07:38:16] taavi: according to https://docs.yubico.com/hardware/yubikey/yk-5/tech-manual/yk5-apps.html#elliptic-curve-cryptographic-ecc-algorithms you can [07:38:38] assuming you got a yubikey 5 with FW >= 5.2.3 [07:38:41] interesting [07:38:51] vgutierrez: anyway, if it can be solved with a more modern yubikey, it could be worth buying/expensing one, but anyway for now just mention it in the task, it's no big deal to keep some keys as RSA [07:38:52] maybe you would need to generate a non-RSA PGP key then? [07:40:02] taavi: it would be super weird cause I couldn't use ed22519 as my encryption key [07:40:21] dunno if I can use a rsa 4096 as my encryption key and ed25519 for signature and auth purposes [07:42:14] from: https://musigma.blog/2021/05/09/gpg-ssh-ed25519.html [07:42:47] ed25519 for auth/sign and cv25519 for encryption [07:43:57] XioNoX: I guess it's time to ask ITS for a yubikey 5 :) [07:44:09] WMF never issued one for me AFAIK [08:03:19] actually it did.. so it's just a matter of renewing my personal / backup yubikey [08:46:35] hey folks, one qs - I ran pcc for role::cache::text ulsfo nodes and I got https://puppet-compiler.wmflabs.org/output/929963/41716/cp4037.ulsfo.wmnet/change.cp4037.ulsfo.wmnet.err [08:46:51] seems related to a profile::base snippet, that checks facts [08:47:25] in theory we should update facts regularly from puppetmasters nowadays (IRRC), so I am wondering if there is any gotcha with pcc and cp nodes [08:48:00] 10Traffic, 10Community-Tech, 10MediaWiki-Parser, 10SRE, and 3 others: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10Winston_Sung) [08:48:04] or maybe $facts['wmflib'] is not there on cp nodes [08:49:47] change seems to be https://gerrit.wikimedia.org/r/c/operations/puppet/+/928657 [08:53:43] jbond: o/ [08:53:51] (if you have a min, you probably know) [09:11:30] ah wait I see https://phabricator.wikimedia.org/T338961 [09:14:06] "The DB host processes facts on a daily basis using the pcc_facts_processor systemd timer.", so maybe it still needs to get updates [09:15:13] trying to re-run it [09:22:18] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) >>! In T307357#8928451, @cmooney wrote: > @aborrero I discussed the idea of a [[ https://wikitech.wikimedia.or... [09:28:57] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) >>! In T307357#8930600, @aborrero wrote: > I think that's the `query-local-address`option. Upstream docs: Tha... [09:29:35] looks better now, but pcc still fails with cp4037 (not with other cp nodes though) [09:46:10] 10netops, 10Infrastructure-Foundations, 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [09:46:35] 10netops, 10Infrastructure-Foundations, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) 05In progress→03Resolved a:03ayounsi With row D upgraded, I couldn... [09:46:53] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) >>! In T307357#8930604, @cmooney wrote: > >> Could you describe the setup you have in mind? Would it be a sta... [09:47:05] elukey: could be related to cp4037.yaml? [09:49:34] vgutierrez: trying, but in theory I'd expect the same failure [09:50:40] yeah same [09:50:47] I think that the pcc situation is not fixed yet [09:53:02] ack [09:53:44] XioNoX: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/929998/ :) that should do it [09:53:52] awesome! [09:54:23] I might wait to have a few more to deploy them in batches as it's a bit time consuming [09:54:46] no rush [09:56:39] elukey: sorry missed the ping earlier looking now [09:57:54] <3 [09:58:09] jbond: there seems to be a task already opened, I linked it above [09:59:45] elukey: i dont think that should be affecting this host [10:02:02] jbond: yeah but I get the error for profile::base:56, that's where the line is [10:02:23] yes im just looking i see the fat on the file system justchecking puppetdb now [10:34:57] 10netops, 10Infrastructure-Foundations, 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) cr1<->row D is now operational on the new 40G link @Jclark-ctr Those 4 SMF cables can now be remo... [10:41:03] 10Traffic, 10Maps, 10Product-Infrastructure-Team-Backlog-Deprecated, 10SRE, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10MSantos) @Galessandroni and @Elitre per https://wikitech.wikimedia.org/wiki/Maps/External_usage I filled T339102 wh... [11:11:37] elukey: should be fixed now. id did just need another facts upload but i had to go down a few rabbit wholes befor hitting the clue stick and working that out [11:11:41] https://puppet-compiler.wmflabs.org/output/929963/41725/cp4037.ulsfo.wmnet/index.html [11:11:46] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) >>! In T307357#8930653, @aborrero wrote: > > I think I'm proposing this: > Talking with @taavi on IRC, he p... [11:12:11] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) >>! In T307357#8930653, @aborrero wrote: > My point is that we could go with the 2 public IPv4 addresses for bo... [11:13:53] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) Ok, I think we are in the same page! [11:17:18] 10netops, 10Infrastructure-Foundations, 10SRE-Sprint-Week-Sustainability-March2023, 10ops-eqiad: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10ayounsi) [12:51:57] jbond: you rock thanks! [12:53:39] no probs [13:34:35] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10Jhancock.wm) @aborrero the patch changes have been made and the server is currently connected from eno1 to cloudsw ge-0/0/11 [14:14:45] 10netops, 10Infrastructure-Foundations, 10SRE: test_matching_vlan() function crashig in Netbox network report - https://phabricator.wikimedia.org/T339133 (10cmooney) p:05Triage→03Low [14:20:23] 10Traffic, 10SRE: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10ssingh) [14:36:53] 10Traffic, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10Jhancock.wm) [14:37:55] 10Traffic, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10Jhancock.wm) 05In progress→03Resolved servers have been removed from the racks but left in the hot aisle of row D. they will be moved to storage after the recy... [14:50:04] 10Domains, 10Traffic, 10DNS, 10SRE: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10BCornwall) 05In progress→03Invalid Thanks for that extra bit of information, @Mschon. I'm going to close this as INVALID until there's something here to service. [16:21:02] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) a:05Papaul→03aborrero [16:31:48] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 3 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) a:05aborrero→03Jhancock.wm Hey @Jhancock.wm and @Papaul : https://netbox.wikimedia.org/dcim/devices/4143/interfaces/... [16:39:55] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10aborrero) a:05Jhancock.wm→03aborrero >>! In T338778#8932260, @aborrero wrote: > Hey @Jhancock.wm and @Papaul : > > https://netb... [17:11:49] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 4 others: cloudservices2004-dev: reimage into new network setup - https://phabricator.wikimedia.org/T338778 (10cmooney) >>! In T338778#8932283, @aborrero wrote: >>>! In T338778#8932260, @aborrero wrote: >> Hey @Jhancock.wm and @Papaul : >> >>... [17:16:14] 10netops, 10Infrastructure-Foundations, 10SRE: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) [17:16:38] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) 05Open→03Resolved a:03cmooney Change is now live on all relevant Juniper devices. [18:41:46] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10BCornwall) Is this to say that the existing cookbook already suffices? If so, it sounds like the actionable here is to update the documentation to reflect Clement'... [19:34:41] 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10KOfori) [19:38:12] 10Domains, 10Traffic, 10DNS, 10SRE: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10NMariano-WMF) 05Invalid→03Open [19:46:13] 10Domains, 10Traffic, 10DNS, 10SRE: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10NMariano-WMF) Adding @Abit @Bmueller & @AAlikhan for visibility into this and to answer any questions or concerns that may come up. The Comms team is requesting a subdomain of... [19:50:52] 10Domains, 10Traffic, 10DNS, 10SRE: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10Dzahn) Per above, I suggest to rename this ticket to buy a domain like wikimedia.social. That would be via the Legal team. Then once you have that you can let us know and we (SRE... [20:47:30] 10Traffic, 10Continuous-Integration-Infrastructure: CI failing with "No space left on device" (debian-gule) - https://phabricator.wikimedia.org/T339171 (10ssingh) [20:47:58] 10Traffic, 10Continuous-Integration-Infrastructure: CI failing with "No space left on device" (debian-gule) - https://phabricator.wikimedia.org/T339171 (10ssingh) p:05Triage→03Medium [21:52:19] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: Write a cookbook to roll reboot cache hosts - https://phabricator.wikimedia.org/T338783 (10BCornwall) [21:52:25] 10Traffic, 10SRE: Create a cookbook to reboot CDN hosts - https://phabricator.wikimedia.org/T338813 (10BCornwall)