[05:28:37] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9943050 (10Marostegui) [08:00:33] New fastnetmon release, I don't think there is a need to upgrade - https://github.com/pavel-odintsov/fastnetmon/releases/tag/v1.2.7 [09:23:06] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9943538 (10cmooney) So the change to the timeout has made a big difference, but there are still some small gaps: {F56165130} {F5616524... [09:28:45] 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Spicerack, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9943566 (10elukey) [09:36:17] btw I only did the bullseye point release d-i update yesterday, if someone from I/F can do bookworm [09:44:14] claime: definitely, just to be clear you ran update-netboot-image on puppetserver1001 right? [09:44:59] elukey: By lack of knowledge, I did it on puppetmaster1001 first, then did it on puppetserver1001 [09:45:31] claime: no problem, I am asking since it is new for me too and I am reading https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_Foundations/Debian-installer#Updating_the_netboot_image [09:46:01] I think it makes sense to remove update-netboot-image from puppetmaster hosts [09:46:20] Probably yeah [09:46:29] anyway, doing bookworm :) [09:46:42] I explicited the need for a puppet run on the installservers afterwards as well [09:52:27] claime: done! [09:52:43] elukey: <3 [09:54:00] slyngs: the daily account consistency check reported "$name has shell access despite being disabled in LDAP" is a red herring, something clinic duty should act on or something you/mor.itz usually takes care of? [10:04:41] Let me just check, I think those three are "special" [10:05:01] thx [10:08:15] Someone (I know who) in data engineering is checking up on those three accounts. They all left, but there was reason to believe that they might want to retain there account, but for that they'd need to provide updated email addresses. [10:08:30] I'll ping for an answer by the end of the week. [10:08:39] great, thanks a lot [10:23:19] claime: lemme know when you have time if https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051317 would work [10:35:43] elukey: I don't know if we have a way in puppet to know which puppetserver is the volatile master, so the script could just exit early and tell you to run it on the right server right away [10:37:15] claime: that's adding a nerdsnipe on top of another one, not fair :D [10:37:24] elukey: lol sorry [10:37:29] joking, not sure if possible, but I can try later on :D [10:38:16] but in any case, yeah, looks like it should work [10:48:37] <3 [10:48:38] ---- [10:48:58] Generic question about the future of puppet-merge, I'll write some stuff as brainbounce [10:49:19] Moritz opened https://phabricator.wikimedia.org/T366355, since the long term idea is to move away from puppet-merge and use a cookbook [10:50:05] the main idea, IIUC, is that we want something more structured/integrated/stable than puppet-merge, but also that the script has never been tested on puppetserver nodes [10:50:20] (running the puppet 7 stack) [10:51:00] The other bit is that we'd need to move the puppet private repo commits from puppetmaster1001 to a puppetserver, as part of https://phabricator.wikimedia.org/T368023 [10:51:27] so overall the main idea is to move away from the puppet 5 infra, and start migrating people to puppetserver1001 [10:51:43] I have some doubts/concerns: [10:51:49] how many things you want to couple together? the move to puppetserver1001 might be already enough :D [10:52:01] lemme write :D [10:52:18] lol [10:52:48] * volans has to go for errand+lunch in 5m so will reply later, sorry [10:53:25] 1) if we move puppet private to puppetserver1001 only, I can already see the use case of SREs trying to also puppet-merge on it. I'd do it, maybe because of a quick read of the long email sent to explain, or just because I am used to commit private and public in the same place [10:54:46] 2) In theory puppet-merge could work on puppetserver, but it was never tested. IIUC John suggested against it, but I don't have any more context if there were big blockers or if it was more to prefer a cookbook. To be clear, I like the cookbook idea, but it may take a while so having puppet-merge usable on puppetserver would (imho) make people less confused. [10:55:20] (so moving both private and public merges away from puppetmaster1001) [10:56:43] To answer to v*olans, I think that we could do private only first, then move puppet-merge to a cookbook, and deprecate puppet-merge (script) completely [10:57:23] if we do it, SREs will have to deal with the fact that we puppet-merge on puppetmaster1001 (old Puppet 5 infra) and private-commit on puppetserver1001 (new infra) for a while [10:57:38] at least, this is my understanding :) [10:57:53] Lemme know if I misunderstood something, and if you have preferences :) [10:59:28] ah I forgot something - the above plan would include https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050607, so we remove puppet-merge from the new puppet 7 infra :D [10:59:45] no accidental mistakes [11:00:06] (or we could replace the script with an echo "please use puppetmaster instead") [12:44:16] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:11:41] slyngs: yay ! https://github.com/TheDJVG/netbox-more-metrics/pull/24#issuecomment-2203119287 [13:12:10] Success :-) [13:39:16] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:46] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9944795 (10cmooney) [14:00:01] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9944811 (10cmooney) [14:15:56] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9944984 (10cmooney) [14:30:04] 10netops, 06cloud-services-team, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9945048 (10aborrero) 05Open→03Stalled marking as stalled, because the work on ceph nodes wont be progressing for a while. [14:46:47] if anyone fancy a quick data.yaml review for clinic duty... :) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051385 [14:50:01] volans: lgtm [14:51:49] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9945244 (10Jhancock.wm) @cmooney got sretest2002 on lsw-d4, ports 44 and 45. 10G card. [14:52:16] thx [14:58:29] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9945280 (10cmooney) >>! In T367512#9945244, @Jhancock.wm wrote: > @cmooney got sretest2002 on lsw-d4, ports 44 and 45. 10G card. Awesome thank... [15:12:22] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9945320 (10cmooney) All seems ok following the increase: {F56173453 width=500} FWIW the scraping is now taking longer, indicating that... [18:28:38] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9946538 (10cmooney) @Jhancock.wm can you confirm what position in the rack the server is in? I assumed based on the first port it's in U45 so I... [19:09:02] 10netops, 06Infrastructure-Foundations, 06SRE: Add new elements to automation to support new switches in codfw C/D - https://phabricator.wikimedia.org/T369106 (10cmooney) 03NEW p:05Triage→03Medium [19:10:45] 10netops, 06Infrastructure-Foundations, 06SRE: Add new elements to automation to support new switches in codfw C/D - https://phabricator.wikimedia.org/T369106#9946735 (10cmooney) [19:10:46] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9946736 (10cmooney) [20:16:08] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add new elements to automation to support new switches in codfw C/D - https://phabricator.wikimedia.org/T369106#9947129 (10cmooney) [20:48:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9947208 (10cmooney) Also @Jhancock.wm when next on site can you check the mgmt / idrac connection for this one? It doesn't seem to be trying to... [21:00:39] 10Mail, 06Infrastructure-Foundations, 06SRE: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9947237 (10jhathaway) [21:01:13] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9947239 (10jhathaway) [21:31:34] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9947342 (10Jhancock.wm) a:03VRiley-WMF