[01:54:16] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:50:34] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:04:16] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:34] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:55:54] Who can replace Moritz into reviewing this change ? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051366 (+parent), first change has been tested manually. Only Impacts routing ganeti stuff, so no impact to prod [06:15:48] haha also don't get used too much to the Netbox 4.0 UI changes as it might change again with 4.1 :) https://github.com/netbox-community/netbox/discussions/16777 [07:34:00] arnaudb: thx! [07:34:28] np! [07:37:01] Open for reviews of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051458/ as well if anyone feels like it :) [07:59:04] XioNoX, topranks: FYI on July 16th there will be 2 transport maintenance at the same time, Lumen's eqiad-codfe and Arelion eqiad-eqord. [08:04:15] volans: thanks! shouldn't be an issue, but good to know. How long do they overlap? [08:05:03] pretty much the whole time, 4-7am one and 4-8am the other [08:05:38] noted, thx [08:05:55] they are in the calendar ofc, I just added the second one, hence the overlap detection ;) [08:06:33] why needs AI when we have volans ? [08:07:17] lol [08:08:44] a much better alternative agreed [08:08:52] until volans starts hallucinating :P [08:16:48] XioNoX: in terms of the ganeti patch what's the reason to introduce a public /32 ? [08:17:38] topranks: waste an IP to make DHCP happy :( https://phabricator.wikimedia.org/T362330 [08:19:26] hmm ok [08:19:43] I guess I don't get why the dhcp server likes the public IP in the BOOTPREQUEST, but doesn't work with the private one? [08:20:12] I guess we have two subnets configured, one covering the private /32 range, one the public? [08:20:39] And the BOOTPREQUEST coming in needs to be part of the right one, regardless of our snippet? [08:21:37] All good anyway, +1 from me, was just curious [08:24:02] the DHCP relay adds the source IP of the request as bootprequest, to "help" the DHCP server knowing in which network the client is (usually it's the gateway IP on the client's network). But here, because of our setup, that info is counter-productive as it's just a virtual IP. [08:26:15] ok... but basically the first match is comparing the IP in bootprequest to the configured subnets in dhcpd.conf [08:26:58] yeah exactly [08:27:03] I wonder if we couldn't just have a single subnet in dhcpd.conf, for 0.0.0.0/0, and put all the config in the snippet we add when we are bringing something up [08:27:24] not a big deal anyway [08:27:38] possibly [08:27:45] and thankfully now automated - I went to add the codfw new 16 vlans yesterday and was pleasantly surprised :) [08:28:45] all kudos to brouberol [08:28:58] same for the preseed config [08:29:10] indeed yep I sent him a msg this morning :) [08:31:04] 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Spicerack, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9948184 (10elukey) Reporting some thoughts from IRC: ` 10:48 Generic question about the future of puppet-merge, I'll writ... [08:31:58] elukey: sorry I didn't manage to reply yesterday, I'll get back to reply for puppet merge after the current meeting [08:32:36] volans: you are not supposed to answer! Don't worry :) [08:32:57] it was more to kick off the conversation and start thinking about it [08:33:55] XioNoX topranks: <3 back [08:34:13] what's the timeline to get rid of puppetmaster1001? [08:35:37] if there are any Debian packaging experts around: https://phabricator.wikimedia.org/T369136 [09:03:12] topranks: thx for the review btw, looks like it's working well, testvm2008.wikimedia.org is at the first puppet run stage [09:03:26] nice!! [09:06:21] 10Packaging, 06Infrastructure-Foundations: Package ipxe-qemu - https://phabricator.wikimedia.org/T369136#9948455 (10Aklapper) [09:39:51] volans: re timeline for puppetmaster1001 - I think we need to get rid of the last Puppet 5 nodes first (like maps etc..), so not immediately. Moritz wanted to start using puppetserver1001 for some things, so people get used to it basically [09:40:01] and the first use case would be the private repo [09:42:23] as long as we force people to run the private commits in the correct place it could be doable, *but* keep in mind also all the current use cases for the private repo [09:43:13] such as cloud IPs auto-updater, cergen, swift ring manager (IIRC) [09:44:10] also teh usage of PUPPET_PRIVATE_REPO_PATH in the decommissiona nd move-vlkan cookbooks [09:44:57] XioNoX: there is the option of using dgit https://wikitech.wikimedia.org/wiki/Debian_packaging_with_dgit_and_CI, maybe we could just create a mirror in gitlab of the ipxe repo and commit the debian stuff from the upstream package to it [09:46:01] it shouldn't be too difficult to do, I don't have a lot of experience with dgit but this use case seem optimal [09:47:57] volans: dont' know much about the auto-updater, for cergen the idea is to generate the files on puppetmaster1001 and copy over if needed (in theory only few services, if not none, should still use cergen). Agree that some review of the use cases is needed, I was more concerned about the confusing of using puppetserver1001 for one thing and puppetmaster1001 for another one [09:48:47] yeah that too, but as long as the private repo is not anymore on puppetmaster1001 or trying to commit fails hard with a clear message [09:48:50] 10netops, 06Infrastructure-Foundations, 06SRE: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358#9948690 (10cmooney) 05Open→03Resolved Gonna close this one as the design is finalised, see detail on wikitech here: https://wikitech.wikimedia.org/wik... [09:49:16] it might be ok (although not ideal because for sure is confusing) [09:51:30] I have no idea how to not have the private repo on puppetmaster1001, but in theory it shouldn't be difficult [09:52:17] elukey: how about sending a PR to ipxe to add the debian repo that's in the source archive? but does the license permits it? is it the proper way of doing things? [09:54:10] XioNoX: could work as well, but we wouldn't have our own versioning/changelog etc.. in theory having our own repo should be very quick and easy, no licensing issues etc.. [09:54:37] (don't think we'd have any with a PR to ipxe too) [09:55:16] do we need our own versioning/changelog? [09:55:30] yep, like using wikimedia apt repo names etc.. [09:55:42] and also versions, we'd be free to quickly change those [09:55:56] of course waiting for debian to finaly release the package itself [09:56:10] if it is not urgent I can try to work on it later on this week or the next [09:58:08] elukey: thanks, not urgent :) [10:14:44] 10Packaging, 06Infrastructure-Foundations: Package ipxe-qemu - https://phabricator.wikimedia.org/T369136#9948760 (10elukey) [10:15:59] 10Packaging, 06Infrastructure-Foundations: Package ipxe-qemu - https://phabricator.wikimedia.org/T369136#9948763 (10elukey) Keeping archives happy - on IRC we discussed the use case and [[ https://wikitech.wikimedia.org/wiki/Debian_packaging_with_dgit_and_CI | dgit ]] could be a good fit in my opinion (until D... [13:09:04] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989#9949514 (10elukey) To keep archives happy: T360356#9949479 We filed a proposal to basically implement sudo_pair "socially", as starting experiment. While at it... [13:23:54] 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Spicerack, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9949559 (10elukey) Proposed plan: * In T368023 we move the private repo to puppetserver1001, and we add a git pre-commit hook confi... [13:28:18] hello! I come to be a bother and ask if I could build and package `gping` to install on our cumin nodes as I find it quite useful [13:28:59] huh gping TIL [13:29:02] looks pretty cool [13:29:18] ping is unreliable anyway :D [13:29:23] indeed it is :) [13:29:35] arnaudb: what problem are you trying to solve? [13:29:39] also we should get 'mtr' [13:30:30] I have a habit, for instance like in T365994 where I'll have 3 nodes down to send a gping with a string of nodes that I want to watch and I'll have a ncurses graph rendered on the fly for this [13:30:30] T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994 [13:30:39] its handy to keep an eye on stuff from peripheral vision [13:30:46] topranks: we have mtr on bast nodes but not elsewhere [13:31:01] cdanis: indeed yes I'm aware [13:31:08] ah ok :) [13:31:11] arnaudb: I also use gping and quite like it [13:31:12] I've long since resigned to use shitty old traceroute on the others :P [13:31:23] hahaha just add it to base packages [13:31:25] which works just fine for what I need but isn't as pretty [13:31:45] I usually have several tmux panes with pings but its less visual than a graph x) [13:31:55] topranks: https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/base/manifests/standard_packages.pp [13:31:56] tbh never delved into what the policy was, I can appreciate we might want to avoid putting every tool in the history of the universe on our hosts [13:32:22] topranks: that specific gate keeper is on sabbatical... [13:32:35] when the cat's away ... :p [13:32:46] * arnaudb install ALL the things [13:32:47] * topranks starts writing his list hahahaha [13:33:21] and the eternal struggle of inclusionist vs deletionist continues ;P [13:33:46] I often doing network maintenances have multiple windows open with ping running to keep an eye on several systems, gping on one pane does look useful [13:34:21] I was using it to test network while tinkering my quagga configs in a past life x) [13:34:31] btw arnaudb if you like pretty graphs you will like `btop` [13:34:43] is it anything like bpytop? [13:34:49] it's a faster C++ port of that [13:34:53] and we autoinstall it on bookworm+ now [13:35:02] 😍 [13:35:10] omg btop [13:35:18] I feel like I'm a hacker in a mission impossible movie :) [13:35:21] you made my day cdanis [13:35:32] * arnaudb types very fast random strings with shades on [13:36:46] +1 to that [13:38:06] on my laptop (bookworm) gping isn't in repos :( [13:38:22] yeah it's not packaged for debian yet [13:38:30] topranks: you have to cargo install [13:38:30] is uh [13:38:40] is rust packaging in debian as annoying as golang is [13:38:49] I also recommend trying zellij while you're DoSing your CPU :D [13:42:43] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949632 (10JMeybohm) [13:44:08] viddy is a pretty cool take on 'watch' https://github.com/sachaos/viddy/blob/master/images/demo.gif [13:45:01] 👀 [13:45:02] that is pretty neat [13:45:14] go make debian packages ;) [13:45:22] ↑ [13:47:23] do we have a wikitech page for "cool CLI tools" or some such? That might be a fun topic for SRE mtg as well [13:49:08] feel free to start one. if you're unsure about where to put it you can always use your userspace [13:49:26] what about a git repo inflatador ? [13:49:44] like the awesome lists on gh [13:50:56] https://gitlab.wikimedia.org/arnaudb/awesome-wmf-tools → if needed [13:51:12] ooh, ALF! [13:51:48] I never went back on melmac [13:52:26] I'll get started on a wikitech page, but I think it should link back to a repo...easier to keep those updated [13:52:35] lgtm :) [13:52:46] feel free to move the repo around as I don't know where to store it 😬 [13:52:51] FWiW I hosted a "favorite tools" presentation for DPE https://etherpad.wikimedia.org/p/dpe-favorite-tools [13:53:17] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949642 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c8dbb89d-640c-4078-bc10-bbbe9c30f3ef) se... [13:56:12] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949650 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=753739a5-e1fb-44b6-9174-f7b3a8c4b73b) se... [13:58:55] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949656 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=185956f6-b0e6-4a89-9e32-6a8223f5678e) se... [14:00:06] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949655 (10JMeybohm) !log jayme@cumin1002 conftool action : set/pooled=no; selector: name=(wikikube-worker1007.eqiad... [14:01:25] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949662 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=11036a9f-0b48-4b07-9e63-571b4f67c201) se... [14:09:55] updated https://gitlab.wikimedia.org/arnaudb/awesome-wmf-tools#awesome-wmf-tools [14:22:09] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949750 (10cmooney) Switch is back up, all looks good at first glance from the network side. [14:25:11] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949772 (10ABran-WMF) db hosts as well, repooling [14:33:14] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949834 (10JMeybohm) >>! In T365994#9949655, @JMeybohm wrote: > !log jayme@cumin1002 conftool action : set/pooled=no... [14:38:48] cdanis: https://phabricator.wikimedia.org/T220836 :) [14:45:23] arnaudb awesome! The wikitech page exists now (more or less) https://wikitech.wikimedia.org/wiki/User:Bking/cli-tools [14:46:33] "Awesome WMF tools" is a better title, will adjust WT page [14:46:51] I've updated the page w/ the repo link, thanks inflatador [14:47:33] * inflatador wonders if we can include gitlab README.md in the WT page [14:49:46] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9949915 (10Eevans) >>! In T365994#9949750, @cmooney wrote: > Switch is back up, all looks good at first glance from... [15:08:21] created a plan for the puppet private migration, hope that it makes sense https://phabricator.wikimedia.org/T368023#9949992 [15:48:54] XioNoX: was looking at anycast-hc for other reasons and found https://github.com/unixsurfer/anycast_healthchecker?tab=readme-ov-file#prometheus-exporter [15:49:20] the new version has support for exporting prometheus metrics [15:49:36] can be a nice pairing with bird_exporter [15:52:25] sukhe: for sure yeah! [15:52:46] sukhe: while I think about it, there is also https://phabricator.wikimedia.org/T311618 [15:55:19] yep, let's talk about this tomorrow (this was the main item under improving anycast monitoring :) [15:56:20] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9950212 (10ayounsi) 05Open→03Resolved All is done here. [17:05:47] 10Mail, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9950703 (10bcampbell) I see the new MX records in Google Workspace Admin now @jhathaway. {F56203753} [17:14:46] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9950763 (10cmooney) 05Open→03Resolved [18:11:40] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9951125 (10cmooney) So one thing I noticed is that we are not getting the stats for LAG/ae interfaces with the current setup, nor routed... [18:20:08] 10CAS-SSO, 06Infrastructure-Foundations: Login attempts from bd808 get 500 on Debmonitor - https://phabricator.wikimedia.org/T369205 (10Volans) 03NEW p:05Triage→03Medium [18:21:32] 10CAS-SSO, 06Infrastructure-Foundations: Login attempts from bd808 get 500 on Debmonitor - https://phabricator.wikimedia.org/T369205#9951204 (10Volans) [18:54:53] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9951390 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bo... [19:19:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9951466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bo... [19:24:31] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9951473 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bookwo... [19:25:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9951474 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bo... [19:55:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9951547 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bo... [21:25:43] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9951919 (10jhathaway) [22:37:41] 10netops, 06Infrastructure-Foundations, 06SRE: Should we add links between our spine switches aggregating each row of two? - https://phabricator.wikimedia.org/T369238 (10cmooney) 03NEW p:05Triage→03Low