[02:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:03:04] TIL: "Newer NetBox versions do not use the scripts directory like older versions. Scripts are stored in the DB. Using either a data source or uploading it directly in the UI is the way to go." [06:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:14] ugh... storing the scripts in the db doesn't sound very convenient for anyone [08:16:42] topranks: so after investigation, it still store the script on disk, but needs to update the DB to know that the script exists [08:17:22] I'm about to send a CR for a possible fix [08:17:23] oh ok... so we can still manage it from a repo and put it on somewhere on the disk? [08:17:33] what happens if it changes on disk? [08:18:12] the way it works is that you configure a "Data Source" which can be a local directory or a git repo [08:18:27] ah ok [08:18:39] and then in the script page, you "import" a script from that data source [08:18:50] "git repo" also sounds workable [08:18:59] yeah that's the way I'm going toward [08:19:05] points to netbox-extra on gerrit [08:19:12] cool [08:19:14] be good if we can do that "import" programmatically somehow [08:19:20] yeah it can [08:19:47] nice, so not a disaster just loads more work again :) [08:19:48] and then we will use the local netbox-extra checkout only for the validators [08:20:10] haha yeah, never ending : [08:20:11] :) [08:31:00] topranks: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1048368 the approach I'm suggesting [08:34:47] looks good to me [08:35:10] worth getting v.olans input when he's back I guess [08:35:19] out of interest what is the leading '2' in the mode for? [08:35:47] topranks: sticky guid, files created in the directory will have the same group as the directory [08:36:04] ah ok, TIL [08:39:36] volans is out next week iirc, maybe elukey, or slyngs, can be the extra pairs of eyes. Also it wouldn't be too difficult to change if a different approach is preferred before we actually upgrade prod [08:41:34] Sure, I also have the OIDC for Netbox 4 , so I'll be in a Netboxy mood anyway :-) [08:41:54] the best kind of mood! [08:49:52] topranks: fyi, the wikikube-ctrl2002 mystery host is happily pxe booting now, as predicted [08:51:14] what the actual.... [08:51:22] yep! [08:52:10] when you first mentioned this I was sure it was some issue with a DHCP lease time and how the Junipers store that state [08:52:26] yeah, I wondered about that too [08:52:29] as in... subsequent DHCP failing cos of some state entry one of the routers added when they seen the first [08:52:34] yeah [08:52:41] but, nope, it literally is not trying DHCP the second/failed times [08:52:48] excellent [08:53:05] :D [08:53:16] so it is state on the host that seems to be impossible to wipe afaict (yes I tried a lot of things) [08:53:49] yeah like media failure is such a low-level thing, a logical or configuration error you'd never expect to cause that [08:53:58] yeah [08:54:09] it's probably in it's contract of employment [08:54:16] "no more than 1 PXEboot a day" [08:54:23] :D [08:54:39] best explanation I can think of :D [08:55:04] (now imagine what it took out of me to figure out that pattern '^^) [08:55:11] yeah nuts [08:55:20] (I want dell to pay for the psychiatrist XD) [08:55:22] esp. as I don't think anything has changed or there are any unusual components here [08:55:35] it'll take years of therapy no doubt :D [08:58:02] mhm [08:59:21] I'm a tad worried about using those hosts and getting bitten by it in an emergency but it'd take a somewhat contrived scenario to be an actual problem, so it's probably fine? [08:59:48] in any case, thank you for your help <3 [09:03:53] XioNoX: checking the patch.. [09:08:47] the process is nice, I merged https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/1048260 on the dev branch, clicked the "sync" button on Netbox 4 UI, and the script was working at the next run [09:10:40] https://netboxlabs.com/docs/netbox/en/stable/models/core/datasource/ [09:15:45] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9912805 (10ayounsi) [09:19:13] kamila_: it's no problem as long as everything else works and you get the partman and puppet bits right first time every time [09:19:17] how hard could that be? [09:19:41] XioNoX: doesn't sound too bad :) [09:27:35] XioNoX: ok for me to proceed! But please consider using a netbox4 flag if the number of settings to vary will increase [09:34:23] topranks: yes that's exactly my worry :-D [09:35:08] elukey: cool, thx, yeah I was going to but then realized that using $deploy_project == 'netbox-dev' was kind of the same without having to add a new variable [09:36:58] sure sure for now it is ok, but it can get messy very soon :) [09:38:54] true, let's see what other surprises Netbox 4 have for us [09:44:22] yep yep :) [09:55:35] topranks: I'm kinda spent on the pxe boot mystery and I want to actually finish reimaging the host and be done with it, but an idea that pops up is speed negotiation fail maybe? but I don't have a test host anymore and I don't know if it's worth digging into it any further [09:55:57] it seems unlikely [09:56:07] ok [09:56:32] because they are SFP-based ports, so the module in the port forces it to one speed, there is no negotiation of speed (even if autoneg starts only 1 speed and full-duplex is announced as capability either side) [09:56:48] plus that kind of thing you'd expect to happen *all* the time, or at least not with this crazy pattern [09:56:54] yeah, fair point [09:57:11] that said I wouldn't rule anything completely out we're past that stage [09:57:26] yeah '^^ [09:57:38] * kamila_ actually has no idea about our network setup and needs to stop randomly guessing [09:58:09] also the switch side shows the port as "up", you'd expect that to be down if autoneg had run and failed [09:58:43] it's more likely the host is somehow trying the wrong port [09:58:55] or at least with that error, traditionally, that's usually been the cause [09:59:10] somehow it was configured to pxeboot on wrong port, and we get that error every time [10:07:35] good point [10:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:24] kamila_: Despite my best efforts to convince topranks to switch to wireless, or network setup is based on fibers and ethernet cable, if that helps :-) [12:09:40] slyngs: actually I'm fine with cables, at least they usually don't interfere with each other (let's ignore 10G copper for a while) :D [12:10:32] exactly! [12:10:40] wireless is for things that move around only :) [12:10:54] the funky host is very unlikely to be a bad cable given there've been 3 of them and I stress-tested one of the links enough to cause a page XD [12:11:37] yeah, although I think there is probably a class of errors that might make it "not initialise", but still work ok when it does (i.e. in your speed test) [12:11:54] but the fact we've had the same thing on multiple hosts, bought at different times and with different cables, seem to rule it out [12:12:32] yeah, it smells like firmware, and maybe the OS has workarounds for some weirdness that the firmware doesn't? who knows [12:12:55] I'm going with a poltergeist that likes their routine (hence the deterministic behaviour) [12:13:46] haha I was ruling out the spirit world because it was so predictable, but yes perhaps it's just one with a good daily routine and lots of discipline [12:13:58] the ghost of what I'll never be [12:14:23] nor me :) [12:16:54] is it time for that joke? [12:16:57] it's time for that joke [12:17:07] what is the medium for wifi? [12:17:18] gasfaser or twisted air [12:20:48] google tells me this is some pun on "Glasfaser" that I'm not getting at all :P [12:20:58] german word for fiber [12:21:14] so I've just learnt [12:21:27] I'll go with twisted air then :-) [12:22:54] (true, my german is meh but I have to pay for my fiber, so I suppose it sounds funnier in my head XD) [13:11:19] 10netops, 06Infrastructure-Foundations: Move a server within the same row script not working - https://phabricator.wikimedia.org/T368148 (10Papaul) 03NEW [13:11:35] 10netops, 06Infrastructure-Foundations: Move a server within the same row script not working - https://phabricator.wikimedia.org/T368148#9913358 (10Papaul) p:05Triage→03Medium [13:21:24] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275#9913397 (10ayounsi) It's not possible to the the DB migration directly from 3.2.9 to 4.x. We need to do a pit-stop on 3.7.x. This was tested successfully by: # running a PostgreSQL l... [13:30:25] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9913434 (10cmooney) Gonna copy some of the discussion from the patch here as I think it's easier for discussion and a record of what we decide:... [14:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:05] XioNoX: https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/1048486 - no idea if totally off or not, took a stab in exploring netbox-extras.. [15:07:13] elukey: nice! [15:07:28] elukey: you can test it on netbox-next if you want [15:09:55] XioNoX: ah nice, no idea how though :D [15:10:02] * elukey goes on wikitech [15:10:30] elukey: netbox-dev2002:/srv/deployment/netbox-extras [15:11:30] sudo -i, then cherry pick the commit, maybe restart uwsgi-netbox, and then try https://netbox-next.wikimedia.org/extras/scripts/move_server.MoveServer/ [15:11:37] maybe try it before to be able to compare [15:12:33] ah nice I thought I needed to deploy it somehow [15:12:47] okok, I can ask to Papaul to run the script in netbox-next [15:34:02] XioNoX: the fix works :) [15:34:09] nice! [15:36:56] XioNoX: and now I guess that I'd need to run sre.netbox.update-extras right? [15:37:04] elukey: yep :) [15:40:25] I see failures when deploy to canary, more specifically [15:40:27] 100.0% (1/1) of nodes failed to execute command 'git -C /srv/depl...s pull --ff-only': netbox-dev2003.codfw.wmnet [15:40:36] I guess it is ok, new node with netbox 4 [15:40:46] I'll proceed with prod [15:40:51] elukey: yeah [17:26:24] Friday afternoon thought: what would a good devcontainer setup for the Puppet repo look like? [17:49:03] 10netops, 06Data-Persistence, 06Data-Platform-SRE, 06DBA, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9914102 (10Ottomata) [18:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:38:50] 10Packaging, 06Infrastructure-Foundations: upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088#9914512 (10Aklapper) [22:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:41:47] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9914610 (10cmooney) 05Open→03Resolved