[01:00:06] 10netops, 06Infrastructure-Foundations: New hosts with "Netbox status: unknown" - https://phabricator.wikimedia.org/T371653 (10Papaul) 03NEW [04:44:22] FIRING: SystemdUnitFailed: netbox_ganeti_esams01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:45:40] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_esams01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:22] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_esams01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:59:22] RESOLVED: [2x] SystemdUnitFailed: netbox_ganeti_esams01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:22:51] 10Mail, 10Bitu, 06Infrastructure-Foundations: Don't get password reset emails for my alt through IDM - https://phabricator.wikimedia.org/T371612#10037613 (10SLyngshede-WMF) p:05Triage→03High [07:26:44] 10Mail, 10Bitu, 06Infrastructure-Foundations: Don't get password reset emails for my alt through IDM - https://phabricator.wikimedia.org/T371612#10037616 (10SLyngshede-WMF) a:03SLyngshede-WMF Just checked, you Wikitech account (Nintendofan885) is linked to a SUL account with the same name. I'll look into... [08:30:15] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: New hosts with "Netbox status: unknown" - https://phabricator.wikimedia.org/T371653#10037746 (10ayounsi) The re-image cookbook is hitting this bug too: https://github.com/netbox-community/pynetbox/pull/632 `lang=pytb,lines=30 Traceback (most recent... [08:32:27] o/ [08:32:36] hello [08:45:40] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: New hosts with "Netbox status: unknown" - https://phabricator.wikimedia.org/T371653#10037761 (10ayounsi) Confirmed working: `lang=pytb, name=before >>> robj = api.extras.scripts.get('import_server_facts.ImportPuppetDB').url [...] TypeError: a bytes-... [08:53:32] "This message is to inform you that as part of Google's effort to simplify peering operations and increase routing security, Google has decided to stop advertising and receiving prefix information from Route-Servers in Internet Exchanges. " [09:00:30] whaaat??? [09:01:12] beg your pardon? [09:01:35] they 'll stop using route servers and this somehow will simplify things? [09:01:45] We peer with them almost everywhere anyway, I sent peering requests for the last 3 locations where we don't [09:02:15] yeah, but... what about all of those orgs that don't directly peer with them but rely on the route server? [09:02:23] the issue is Equinix San Jose, last time I asked to peer, they say "we exchange too much traffic, please contact us for a PNI" then I contact them and they ever reply [09:02:30] never* [09:03:05] yeah it's going to be a big shift for lots of people [09:03:23] bangs of automatic process/thresholds and not properly analysing the situation, networks involved etc [09:03:33] (the pni request thing) [09:04:07] the route server thing is strange, do they want individual sessions for more control somehow? [09:04:23] that would be my guess, yeah [09:04:36] they did a Q&A in portugese for IX.BR [09:04:44] that email above is for SG.IX [09:04:44] be fascinating to see the internal discussions that led to that [09:05:20] do you have a link to the q&a ? [09:06:00] https://www.youtube.com/watch?v=Or0lPtqoZVQ [09:07:00] haha thanks - didn't realise it was a video :) [09:07:14] I'll dig out my Portuguese phrase book [09:08:06] surprisingly no thread on the NANOG mailing list [09:08:47] some threads online https://anuragbhatia.com/post/2024/04/google-to-stop-peering-via-rs/ [09:09:49] thanks [09:11:29] topranks: also new pmacct release, as usual, extremely packed with new features and bug fixes: https://github.com/pmacct/pmacct/releases [09:11:55] nothing worth upgrading I think, but some possible nice features here and there [09:44:40] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox: replace getstats.GetDeviceStats with ntc-netbox-plugin-metrics-ext - https://phabricator.wikimedia.org/T311052#10037907 (10ayounsi) As a data point, `GetDeviceStats` runs ~5000 times per day, which clutters the DB and probably contributes to... [09:48:29] XioNoX: indeed yeah, such a great project [09:49:10] I agree though, don’t think we need to upgrade [10:25:24]  [11:04:04] XioNoX: thanks for the reimage investigation! [11:05:09] thanks for the review! [13:00:33] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox: replace getstats.GetDeviceStats with ntc-netbox-plugin-metrics-ext - https://phabricator.wikimedia.org/T311052#10038568 (10ayounsi) I played a bit with the plugin and it's quite easy to duplicate what we currently do: {F56922356} The only d... [13:04:18] XioNoX: o/ if you have a min, I am trying https://netbox-next.wikimedia.org/extras/scripts/add/ [13:04:44] I encountered multiple errors, tried to workaround them but not sure if I managed to :D [13:05:05] I also tried to delete the existing provision script, and add mine from localhost [13:05:20] but it complains about it anyway [13:06:16] elukey: sure, what's up? [13:06:23] what file are you trying to upload? [13:08:42] my local provision_server.py [13:09:06] with my patch basically [13:09:34] elukey: ok, and what errors are you getting? [13:11:07] I select both Data Source and File, the latter being customscripts/provision_server.pu [13:11:30] "Cannot upload a file and sync from an existing file" [13:11:39] if you uploading it manually just select the file, and no datasource [13:14:15] okok thanks :) [13:28:00] if you're curious about the warning when running Homer: https://github.com/paramiko/paramiko/issues/2419 nothing worth doing for now, waiting for an update [13:34:31] https://netbox-next.wikimedia.org/dcim/interfaces/35261/ -> Mac address :) [13:34:42] elukey: nice!! [13:35:01] fixing a couple of things and then the code change should be ready [13:35:10] then you can tell me how many things I've done wrong :D [13:36:34] cool! [13:37:43] Unrelated, but I think we can remove "How many Cassandra instances" from that script, as it has been replaced by a dedicated script (CC topranks ) https://netbox-next.wikimedia.org/extras/scripts/19/ [13:41:12] code review updated :) [13:42:13] elukey: a couple nits to start with [13:42:35] elukey: also should it require the mac address if the server's vendor is supermicro ? [13:42:44] to not risk forgetting it [13:42:55] "to start with" sounds worse than Riccardo's "LGTM but a couple of nits" :D [13:43:10] :) [13:43:40] ah so I can get the vendor slug in there since the server is already in netbox, and in case warn the user [13:43:43] yes yes good point [13:43:57] you can't just warn, it's either blocking or passing [13:44:12] yes in the sense error out, since it is needed [13:44:24] or you can show a warning log line, but I doubt it's useful/read [13:44:35] nono it needs to block [13:45:00] elukey: from a quick look everything else is good :) [13:45:19] thanks! Going to find a way to get the vendor and I'll test it [13:45:36] 10CAS-SSO, 06Infrastructure-Foundations, 10GitLab (Auth & Access), 10Release-Engineering-Team (Priority Backlog 📥): GitLab sessions expire frequently - https://phabricator.wikimedia.org/T330359#10038681 (10hashar) I am not using GitLab that often but I once get disconnected and reproduced it by simply... [13:46:18] elukey: something like `device.device_type.manufacturer.slug` [13:46:31] I was checking that, lemme try [14:04:24] should be ready now, tested and it works fine [14:07:38] slyngs: o/ do you have a min for a brainbounce (on irc)? [14:07:49] Sure [14:08:44] thanks :) Yesterday I wanted to test the new debmonitor-server release, and I tried this https://phabricator.wikimedia.org/T368744#10035803 [14:09:29] I thought it was related to the new version, then I rolledback on 2003 and I still see the same issue [14:09:43] but it doesn't happen if I point /etc/hosts to sretest1001 [14:09:49] err sorry debmonitor1003 [14:10:00] afaics they are the same, same db config etc.. (m2-master) [14:10:58] my assumption is that moving debmonitor.discovery.wmnet CNAME to debmonitor2003's IP should be the only thing for a failover [14:11:14] but if so, I don't explain why it fails in the way I described in the task [14:11:19] I'll just find the code [14:19:19] That is weird [14:19:52] I am very puzzled as well, now I am wondering if I did something wrong (testing or upgrade) [14:23:24] also checked the code /usr/lib/python3/dist-packages/debmonitor/hosts/views.py, it is the right release (0.4.0) [14:23:55] It's something internal to Django, because the it seems to fail on HostPackage.objects.get_or_create, so it does a get, decides that it needs to "create" the object and then fails on save, because it already exists [14:24:54] * elukey nods [14:25:21] anything that could prevent the django on 2003 to fetch the data from m2-master? [14:25:38] We can test [14:27:04] I'm just doing the query, without the _or_create in a django shell [14:27:09] host is sretest1001 [14:28:09] exactly yes [14:28:24] very curious to know about the django shell :) [14:31:00] sudo debmonitor shell <- Gives you a shell with all the models and everything loaded [14:31:10] Very handy and very dangerous :-) [14:31:14] TIL :) [14:34:03] I wonder if it only happens with a specific package [14:34:50] Dumb question, what happens if you run the client twice? [14:35:16] lemme try [14:39:15] I am a little suspicious of PackageVersionManager and it's overwriting of the get_or_create [14:40:50] something interesting - if I drop the package first from spicerack (spicerack.debmonitor().host_delete('sretest1001.eqiad.wmnet')) it doesn't work even if I try multiple times [14:41:10] lemme try if I drop, run with /etc/hosts pointing to debmonitor1003 and then to 2003 [14:41:40] Okay, exciting :-) [14:43:10] with the last combination, it works [14:43:16] namely the client completes [14:46:08] So drop the package in 1003, then run the client again, pointed to 2003 and the package is not created again [14:47:37] my main worry is that if we'll ever need to failover debmonitor we'll see fireworks [14:47:43] have we ever done it? [14:47:59] I don't know :-) [14:48:18] fair :) [14:48:21] I can't remember if we did with the last update [14:48:41] The last update was the first one rolled out with a Debian package [14:49:02] Maybe we build new hosts to upgrade to Bookworm [14:50:04] My thinking, and I have no proof, in bin_packages/models.py we have "PackageVersionManager" [14:51:18] That overwrites get_or_create and does some in memory caching... I think, so if we manage to hit both hosts with our queries for some reason, things may be expected to go very wrong [14:52:41] right this is a good point, so some state is hidding from us [14:53:36] A really dumb test is to on 2003 go in and rename get_or_create to _get_or_create or something, restart debmonitor and try again [14:54:06] Then again, that might not be a good idea [14:54:19] I'm still not entirely sure what that code is suppose to do [14:58:17] could it be that django's object mapping to the DB uses some sort of state/temp-key/etc.. that depends on the host? [15:00:37] It does, so some extend. Especially when you deal with foreignkeys and related objects. If you get an object out, and that has a bunch of related objects, then those are loaded either when you fetch the object, or access the relation for the first time. If the database below is changed, you won't see it, unless you call refresh_for_db [15:05:40] it is weird since we already changed vms etc.. for debmonitor, so we should have seen it [15:07:27] It's also only an issue if two requests happen to overlap, so it should be somewhat hard to trigger. [15:08:58] I am not sure, what if we move debmonitor.discovery.wmnet to debmonitor2003 via DNS? [15:09:18] I'd expect to see the failures every time we drop some host data [15:09:22] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:40] or do you think it wouldn't happen in that case? [15:09:47] this is what concerns me [15:10:24] Understandably. I think we'll be fine, the issue is how to recover if we're not :-) [15:11:08] yeah :D [15:11:24] In any case, let us maybe not move the DNS on Friday afternoon. [15:12:01] oh yes I wasn't suggesting that don't worry :D [15:12:56] I have to go start dinner, but let's test something on Monday... or if you feel like doing it now. Have the access log for Debmonitor for both hosts open at the same time, run the client and see if both gets traffic [15:13:33] If they both get like half the traffic then that would explain the problem [15:13:59] Well... provide a more qualified guess at least [15:15:36] enjoy your evening and weekend! Thanks for the brainbounce [15:15:43] will log off soon too [15:16:35] You too, we'll give it another look Monday [16:04:22] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:13:24] I'm gonna log off too folks, and FYI Monday is a public holiday here so I will catch up with you all Tuesday [16:13:30] have a good weekend! [17:45:20] cheers topranks [22:50:40] FIRING: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:22] RESOLVED: SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:40:40] FIRING: SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:54:22] RESOLVED: SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed