[02:19:55] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:23:10] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:26] the current git status of netbox-extras on netbox-dev2002 is: [09:54:27] modified: customscripts/_common.py [09:54:28] modified: customscripts/provision_server.py [09:54:39] anyone still working on those or are leftover from past testing? [10:24:55] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:23] volans: I usually save the git diff in /root/ then reset [10:27:59] yes, but do we have a reason to keep them around? are they in a CR? [10:29:44] volans: dunno, but that's why I suggest to save the diff then reset it if you need to work on something else [10:30:49] sure, I wasn't blocked by that, [10:32:21] not sure I understand the question then :) [10:34:21] it's pointless to keep something that every time requires manual hacking if it's not needed :D so just checking if it's needed or leftover from old tests [10:35:07] ah yeah I agree, I think we usually think we will get back to it then forget [10:37:26] testing all it's easy [10:37:33] ips = IPAddress.objects.all() [10:37:33] v = Main() [10:37:33] for ip in ips: v.validate(ip) [10:37:46] sorry wrong tab, was referring to the validators in netbox [10:37:47] :D [10:38:59] volans: that's nice ! Could be worth adding to the wiki [10:39:22] sub section of https://wikitech.wikimedia.org/wiki/Netbox#Validators probably [10:39:36] sure, will do, I had to paste the class in the repl though [10:39:46] I can probably find a better way to load it [11:00:55] XioNoX: added a section Testing Validators: https://wikitech.wikimedia.org/wiki/Netbox#Validators [11:09:35] <3 [11:10:00] I'll probably do a full pass of all existing validators [11:15:46] and finding issues :/ [11:28:14] volans: I can’t promise those leftover modifications were not me (although I didn’t think so). Either way there is nothing there I need so feel free to reset [11:28:42] and sorry had to take my mam to doctor only seen the chat now [11:29:59] topranks: no worries and no hurry at all, it was just a low prio check if we do have WIP modifications in there or not. Happy to keep them if in use, but also no need to keep unnecessary burdeen around if not needed :) [11:31:22] yep absolutely. I tend to do any development on my laptop and push to the host so I’ll never have “unsaved work” there. Thanks for checking! [12:35:50] who validates the validators ? [12:35:56] Riccardo. [12:36:01] :) [12:36:09] lol [12:45:21] 10CAS-SSO, 06Infrastructure-Foundations: Create Tomcat 9 for Bookworm - https://phabricator.wikimedia.org/T359333 (10MoritzMuehlenhoff) [12:45:43] 10CAS-SSO, 06Infrastructure-Foundations: Build Tomcat 9 for Bookworm - https://phabricator.wikimedia.org/T359333#9606112 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [12:53:10] (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:57:21] XioNoX, topranks: I have upgraded rpki2002 to Routinator 0.13.2 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009247 [12:57:49] I have stopped Puppet on 2002 and fixed up the puppetised template manually (with what the patch contains) [12:57:55] looking [12:58:02] can you please verfiy that it works as expected [12:58:10] (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:17] and that [12:58:19] By default, Routinator will now use the TALs of the five RIRs which are included in the Routinator binary. This is what most networks using RPKI want to use and means for them no initialisation is necessary at all any more. [12:58:25] applies to our installation as well? [12:58:47] yeah [12:59:02] sounds ok yeah [12:59:12] before there were some legal stuff, where they wouldn't bundle the ARIN TAL [12:59:15] couldn't* [12:59:21] we'd need to upgrade 1002 and 2002 in lockstep, since I believe applying the patch while still on 0.11 will also break [12:59:23] good to know that it's solved now [12:59:56] from a router POV rpki2002 is fine [13:00:12] 10Mail, 06Infrastructure-Foundations, 07User-notice: Stop sending change notification email if edit is done by a bot - https://phabricator.wikimedia.org/T356984#9606145 (10Ladsgroup) >>! In T356984#9600958, @Tacsipacsi wrote: > Thanks for the explanation! So the concern is the amount of outgoing mail, specif... [13:02:00] moritzm: the prometheus endpoint isn't update through https://grafana.wikimedia.org/d/UwUa77GZk/rpki?orgId=1 [13:03:32] `ayounsi@rpki2002:~$ curl -v localhost:9556/metrics` -> Initial validation ongoing. Please wait. [13:03:36] I guess I'll wait [13:05:41] yeah, we can just wait until metrics are back, and then I'll do the upgrade dance for the rest [13:10:31] oh, and https://github.com/NLnetLabs/routinator/releases/tag/v0.13.0 also mentions the availability of debs for bookworm, nice [13:14:03] alright, metrics are back [13:14:49] router still happy [13:15:58] I have to step away for a bit [13:18:20] for leter, no hurry, we should check if the various routinator commands on wikitech are still working, they are veru useful to respond to pages and alerts [13:18:27] *later [13:18:57] volans: good call, I'll take a look now [13:19:09] <3 thx [13:28:50] all good still working the same [13:34:01] patch is merged and rpki1002 is now also upgraded [13:36:09] session to both is looking healthy on two CRs checked so I think we're good [13:36:36] same thing with the metrics but I'm sure it'll come good, I'll keep an eye on the graph [13:38:06] I can't see an option to make https://phabricator.wikimedia.org/T358581 public, is that a permission I lack or am I missing something? [13:43:30] probably some ACL, I'll ask Scott to make it public [13:58:54] metrics are back btw [14:00:06] mortizm: in terms of that task I don't have any option to change the visibility either [14:02:17] thanks [14:41:14] volans re: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/966492 , we don't have to meet later today if you're busy...if you just wanna merge/deploy to cumin2002 and give me the heads-up that's fine [14:42:06] inflatador: there are open comments, who's working on them? (see gerrit / inbox) [14:45:41] volans gotcha. I was thinking "merge and see what happens" ...but I'm in no great hurry either. I can talk it over with dcausse and fnegri . [14:47:16] those concerns seems easy to be tested in isolation way before merging, cutting a release, deploying, having 2 different versions in the 2 cumin hosts, etc... :) [14:48:54] I think we'll eventually have to do that anyway...we're not turning this major of a change loose without testing. The only other option is David's suggestion that cloud use a separate library [14:50:25] sure, but if we can test what's testable before it should reduce the number of test-releases [14:53:13] yeah...sorry, I think I'm creating a false sense of urgency. I'm a people-pleaser so I wanted to unblock Francesco. But he's OK w/waiting. [14:53:34] So let's take it slow and do some more testing as you suggested [14:55:55] we all want to unblock that patch because it's causing so much issues to keep elasticsearch-curator around [14:56:03] in terms of dependency hell [14:57:01] we could also go the other way around, add an opensearch module with what is in the latest PS, dupicating modules, tests and spicerack accessors [14:57:17] and then let the onwers migrate the cookbooks from one to another at their own pace [14:57:21] *but* [14:57:35] we'll not be able to fix the dependency hell until the removal of the old module [14:57:47] so we can't keep it around for a long time [14:59:38] Yep, python gonna python [15:00:18] It all depends on what's the least amount of pain. To me, it's just attacking and seeing what happens. But it's easy for me to say, I don't have to cut the release/deploy/etc ;) [15:03:26] different topic: I'm having a tough time reimaging wdqs1025. I keep getting "media test failure" on PXE boot...based on the BIOS, it's trying to boot off the right NIC, the firmware is the version from DC Ops' wikitech page, the DRAC UI says the NIC is plugged in...any suggestion? [15:03:55] inflatador: I'll have to cut a release for serviceops anyway but that bit is needed for the switchdc so I can't rollback that one, in case there are issue with yours [15:04:25] volans it's cool. We'll just wait it out for now, regroup after the offsites [15:04:30] inflatador: media test sounds like is not finding the disks... does it have a HW raid? [15:06:39] volans Y [15:06:56] is the raid properly configured? [15:07:45] probably not...it just went thru the decom process. This is an oddball host (reclaimed CP node) , my first time dealing with HW raid cards here [15:10:02] I can get into the DRAC, if I need to set up HW raid there LMK [15:11:39] you can ask dcops to have a look at the raid and check if it's properly configured [15:11:52] Y, it's not showing any disks at all [15:12:03] Yesterday I could see them. Sounds like they might need reseating then [15:30:34] volans yet another unrelated question, is the ganeti in eqiad/codfw storage all on RAID5 ? I see a RAID1 partman recipe [15:32:55] FYI, I'll leave a note on the capex sheet that we don't need a refresh of sretest2003/2004 (given that these are old reclaimed servers and if we need this again we can grab old ones again) [15:33:06] moritzm: ack, +1 [15:33:09] (based on what we talked about in the Monday meeting) [15:33:14] inflatador: 301 to moritzm and modules/profile/data/profile/installserver/preseed.yaml [15:36:09] excellent, thanks again! [15:37:02] the servers in eqiad/codfw are all raid5, yes [15:37:31] the raid1 variant is used in the smaller installations in the pop sites [16:36:46] /etc/cumin/aliases.yaml is currently broken on cumin hosts FYI [16:36:51] ganeti-all A:ganeti and A:ganeti-test and A:ganeti-routed [16:37:52] I'll send a fix [16:42:19] ngl erbized yaml is a plague [16:51:28] moritzm: ^^^ fyi, and sorry I missed that [16:52:04] already +1d :-) [16:59:55] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:02] 10Mail, 06Infrastructure-Foundations, 07User-notice: Stop sending change notification email if edit is done by a bot - https://phabricator.wikimedia.org/T356984#9608335 (10Tacsipacsi) >>! In T356984#9606145, @Ladsgroup wrote: > Yeah, the second part is something that we need to make sure happen before being... [18:32:22] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839#9608338 (10cmooney) p:05Medium→03Low [21:01:55] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed