[05:07:57] swfrench-wmf: dbctl seems to be broken [05:08:06] root@cumin1002:~# dbctl config diff [05:08:06] ERROR:conftool:File /etc/conftool/etcdrc not found [05:08:06] ERROR:conftool:File /etc/conftool/etcdrc not found [05:08:06] root@cumin1002:~# [05:08:17] _joe_: is this something you could help with? [05:08:45] <_joe_> marostegui: uhh give me a sec, but probably yes [05:08:53] thank you [05:09:03] <_joe_> marostegui: the fastest solution for now is to re-install the old version [05:09:16] <_joe_> if you're in an emergency [05:09:17] that's not broken [05:09:26] that's error message spam, which is new [05:09:37] I have a task open to fix [05:09:55] <_joe_> hah [05:09:58] T367919 [05:09:58] swfrench-wmf: I see, I was able to depool a host indeed [05:09:58] T367919: Avoid error logging while searching configs during normal operation - https://phabricator.wikimedia.org/T367919 [05:10:08] <_joe_> swfrench-wmf: go away :D [05:10:12] haha [05:10:15] <_joe_> it's way too late in your TZ [05:11:11] I may have let my IRC mention notifications run a bit latter today than usual on account of the things I touched :) [05:11:24] but yes, will disappear into the bushes again shortly [05:11:29] swfrench-wmf: :* [05:13:18] <_joe_> thanks, I just merged your fix [05:13:25] <_joe_> now go enjoy your evening :) [05:14:03] so, to summarize for wider visibility: the 3.0.0 conftool release has alas added some non-fatal error message spam, which we'll aim to fix in 3.0.1 :) [05:14:08] thanks _joe_! [05:14:45] <_joe_> yeah I'll take care of making a release this morning so that it's not causing too much surprise [05:15:04] oh, that would be great! thank you very much [05:15:05] thank you :* [05:15:44] alright, I'll disappear again - have a nice Wednesday all, I'll be back Thursday :) [05:25:35] Interestingly that error is preventing our schema changes to get deployed as the script doesn't commit the changes I think - so yeah, this would be high priority [05:26:00] <_joe_> marostegui: yeah on it [05:26:11] _joe_: thanks! :) [05:27:29] <_joe_> marostegui: are your schema changes running from cumin1002? [05:27:44] Yes [05:27:52] <_joe_> ok so [05:28:10] <_joe_> If you want to get unblocked now, I'll just downgrade there [05:28:17] I can wait [05:28:19] No worries [05:29:08] <_joe_> we both waiting on gerrit rn :D [05:31:12] I know I keep saying a plan to leave, but marostegui: when you get a chance, if you could point to the schema change script that's having issues, I would be interesting in seeing how it's failing [05:32:17] context: I did test a noop a dbctl write (changing a note field on an instance - i.e., a change that leads to no commit diff), and it succeeded [05:33:46] <_joe_> swfrench-wmf: this is a script that I guess shells out to dbctl [05:33:51] <_joe_> and uses its output [05:33:59] <_joe_> so, tech debt [05:34:06] I am trying to understand what is doing [05:35:09] Because it got at Depooling db2152 [05:35:10] 2024-06-19 05:24:23.867965 dbctl instance db2152 depool [05:35:10] ERROR: [05:35:51] But yeah, looks like what _joe_ says: [05:35:53] error: [05:35:53] ERROR:conftool:File /etc/conftool/etcdrc not found [05:35:53] ERROR:conftool:File /etc/conftool/etcdrc not found [05:35:53] Waiting for dbctl to clear up [05:36:11] ah, interesting - it didn't occur to me something might be invoking the tool and then parsing stderr [05:36:26] swfrench-wmf: I think you should go away :) [05:38:18] heh, yes, I should :) [05:39:28] <_joe_> marostegui: I am having problems with uploading the package to reprepro [05:39:36] <_joe_> it seems there's something that was left in a broken state [05:39:45] _joe_: no worries, it is fine, this is not super urgent as long as it is known :) [05:39:46] <_joe_> so now reprepro is unusable [05:39:47] schema changes can wait [05:42:23] <_joe_> marostegui: can you try dbctl on cumin1002? [05:43:24] just went thru! [05:43:40] <_joe_> ok cool [05:43:50] <_joe_> now I need to rebuild for the other distros [05:44:44] from my side it seems fixed! [05:44:52] thank you [08:36:30] Apologies all for the inconvenience re reprepro. That was my fault. [08:42:36] So I have a question about the installservers. One integral part of net-booting (PXE) is /srv/tftpboot/lpxelinux.0 --- where did this file come from? I ask because we have once more run into the "Failed to load ldlinux.c32" bug with the new SMC machines. And rather than "just downgrade the firmware", I'd like to figure out if it's a bug in our version of the PXE binaries. [08:43:20] Normally in Debian, that binary is provided by one of the packages that fall out of the syslinux source pkg. [08:44:47] i.e. pxelinux, but it is not installed on the installserver. [08:46:55] The md5sum of the file in /srv is different from what is shipped with Bullseye's pxelinux binpkg [08:51:27] klausman: I think you want to look at modules/profile/files/puppetmaster/update-netboot-image.sh [08:51:49] thx! having a look [08:52:04] broadly, it looks to effectively be downloading http://cdimage.debian.org/cdimage/unofficial/non-free/firmware/"$distro"/current/firmware.tar.gz [08:53:01] Mh, but I think this is what's in the distro dirs on the install server, i.e. /srv/tftpboot/bookworm-installer-12.0/ etc. [08:54:00] My concern is the lpxelinux.0 loader that is basically the first bit of code loaded via PXE (and then trying to load a specific ldlinux.c32 depending on distro. [08:55:25] My current working hypothesis is that lpxelinux.0 has a bug that newer firmwares tickle. Or at least I'd like to know what specific source said binary was built from. [08:56:00] klausman: thanks for looking into it! [08:56:13] Scratching my own itch :) [08:56:54] Plus, I think "just downgrade the firmware" is a) unsatisfying and b) will eventually be at odds with security and/or vendor support. [08:57:14] Especially with newer 10G/25G cards, since they contain a _lot_ more magic. [09:00:56] To add more background: I found https://bugs.launchpad.net/ubuntu/+source/syslinux/+bug/1577554 which sounds a bit like our bug, which transitively points to https://repo.or.cz/syslinux.git/commit/804efa7bb278a032d384c97e8530195b294e71bc --- but now I have the problem that I don't know whether "our" lpxelinux.0 contains that patch or not. Bullseye's syslinux does, but apart from a timestamp [09:00:59] (Jan 31 2023) and a size 75607, I got nothing. And as mentioned, the md5sums of our file vs the package file differ. [09:01:26] I could of course just copy over the pkg file and try it (making a backup of our original, naturally), but that seems a bit ... Cowboy. [09:01:53] ITYM "agile" HTH ;p [09:02:33] move fast, break stuff, never fix it [09:04:20] I would expect our version to match what was in the relevant Debian (point?) release; ask infra-foundations, who I think own the netboot infra? [09:07:01] our install images are identical to the offical Debian images, only that the firmware cpio archive gets appended [09:07:54] modules/profile/files/puppetmaster/update-netboot-image.sh [09:10:03] But that's later in the chain. lpxelinux.0 is outside of the distro-specific dirs. [09:11:04] Given the file's timestamp I am like 90% sure it was copied from i2003 when i2004 was set up [09:24:46] (discussion resumed in #wikimedia-sre-foundations) [09:26:52] Summary: our file is 6.0.3 20150819, Debian stable ships 6.0.4 20200816. Will do a non-destructive test this afternoon. [09:28:42] <_joe_> btullis: heh np, just be extra careful with reprepro as it might be vital in some cases - we still rely on debian packages to deliver infra software quite a bit [09:29:00] _joe_: Ack, will do. [09:29:04] <_joe_> case in point there was an issue with conftool, and I needed to fix it before it broke more stuff :/ [10:48:54] who is the scap expert those days ? How do I initialize a new repo in deploy1002:/srv/deployment ? I want to create "netbox-dev" [10:49:39] my new test VM, which I think tries to fetch from it, fails with "Error: Execution of '/usr/bin/scap deploy-local --repo netbox-dev/deploy -D log_json:False' returned 70:" [10:49:58] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/deployment_server/kubernetes.yaml#144 I believe [10:50:10] running it manually I get ERROR:deploy-local:deploy-local failed: 404 Client Error: Not Found for url: http://deploy1002.eqiad.wmnet/netbox-dev/deploy/.git/DEPLOY_HEAD [10:50:50] taavi: nice, thanks ! [10:52:44] <_joe_> you might need some elbow grease to make it work [10:53:13] <_joe_> XioNoX: if you have created the directory already manually, or scap did, remove it before applying puppet with your change [10:53:19] can I just mkdir or git clone, or that's going to break it all? [10:53:43] I don't think I really need the full scap experience [10:54:03] just that directory, and the existing netbox user [11:04:15] <_joe_> that's going to break it all yes [11:04:43] <_joe_> unless I misunderstood what your goal is [11:04:59] <_joe_> just a repo? [11:05:12] <_joe_> or to have code deployed via scap? [11:05:34] <_joe_> in the first case, i think there's a class in puppet to init a git repo [11:08:05] first case, yeah, the deploy will be with the deploy python code cookbook [11:08:16] <_joe_> then better call volans [11:08:46] that we're already using since long time on the netbox repo :D [11:09:15] and is not managed via scap since long time [11:10:15] I wonder what would happen if we had to reimage the deploy host right now [11:10:18] would it work? [11:16:54] <_joe_> volans: I guess you'd want a separate automation for those cases [11:17:09] <_joe_> you want puppet to create a user and a clone a git repo, right? [11:19:04] <_joe_> so I would think a dedicated define should do that for you [11:19:22] <_joe_> volans: I guess you're keeping the deployment servers in sync yourself, right? [11:19:45] <_joe_> as in, the cookbook does it [11:20:26] the current netbox servers still go through this code path https://github.com/wikimedia/operations-puppet/blob/production/modules/service/manifests/uwsgi.pp#L77 not sure how safe it is to skip it [11:21:22] eh "Note: this parameter will be removed onces ores.wmflabs.org stops using service::uwsgi" do we know if that happened? [11:21:59] <_joe_> yes [11:23:33] XioNoX: can we discuss this in today's office hours? [11:24:41] add the item to the meeting notes please ;) [11:24:45] volans: I won't be around at 6pm :( [11:31:14] XioNoX: ack, I'm going for lunch now, we can chat later also with Luca, but the options here are either adapt the current status to work with what you want or start fresh ignoring the deployment host entirely and potentially using the reposync module in spicerack to manage the local copy of the repo when running the cookbook. [11:33:59] I sent that https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047482 migrating it all away from scap should be a side task to prevent scope creep. The netbox upgrade already grew quite out of hands [12:13:38] I agree it's scope creep but at the same time if the current way doesn't work... :) [12:25:56] volans: current way works (current netbox setup) but is probably not optimal [12:56:58] XioNoX: I can definitely check into clearing out the last bits for scap, I know that everything is being slowed down but I agree with vol*ans that it is better to clean up now. Give us today/tomorrow to fix it, otherwise we proceed with your patch [12:57:02] how does it sound? [12:58:52] elukey: I'd argue that it's safer to add scap for netbox-dev, then remove it from there first, instead of removing it directly from production, but up to you [13:27:49] taavi: yes, download on the cumin host where you run the cookbook the appropriate firmware file in the appropriate directory under /srv/firmware/ [13:28:01] how do I do that? [13:28:25] go to netbox, device page, click on the Dell link [13:28:35] it brings you to dell website with the serial already set [13:29:11] go to the download page (or similar) and filter the firmwares available with the nic name (be careful between 1g and 10G version of the NICs) [13:29:21] download the file and scp it to the cumin host [13:29:39] assuming you can find the right version, if it's not there [13:29:52] there is another dell page with older ones but I don't recall by memory the URL [13:30:42] sorry I'm on mobile right now, can't be too much of a help [13:31:05] taavi: OR [13:31:17] try to run the cookbook from the other cumin host, maybe it's already cached there ;) [13:34:09] volans: ha, /srv/firmware/poweredge-r640/NETWORK/Network_Firmware_RXP80_WN64_21.85.21.92.EXE sounds like what i'm looking for [13:34:27] does 'RXP80' stand for anything meaningful I should check or is that just a part of the version or similar? [13:35:07] so the file is already there? [13:35:12] then the firmware upgrade cookbook [13:35:29] yeah, the cookbook offers that file [13:35:32] should list it to you [13:35:34] :D [13:35:35] great [13:35:41] yeah the version is 21.85.21.92 [13:35:47] the other part I wouldn't bother much [13:35:55] ok, I'll try, thanks :D [13:37:03] wait doesn't say if even if it's the 1g or 10g? weird [13:57:09] effie: do you recall which dashboard https://phabricator.wikimedia.org/T366455 image is from? [13:59:07] yes, I should have added a link, let me sort it [13:59:20] just added one to the task [14:01:55] AntiComposite: I was wondering how this appeared [14:02:20] Krinkle: seems that at the moment the key is not generating more traffic than others, but I reckon it will be back :p [14:02:45] eg this is from last night https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&viewPanel=58&from=1718746896360&to=1718770761002 [14:06:36] volans: seems like the upgrade worked, thanks! [14:30:38] glad it did :) [20:00:33] effie: thx. That should help LangEng triage it and understand it.