[04:57:27] FIRING: [2x] SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [05:02:27] RESOLVED: [2x] SystemdUnitCrashLoop: node-bgpalerter.service crashloop on rpki1001:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:26:45] 10netops, 06Infrastructure-Foundations, 06SRE, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10590752 (10Gehel) [14:27:34] hi, looks like git cloning https://netbox-exports.wikimedia.org/dns.git is capped at 128k/s when cloning from my machine :] [14:47:51] it goes pretty slow for me but I'm getting about 1.1Mbit/sec. Took like 5 mins to clone to 50MB :( [14:49:17] runs fairly quick from cumin1002 though [14:49:34] hashar: what is your need to clone this to your machine? i.e. do you do it frequently? [14:50:21] the .git/objects/ directory is most of the file, I wonder if it's just a lot of messaging back-and-forth due to the size of the commit history? [14:50:30] I tried this to see if it would be better but it didn't work [14:50:31] ah sorry I should have given a bit more context [14:50:35] https://www.irccloud.com/pastebin/wjedcWVS/ [14:51:04] I have rebuild the releng/operations-dnslint image that is use to run dnslint in CI [14:51:12] upgrading it from Buster (yes no comment) to Bookworm [14:51:34] and I wanted to test the dnslint process locally. After some minutes I wondered what I could have broken and found out the `git clone` was taking ages [14:51:54] my guess is the repo on the server side is borked some [14:52:06] or has some specificity that does not play nice with git [14:52:07] ok [14:52:12] BUT [14:52:34] if I clone it from contint.wikimedia.org or a WMCS cloud instance, it is reasonably fast (some seconds rather than minutes) [14:52:38] so I think it is fine to ignore it [14:52:52] hmm ok, well that still is not great [14:53:11] If it can wait I will talk to Riccardo on Monday and see what he thinks [14:53:27] I think we should ignore it until it becomes a problem [14:53:51] also sukhe said there is no intentional traffic shapping / rate limiting whatever [14:54:36] hashar: to the best of my knowledge, no. [14:54:37] no there is nothing like that, and I checked a few network basics don't think there is a transport issue [14:54:38] but maybe the repo on the server side could use some clean up or an optimization of some sort [14:55:04] it goes quick from within eqiad, so latency is what is making it go slower [14:55:26] I do wonder if that's due to the size of the commit history or something [14:55:28] * hashar blames either DNS or IPv6 [14:55:32] lol [14:55:38] no winning with you guys :P [14:55:42] I am SOOO ready to become a CIO [14:55:55] I assume you only need the top of the latest commit right for hte build? [14:55:58] hahaha [14:56:22] well git should send a large pack with everything bundled in it [14:56:41] and maybe it takes time to craft that on the server side but then there will be the same slowness on other hosts [14:56:51] so well who knows :) [14:57:24] I think I will file it as a low priority task and maybe debug it at some point in the future [14:57:29] indeed it does. the contents of the repo are like 3MB of files, and 47MB of git objects [14:58:09] I don't know enough about git to know if that should affect anything, or if we should expect the transfer to be as fast as a dumb 50MB file transfer [15:00:51] oh [15:00:53] I found it [15:00:54] :) [15:00:58] easy ™ [15:01:20] so git has a bunch of debug settings that can be turned on via environment variable [15:01:20] oh yeah?? [15:01:24] which is sOOooo handy [15:01:38] and you mentioned it is latency based / too many round trip [15:01:52] and since we both do have a background in networking/telco we know exactly what it means [15:01:58] stuff is doing too many dumb thins [15:02:00] things [15:02:17] haha exactly, that is the technical term for it :) [15:02:24] here comes GIT_TRACE_CURL=1 env variable which tells git to dump to stdout (fd=1) what ever curl is doing [15:02:36] and the best hacking tool ever that is only known to the wisiest folks: `grep` [15:02:43] GIT_TRACE_CURL=1 git fetch |& grep '=> Send header: GET' [15:02:45] +1 [15:02:55] that gives a constant stream of objects being retrieved: [15:03:02] 16:01:01.122783 http.c:684 => Send header: GET /dns.git/objects/68/e442d4b3a09030290c1230fe933c702c28014f HTTP/2 [15:03:02] 16:01:01.542819 http.c:684 => Send header: GET /dns.git/objects/82/cd566abf334c29de39b42f6b2658c36ebd8a2a HTTP/2 [15:03:18] each being their own standalone https / TLS negotiation etc [15:03:24] and all of them serially as far as I can tell [15:03:33] ok yeah I suspected something like that [15:03:41] and the sheer number of those objects is large in this repo [15:04:08] esp. compared to the actual size of the data [15:04:35] git count-objects -vH tells me there is roughly 19k objects [15:04:53] so 19k TCP + TLS handshakes?? that's lots of fun [15:05:08] if it takes 500 ms per requests, we are looking at 158 minutes to clone [15:05:09] :) [15:05:11] I wonder why it was quick from a WMCS host? [15:05:18] low latency as well? [15:05:32] I am in europe and on an ADSL connection [15:05:35] ah sorry... you ran the clone from WMCS? [15:05:44] I thought there was a mirror there you pulled to your own machine [15:05:48] yeah I did try to clone from WMCS and that is reaonsably fast [15:05:59] right, same as me doing it from cumin1002 [15:06:24] so I think I can either: (a) ask for optical fiber at home (b) relocate near eqiad (c) setup a netbox mirror in France [15:06:26] I tried the "--depth 1" in the clone as my understanding is that will only get the HEAD and potentially avoid this [15:06:45] relocating is the obvious choice I think [15:07:00] whatI don't get is that git on the server side should craft a large pack that has everything [15:07:01] optical fiber won't improve the latency much. well perhaps if you've got bad ADSL and interleaving and stuff [15:07:29] hmm [15:07:30] wait [15:07:31] I guess as a hack you can clone to WMCS, then transfer it as a tarball or something. Though that's hardly something to make the official process [15:07:40] $ ping netbox-exports.wikimedia.org [15:07:40] PING netbox-exports.wikimedia.org(text-lb.drmrs.wikimedia.org (2a02:ec80:600:ed1a::1)) 56 data bytes [15:07:44] ... 50ms [15:07:49] so 50ms is fine [15:07:59] but why is that in drmrs? [15:08:05] That's to drmrs/esams yeah [15:08:14] The text-lb LVS to be precise [15:08:21] so that is dumb http serving that repo? [15:08:54] the load-balancer is in front of the repo yeah, I actually haven't tried to work out where the repo is (other than knowing it's in eqiad cos that's where performance is ok) [15:09:40] you could try a trick in /etc/hosts to go direct to the LB in eqiad [15:09:54] 2620:0:861:ed1a::1 netbox-exports.wikimedia.org [15:09:58] netbox-exports is a DYNA, so it will return the geo IP of text-lb closest to you. it's still behind the CDN but it's a "pass" [15:10:15] but likely won't make a difference, it's the trans-atlantic latency that is the problem [15:10:30] yep [15:10:54] 50ms within France to Marseille isn't ideal though, you have my sympathies [15:11:12] :P [15:11:21] I'm at 20ms from Ireland to Amsterdam - which sounds like a longer trip [15:11:21] that is still better than 150/200 ms from good o' RTC [15:11:28] I still feel spoiled to have "only" 50ms [15:13:11] sukhe: do you know where this repo lives exactly? is there any way to clone with ssh possibly? [15:13:23] (just for comparison) [15:16:00] topranks: netbox::frontend so netbox2003.codfw.wmnet and netbox1003.eqiad.wmnet [15:18:57] not sure if we can clone with ssh though [15:19:42] last-modified: Mon, 22 Jul 2024 10:08:29 GMT [15:19:42] last-modified: Wed, 27 Nov 2024 16:10:02 GMT [15:19:43] so hmm [15:19:50] the repository has a lot of commits / churn etc [15:20:09] and those objects keep pilling up un `.git/objects` [15:20:21] yeah exactly, multiple commits every day ] [15:20:23] because the clone happens over https and the dumb protocol (aka as if just Apache served it) [15:20:30] git does a brute force download of objects [15:20:31] yeah I think that was the consensus last time we looked at this + the trans-atlantic latency topranks was talking about [15:20:39] yeah, I suspect with "--depth 1" it would be quick [15:20:48] when using the smart protocol (with git-http-daemon) the server would craft a single pack with everything in it [15:20:56] my attempts to pull over ssh are failing though to try and validate that [15:21:19] the last-modified headers I have pasted are for the two packfiles available. So the repo has not had a gc/repack since at least november [15:21:31] there should be systemd timer to do that on a weekly or so basis :) [15:21:45] or ok yeah the server can be smarter if we use that rather than a standard web server that isn't aware of git-things? [15:22:05] and it should ideally use git-http-daemon [15:22:12] I'll file a task [15:22:12] :) [15:22:19] I am happy to have found the bottleneck [15:22:22] you clearly understand all the git things I wanted to ask volan.s about :) [15:22:22] thanks topranks ! [15:22:25] yep good stuff! [15:22:31] please do file the task and we will discuss [15:22:49] and sukhe tested the newish dnslint image works in CI so we at least achieved the initial goal: get rid of Buster in the CI Docker image `o/ [15:22:50] the commits to that repo is only going to keep growing so probably we need to do something [15:23:00] woot! [15:23:18] yeah [15:23:24] my job is tiring :( [15:23:36] what was just a post lunch "s/buster/bookworm" [15:23:39] at least it's Friday :) [15:23:57] becomes a "lets investigate how git deals with dumb protocol when cloning a non packed repo and how we can improve Netbox architecture to make it faster" [15:24:02] it's never like that is it? [15:24:10] always :( [15:24:12] such is our lot in life [15:24:35] it is because we are too smart, too curious and too picky in what we accept as "normal failure" [15:25:27] sukhe: and a dumb patch to dns.git does pass CI https://gerrit.wikimedia.org/r/c/operations/dns/+/1123666 [15:25:29] success! [15:25:57] hashar: thanks! I tested it with fail-CI too https://gerrit.wikimedia.org/r/c/operations/dns/+/1123661 and that also worked [15:26:50] \o/ [15:27:19] thanks for doing all the work :) [15:38:10] topranks: sukhe: are requests load balanced between both netbox2003.codfw.wmnet and netbox1003.eqiad.wmnet ? [15:38:16] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10591428 (10cmooney) Just want to confirm all the links are in place and working (the only ones I have not tested are the 100G t... [15:38:50] I suspect they all go to the primary - netbox1003 - but I'm not 100% [15:38:58] hashar: whatever is the active one, as per netbox.discovery.wmnet, which right now is netbox1003 [15:39:01] topranks: you are right [15:39:08] cool [15:55:53] 10netbox, 06Infrastructure-Foundations: https://netbox-exports.wikimedia.org/dns.git takes age to clone - https://phabricator.wikimedia.org/T387575 (10hashar) 03NEW [16:04:18] 10netbox, 06Infrastructure-Foundations: https://netbox-exports.wikimedia.org/dns.git takes age to clone - https://phabricator.wikimedia.org/T387575#10591527 (10taavi) Fwiw, there is a mirror of this repo on Phabricator for this exact reason :-) [16:06:55] so in short [16:06:55] 10netbox, 06Infrastructure-Foundations: https://netbox-exports.wikimedia.org/dns.git takes age to clone - https://phabricator.wikimedia.org/T387575#10591530 (10hashar) [16:07:17] A) run `git gc` on the git repo should pack objects in a single pack or at least a few ones [16:07:24] instead of having hundred of objects [16:08:02] B) switch to git-http-backend to serve the git repos :) [16:08:24] and done! [16:08:42] at least I got rid of a Buster image :) [16:14:42] I am off! Have a good week-end [16:52:29] elukey: I'm done with ms-be2088, all yours! [16:53:11] jhathaway: ack thanks! [16:53:27] I just discovered https://phabricator.wikimedia.org/T387577#10591657 [16:53:33] and I feel very sad [16:53:43] I need to come up with a different logic now [16:59:17] elukey: so PXE booting all interfaces doesn't work or is too slow? [16:59:42] jhathaway: it doesn't work yes, sometimes dcops need to explicitly disable some nics [17:00:03] and supermicro doesn't provide link status through redfish? [17:00:56] I am rechecking, it seems yes but there is no link related to what the correspondent Key in the BIOS is to be set to PXE [17:01:22] for example [17:01:26] >>> pprint(r.request("get", "/redfish/v1/Managers/1/EthernetInterfaces/1").json()["LinkStatus"]) [17:01:29] 'LinkUp' [17:01:29] that is great [17:01:48] but how can I then set "PXE" to something named like P1_AIOMAOC_AG_i2LAN1OPROM ? [17:03:02] ah, so now way to correlate EthernetInterfaces/1 with P1_AIOMAOC_AG_i2LAN1OPROM [17:03:09] s/now/no/ [17:03:09] exactly yes [17:03:18] ugh [17:03:20] but maybe there is some weird logic that I still don't know [17:03:43] *just* open a ticket ;P [17:04:12] I may have to do it.. Is there a procedure to do it? I meant to ask it the other week [17:05:47] Willy just needs to request that they add you to their portal [17:05:55] then you can open a ticket [17:06:59] super [17:09:00] how do you get the list of nics, e.g. P1_AIOMAOC_AG_i2LAN1OPROM [17:09:46] I basically get the BIOS's Attributes and grep for anything wiht "LAN" inside [17:10:26] in the provision cookbook we set "legacy" where needed, that I thought was the same as PXE but it may not be [17:11:47] what redfish call gives you the bios attributes, I don't see an lan devices under /redfish/v1/Systems/1 [17:11:54] on the ms-be2088 nod [17:11:56] node [17:12:10] probably not relevant but I played with trying to use redfish to enumerate the NICs from the pcie IDs when trying to work out the linux interface naming [17:12:12] https://phabricator.wikimedia.org/T347411#9203210 [17:12:49] top comment is great, lol [17:13:02] ahahha yes! Thanks for the link I'll read it [17:13:04] it's true :D [17:13:24] yeah tbh I don't think there is much there, apart from perhaps correlating via PCIE location being an approach [17:13:36] "P1_AIOMAOC_AG_i2LAN1OPROM" doesn't seem related though [17:18:31] the weird thing is that we have at least three different naming schemes in supermicro configs [17:22:48] per paravoid's bug, it seems like systemd cast the naming methodology in stone and is pretty reticent to change it, which is both understandable and unfortunate [17:26:43] I wouldn't say so, they are not the easiest project to work with but were open to a PR [17:26:56] (that was many years ago though, who knows by now) [17:27:16] https://github.com/systemd/systemd/issues/12261 is the upstream bug [17:27:40] o/, yup that is the one I read [17:27:59] hi :) [17:28:00] I thought you did a nice job being diplomatic :) [17:28:27] curious, did you ever send the email to the linux kernel mailing list? [17:28:32] no [17:28:36] never got around to it [17:28:52] understandable, not a small ask, given the complexity [17:30:34] honestly I had a good grasp of what it would take, I don't remember as thinking it was a ton of work [17:30:40] just other priorities at the time :( [17:30:43] sorry! [17:30:43] Hi Faidon :) [17:30:47] hope all is good in your world [17:31:42] hi! yeah, can't complain! [17:35:11] no need to apologize! just curious [17:35:54] Hi Faidon! o/ [17:38:44] could we correlate the mac from lldp, or do we not know that? [17:39:19] what stage are we talking here Jesse? [17:39:41] it is something I've thought about yeah [17:39:49] dell attempted to fix this problem with https://github.com/dell/biosdevname way back in the day, but was superseded by systemd [17:40:00] it was still more consistent and reasonable than systemd for a number of years [17:41:34] just mentioning it in case you find value in inspecting what it did to create consistent names [17:41:43] thanks yeah wasn't aware at all [17:42:02] it's been a while since we looked at it tbh, but no doubt we'll have to revisit [17:42:13] topranks: I was thinking about the provisioning stage, and e.lukey's issue, https://phabricator.wikimedia.org/T387577 [17:42:14] I think probably the best way forward is to pull the MAC from redfish [17:42:30] and use the mac-based naming scheme in systemd [17:43:09] the only thing we need to work out if how to change systemd to use that scheme from the default, and what that looks like if it boots the installer with the acpi or location-based name, and then reboots to the mac based one [17:44:08] nod, a bit ugly, but would definitely simplify identification [17:44:51] yeah it's definitely ugly but may still be the best way forward [17:45:03] yeah [17:45:21] for Luca's issue we sort of have a convention to use the first port always [17:45:37] all Super Micro's will have a dual-port 10/25G NIC right? [17:47:31] not sure [17:49:56] I guess where I'm coming from is if there is a convention that we follow we may not have to do anything to "work out" which one is connected to the network in any given case [17:50:02] we always know it'll be 10/25G port 1 [17:51:14] makes sense [17:52:18] there is also the option of moving to the cloud :P [17:56:34] I hear it rains there also :P [17:57:48] lol [18:46:55] FIRING: MaxConntrack: Max conntrack at 81.2% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:51:55] RESOLVED: MaxConntrack: Max conntrack at 81.84% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [23:19:44] FIRING: NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [23:24:44] FIRING: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [23:30:44] FIRING: NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [23:34:44] RESOLVED: [2x] NetboxAccounting: Netbox - Accounting job failed - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/12/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxAccounting [23:35:44] FIRING: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts [23:40:44] RESOLVED: [2x] NetboxPhysicalHosts: Netbox - Report parity errors between PuppetDB and Netbox for physical devices. - https://wikitech.wikimedia.org/wiki/Netbox#Report_Alert - https://netbox.wikimedia.org/extras/scripts/18/jobs/ - https://alerts.wikimedia.org/?q=alertname%3DNetboxPhysicalHosts