[05:54:49] this tool looks pretty nice: https://github.com/SigNoz/signoz (/cc o11y) [06:04:52] <_joe_> XioNoX: that is a visualization tool for distributed tracing [06:05:40] _joe_: yeah I thought Chris and/or o11y were looking into it [06:06:53] <_joe_> ah no it does also the collection, uhm [07:53:21] Folks I’m not feeling so hot, came down with a cold over the weekend still not feeling great today gonna take a sick day (cc jobo) [07:54:30] topranks: sorry to hear that, take care. Should we find a substitute for the oncall? [07:57:39] Ugh yeah I’m not best placed to handle anything. [07:57:49] If someone was able to would be great, otherwise I’ll keep phone close will try my best to respond [07:58:38] topranks: sorry I can't help with that as I'm oncall too already :D [07:58:44] but don't worry, we'll figure it out [08:02:48] Volans: thanks, if nobody can I’ll dose myself with lemsip and espresso if there’s an incident [08:03:11] go do bed, close the laptop! [08:40:34] I'm happy to take over topranks on-call week as I would be on-call the next one [08:41:06] I'll swap today and discuss with topranks the rest of the week when he is online :) [08:45:19] thanks vgutierrez! <3 [08:46:03] the override should trigger in 15 minutes... as I couldn't create one with start date before now() [08:47:57] ofc, who lives in the past :D [09:05:02] there it is :) [09:41:19] heads-up, at 10:00 I'm gonna depool sessionstore in eqiad to reimage some hosts [09:41:31] do we have a puppet datatype for a MAC address? [09:55:03] <_joe_> hnowlan: ack, that should cause some latency on the appservers in eqiad [09:55:14] <_joe_> but nothing too tragic, I expect a 30-40 ms bump [09:59:20] _joe_: yeah, somewhat unavoidable given the kask issues. I'll keep it as short as I can [09:59:51] <_joe_> ah np just noting so that everyone is aware [10:03:14] actually the depool will only be needed when the reimage completes so I will hold off for a little bit [10:06:49] unrelated to this work but there's been a significant uptick of session loss since the 1st https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13&from=1664537985052&to=1664791550466 [10:07:00] No idea if that means anything but it is unusual [10:13:17] <_joe_> that seems like something that could relate to a deployment [10:13:33] <_joe_> it is indeed, good catch [10:21:08] arturo: Stdlib::MAC? [10:21:22] vgutierrez: yeah, thanks, t.aavi was faster :-P [10:43:58] sessionstore reimage/juggling finished - went well enough that I'll be doing the remaining eqiad node now (after a few minutes of graph-watching) [13:33:32] XioNoX: _joe_: really interesting tool, thanks for the link. all the current discussions have revolved around Jaeger, this tool is a bit younger but also has a few features that would be interesting for us. good news is that both are OpenTracing implementations [13:33:51] <_joe_> yep [13:36:53] agreed, that's the first time I've heard of this tool, could be interesting [13:37:19] Since anyways the best way to do telemetry is to go through the otel collector, it can easily be evaluated alongside jaeger [13:39:00] excuse me, OpenTelemetry, of which OpenTracing is an (archived?) subset [13:45:46] yep, archived and folded into OpenTelemetry [17:09:24] anybody familiar with how we deal with getting the debian installer to load specific storage card drivers necessary for install? [17:10:15] we have a Dell HBA that needs mpt3sas kernel module loaded to detect the root disks, and it wants to prompt for driver selection. The driver does exist in the installer, it's the manual selection part we're trying to avoid. [17:10:29] https://phabricator.wikimedia.org/T317244 ^ [17:19:05] bblack: what's the debian installer question for the manual selection? [17:19:19] to match the relative preseed line [17:19:51] although I would have thought it was autodetected tbh, but maybe it's not [17:20:09] oh I linked the wrong ticket (related) [17:20:35] eh :) sorry for asking, I didn't find it there [17:20:40] https://phabricator.wikimedia.org/T319067 [17:20:46] ^ that has a screen cap [17:21:03] of where the installer got stuck [17:21:21] but from install_console, I was able to basically "modprobe mpt3sas" and the disks showed up fine [17:22:06] are you installing bullseye or buster? [17:22:33] also potentially confusing factor that I'm now realizing: the initial imaging/install attempts have --os bullseye for the spare::system install, but in reality they'll be buster once they image into their real roles [17:22:41] it's entirely possible this is a bullseye-only problem heh [17:23:00] I see a kinda related possible issue in here: [17:23:01] d-i hw-detect/load_firmware boolean false [17:23:12] in modules/install_server/files/autoinstall/bullseye.cfg [17:23:22] that is not in buster [17:23:49] hmmmm [17:24:10] either way, I might try just doing the initial imaging as buster for now, and address this later when we do bullseye migration [17:24:17] if it works, we're past this for now :) [17:26:36] SGTM :) [18:01:22] eh buster doesn't work either. stops on a slightly different screen about: [18:01:25] When RAID is configured, no additional changes to the partitions in the disks containing physical volumes are allowed. Please convince yourself that you are satisfied with the current partitioning scheme in these disks. Keep current partition layout and configure RAID? [18:01:39] but that's probably because it has no /dev/sd[ab] because it still didn't load the driver [18:02:55] if you try to manually partition is there any disk/partition? [18:03:00] if not that might be the case [18:03:33] there's not, I'm on install_console [18:03:48] also on buster, loading mpt3sas didn't do anything, whereas on bulleseye it made them show up [18:04:05] so I'm guessing buster's kernel lacks the right version of the driver to match this card [18:04:11] :( [18:05:00] that's going to be fun! [18:05:29] we don't even want/need a such a card, either way we're doing software raid. just new dell model brings new annoyance. [18:06:13] and it's probably not reasonable to port our stack to bullseye in the desired timeframe to get the hw swaps done, either [20:10:39] do we still intend to run dc-switchover tests? I suppose in the new multi-dc world that might mean running masters out of an alternate datacenter, or perhaps sending all traffic to a single cluster? [23:06:28] * ebernhardson finds it super tedious to track ports through 3-4 layers to ensure the right mw config points to the right cirrus cluster [23:28:42] ebernhardson: we were just discussing that earlier today in the serviceops meeting -- yes, we'll still do switchover tests, which at the current stage would mean sending writes to codfw and depooling eqiad for reads