[09:20:10] saw this (ancient) horror story via fedi - https://milk.com/wall-o-shame/bucket.html [10:38:10] Amir1: did you depool the server? [10:38:21] effie: yes, see the cookbook output in -operations [10:39:08] So the other thing to do (given it's a replica not a master) is to open a ticket for the DBAs (and eyeball the server briefly, but largely just to put something sensible on the ticket) [10:40:36] host isn't pingable still, I'll go look at the web idrac thingy [10:41:47] alright, I will open a task [10:42:33] thanks, add the dba tag to get it to the right folks [10:45:01] server appears up from the console (trying to see if I can get in) [10:47:07] OK, in from the console [10:47:37] and, oddly, the host is now pingable from a cumin node [10:48:40] host uptime is 162 days, nothing in kernel log [10:48:57] could be a network hiccup ? [10:49:01] syslog has it mysqld noting network loss and regain [10:49:15] ping from the cumin node has stopped once more [10:49:52] so yeah, I think something is wrong with the networking on this node most likely, but not badly enough for the kernel to spot link down. [10:50:28] effie: have you got a task number? I'll summarise my findings there, and downtime the node until Monday, since I'm guessing otherwise the network is going to continue to flap all weekend [10:51:03] yes, I subscribed you [10:51:21] I do not see anything very concerning on grafana either https://grafana.wikimedia.org/goto/effvwvrqw7uv4d?orgId=1 [10:51:23] OK, cool, I see the email now. Thanks :) [10:51:54] Emperor: want me to step in? I don't have much time, but on an ongoing issue I can [10:52:05] though that spike after 10 UTC, dunno what it is [10:52:32] sorry wrong url [10:52:42] jynus: this is a replica server which I think we can leave depooled until Monday, so if you're stretched I don't think you have to drop other stuff today [10:53:03] nah, but I usually see stuff quicker [10:53:15] I prefer you reviewing my stuff in exchange 0:-) [10:53:51] https://grafana.wikimedia.org/goto/dffvx3ys0z474c?orgId=1, nothing odd before hosts' network started flapping [10:54:01] jynus: give me a couple of minutes to finish writing up my observations, then it's all yours. I'll ping you here when I'm done [10:54:43] lots of kills of sleeped threads [10:54:47] Emperor: for sure [10:54:55] that usually means a stall [10:56:59] there were an abnormal size of inserts since 08:33 [10:57:23] jynus: I'm out of the system, and have commented on the task; I'm going to downtime until Monday morning. [10:58:47] ok, then taking over analysis, will read what you said and continue from there [10:59:07] !incidents [10:59:07] 7747 (RESOLVED) Host db1258 (paged) [10:59:08] 7745 (RESOLVED) This is a test incident (please ignore) [10:59:19] downtime done. [11:00:22] thank you both [11:00:38] the mysql log is normal, for having tcp error ofc [11:01:24] there was a spike on inserts at 10:50, but may have been a depool consequence [11:01:58] so I think it is not mysql related, but host related [11:05:34] "The network link is down. Either the network cable is not connected or the network device is not working" [11:06:12] link is flapping [11:06:20] weird dmesg didn't show that [11:06:52] first event: 2025-10-01 12:15:38 [11:07:06] but got more frequent only recently [11:20:14] thanks E.mperor j.ynus for the support [12:32:33] moritzm: pki1002 is you or is it a real issue? [12:32:56] not me, no [12:33:47] pki1002 is the new WIP trixie host, it's only only used by wikikube-staging [12:33:51] it's not the main PKI host, that one is still pki1001 [12:33:52] ah, ok [12:36:16] SEL says "A high-severity issue has occurred at the Power-On Self-Test (POST) phase which has resulted in the system BIOS to abruptly stop functioning.", I'll try to powercycle it in a bit and otherwise this will need a DC ops task [12:36:55] so many hw issues today :-( [12:51:22] sigh [12:51:47] at least it is not cfssl itself that panicked, that would have been worse [13:38:05] checking https://doc.wikimedia.org/spicerack/master/api/spicerack.service.html#spicerack.service.ServiceIPs I don't see an easy way of fetching all IPs (v4/v6) for a given site [13:38:32] consider ncredir as an example [13:38:39] https://www.irccloud.com/pastebin/rgZnXbuA/ [13:39:06] .all provides all the IPs but gets rid of the site dimension [13:43:16] am I missing something here? [13:43:41] cause .get(dc) requires me to know the label [13:45:48] I guess that I can access `.data` directly and fetch per dc like >>> catalog.get("ncredir").ip.data['codfw'] [13:45:48] {'ncredirlb': '208.80.153.232', 'ncredirlb6': '2620:0:860:ed1a::9'} [13:46:16] how did you get catalog? [13:46:30] spicerack.Spicerack.service_catalog() [13:46:39] from here https://doc.wikimedia.org/spicerack/master/api/spicerack.service.html#spicerack.service.Catalog there are some options but it may depend on the actual implementation of that [13:56:14] vgutierrez: ah I see you mean site as in https://doc.wikimedia.org/spicerack/master/api/spicerack.service.html#spicerack.service.ServiceIPs.get [13:56:22] yes [14:01:34] ok I think I get what's happening [14:01:54] the Catalog class wants a dict as init parameter, that is basically the yaml load of service.yaml from puppet [14:02:43] and the "ip" key is a dict with sites as keys [14:03:10] that is what you see in ServiceIPs I think. [14:03:56] because it gets created like params["ip"] = ServiceIPs(data=params["ip"]) [14:05:20] so the ServiceIPs' get in theory should use the "sites" to loop through the dc keys, and return what's needed [14:05:29] but it assumes "sites" is there as key, afaics [14:05:56] if it is not super urgent we can try to patch it, but it will require a spicerack release [14:06:13] so in the meantime you may use ip.data directly [14:06:23] does it work for you vgutierrez ? I don't have anything better [14:06:54] yes, ip.data works for me [14:09:19] mmm maybe I am wrong, the code does ip_str = self.data.get(site, {}).get(label, ""), so it uses the site variable, in this case it should be the dc one [14:09:29] lemme check better [14:10:24] (I sound like a very bad LLM but I am not, I can assure you) [14:13:00] yeah... and that works [14:13:10] I can use get() to fetch an IP if I know the label [14:13:14] or if it's the default one [14:13:16] ok I got it [14:13:19] but I need both IPv4 and IPv6 [14:13:19] >>> catalog.get("ncredir").ip.get('codfw', 'ncredirlb') [14:13:19] IPv4Address('208.80.153.232') [14:13:27] yeah [14:13:34] no way of fetching the IPv6 one AFAIK [14:13:42] well.. with the label ncredirlb6 [14:13:50] but yeah.. I need to know them first [14:13:57] so that would be fetching them via .data [14:13:58] oh ok you wanted something that worked with sites only [14:14:02] site [14:14:41] I need to be able to fetch all the service IPs in a single site [14:14:57] for a given service [14:15:07] without knowing its labels those vary from service to service [14:15:23] if I need to loop through the data structure to fetch the labels I can also do that and fetch the IPs myself [14:15:38] yep at the moment there is no way [14:16:05] that's fine, I was wondering if I was missing something obvious [14:16:22] nope, you got it right, in this case the best action is to run through data [14:16:34] we can add a more lenient get in theory [14:22:34] <_joe_> elukey: you don't learn do you? [14:22:40] <_joe_> vgutierrez: "patches welcome"! [14:22:57] _joe_: thanks [14:23:03] <_joe_> vgutierrez: you're welcome [14:23:05] really appreciated :) [14:23:18] <_joe_> I was trying to teach elukey how it's done [14:32:49] I mean both of you at the same time [14:33:00] * elukey shakes his head