[07:52:58] hey folks good morning [07:53:26] not sure if it is monday and my brain is still lagging, but most of the grafana dashboards that I see have big holes in metrics [07:53:30] https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m&var-site=eqiad&var-deployment=mw-web&var-method=GET&var-code=200&var-handler=php&var-service=mediawiki [07:53:36] https://grafana.wikimedia.org/d/StcefURWz/docker-registry?orgId=1 [07:53:42] is it the same for you? [07:54:01] yes [07:54:19] also a lot of alerts with NaN/Unknown in karma [07:57:15] elukey: I'm taking a look. I found some Prometheus instances with problems this morning and I'm trying to recover them [07:57:25] ciao tappof [07:57:28] yes same here, gap between 4:00 and 7:50 UTC [07:57:33] yes I was checking https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=prometheus&var-instance=All :( [07:59:04] from https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=team%3Do11y I see that some prometheus instances have restarted, no idea what it means [08:00:55] elukey: I'm restarting those instances because they are not accessible and they’re not responding over COM2 via racadm [08:01:14] okok I was checking the uptime too [08:01:35] tappof: remember to `!log` things so we are all aware [08:02:00] yeah elukey you're right [08:04:29] okok so IIUC at least 3 prometheus hosts were totally frozen, and they were rebooted [08:04:45] did racadm getsel show anything? [08:05:44] going to check on 1005 [08:07:00] nothing for today, OEMs for the 29th [08:07:43] same for 1006 [08:08:54] tappof: prometheus1008 seems down as well, ok to reboot? [08:09:59] ah wow I can log into console and instead of the root login I see the kernel dumping a ton of stacktraces [08:10:12] watchdog: BUG: soft lockup - CPU#47 stuck for 26546s! [prometheus:844496] [08:11:50] powercycled [08:11:53] elukey: I didn't get stack traces on the other hosts, they were just hanging ... free to reboot [08:12:19] I am checking https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=codfw&var-cluster=prometheus&var-instance=All [08:12:41] so 2005 and 2008 seem to show high load [08:13:15] lemme check both [08:14:34] 2005 is different, no cpu soft lock but I can't log in as root, a lot of things dumped [08:14:58] elukey: Which one did you start from? I’ll take the other. [08:15:10] 2005, just powercycled [08:15:13] ok 2008 ... I'll take a look on 2008 [08:16:38] 2007 seems to offer ssh, but I see load 1 min up to 22 [08:16:46] maybe it is a good use case to find what's wrong [08:17:31] the kernel shows stuff like [08:17:32] [Mon May 5 00:18:24 2025] BUG: kernel NULL pointer dereference, address: 0000000000000000 [08:17:37] that seems not ideal [08:19:28] IIUC they all good rebooted for the new kernels, maybe something weird is going on? [08:19:35] (last week I mean) [08:20:21] https://phabricator.wikimedia.org/T392804 IIUC [08:20:47] yes kernel version matches [08:22:19] I'll add a comment [08:22:50] thank you elukey, catching up on backlog [08:25:41] added! [08:26:08] Moritz is out for part of the morning so we can ping him later about it [08:26:47] looks like I was lucky, 2007 is not responding to ssh now [08:27:13] probably it makes sense, we rebooted the others and load hit it [08:28:27] indeed [08:30:21] cluster overview seems not showing issues for all the other DCs (except eqiad/codfw) [08:30:30] looks like centrallog hosts also suffered the same fate heh [08:30:56] Logged into 2007, but I'm unable to perform any actions ... rebooting [08:36:22] so far it seems that we are recovering [08:37:00] I see metrics in the dashboards after the "holes" [08:37:39] same [08:43:41] I'll open a subtask of T392804 with what we know so far, I did find fstrim.service starting on prometheus2007 shortly before the kernel freaked out [08:45:37] super [09:01:53] vrts1003 and vrts2002 look like they are in the same state heh, I'll reboot [09:43:44] this is https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1104460 [09:45:02] I'm opening a task with all bookworm hosts using raid10, then we can revert them to 6.1.133 (and uninstall the 6.1.135 update if they are not yet rebooted) [09:48:34] ack, thank you [09:56:19] mirrors.wikimedia.org is timing out for me from apt, not sure if expected [09:57:46] I see it now failing [09:59:22] mirror1001 looks fine to me, network issue? [10:03:33] nope, network looks fine [10:04:48] but what I say is real: https://grafana.wikimedia.org/goto/KeMsEbxNg?orgId=1 but suspecting now mirror traffic [10:06:37] I created https://phabricator.wikimedia.org/T393366 [10:35:45] moritzm: I am not going to create a ticket because it is not a big issue, but I think the mirrors may get overloaded from for some hour every day, peaking today https://grafana.wikimedia.org/goto/XHgAUxxNR?orgId=1 [10:41:33] or possibly "AI" scrapers, we've had a bunch of bytedance IPs hit it hard before as well [10:42:49] but mirror1001 isn't behind the CDN, so the usual counter measures are not available [10:43:16] yeah, no biggie, just wanted you give you a heads up for awareness, not expecting you to take any action, but in case it repeats [10:43:34] as usually this things go to worse [10:43:52] and we'll most likely phase out the mirror by end of the year, it sees very little use since Debian defaulted installations to the Debian CDN (deb.debian.org), so it's likely we're not refreshing the hardware [10:44:00] let me store send you the ips that got more hits [10:44:09] ack [13:21:09] I'm testing a new haproxykafka version on cp7001, no impacts expected but for safety I've depooled && deactivated puppet on that host [14:51:18] oncall from eu: nothing to report [14:56:23] x-posting from serviceops, since that seems like mostly a bot channel. If anyone has time to look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/838182 it would be appreciated. It's an envoy patch that'd let us pool/depool elasticsearch without requiring a mwconfig change [15:03:17] <_joe_> inflatador: I am back today, I'll try to take a look [15:10:43] _joe_ Thanks for taking a look. We broke envoy last time we pushed a patch w/out service ops review, so trying to be safer this time ;) [20:42:10] ryankemper: inflatador: there are a fair amount of wdqs-related alerts popping up in -operations, seemingly related to hosts touched by [0]. is that expected / known? [20:42:10] [0] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1141956 [20:43:05] swfrench-wmf just got back to my desk, but I do believe ryankemper is working on those [20:43:18] swfrench-wmf: thanks for the ping. these are expected, I had forgotten to downtime however [20:44:22] great, thank you both - just wanted to confirm that this was a known thing given their number :) [20:45:37] NP, it's appreciated. "If you see something, say something" and all that ;P [20:45:47] ^