[13:26:58] PR if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/917343 [14:31:59] o/ [14:32:48] I'm stuck in traffic and won't make it for triage today [15:02:07] pfischer: ack [15:03:23] ebernhardson: triage: https://meet.google.com/eki-rafx-cxi [16:02:11] ryankemper I'll depool codfw w/conftool now [16:02:43] and....done [16:05:59] sigh...my external monitor doesn't want to connect :( [16:16:58] workout, back in ~40 [16:18:48] no clue why that fixed it ... but rebooting makes everything work :) [16:52:33] back [17:03:24] quick break, back in ~15 [17:30:53] back [18:09:03] lunch, back in ~1h [18:12:34] o/ [18:15:03] \o [18:56:41] back [19:02:01] ebernhardson: 1:1? https://meet.google.com/stp-swkd-iho [19:07:07] ebernhardson: rescheduling for Wednesday [19:07:32] gehel: doh, i was distracted [19:07:56] ebernhardson: if you're still there, I can jump back in [19:08:15] sure [19:08:38] looks like we got an alert for an eqiad wdqs host too...checking [19:21:02] reboot cookbook appears broken. Lovelt [19:21:38] ah well, I'm asking for help in #wikimedia-sre [19:33:49] Oops, I was using dry-run... [20:09:43] I rebooted wdqs1004 and its lag has gone back down. On the other hand, I rebooted wdqs2004 and nothing has changed. I know we talked about maybe some issues with DFW k8s, let me check with service ops [20:35:12] for wdqs...i'm reasonable certain it's related to the flink side but i dunno what exactly...kubectl shows 50-150 restarts for taskmanager's in codfw, while eqiad is running fine [20:36:07] ebernhardson interesting, I might try redeploying in that case [20:37:57] if the logs aren't making it into logstash, that might complicate things [20:43:14] oh that's interesting, i think we are running out of heap [20:45:12] inflatador: can we increase the heap size available? [20:46:50] taking the list of pods from `kubectl get pods`, all three that show restarts have OOM's in their --previous log [20:48:45] i suppose i don't know why eqiad is fine and codfw is not, or if these are are another symptom of something... [21:41:10] ebernhardson: having trouble figuring out where we set the java heap size wrt the actual docker image [21:42:10] (changing the memory limit of the k8s pod can be done pretty simply in `charts/flink-session-cluster/templates/taskmanager-deployment.yaml` but I want to be sure java itself is being supplied with enough heap if that makes sense) [21:42:26] ebernhardson I wanna say we bumped up the heap size in the past and it didn't help? we've been checking for the heap size based on a docker entrypoint (invoking a long java command) but haven't found it in the docker image or deployment-charts yet [22:09:26] inflatador: poking around, flink docs are that they add the jvm arguments themselves from configuration. I'd have to guess, maybe task_manager_mem and the related requetsts/limits? [22:11:55] last change there looks to have been to decrease it, back in sept 2021, It mentions jvm metrics for old/young gen counts. wonder where those are [22:14:03] Hi, https://phabricator.wikimedia.org/T335974 has been UBN since Thursday but doesn't look like anyone from the search team has looked at it yet? [22:33:49] legoktm: i just took a look over it, unfortunately not able to reproduce from their instructions. vector 2022 and 2010, same result. hard to say how important it is if it's not easily reproducable