[07:21:58] going to depool from wdqs2009 to wdqs2012, they're apparently serving user traffic but they don't have the data loaded [07:22:18] T322869#8559767 [07:22:18] T322869: Fewer results from wdqs nodes running in codfw than eqiad - https://phabricator.wikimedia.org/T322869 [07:30:19] ryankemper, inflatador: I've run "sudo depool" on these 4 hosts ^, not sure if it's the right command to permanently depool them til the data is loaded. (there might be something that has pooled them unexpectedly) [07:51:20] dcausse: ack I went ahead and marked them `inactive` [07:51:33] yeah we'll want to look into what pooled them, maybe the cookbook altho I'd find that a little surprising [07:52:03] ryankemper: thanks! [07:52:23] sorry for the bother that late! [07:53:56] no worries! [09:11:50] Need to pick up the car from the shop tmrw, won’t be around for retro [11:54:10] lunch [12:21:10] lunch [14:22:30] Interesting re: depooling , we'll need to figure out what's happening as I'm pretty sure this isn't the first time [14:26:26] o/ [14:31:26] I wonder if we couldn't take benefit of the data_loaded flag to at least have some alerting [14:31:34] but yes understand how it happened would be great [14:31:43] *ing [16:02:44] inflatador: https://meet.google.com/eki-rafx-cxi retrospective time! [16:20:09] dcausse: woiuld you be willing to try the pyflink installed flink? :) i think it will be fine, but if its not we can revert and build two different imates? [16:20:14] images* [16:26:52] ottomata: [16:27:32] ottomata: I'll have to download flink again to have the plugin I need so not sure there's much value in reusing the pyflink image [16:51:42] dcausse: really? [16:52:04] ottomata: yes the jars I need are not in pyflink :( [16:52:15] but there are other jars you are going to have to download too [16:52:22] e.g. we are not including kafka connector or client [16:53:00] i'd assume we'd download from archiva or maven in blubbler pipeline? [16:53:07] or make them non provided application deps in your pom? [16:53:13] https://mvnrepository.com/artifact/org.apache.flink/flink-state-processor-api/1.16.0 [16:54:41] https://mvnrepository.com/artifact/org.apache.flink/flink-s3-fs-hadoop/1.16.0 [16:54:58] or is it https://mvnrepository.com/artifact/org.apache.flink/flink-s3-fs-presto/1.16.0 [16:55:51] if there are plugins / other jars we decide we want in the base image, we can install them in the base production image dockerfile. but i was trying to avoid adding dependencies that won't be used by all apps (take that with a grain of salt since i'm advocating for including python deps :p ) [16:56:12] inflatador: is 1009 the only wdqs instance that failed reload, or are there more? [16:56:31] that way the app developer has control over which optional dependencies and versions of those that they use [16:56:32] inflatador: i poked at the smart disk stats and am sus of /dev/sdb, [16:56:52] was looking because the error is a checksum against disk reads [16:56:53] ebernhardson 1009 failed due to corruption, 2009 and 2010 got into that weird oom state I mentioned [16:57:08] inflatador: have dates on 2009 and 2010? [16:57:24] let me grab from the dashboard, basically the logging disappears [16:57:33] looks a lot like oom but no direct evidence [16:57:44] inflatador: for oom we should be able to verify from the cluster overview dashboard [16:57:51] well, if it's a system-level oom and not a jvm oom [16:57:58] but it sounded like you were talking about system level [16:58:22] yes, system level [16:58:57] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=wdqs2009&var-datasource=thanos&var-cluster=wdqs&from=now-30d&to=now you can see queue climbing here [16:59:42] hmm, with cached having a significant % that shouldn't be a system level oom [17:00:20] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=wdqs2010&var-datasource=thanos&var-cluster=wdqs&from=now-2d&to=now wdqs2010 [17:00:31] also note that I made a 128GB swapfile for wdqs2010 and it didn't help [17:00:32] ottomata: that could be possible to fetch them individually but this does not seem very practical, I'll end up having an image with python that I don't need and also have to fetch individual dependencies, I might just download flink from scratch that seems easier? [17:01:13] so it might not be oom. Maybe a fork bomb or something else that makes it unresponsive [17:01:35] could be hanging due to NFS, even [17:02:08] seems like blocked on io to me [17:04:09] TCP errors graph shows something as well [17:05:10] temperator drops so cpu is just idle... [17:05:35] hmm, yea it seems suspicious that the cpu queue length steadily increases until all the stats disappear [17:05:41] i've never seen that metric though, not really sure what it is [17:06:09] dcausse: there are some things in the image tthat we want to set for all uses [17:06:15] ECS logging and log4j configs for example. [17:06:15] "a saturation mesure which becomes non-zero in the event of a CPU overload" [17:06:31] the question i guess then is what got stuck, IO could be a thing [17:06:46] dcausse: even without the pyflink image idea, i had considered removing everything in optt [17:06:47] opt [17:06:54] :( [17:06:59] opt/ is just providied as a conveniience, you still have to install things in opt into plugins/ [17:07:07] there's a bunch of stuff in there we'll never use [17:07:23] what about 2 images? one for java jobs and one for python? [17:07:26] possible. [17:07:34] i'm trying to convince you we don't need 2, but maybe we do :) [17:07:46] but you sure you'll need to download them individually? [17:07:53] since you are building a java app anyway [17:08:03] why not just specify those as deps in your pom, so they are included in your fat jar? [17:08:10] just like any other dep you have? [17:08:18] this won't work [17:08:24] oh, no? [17:08:37] cuz classloader stuff? [17:08:37] I think plugins has to be there [17:08:42] yes I think so [17:08:59] s3 is required for flink H/A not only for my job [17:09:05] if we need, the base image could install what we want at wmf into opt/ [17:09:25] i'd prefer not to, because there will probably always be other things that need downloaded anyway? [17:09:52] s3 is requried for flink H/A now, but some other app might choose to do HA in another way? [17:10:04] but yes, perhaps we should just include the s3 stuff for HA in the image for that reason. [17:10:16] we could do kafka connector too, but i'm reluctant to do kafka client [17:10:23] i suppose kafka client could be in your fat jar? [17:10:32] yes kafka is part of the jar [17:10:35] okay. [17:11:03] okay lets rephrase my q then: woudl you be willing to try pip installed image if we also include the basic flink opts? [17:11:07] opt jars? [17:11:17] you'd still have to install them into plugins in your blubber though? [17:11:57] ottomata: deal :) [17:12:09] haha okay, note the 'try', we can always revert and make two images later [17:12:21] is it just those two from opt you need? [17:13:09] that's only s3 actually (https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater/-/merge_requests/1/diffs#9ab3c1596bde96ebebbc1f7b2251e2e26376dd7e) [17:13:20] state-processor can be part of the fat jar [17:15:03] okay, and the presto one, got it. [17:16:36] yes the hadoop one cannot be used for checkpointing (only for sinks IIRC) [17:17:34] okay. kafka connector? [17:17:46] or, that can be in fat jar? [17:18:15] inflatador: i suppose we think 2010 is currently in the stuck state, Is it also locked out of the sre server mangement console or just ssh? [17:18:47] ebernhardson just SSH, you can get into the mgmt console but cannot type on the serial getty [17:19:13] hmm [17:19:14] ottomata: kafka should be part of the fat jar unless Gabriele needs it [17:19:16] ebernhardson correction have not checked that yet on 2010, but that's what it was like last time on 2009 [17:19:56] I will try it once I'm out of mtg [17:20:49] kk [17:24:09] k [17:24:30] gabriele will need it but we'll use blubber or setuptools to get it [17:24:53] ottomata: btw how to make the image, for me it complains that openjdk-jre-11 is not there [17:25:34] whatcha doin? [17:25:37] tried: docker-pkg -c config.yaml update flink --version 1.16-dcausse --reason "hop" ./images/flink/flink [17:26:02] RuntimeError: Image openjdk-11-jre (dependency of docker-registry.wikimedia.org/flink:1.16.0-wmf4) not found [17:26:08] oh wait [17:26:20] cd to repo root [17:26:33] docker-pkg --info -c ./config.yaml build --use-cache images [17:26:53] you might have to change the image version in the changelog if you it is built locally, or in wmf docker reg already [17:27:06] i haven't found a way to tell docker-pkg to build only one image [17:27:09] it'll build all the images? [17:27:11] oh ok [17:27:15] it will only build what has changed [17:27:23] it won't build things it has in remote repo [17:27:27] or cached locally [17:27:33] ah [17:28:24] thanks! it's doing things now [17:29:59] how can I check why a yarn container died? [17:30:28] Exit code from container container_e50_1663082229270_775551_01_000006 is : 143 [17:31:08] is there a way to confirm that it has been oom killed or something like that? [17:31:34] dcausse: have to check the nodemanager logs usually, sec i think i added docs on that years ago [17:32:09] dcausse: https://wikitech.wikimedia.org/wiki/Discovery/Analytics#Yarn_manager_logs [17:32:28] thanks! looking [17:32:43] dcausse: essentially the nodemanager logs are available over http, fetch them from the server that was running the container that was killed and look for your application id [17:33:31] you may have to poke the spark logs a bit to figure out where your container was running at [17:35:29] this is flink but it only complains that one of its worked has disappeared [17:37:31] sigh... does not tell much [17:37:35] 2023-01-26 14:41:50,021 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_e50_1663082229270_775551_01_000006 [17:37:37] 2023-01-26 14:41:50,021 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=analytics-search IP=10.64.138.3 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1663082229270_775551 CONTAINERID=container_e50_1663082229270_775551_01_000006 [17:38:16] Container container_e50_1663082229270_775551_01_000006 transitioned from RUNNING to KILLING [17:38:31] does that mean it ended gracefully? [17:39:01] yes, looking at the yarn logs that container was asked to exit and did so, have to figure out why [17:41:22] dcausse: it looks like before it quit it restarted a couple times [17:41:30] oh? [17:41:52] you see that here http://an-worker1114.eqiad.wmnet:8042/logs/yarn-yarn-nodemanager-an-worker1114.log ? [17:42:10] dcausse: in `yarn logs -applicationId application_1663082229270_775551` [17:43:21] maybe i'm not interpreting right, but there are 'switched from RUNNING to CANCELING' messages at 14:36, then 14:39, then 14:41, then 14:43 [17:43:35] and between those it seems to restart [17:43:39] I cannot even find the logs from this container in the app logs [17:43:59] searching for "^Container: container_e50_1663082229270_775551_01_000006" [17:44:22] dcausse: hmm, i get those logs as one of the first in the yarn logs output [17:44:37] ah the flink job restarted multiple times indeed [17:45:07] but container I only see "The heartbeat of TaskManager with id container_e50_1663082229270_775551_01_000006(an-worker1114.eqiad.wmnet:8041) timed out." [17:45:23] and then it requests: "Stopping worker container_e50_1663082229270_775551_01_000006(an-worker1114.eqiad.wmnet:8041)." [17:45:55] hm.. might just be that the jvm is completely stuck that the heartbeat is not getting through... [17:46:59] and then flink is requesting flink to kill the container but that is working ok [17:47:07] oh, i'm totally not looking right. Indeed i only get logs for the jobmanager and not seeing the individual container logs [17:47:31] ok = SIGTERM stops the taskmanager jvm app that then return errcode 143 [17:48:11] I'll have to take few stackdump just before it's stuck I guess [17:48:28] indeed it looks like the jobmanager starts an instance, isn't able to talk to it after it gets running, and then asks yarn to kill it [17:49:02] I wish the logs of the container would not be lost like that [17:49:19] why doesn't it remained attached to the yarn app? [17:49:21] it seems odd, usually yarn will at least say it has a blank stdout log file [17:53:48] but indeed, /var/log/hadoop-yarn/apps/analytics-search/logs/application_1663082229270_775551/ in hdfs is empty :S [18:01:05] lunch, back in ~1h [18:04:06] sigh can't make any sense of these yarn logs... will continue debugging this tomorrow... [18:04:15] dinner [18:07:48] poked a bit more as well, but i can't explain why there are no logs for individual containers :( [18:16:50] inflatador: for when you're back, i suspect NFS is our cluprit there. I don't have significant proof, but there is a long history of nfs locking up linux instances when there are communication problems and we see both high io-wait and oddities in the tcp metrics of 2010. I don't know why that would be the case though, we shouldn't be doing anything on nfs after the munge is complete. [19:02:26] back [19:03:24] ebernhardson agreed on both counts, I think we'll have to try rsync in the long term