[08:43:32] today I tried to mess with blubber and docker, ending up in https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/715918 [08:43:54] the docker image that I tested (that seems to work) weights 1.12G, vs the 1.56G of the actual one [08:44:58] I have the feeling that we can trim more, but this seems to be a good start [08:46:40] lemme know your thoughts :) [08:48:04] ah lovely jenkins fails [08:48:18] same error not popping up locally [08:50:05] ahhh maybe the test variant [08:51:07] yes yes I forgot about it [09:04:37] ok should work now, jenkins is doing its thing [09:08:23] mmmm not really, I get some error for production now [09:08:24] lovely [09:09:31] I see why, the RUN statement for nltk runs before the COPY of the site-packages [09:14:34] sent another change, let's see if jenkins is pleased [09:18:43] yep better now [09:37:25] * elukey bbiab [10:08:32] this is the current size of the new prod image [10:08:33] somebody@6d4be97e87f8:~$ du -hs * [10:08:33] 123M nltk_data [10:08:41] somebody@6d4be97e87f8:/opt/lib/python$ du -hs * [10:08:41] 651M site-packages [10:09:08] and we get up to ~800MB, then we have ~230MB under /usr (libs etc..) and the rest are little bits of the OS [10:10:23] more trimming may be difficul, the major win could be in dropping some python deps [10:42:11] * elukey lunch! [13:32:20] we had almost a complete outage for ORES :D [13:32:52] this change https://gerrit.wikimedia.org/r/c/operations/puppet/+/714640 had the side effect of knocking down almost all our celeries [13:33:03] elukey: that needs an IR [13:33:32] RhinosF1: it is a human mistake, we'll follow up [13:49:36] RhinosF1: I followed up with Michael, with some links etc.. I think it is better than a full IR since we know the root cause. Lemme know if it is ok for you [13:50:15] elukey: as long as he's aware of how to check for usage of puppet code [13:50:53] RhinosF1: yeah I think it comes with time, we have some fences to prevent mistakes but it still takes no time to cause an issue with an oversight [13:51:37] elukey: yeah it's a very simple oversight if you're not aware but impact can easily get very big [13:53:03] agreed yes [13:53:07] thanks for following up [13:54:13] No problem [15:53:39] elukey: call me ignorant, but what is the specific problem with that change? [15:54:09] IS it really just the order of --app foo worker --loglevel bar? [15:55:23] klausman: I didn't have any idea about it until I saw alerts from icinga, and celery logs were showing the classic output of when you start a daemon with the incorrect parameters. [15:55:42] Huh. Well, TIL [15:55:44] (basically puppet knocked celery one by one when applying the change) [15:55:51] seems that stretch's version doesn't support it [15:55:57] I mean, I know other programs where flags and args can't be ordered arbitrarily [15:56:17] no idea on the specifics, but there was a mention of stretch in the commit msg [15:56:22] (not supporting it) [15:56:57] the merge was a little rushed, that celery class is shared among multiple systems and more research should have been done (mistakes happen, I followed up with the author) [15:58:11] * klausman nods [16:37:58] good to know thanks Luca [16:38:40] I assume it didn't actually go down? Since I didn't receive a blaring alert on my phone [16:38:58] Otherwise I have to change my own alerting setup [16:41:58] chrisalbon: I stopped puppet on some nodes that kept working, but the scores errored increased a lot for a brief moment [16:42:45] Hmmm. Okay cool. Thanks for saving it. I need to change my alerts on my phone then [16:43:21] https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&from=1630507818289&to=1630510094031 [16:43:30] I think we may need to add more alers [16:43:33] *alerts [16:44:27] yeah [16:51:26] elukey: nice catch on that. celery seems to always be a bit precarious w/ ores [18:03:47] * elukey afk [21:45:54] it would be nice if someone could add info about the ML k8s cluster(s) to https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters