[00:38:13] so I joined this channel when i saw chrisalbon's tweet, but I also wanted to share my new paper that's coming out at OpenSym with this chennel [00:38:16] https://arxiv.org/abs/2108.10684 [00:38:42] I'll tweet more about it later, but wanted to share it hear first [00:39:37] it's about using ORES quality models in down stream analysis and more generally, the pitfalls of using multinomial models for ordered outcomes. [00:41:38] but it's okay you can use a Bayesian ordinal model to convert your multinomial predictions into a continuous outcomem [00:41:56] you can also recalibrate at this step and get some accuracy for free [06:25:00] good morning! [07:21:35] still trying to figure out why the storage initializer doesn't work anymore [07:25:40] ok I am getting somewhere [07:25:58] in the annotations about the endpoint of the secret, in theory, the https:// prefix is not needed [07:26:08] I recall that I added it the first time that I tried, then removed [07:26:15] I just added it and now I see [07:26:16] botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://https://thanos-swift.discovery.wmnet/wmf-ml-models?prefix=goodfaith%2Fenwiki%2F202105140814%2F&encoding-type=url" [07:26:31] that is of course wrong, but at least I see thanos-swift [08:34:34] kevinbazira: o/ [08:34:43] elukey o/ [08:35:43] I am wondering one thing - if we had a ORES "build" docker image to copy data from (for example, we build on it and copy the results in a new image) would it reduce all its dependencies? [08:38:25] for example, I see packages like git, gfortran, g++, etc.. [08:38:45] ideally those should not be in the final images (to trim down things that can be exploited etc..) [08:40:38] ah yes I see from the Dockerfiles in the repo [08:42:19] ummm... if I understand correctly, you're saying we build one ORES image with all dependencies e.g revscoring, etc. [08:42:20] Then when we need to build an image for editquality or draftquality model-server, we copy the ORES image? [08:45:50] kevinbazira: we could do multiple things - 1) a base ores image that collects common dependencies etc.. (like aspell), so that we don't have to repeat all of them in the blubber config [08:46:11] 2) if possible (and this is the part that I am not sure) avoid dependencies like git/gfortran/g++ etc.. [08:46:45] creating a "build" image, that compiles/etc.. and from which we copy only the final artifacts/binaries/whatever to the image that we run in prod [08:47:11] for example, do we need the git repositories on the image or are those only to support some initial step? [08:47:20] like "git clone https://github.com/wikimedia/articlequality.git [08:47:45] (this is from the Dockerfile so not sure if it is still also in the blubber ones) [08:49:12] the example that I have in mind is istio/knative/etc.. [08:49:22] we have a build image on which we build go binaries [08:49:30] that is huge, because it needs a ton of things [08:50:01] but the images that we run in production are the result of a multi-stage build - they basically copy binaries that are needed from the "build" image [08:50:08] without the need of other deps [08:50:22] I know that we are dealing with python etc.. so it is different [08:52:17] The "git clone" is in the Dockerfile and is yet to be removed as stated here https://phabricator.wikimedia.org/T289127 currently the Blubberfile uses https://github.com/wikimedia/machinelearning-liftwing-inference-services/blob/main/revscoring/articlequality/model-server/requirements.txt#L2 [08:54:36] Please share examples of the images you're referring to. [08:55:04] or rather share the links to their Docker or Blubber files [08:57:12] ah okok, so the article quality git repo is fine [08:57:31] IIUC even in the last link that you showed the repo is cloned right? [08:57:52] what I am asking is if the repo is needed only to build something and then it is never used afterwards, or if it is needed at runtime [08:58:11] because it is big, and if we could remove it or move it to a build image we could trim down the final image size [08:58:14] this is my point [08:59:08] the other thing that I noticed is that we pin kfserving 0.3.0 in the image, meanwhile I am testing with 0.6.0 [09:00:31] kevinbazira: read the above points assuming that I know 0 about how revscoring works, so I am really ignorant about all is connected :D [09:01:04] we can hop on meet if you prefer, so I can explain [09:01:58] ok let's do a quik meeting [09:04:06] meet.google.com/tzn-jfea-tzh [09:05:28] ah I didn't see your room [09:05:30] joining [09:32:25] thank you for the suggestions elukey, let's keep exploring them [09:32:38] the smaller the images the better [09:32:40] <3 thanks for the chat, I understand things better now [09:33:46] kevinbazira: one thing that is completely unrelated, but I noticed it while checking blubbler - is there a reason to use kfserving 0.3.0 as python dep? (we have kfserving 0.6.0 on the prod cluster atm, this is why I am asking) [09:36:52] We were using kfserving 0.3.0 in the development sandboxes [09:37:47] This can be changed ... we shall engage Andy on this. [09:42:56] I hope it will not be a horror of dependencies [09:43:33] fingers crossed ... [09:43:59] we've run into some dependency hell situations before :) [10:39:51] * elukey lunch! [13:54:12] another weird one - if I set usehttps to 0,then [13:54:13] botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://thanos-swift.discovery.wmnet/wmf-ml-models?prefix=goodfaith%2Fenwiki%2F202105140814%2F&encoding-type=url" [13:54:27] that is ok, the endpoint url etc... is picked up [13:55:40] I have the horrible suspicion that the issue is a value "1" vs '1' [13:56:43] or something similar [14:04:38] elukey: I'm gonna do the codfw reboots now. I presume nothing of importance is running there atm? [14:06:48] yep! [14:18:08] I may be able to repro with a specific version of boto [14:23:43] nope [14:25:01] it seems to be the AWS_DEFAULT_REGION [14:25:20] ok I am going to change it before getting crazy [14:28:46] namely https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/714773/ [14:34:58] morning all [14:45:13] hello hello [15:17:54] elukey: reboots in codfw done. I'll do eqiad tomorrow [15:20:44] klausman: sure, there is a tutorial IIRC about how to do it gently when a cluster is running pods [15:21:01] we could try to follow it, maybe thinking about a cookbook for the future [15:21:22] (not urgent but we'll have to come up with one, shared with sre, at some point) [15:24:33] Yeah, sounds good. [15:43:03] elukey, the specs RobH sent over look over as they are the same as our current specs, but I would love if you and klausman could check https://phabricator.wikimedia.org/T286594 [15:44:08] yep yep I have it in my backlog [15:44:19] great thanks [15:55:34] the issue with kubeflow's storage init seems to be fixeD! [15:58:45] yessssss [16:04:34] Noice. [16:04:39] What was it, ultimately? [16:10:44] nice! [16:11:10] hopefully it was something minor like new env var or something... [16:31:42] klausman: the new region, for some reason boto told me that it failed to connect to AWS rather than thanos when using the wrong one (US was ok on buster, on bullseye us-east-1 is needed0 [16:31:49] totally weird debugging [16:42:50] opened https://github.com/kubeflow/kfserving/pull/1780 to support AWS_DEFAULT_REGION (vs AWS_REGION that doesn't work) [16:42:57] will open another one for AWS_CA_BUNDLE [16:46:34] I mean... I sorta get it [16:46:53] "US" doesn't specify which US DC, so I figure they now allow for more specific selection? [16:47:09] Still not an exactly helpful error mode [16:55:13] so US worked with the swift version on Buster, but not with the one on Bullseye, that wants us-east-1 [16:55:25] (Filippo upgraded Thanos/swift hosts recently) [16:55:35] I suspect Swift used to infer "us-east-1" from "us", but doesn't anymore [16:55:48] and other services had to do the same, like Tegola https://phabricator.wikimedia.org/T289076 [16:56:03] ... or an API on AWS's side changed [16:56:04] but the error msg from boto is completely madness [16:57:03] full pull request for the two changes in https://github.com/kubeflow/kfserving/pull/1780/files [16:57:17] in theory if this gets merged we could get rid of the two ENV vars on our docker image [16:57:20] in theory [16:58:03] Yeah, that would be nice. But I'm not holding my breath :) [18:06:20] * elukey afk! o/ [19:22:42] hey congrats majavah [19:22:44] https://phabricator.wikimedia.org/T289329 [20:39:49] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Add inference-services CI pipelines to the Zuul gate-and-submit - https://phabricator.wikimedia.org/T289562 (10mmodell) This change has been deployed to zuul. [23:22:02] 10Lift-Wing, 10artificial-intelligence, 10draftquality-modeling, 10Machine-Learning-Team (Active Tasks): Configure draftquality deployment pipeline - https://phabricator.wikimedia.org/T287787 (10ACraze) Looks like the pipeline is working well now: https://integration.wikimedia.org/ci/job/inference-services... [23:22:12] 10Lift-Wing, 10artificial-intelligence, 10draftquality-modeling, 10Machine-Learning-Team (Active Tasks): Configure draftquality deployment pipeline - https://phabricator.wikimedia.org/T287787 (10ACraze) 05Open→03Resolved [23:22:14] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) [23:26:42] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) [23:26:54] 10Lift-Wing, 10drafttopic-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Configure revscoring topic deployment pipeline - https://phabricator.wikimedia.org/T287788 (10ACraze) 05Open→03Resolved Nice one @kevinbazira! Looks like the pipeline is working well now: https://integration... [23:48:55] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Fix articlequality production pipeline - https://phabricator.wikimedia.org/T289749 (10ACraze)