[06:44:08] hello hello [06:44:27] I am checking https://github.com/mediawiki-utilities/python-mwapi but from a preliminary check I don't find how to use internal endpoints [07:16:31] from https://github.com/mediawiki-utilities/python-mwapi/blob/master/mwapi/session.py#L95 I don't think that the library uses any internal endpoint [07:16:44] so ORES is hitting the external LVS endpoints [07:54:27] kevinbazira, accraze - I found this nice (but a little old) article about python + multi-stage builds - https://pythonspeed.com/articles/multi-stage-docker-python/ [07:56:33] Thank you for sharing elukey 🙏 [07:58:42] the venv solution may be doable and easi-ish [08:12:23] kevinbazira: I am starting to understand a little better the blubbler files, and IIUC we are already using multi-stage right? [08:12:36] we just "include" build into production [08:15:04] ah no ok from https://wikitech.wikimedia.org/wiki/Blubber/Tutorial/HelloNode#Copies_for_local_files_and_artifacts it seems that "copies" triggers the multi-stage [08:15:57] so maybe we can follow up with Releng to ask if they have similar python cases [08:17:15] yes elukey, blubber files use multi-stage [08:18:17] in our case, we could have a build stage that includes g++ gfortran python3-dev etc.., and that basically does pip install [08:18:49] and then "production" could simply use "copies" to get the python venv or similar, installing the aspell/etc.. packages as well [08:18:59] (that IIUC are not needed when doing pip install) [08:20:22] yep, that's exactly what we are doing. the build stage handles those packages then the production and test images "copy" [08:21:13] kevinbazira: IIUC we include all the build image, we don't simply copy [08:21:54] what's the difference? please elaborate. [08:22:07] sure [08:22:56] so IIUC, the resulting docker file (for the "production" variant) will include all that was done in the "build" variant, namely all its layers [08:24:07] as it was a single Dockerfile basically (this is my undertstanding) [08:24:25] check something like https://gist.githubusercontent.com/Akashdesarda/0200f5b83de6a2a00869a508474fff2f/raw/2beade6c498f019582d4b24db139f90c7e7eaa55/Dockerfile_multi_stage (that may not be the best way to do it but it is the first example that I found) [08:25:14] in one "build" image you just pip install etc.. (in our case, including gfortran g++ that are only needed by pip when installing, not when we run the finaly model.py) [08:25:55] then what we call "production" could simply completely skip the dev packages like gfortran, g++, etc.. and copy only the python files needed [08:26:09] the example above is a little "raw", we could use a venv instead [08:27:42] kevinbazira: not sure if --^ is more clear or not, lemme know (I am also brainstorming, not really sure if --^ is 100% doable) [08:27:54] Thanks for sharing the example link. My understanding was that blubber variants did a similar thing: https://wikitech.wikimedia.org/wiki/Blubber/User_Guide#Variants [08:29:41] We shall have to dig further and see how best this can be achieved using blubber. In case it requires to be done differently from what we are doing with the blubber variants e.g: [08:29:50] kevinbazira: yep but check the diff between "includes" vs "copies" [08:37:04] my understanding is that "includes" inherits a variant where as "copies" copies files from a variant. is this correct? [08:37:51] elukey: --^ [08:47:11] kevinbazira: yes it is my understanding as well [08:48:53] Great! maybe let me try to explain how this particular image works and then we can see how best to optimize it. [08:49:15] On the line below, the "production" variant runs revscoring. [08:49:15] https://github.com/wikimedia/machinelearning-liftwing-inference-services/blob/main/.pipeline/articlequality/blubber.yaml#L75 [08:49:53] revscoring requires apt packages + python requirements that are built in the "build" variant. [08:49:53] https://github.com/wikimedia/machinelearning-liftwing-inference-services/blob/main/.pipeline/articlequality/blubber.yaml#L10 [08:50:17] I believe the reason above is why we are using "includes" [08:51:45] kevinbazira: one thing that we could do, as initial refactoring step, could be to move the apt installs of aspell etc.. to "production" [08:52:01] (basically starting to move only packages needed after the pip install in there) [08:57:38] \o Buon giorno! [08:57:53] elukey: I presume you meant https://wikitech.wikimedia.org/wiki/Kubernetes#Rebooting_a_worker_node <- this yesterday? [08:59:17] buon giorno klausman o/ [08:59:18] klausman: good morning! Exactly yes [09:03:07] Ok, I will do that manually. My main question about cookbooking that would be how it would map the to-be-rebooted-node to the k8s instance it needs to talk to. [09:03:22] I presume such mappings already exist in other domains [09:04:08] elukey thanks for the suggestion. I am wondering how moving the apt packages from the "build" to the "production" variant would reduce the size of the production image as the "production" variant is already inheriting from the "build" variant. [09:04:08] I will try it and revert. [09:04:44] kevinbazira: ah yes moving apt packages for the moment wouldn't change size, it is just a starting point for refactoring [09:05:12] what I hope is that if we find a way to simply "copy" it will be easier if the blubber file is refactored [09:05:19] (lemme know if it makes sense) [09:10:12] elukey: doing reboots for etcd and control plane now. That should be entirely transparent [09:10:27] +1 [09:10:38] elukey: yes, it makes sense. [09:28:55] Starting to drain&reboot cluster nodes in eqiad [09:49:28] `error when evicting pod "cluster-local-gateway-585f96dccc-54kqm" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.` [09:49:47] elukey: it seems that a node running the cluster-gw in the current config can never be drained [09:50:31] klausman: interesting! [09:50:50] where do we define the replica count for the local cluster gw? [09:51:39] in theory it is in the istioctl config, but for the moment there is nothing set [09:51:55] I wonder if I can set it to 2 temporarily [09:52:53] we can do it even permanently, seems to be a good thing [09:53:08] so I usually check the helm manifests in the istio upstream repo [09:53:16] to find the right key to add to the istioctl manifest [09:53:42] https://github.com/istio/istio/blob/master/manifests/charts/gateways/istio-ingress/templates/deployment.yaml#L16 [09:53:59] so in the specs, we could add gateway.replicaCount: 2 [09:54:49] but I think that the ones in deployment-charts are not up to date, I used another one to allow port 80 and forgot to update [09:54:53] so lemme look for it [09:59:09] It'd be set in custom_deploy.d/istio/ml-serve/config.yaml right? [09:59:34] exactly yes but that file is not up to date yet [09:59:41] ah :) [10:00:02] istio-system cluster-local-gateway-585f96dccc-54kqm 1/1 Running 1 33d [10:00:05] istio-system cluster-local-gateway-585f96dccc-tnmjb 1/1 Running 0 30s [10:00:08] istio-system istio-ingressgateway-7ffffd874b-5fg5g 1/1 Running 0 30s [10:00:11] istio-system istio-ingressgateway-7ffffd874b-67zkp 1/1 Running 1 33d [10:00:14] ta daaan [10:00:15] I'll update the specs in a sec [10:00:20] klausman: can you check if you can drain now? [10:01:00] yes! [10:01:29] Thanks <3 Do send the code review my way ;-P [10:02:29] I think it is good to do it also for the pilot [10:02:40] Ack [10:03:31] lemme know when the host is up and I'll do it [10:04:02] will do [10:07:14] elukey: machine is back but still drained. Should I undrain before you proceed? [10:08:00] yeah let's do it [10:08:25] klausman: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/714977 this is the diff [10:08:55] Machine is undrained [10:09:18] +1'd [10:10:02] istio-system istiod-68d4cb6c9-lw2ls 1/1 Running 0 11s [10:10:05] istio-system istiod-68d4cb6c9-mnl6n 1/1 Running 0 13d [10:10:08] all good! [10:10:44] will merge later on, you are free to proceed! [10:10:47] Ok, will proceed with the rest of drian-reboots [10:11:00] now I am wondering if knative/kubeflow pods will lead to the same errors [10:11:09] We'll see™ [10:11:45] but it is very nice to do these drills when we are not live :D [10:11:46] error when evicting pod "activator-64b9b4f644-5b7vv" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. [10:11:52] good :) [10:13:07] past Luca added templated vars to the knative chart [10:13:08] replicas: {{ .Values.core.autoscaler.replicaCount }} [10:13:19] I like past Luca [10:13:23] (also, current) [10:14:28] but not to all pods of course, bad Luca [10:14:29] ahhahaha [10:14:39] No saint is without sin ;) [10:15:36] ah yeah for the activator we have max/min replica count afaics [10:17:01] So the min used to be 1, I presume? [10:17:30] yes yes, I see that there is also a specific mention of the podDisruptionBudget etc.. [10:17:36] but not all is templated yet I think [10:19:15] Yeah, I think the default for the pdb might not be ideal in all cases [10:20:17] we also have HorizontalPodAutoscaler entries [10:20:36] so basically we can either use HorizontalPodAutoscaler or spec: replicas: #n [10:20:54] there is a mixture of those in the yaml config provided by upstream [10:21:00] I have templated what was already there [10:21:02] Of course [10:21:07] (re: mixture) [10:21:18] ok if we continue after lunch?? [10:21:31] (seems a not-so-easy code change) [10:21:39] Sure, I have a growling in my belly as well :) [10:22:10] ack thanks :D ttl [12:40:42] klausman: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/715006 - should do the trick :) [12:41:23] on it [12:41:56] +1'd [12:42:44] the main ? that I have is how the replicas are getting traffic and how the sync between them [12:42:58] probably they get requests via a service ip [12:43:33] should be fine in theory [12:47:29] Yes, that is hwo I understand it. Not sure if there is any relevant state for them to sync [12:53:55] ack let's try [12:57:49] klausman: pods doubled as expected! [12:58:02] Let's see if I can now drain the node [12:58:04] I assume that we may end up in errors for kubeflow's controller [12:58:23] drain indeed proceeds [12:59:07] As for the normal way of doing this in prod: maybe there is a way for temporarily bumping replicas to (or add a standby?) and then drain the existing one, but over the long term run only one? [12:59:56] rebooting 1003 [13:01:03] no idea, we'll have to explore [13:01:38] I have to say that having a static 2 for every pod that may be a SPOF would be nice [13:09:04] Yeah, that is also true [13:09:11] Now draining the last machine (1004) [13:09:42] one thing that I'd like to understand now is why a prediction takes 28s to complete [13:09:54] ORES legacy behavior ;) [13:10:14] there is probably some bottleneck, it would be great to explore metrics/debugging-tools/etc.. [13:11:08] It looks like enwiki-goodfaith-predictor-default-xbzss-deployment-f96885gj9sj is taking a while to evict ;) [13:13:25] the other is already up, weird [13:14:17] I dunno what specifically kubectl et al war waiting for. Just termination? [13:14:57] ah, it completed [13:15:31] Maybe the startup is slow, so k8s waits until healthz reads OK [13:16:08] could be yes [13:22:53] I am also wondering about cpu/memory constraints that we have for the inference service pods [13:22:55] aand we're done. [13:23:20] That took longer than expected :) [13:23:28] But then again, good we spotted the replica issue. [13:24:05] https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details is very nice [13:24:27] klausman: yes yes it is good that we discover these things, the first times it will take a bit [13:25:30] ok so the pods don't seem to suffer from cpu/memory limits [13:26:09] Yes, they seem to be far away from saturation [13:31:48] we may need to direct logs to logstash, I see only syslog [13:31:57] nothing really from the pods [13:50:45] 10Machine-Learning-Team: revscoring model should be able to query an internal mw endpoint - https://phabricator.wikimedia.org/T289778 (10elukey) p:05Triage→03High [13:54:59] kevinbazira: o/ how much time does it take to get a score for enwiki goodfaith on our testing set up on AWS? [13:55:21] in seconds I mean, how much time it takes for curl to return a result [13:57:05] I found something in the kfserving-container logs [13:57:06] [I 210826 13:16:15 web:2243] 200 POST /v1/models/enwiki-goodfaith:predict (127.0.0.1) 31755.71ms [13:57:09] [I 210826 13:52:12 web:2243] 200 POST /v1/models/enwiki-goodfaith:predict (127.0.0.1) 31742.78ms [13:57:30] 31s is... long [13:59:38] https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=kfserving-system&var-pod=All is interesting [13:59:43] even if it shouldn't be the issue [13:59:54] the kfserving controller seems to be throttled [14:02:57] That should be fixable [14:34:01] elukey it takes about 2 seconds [14:34:19] real 0m2.190s [14:34:19] user 0m0.468s [14:34:20] sys 0m0.211s [14:39:01] I got this result by logging into the KFv1.1 sandbox and running: [14:39:01] $ cd ~/kevin-deployments/revscoring-inferenceservice [14:39:01] $ time ./infer-enwiki-goodfaith.sh [14:51:32] kevinbazira: thanks! I suspect that we are hitting some weird limitation or network bottleneck, very strange [14:51:47] do you recall if you had to tune something on the kfserving side? [14:53:21] mmm maybe I can use nsenter to see if using curl from the container takes ages [14:57:05] yeah a simple curl runs very fasrt [14:57:08] *fast [15:04:42] Ummm... I'm not sure whether there was any kfserving tuning. [15:04:42] Andy is the one who setup this environment and it uses MiniKF which is most likely different from the environment you're using. [15:04:42] Here is the documentation he shared: https://docs.google.com/document/d/1aa6AxFN-89HgYuk-XfeLMcXSXd6S6SsPW5dR9XbIcyI/edit# [15:05:31] yeah I am working on some limitations on pods, let's see what happens [15:11:23] * elukey bbiab [16:14:16] o/ [16:18:37] just reading all the past messages from today so far... [16:18:59] elukey: wow ywah 31s is much much longer than anything we've seen on the sandbox clusters [16:20:42] when i load tested enwiki-goodfaith the avg was ~30ms for small edits up to 2s for larger revisions [16:22:21] accraze: o/ [16:22:42] I am currently reviewing all the pods constraints for memory/cpu, and I found https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/715053/, but not sure if related [16:22:54] https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details is very nice [16:23:06] it breaks down pod usage/limits by cluster and namespace [16:23:59] nice! didn't know we had grafana & prometheus wired up yet :) [16:24:19] only for the cluster stuff, istio and knative are still not wired :( [16:24:23] It's one of the upsides of (re)using puppet stuff [16:28:28] ok super interesting [16:28:44] now after the change for knative I see [16:28:50] real 0m0.535s [16:28:50] user 0m0.013s [16:28:50] sys 0m0.007s [16:29:05] that is A LITTLE faster [16:29:28] every ms counts! [16:29:53] ok half a second for a prediction looks good :D [16:30:03] \o/ [16:30:30] it was just templating issue? [16:31:00] yeah it seems so, the knative pods were constrained in memory usage [16:31:09] oh interesting [16:31:10] wow [16:31:31] I was reviewing the values and they were not matching with settings, so I rechecked the template and discovered the mess that past Luca did [16:31:39] lol [16:31:47] past self is the worst person ever [16:32:08] I get 0.5s on avg consistently now [16:33:59] awesome, that's the average on the sandboxes right now too [16:34:26] we are starting to see some light at the end of the tunnel :D [16:35:41] accraze: did you see my conversation with Kevin about the multi-stage build? It may be very interesting to do, not sure if Releng has some guidance about the Python use case [16:36:25] yeah was just reading that, definitely worth looking into, will make a task later today to investigate a bit more [16:37:15] i know there are python-build images that dispose of the wheels/packaging/etc so I'm sure there is a way to slim things down [16:37:52] the biggest unknown is what we can get rid of after installing the scipy stack [16:38:12] accraze: one quick refactoring to prep for this could be to split the apt install of the packages, moving the aspell-related things to the production variant [16:38:49] not sure about the others, but definitely gfortran g++ etc.. should stay only where we pip install [16:38:55] (namely in the build variant) [16:41:08] I am confident that we will trim down some hundreds of MBs [16:41:33] yeah i agree, there is alot of cruft included in the production image right now [16:41:38] we'll still have to manage "jumbo" images [16:41:53] but all bytes count :D [16:42:26] agreed [16:44:08] I must have spoken too soon earlier on, I am still getting the 30s slow request [16:44:12] but now only sometimes [16:44:33] wat [16:44:42] still using the same revid? [16:45:18] yes [16:45:26] p99.9 curse [16:45:31] lol [16:45:48] it must be something else, I get it very rarely now [16:46:04] could it be something related to psp? [16:46:26] in theory those restrict only what a system user can do, like creating pods etc.. [16:47:05] but I checked all the other layers, we are not throttled anymore (istio/knative/kfserving/inference) [16:47:41] accraze: I opened a task for the mwapi lib, I checked briefly this morning but I didn't find a way to specify an internal endpoint :( [16:48:02] cool yeah i just saw that, I can dig in a bit today on it [16:48:07] <3 [16:48:15] thanks all [16:48:21] going to log off for the day folks, see you tomorrow! [16:48:25] bye! [16:49:39] see ya elukey [17:38:50] 10Machine-Learning-Team, 10artificial-intelligence, 10Wikilabels, 10articlequality-modeling: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10Halfak) I added a `wikitext.revision.list_items` feature to revscoring for tracking articles that are in outline form (... [17:39:59] 10Machine-Learning-Team, 10artificial-intelligence, 10Wikilabels, 10articlequality-modeling: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10Halfak) Once this is merged, I'll use this and other improvements to re-generate the models. Then we can use those mod... [21:44:48] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Fix articlequality production pipeline - https://phabricator.wikimedia.org/T289749 (10ACraze) The pipeline has been fixed, no more pip backtracking after pinning dep versions in the most recent patchset [22:10:06] oof forgot how complex the ores api extractors code is [22:21:20] abstractions on top of abstractions on top of abstractions XD [22:22:12] the good news is that `mwapi.Session` seems to be interchangeable with `requests.Session()` which should do what we want [22:47:11] 10Machine-Learning-Team: revscoring model should be able to query an internal mw endpoint - https://phabricator.wikimedia.org/T289778 (10ACraze) @elukey - thanks for bringing this up! I dug into revscoring and mwapi today, it seems that `mwapi.Session` is interchangeable with [[ https://docs.python-requests.org...