[01:15:28] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Malayalam-Sites, 10User-notice: Deploy "add a link" to 12th round of wikis - https://phabricator.wikimedia.org/T308137 (10Etonkovidova) 05In progress→03Resolved [01:31:16] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138 (10Etonkovidova) 05In progress→03Resolved [05:10:53] o/ Memory usage is stable until now! [05:18:22] I am continuing with the deployments for the rest of the revscoring model servers [05:35:35] (03PS1) 10Ilias Sarantopoulos: Upgrade revscoring images to KServe 0.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959383 (https://phabricator.wikimedia.org/T346446) [05:37:31] (03PS2) 10Ilias Sarantopoulos: Upgrade revscoring images to KServe 0.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959383 (https://phabricator.wikimedia.org/T346446) [05:38:38] (03CR) 10Elukey: [C: 03+1] Upgrade revscoring images to KServe 0.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959383 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [05:42:59] isaranto: o/ thanks! Indeed it seems all good [05:44:58] need to run errand, bbl [05:46:10] (03CR) 10Ilias Sarantopoulos: [C: 03+2] Upgrade revscoring images to KServe 0.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959383 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [05:46:56] (03Merged) 10jenkins-bot: Upgrade revscoring images to KServe 0.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/959383 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [05:56:34] (03PS3) 10Ilias Sarantopoulos: fix: revscoring model server inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958916 (https://phabricator.wikimedia.org/T346446) [05:58:13] 10Machine-Learning-Team, 10Patch-For-Review: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10isarantopoulos) All revscoring model servers have been re-deployed with this fix and successfully tested by running the httpbb tests. [06:41:25] (03CR) 10Elukey: fix: revscoring model server inputs (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958916 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [08:12:45] kevinbazira: o/ [08:13:02] one question - do we need to fix something in https://github.com/wikimedia/research-recommendation-api/blob/master/recommendation/api/types/related_articles/candidate_finder.py#L137 ? [08:13:20] now that we don't need embeddings, it doesn't make sense to always pull them and load them from swift no? [08:17:26] elukey: o/ [08:17:37] The `related_articles` service is the one that required the embedding. Now that is switched off, we don't need to fix anything ATM. [08:20:52] kevinbazira: IIUC from the code we still pull the embeddings from swift, is it the case? [08:21:11] or is the code completely bypassed? [08:22:28] Yes, when the service was switched off, that module wasn't loaded as shown in: https://phabricator.wikimedia.org/T339890#9181719 [08:27:09] kevinbazira: two suggestions - 1) let's make sure that memory is not used, you used --memory=10g in the docker run etc.. it is just to double check. 2) Please add some comments in the code review about the SWIFT env variables (namely we need them only if we turn on related_articles). 3) I'd also add a note in the memory requirements that we'd need more memory if related_articles is enabled, pointing [08:27:15] to you task analysis. [08:27:23] this would allow anybody, if you are not around, to know what to do etc.. [08:27:25] does it make sense? [08:32:41] 10Machine-Learning-Team, 10Patch-For-Review: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) 05Open→03Resolved [08:32:54] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10isarantopoulos) Going through the [[ https://github.com/kserve/kserve/releases | Changelog ]] of kserve and issues I found out that after kserve 0.11 we explicitly need to... [08:49:07] eleukey: I tested the memory usage without the embedding and 1Gi is good. Notes regarding swift variables and disabling/enabling the related_articles service have been added to the patch. [08:50:15] kevinbazira: super, +1ed, you are free to go :) [08:52:31] thanks for the review. going to merge. [08:57:26] (03PS4) 10Ilias Sarantopoulos: fix: revscoring model server inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958916 (https://phabricator.wikimedia.org/T346446) [08:59:37] (03CR) 10Ilias Sarantopoulos: fix: revscoring model server inputs (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958916 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [09:06:34] (03CR) 10Klausman: [C: 03+1] fix: revscoring model server inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958916 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [09:06:57] elukey: I am on the deployment node now, checked and the merged change has been synced. Preparing to deploy on staging ... [09:22:34] ack [09:23:08] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10isarantopoulos) Adding middleware that manipulates the request would be a nice option but as I found out this is not supported yet. I found a [[ https://github.com/kserve/... [09:29:06] 10Machine-Learning-Team: Deploy the recommendation-api-ng on LiftWing - https://phabricator.wikimedia.org/T347015 (10kevinbazira) [09:31:25] 10Machine-Learning-Team: Deploy the recommendation-api-ng on LiftWing - https://phabricator.wikimedia.org/T347015 (10kevinbazira) The attempt to deploy the rec-api on LW staging returned: ` kevinbazira@deploy1002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ helmfile -e ml-staging-codfw sy... [09:31:57] elukey: https://phabricator.wikimedia.org/T347015#9186189 [09:34:15] kevinbazira: please don't use deploy1002, we have done the switchover, we need to use deploy2002 [09:34:38] it was announced in wikitech-l [09:35:07] ok, I am changing the LW deployment docs to reflect deploy2002 instead of deploy1002 [09:48:28] kevinbazira: wait a sec, it is not always 2002 or 1002, it depends [09:48:54] if you want to be sure, we need to use deployment.eqiad.wmnet [09:49:02] it points to the right one, we could use this in the docs [09:49:22] but in general, let's also pay attention to switchovers etc.. [09:49:32] yes, IIUC the switchover happens every 6 months [09:49:52] elueky: ok, let me use that instead. thanks for the clarification :) [09:50:15] exactly we'll now have a regular switch [09:56:42] klausman: o/ Filed https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/959695 to improve the SLO calculations [09:56:59] Looking [09:57:38] LGTM! [09:58:24] thanks! [09:58:39] today I checked the RRLA budget and I saw 27% :D [09:58:48] something was definely off [09:59:04] Oh. Yes. [09:59:14] Though that can't all have been 3xx and 4xx? [10:00:52] removed my +1, since I am not sure 3xx and 4xx should be part of the latency SLO [10:01:02] (availability SLO: yes, absolutely) [10:01:16] good point, we can discuss it [10:01:46] rummaging through the SRE book for any info what is usually done [10:02:41] https://cloud.google.com/blog/products/management-tools/practical-guide-to-setting-slos Search for "Latency SLI" [10:03:01] > The proportion of HTTP GET requests for /checkout_service/response_counts that do not have 5XX status (3XX and 4XX excluded) that send their entire response within 500 ms measured at the Istio service mesh. [10:03:23] (bbiab sorry) [10:03:39] I _think_ this means only 1xx and 2xx are counted for Latency (though I don't think we ever issue 1xx statuses) [10:15:05] 10Machine-Learning-Team: Deploy the recommendation-api-ng on LiftWing - https://phabricator.wikimedia.org/T347015 (10kevinbazira) On IRC, @elukey advised that we use `deploy2002` due to a recent datacenter switchover. The second attempt to deploy on staging returned: ` kevinbazira@deploy2002:/srv/deployment-char... [10:15:40] elukey: https://phabricator.wikimedia.org/T347015#9186399 [10:17:38] klausman: +1 for the 3xx and 4xx [10:17:43] kevinbazira: I'm taking a look! [10:23:30] I ran the same command `helmfile -e ml-staging-codfw diff` and `helmfile -e ml-staging-codfw sync` . in the diff I saw everything (meaning nothing was applied) and the sync seems to take a while (or it is hanging) [10:24:52] the pod is created, however I see it fails with a CrashLoopBackOff after some time [10:32:20] back sorry [10:32:27] klausman: ack will amend the change :) [10:34:07] isaranto: did you get why it fails? If not let's redeploy to check [10:35:40] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/957697 (owner: 10L10n-bot) [10:35:44] on the sync command I got the same error as kevin [10:37:22] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/958441 (owner: 10L10n-bot) [10:38:42] 10Machine-Learning-Team: Deploy the recommendation-api-ng on LiftWing - https://phabricator.wikimedia.org/T347015 (10isarantopoulos) I tried to sync and got the same error as Kevin mentioned above. However I saw that a pod was created during the process which was failing with CrashLoopBackOff. The only issue I... [10:39:01] from the pod I saw just what I wrote in the task --^ but now it is gone [10:39:07] (the pod is gone) [10:39:15] * isaranto goes for lunch [10:42:04] yeah after a bit helm gives us [10:42:06] *up [10:46:04] kevinbazira: so to debug this [10:46:13] (what I am doing basically) [10:46:31] 1) deploy in a tmux session, detach and get to another shell on deploy2002 [10:46:51] 2) get the pods via kubectl, then the logs for the main container [10:47:08] 3) kubectl describe pod etc.. to see what's failing [10:47:27] so from 3), it is clear that k8s cannot make a correct health probe [10:47:41] it also says "connection refused" for port 8080 [10:47:49] from 2) I noticed this: uWSGI http bound on :5000 fd 4 [10:48:54] so the issue is that uwsgi listens on 5000, but the deployment-charts settings say 8080 (it is the default for the chart) [10:49:01] (see python-webapp's value.yaml) [10:49:28] we can probably change 8080 with 5000 in the helmfile.d's values.yaml [10:49:48] ok, let me change 8080 with 5000 in the helmfile.d's values.yaml [10:50:10] lemme also know if you have doubts about the debbuging above [10:50:31] it is more important that we discuss that than the port :) [10:51:10] thanks, I've been following ... [10:51:37] super :) [10:51:48] klausman: updated the CR! Thanks for the feedback [10:53:42] LGTM! [10:54:42] going afk for lunch, ttl! [11:02:55] * klausman lunch as well [13:01:27] Good morning [13:02:46] (03PS1) 10Kevin Bazira: blubber: update recommendation-api-ng port [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/958977 (https://phabricator.wikimedia.org/T347015) [13:04:11] morning chris [13:16:03] (03CR) 10Elukey: [C: 03+2] blubber: update recommendation-api-ng port [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/958977 (https://phabricator.wikimedia.org/T347015) (owner: 10Kevin Bazira) [13:16:34] (03Merged) 10jenkins-bot: blubber: update recommendation-api-ng port [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/958977 (https://phabricator.wikimedia.org/T347015) (owner: 10Kevin Bazira) [13:25:05] kevinbazira: the new image is ready --^ [13:25:40] elukey: updating the config ... [14:03:13] kevinbazira: let's deploy! [14:07:08] merged the update. now in the deployment node going to deploy ... [14:10:01] kevinbazira: I think it will fail again, let's check the issue [14:11:07] ok, I've run the `helmfile -e ml-staging-codfw sync` command. waiting to see what it returns. [14:11:40] kevinbazira: yeah but when it returns helm will have already cleaned up the failing pod [14:11:52] do you recall the 1) 2) 3) points above? [14:12:00] so we try a live session :) [14:13:56] yeah, no problem. here is the meet: https://meet.google.com/jkc-sivd-gtk [14:14:11] kevinbazira: nono I mean we can do it now on IRC [14:14:28] fine if we don't do meet (I can join if you prefer but not mandatory) [14:15:38] ok, was waiting in the meet. lets discuss there. [14:31:14] (debugging live with kevin, if people want to join etc..) [14:32:01] coming! [14:37:54] I have huge FOMO right now, but I'm getting my kids ready for school [14:51:37] (03CR) 10Elukey: [C: 03+1] fix: revscoring model server inputs (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958916 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [14:53:17] (03CR) 10Ilias Sarantopoulos: [C: 03+2] fix: revscoring model server inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958916 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [14:59:45] kevinbazira: +1! [14:59:49] let's see if it works [15:00:02] great. deploying... [15:00:27] (03Merged) 10jenkins-bot: fix: revscoring model server inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958916 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [15:07:02] the pod is running: [15:07:08] $ kubectl get pods [15:07:08] NAME READY STATUS RESTARTS AGE [15:07:08] recommendation-api-ng-main-7b8659c6f7-46dpd 2/2 Running 0 92s [15:07:31] \o/ [15:09:52] great work kevinbazira 🎉 [15:10:26] thanks to isaranto and elukey :) [15:11:57] wait a sec now we have to test the endpoint :) [15:12:42] hihi yep, was looking at the docs but they only show for the revscoring models: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage#Internal_endpoints [15:12:54] yes for these services is kinda new [15:13:10] trying to find what we did for ores-legacy [15:16:23] ah yes [15:18:38] Great job! I just got back from dropping off my kids at school. I'm super bummed to have missed this [15:21:21] sorry folks I need to check the TLS settings in staging, I'll explain aftewards and write docs [15:21:59] sure, no problem. [15:28:49] I think that we are missing https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/959789 [15:29:01] to enable all the istio configs etc.. including staging [15:29:38] in theory after that we should be able to do: [15:29:39] curl https://recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443/api/spec [15:30:39] yep diff looks good [15:31:02] deploying now [15:33:16] we use istio to route requests, and in staging we have a single VIP/LB endpoint so every time we don't have to add one etc.. [15:33:27] but the TLS config and the Istio routing config needs to be enabled :) [15:34:06] yepp! [15:34:16] kevinbazira: you can try the above curl command :) [15:35:04] for prod we'll need to add a dedicated LV/VIP endpoint (recommendation-api-ng.discovery.wmnet), me or Tobias will handle the task [15:37:21] thanks elukey, the curl command works. adding an update to the task ... [15:39:57] going afk folks! Have a nice rest of the day! [15:40:41] Enjoy your evening elukey! [15:45:35] 10Machine-Learning-Team: Deploy the recommendation-api-ng on LiftWing - https://phabricator.wikimedia.org/T347015 (10kevinbazira) The recommendation-api-ng has been successfully deployed on LiftWing staging: ` kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services$ curl https://recommendation-api-n... [15:52:22] bye luca! have a nice evening :) [15:56:45] o/ [15:57:30] \o