[05:07:50] Hola! o/ [05:22:45] proposal to enable multiprocessing for articlequality for enwiki in prod https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/965933 [05:23:11] I started quite conservatively with 2 workers and 3 cpus [05:32:17] at the moment I'm deploying all the changes to the rest of the revscoring services and will deploy the articlequality with mp later [05:56:18] all revscoring servers (except articlequality) deployed and tested! [05:56:40] morning! [05:56:48] +1 for articlequality [05:57:23] (03PS5) 10Elukey: llm: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965192 [06:25:28] morning Luca! [06:25:59] I'm also planning to deploy langid to the new llm namespace today. I just added some tests in httpbb [06:26:19] and will update api gw docs [06:28:41] super [06:35:08] very curious about articlequality [06:45:12] there is also https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/965192 [06:45:19] to upgrade the LLM image to kserve 0.11 [06:46:33] (03CR) 10Ilias Sarantopoulos: [C: 03+1] llm: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965192 (owner: 10Elukey) [06:47:14] thanksss [06:47:17] building the img [06:47:25] (03CR) 10Elukey: [C: 03+2] llm: Upgrade to KServe 0.11.1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965192 (owner: 10Elukey) [06:48:05] we are missing outlink and nsfw (that is not used) [06:48:22] so in theory we could start thinking about migrating the control plane [06:48:27] cc: klausman: --^ :) [06:59:21] * isaranto Afk - be back in an hour [06:59:36] Well deploy articlequality when I'm back [07:22:15] * elukey commute to the office [07:41:18] (03PS4) 10Kevin Bazira: Use envoy proxy to access endpoints external to k8s/LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) [07:45:00] (03CR) 10Kevin Bazira: Use envoy proxy to access endpoints external to k8s/LiftWing (035 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [07:57:02] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: kserve CORS error - https://phabricator.wikimedia.org/T348511 (10elukey) Fix deployed, commented on the pull request, let's see if the issue is fixed. [07:57:09] isaranto: CORS fix deployed [08:02:54] Ack! Will try to check [08:04:23] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Update API calls from ORES to Lift Wing - https://phabricator.wikimedia.org/T343731 (10noarave) https://github.com/wmde/wikidata-item-quality-evaluator/pull/21 [08:25:48] (03CR) 10Elukey: "Very nice improvement! I left some other comments, and also I'd add a couple of tests for the new functions that you created, to verify t" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [08:35:48] Just deployed articlequality, I'll be monitoring for changes/improvements in preprocessing latencies and let you know 🤞 [08:51:54] isaranto: I modified the kserve isvc dashboard to select the model name [08:52:33] it is not perfect but it isolates values in a better way [09:02:34] elukey: what was there previously? the pod name? [09:04:28] isaranto: all the models in one place [09:04:31] now you can select [09:05:38] ok I see. I don't understand what are the different instances of the same model I see in the charts though. do u know? [09:14:47] different pods basically [09:14:53] do you want to see their names? [09:15:41] The only time I need them is when I want to check when a new deployment kicked in [09:16:01] because the revision is in the name of the pod [09:17:47] isaranto: check now if it is better [09:20:06] I think so. I am wondering if there could be a grouping (and avg) per revision as I find it more difficult to read [09:20:29] but please dont do it now. We can try some other time, for the time being this is super helpful. thanks! [09:23:07] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10mfossati) Thanks again for your fast reaction @xcollazo , much appreciated! I can conf... [09:30:59] isaranto: with avg it could hide some pod latency differences, let's think about a solution okok [09:31:22] ok! [09:36:59] isaranto: I am looking at the latency metrics, and the predict ones for enwiki articlequality are weird [09:37:17] it seems as if it is using MP for predict too [09:40:01] indeed they are weird [09:40:41] isaranto: /home/elukey/Wikimedia/inference-services/revscoring_model/model_servers/model_server_mp.py [09:40:44] err [09:40:49] self.inference_mp = strtobool(os.environ.get("INFERENCE_MP", "True")) [09:41:33] Yeah just saw [09:41:41] by default they are both activated [09:41:53] but then we tested it in staging as well [09:41:58] this means that in the load tests I did it was there [09:42:22] I'd vote to set INFERENCE_MP to false [09:42:28] me2 [09:42:35] let's file a change [09:42:37] doing it now [09:43:10] and perhaps afterwards we can remove the default values and throw an error if they are not defined [09:43:40] or a warning with the value, so that we know that it is better to set it explicitly [09:44:08] we can set it to False (both), seems to be the safest [09:46:35] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/966142 [09:46:39] +1ed [09:46:47] ty [09:48:41] elukey: \o yes, re: kserve 0.11. Ideally I want to get that done soon, before you tend to more important matters at the end of the year :) [09:49:39] (03PS1) 10Elukey: blubber: bump Bullseye version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/966144 (https://phabricator.wikimedia.org/T348647) [09:50:04] (03CR) 10CI reject: [V: 04-1] blubber: bump Bullseye version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/966144 (https://phabricator.wikimedia.org/T348647) (owner: 10Elukey) [09:52:01] deployed! [09:52:08] (03PS2) 10Elukey: blubber: bump Bullseye version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/966144 (https://phabricator.wikimedia.org/T348647) [09:52:22] isaranto: look at this beauty --^ :D [09:52:30] just saw [09:52:40] 100% failure [09:52:55] yeah I added the wrong version [09:52:57] should work now [09:53:10] a lot of deployments [09:53:48] aa I just deployed all revscoring this morning :) [09:53:57] I know :) [09:53:58] and I had updated the blubber image [09:54:07] checking the kernel versions, maybe we are good [09:54:14] upside = more frequent deployments [09:54:48] https://www.youtube.com/watch?v=X_-q9xeOgG4 [10:01:40] isaranto: 20231008 seems good [10:02:06] so I'll update all to 20231015 just to be consistent, but we don't have to redeploy revscoring [10:04:27] ok https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/966144 looks good [10:04:59] u mean that it contains the sec update? [10:05:07] exactly [10:05:09] (03CR) 10Santhosh: [C: 03+1] "I tested and works as expected. +1 as I do not have +2 rights :-)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965622 (https://phabricator.wikimedia.org/T347404) (owner: 10Ilias Sarantopoulos) [10:06:24] (03PS5) 10Ilias Sarantopoulos: langid: allow local runs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965622 (https://phabricator.wikimedia.org/T347404) [10:10:15] (03CR) 10Klausman: [C: 03+1] blubber: bump Bullseye version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/966144 (https://phabricator.wikimedia.org/T348647) (owner: 10Elukey) [10:30:34] (03CR) 10Ilias Sarantopoulos: [C: 03+2] "Thanks for the review! I'll proceed with this for now which has been thoroughly tested and we'll keep an eye for a new release." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965622 (https://phabricator.wikimedia.org/T347404) (owner: 10Ilias Sarantopoulos) [10:31:20] (03Merged) 10jenkins-bot: langid: allow local runs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/965622 (https://phabricator.wikimedia.org/T347404) (owner: 10Ilias Sarantopoulos) [10:34:44] (03PS5) 10Kevin Bazira: Use envoy proxy to access endpoints external to k8s/LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) [10:37:56] (03CR) 10Kevin Bazira: Use envoy proxy to access endpoints external to k8s/LiftWing (037 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [10:42:53] (03PS3) 10Ilias Sarantopoulos: blubber: bump Bullseye version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/966144 (https://phabricator.wikimedia.org/T348647) (owner: 10Elukey) [10:43:53] (03CR) 10Ilias Sarantopoulos: [C: 03+1] blubber: bump Bullseye version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/966144 (https://phabricator.wikimedia.org/T348647) (owner: 10Elukey) [10:44:29] elukey: I did a manual rebase cause I merged the langid patch which had the old blubber version. shall I merge it so that I can deploy langid with new blubber at once? [10:46:08] (03CR) 10Elukey: Use envoy proxy to access endpoints external to k8s/LiftWing (038 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [10:46:30] isaranto: sure sure! [10:46:55] lets goooo [10:47:05] I declare this day deployment monday [10:47:37] isaranto: so revscoring images can be skipped, revscoring-la the same (they are running with the fix) [10:47:42] the rest would need a deployment [10:47:53] but we are in the middle of kserve 0.11 upgrades, so let's be careful [10:48:26] sure, I'll proceed with langid for now so that I can focus on one thing (api-gw etc) [10:48:58] +1 [10:54:29] going afk for lunch! [10:56:22] (03CR) 10Ilias Sarantopoulos: [C: 03+2] blubber: bump Bullseye version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/966144 (https://phabricator.wikimedia.org/T348647) (owner: 10Elukey) [11:02:03] (03PS6) 10Kevin Bazira: Use envoy proxy to access endpoints external to k8s/LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) [11:05:10] (03Merged) 10jenkins-bot: blubber: bump Bullseye version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/966144 (https://phabricator.wikimedia.org/T348647) (owner: 10Elukey) [11:05:49] (03CR) 10Kevin Bazira: Use envoy proxy to access endpoints external to k8s/LiftWing (033 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [11:12:18] * isaranto afk - lunch! [11:18:43] * klausman also lunch [12:08:27] Articlequality seems to be going well now. Inference latencies are back down and preprocess spikes at p99 seem to have been cut down to half at 5s instead of 10s. Too early to draw conclusions but I'll run a load test on eqiad to check [12:29:47] Morning alll! [13:01:29] (03CR) 10Abijeet Patro: [V: 03+2] Localisation updates from https://translatewiki.net. [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966157 (owner: 10L10n-bot) [13:02:58] Hi Chris! [13:07:43] morning! [13:17:39] I update the images in all the deployments https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/966199 [13:17:50] (03CR) 10Elukey: Use envoy proxy to access endpoints external to k8s/LiftWing (033 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [13:18:22] (03CR) 10Elukey: "There are 6 unsolved comments, once done I think that we should be good to go." [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [13:18:27] give me 10' to check them them again before you review [13:21:01] I need some tea first [13:25:35] ahhaha ok :) [13:35:52] I have a couple of changes to add pre-commit+ruff+black+Etc.. to rec-api [13:36:05] basically a shamelessly copy of what Ilias did for inference services :D [13:36:17] I'll send the patches once Kevin's change is merged [13:42:21] <3 [13:42:43] elukey: the alerts patch is ready to do, I had missed that unresolved comment somehow [13:45:51] a test is failing :( [13:51:27] :( [14:04:59] now it CI is cooperating [14:06:52] 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10isarantopoulos) I have added a new entry in the [[ https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_langu... [14:14:11] I also double checked the update of all the blubber images , they seem correct [14:16:57] If there isn't any objection I'm going to merge langid deployment changes https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/965189 [14:16:57] and the kafka alert patch https://gerrit.wikimedia.org/r/c/operations/alerts/+/962056 [14:17:16] (03CR) 10Kevin Bazira: Use envoy proxy to access endpoints external to k8s/LiftWing (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [14:17:24] isaranto: I have an extra request for the alert patch, sorry [14:17:34] sending it in a bit (on the phone atm) [14:19:13] no worries. I'm not in a hurry for that one. whenever u can [14:29:01] (03CR) 10Elukey: Use envoy proxy to access endpoints external to k8s/LiftWing (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [14:32:32] (03CR) 10Elukey: Use envoy proxy to access endpoints external to k8s/LiftWing (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/965142 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [14:39:22] isaranto: re alert - I checked the metrics in thanos and in codfw there are two topics for the same "group" [14:39:47] so I am not sure if it will work 100%, I wanted to propose something simpler like using the topic name directly [14:39:50] (without avgs etc..) [14:40:01] will check [14:40:02] but the topic name has the var-site in the name [14:40:07] and I am not sure how to render it [14:40:16] I'm trying to debug langid deployment atm [14:40:22] ah ookok [14:40:24] later np [14:47:16] hmm it cannot connect to swift - it cant find credentials [14:47:52] s3cmd -H -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/llm/langid/ [14:52:21] I'm talking about the deployment. Storage-initializer is failing [14:53:00] maybe I missed sth when configuring the deployment but I can't find a difference with other namespaces (at least from what I see in the deployment-charts repo) [14:53:40] isaranto: ah ok sorry I didn't see "it" [14:53:55] out of context it seemed as if you didn't find the command :) [14:54:08] if you want I can check [14:54:21] yeah my bad didnt explain exactly [14:54:40] if u can that would be great, I am trying to find sth with no luck atm [14:54:48] staging or prod? [14:55:44] staging. I tried to deploy there and there is a pod in Crashloopbackoff with a log in the storage-initializer `botocore.exceptions.NoCredentialsError: Unable to locate credentials` [14:55:55] it may be the same in prod though [14:56:26] ahhhh yes we are missing a change in puppet private! [14:57:10] we have a way from puppet private to materialize helmfile yaml config files on the deployment node [14:57:34] so that things like swift user/pass can be picked up and set up correctly, without the need for us to state them in clear text [14:59:50] what is puppet private? [15:00:47] chrisalbon: it is a puppet repo that only SREs can commit to, containing mostly passwords/certs/etc.. (all that we cannot put in clear text in a repo) [15:05:20] isaranto: now we get [15:05:21] Readiness probe failed: Get "http://10.194.61.202:15020/app-health/queue-proxy/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) [15:05:33] so it may be the issue of the model being slow to come up [15:05:39] but the storage init works! [15:05:41] * elukey bbiab [15:06:07] ah got it thanks elukey [15:06:08] Thanks for explaining and taking care of this ! [15:22:37] the new issue is that I didnt add a file in the docker - and never tested the latest change. will file a new patch and test it [15:32:24] super :) [15:50:57] clumsy me [15:52:06] https://blog.apnic.net/2023/06/02/ripe-86-bites-whats-the-time/ very interesting, I had it in my backlog [15:52:21] very simple but nice read :) [15:56:14] yes very simple 😛 https://blog.apnic.net/wp-content/uploads/2023/06/fig2-1.png [15:56:22] thanks for sharing, will read it! [15:58:27] haahah nono ok I skipped completely that part [15:59:03] but the concept of time and how much effort we put into syncronizing ourselves is staggering [16:02:32] this is also a nice list of things https://gist.github.com/timvisee/fcda9bbdff88d45cc9061606b4b923ca [16:03:09] TIL nice [16:03:48] I always remember this when sth goes wrong with time/timezones/ daylight saving time etc [16:06:06] got to go folks, will continue tomorro o/ [16:06:12] *morrow [16:06:45] have a nice evening isaranto :) [16:24:58] (03PS1) 10Ilias Sarantopoulos: langid: fix - python utils missing [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/966252 (https://phabricator.wikimedia.org/T340507) [16:25:18] now I am officially afk :) - fixed and tested the above [16:26:37] (03CR) 10Elukey: [C: 03+2] langid: fix - python utils missing [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/966252 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [16:27:24] (03Merged) 10jenkins-bot: langid: fix - python utils missing [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/966252 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [16:27:38] I am going to update the docker image in a bit [16:29:01] i can do it tomorrow, dont worry. thanks though [16:36:20] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10procurement: GPU purchase for ml-staging in codfw - https://phabricator.wikimedia.org/T348118 (10elukey) @RobH Hi! Lemme know if I can help in any way to move this forward, it is a non standard request I know, sorry and thanks for the patience! [16:37:32] chrisalbon: I have added you and Tobias to the procurement tasks about new lift wing nodes [16:49:49] isaranto: langid up and running :) [16:54:54] * elukey afk [16:55:00] have a nice rest of the day folks [18:09:27] night elukey!