[07:21:54] (03PS2) 10Kevin Bazira: article-descriptions: fix wikipedia api summary endpoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976965 (https://phabricator.wikimedia.org/T343123) [07:33:10] (03CR) 10Elukey: article-descriptions: fix wikipedia api summary endpoint (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976965 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [07:36:19] Hello folks! Going to be afk this morning, I'll answer to pings (if any) after lunch! [07:40:51] Good morning o/ [08:24:12] (03PS3) 10Kevin Bazira: article-descriptions: fix wikipedia api summary endpoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976965 (https://phabricator.wikimedia.org/T343123) [08:25:49] (03CR) 10Kevin Bazira: article-descriptions: fix wikipedia api summary endpoint (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976965 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [09:34:16] morning! [09:37:31] (03CR) 10Klausman: [C: 03+1] article-descriptions: fix wikipedia api summary endpoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976965 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [09:37:48] morning [09:38:11] elukey: I can +2 and push Kevin's change if you're ok with it. he did address your two comments [10:05:51] (03CR) 10Klausman: [V: 03+2 C: 03+2] article-descriptions: fix wikipedia api summary endpoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976965 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [10:06:37] (03Merged) 10jenkins-bot: article-descriptions: fix wikipedia api summary endpoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976965 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [10:06:40] kevinbazira: I just +2'd the a-desc change and will push it in experimental as soon as it's merged [10:08:42] oh, oops, deployment charts is of course still todo [10:12:58] thanks klausman. let me work on the deployment charts [10:13:36] :+1: [10:28:24] I've pushed the patch. please review whenever you get a minute. thanks in advance: https://gerrit.wikimedia.org/r/978168 [10:35:53] lgtm! Just merged it! [10:42:50] great. thanks isaranto! [10:48:19] added a complimentary +1 :) [10:48:59] super :) [10:49:07] will push now [10:51:09] ok, new pod is running, old is shutting down [10:51:52] and done. Feel free to test the new image [10:54:25] great. let me check ... [10:55:17] From the kserve logs: 2023-11-29 10:52:23.764 uvicorn.access INFO: 127.0.0.6:0 1 - "POST /v1/models/article-descriptions%3Apredict HTTP/1.1" 500 Internal Server Error [10:56:04] There's also a stack trace, let me paste that [10:56:53] https://phabricator.wikimedia.org/P53939 [11:01:11] ubectl logs on my end show `HTTP 502` errors as shown below: [11:01:11] ``` [11:01:11] INFO:root:Opening a new Asyncio session for mwapi. [11:01:11] ERROR:root:Failed to retrieve first paragraph: 502, message='Bad Gateway', url=URL('http://rest-gateway.discovery.wmnet:4111/en.wikipedia.org/v1/page/summary/Clandonald') [11:01:11] INFO:root:Opening a new Asyncio session for mwapi. [11:01:11] ERROR:root:Failed to retrieve first paragraph: 502, message='Bad Gateway', url=URL('http://rest-gateway.discovery.wmnet:4111/fr.wikipedia.org/v1/page/summary/Clandonald') [11:01:11] ``` [11:01:12] klausman: please confirm whether we are able to hit both: [11:01:12] ``` [11:01:13] http://rest-gateway.discovery.wmnet:4111/en.wikipedia.org/v1/page/summary/Clandonald [11:01:13] ``` [11:01:14] and: [11:06:51] The first one definitely works [11:07:12] https://phabricator.wikimedia.org/P53940 [11:08:11] ok, so no need to set a host header for this? [11:08:29] I don't think so. Since the wiki is part of the request path [11:08:57] okok, what about the second url? [11:09:07] I don't see a second URL? [11:09:54] here it is: `http://rest-gateway.discovery.wmnet:4111/fr.wikipedia.org/v1/page/summary/Clandonald` [11:10:41] That works the same way, except the summary is in the expected French [11:11:18] repsonse code is 200, and the body is JSOn, with `"extract":"Clandonald est un hameau (hamlet) du Comt\xc3\xa9 de Vermilion River, situ....` [11:11:51] plus a bunch of other fields, analogous to the English response of the first URL [11:12:41] great. so how come both are failing in the model-server? [11:12:42] if you add a host header to the request you're sending do you get a `HTTP 502` error? [11:12:54] testing... [11:13:36] `r=requests.get(url, headers={"Host": "fr.wikipedia.org"},timeout=5)` results in a 502 [11:13:50] (`url` is the second one above) [11:14:04] okok, so the host header is the issue. let me remove it. thanks for the test :) [11:14:07] The response body is empty. [11:15:06] nice job folks :) [11:15:21] * elukey afk! back later [11:27:47] * isaranto afk lunch [11:30:43] (03PS1) 10Kevin Bazira: article-descriptions: remove host header from rest-gateway endpoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/978170 (https://phabricator.wikimedia.org/T343123) [11:36:22] (03CR) 10Klausman: [C: 03+1] article-descriptions: remove host header from rest-gateway endpoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/978170 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [11:36:35] +1'd [11:39:31] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/978170 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [11:40:18] (03Merged) 10jenkins-bot: article-descriptions: remove host header from rest-gateway endpoint [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/978170 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [11:47:55] thanks klausman. I've updated the isvc image in: https://gerrit.wikimedia.org/r/978171 [11:48:03] Looking [11:49:21] I don't see the image (yet) on https://docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-article-descriptions/tags/ So I'll give it a minute [11:50:08] okok [11:56:07] (03PS3) 10AikoChou: revert-risk: add batch_model.py and USE_BATCHER env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977135 (https://phabricator.wikimedia.org/T348536) [12:00:27] kevinbazira: mh, it's still nto showing. How long does that usually take? [12:00:55] eh, website's just out of date, the docker pull works, +2'ing in a moment [12:01:35] Good [12:01:37] Morning [12:01:50] klausman: the docker registry website is probably cached in your browser. please try using another browser or incognito. [12:01:50] heyo Chris [12:02:01] morning Chris o/ [12:02:26] I think the page just isn't regenerated very often [12:03:30] Since I can docker pull 2023-11-29-114033-publish on my workstation, it should be fine [12:05:17] Pushing to staging [12:05:30] sure sure. another way I check this is using the CI logs: https://integration.wikimedia.org/ci/job/inference-services-pipeline-article-descriptions-publish/7/execution/node/64/log/ [12:07:46] kevinbazira: and should be ready for testing [12:08:30] thanks klausman. let me check ... [12:14:27] I am seeing thsi in the log: "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation." --- it's probably no biggy, but I thought you might want to know. [12:20:30] doing some lunch, bbiab [12:23:16] (03PS4) 10AikoChou: revert-risk: add batch_model.py and USE_BATCHER env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977135 (https://phabricator.wikimedia.org/T348536) [12:25:51] (03CR) 10AikoChou: revert-risk: add batch_model.py and USE_BATCHER env var (034 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977135 (https://phabricator.wikimedia.org/T348536) (owner: 10AikoChou) [12:45:10] klausman: yes, the max_length message is expected. it shouldn't affect the program flow. [12:45:10] meanwhile, the request hangs for almost 6mins then times out: [12:45:10] ``` [12:45:10] $ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1 [12:45:10] request timeout [12:45:11] real 5m0.052s [12:45:11] user 0m0.015s [12:45:12] sys 0m0.010s [12:45:12] ``` [12:45:26] kubectl logs are showing: [12:45:26] ``` [12:45:26] INFO:root:Opening a new Asyncio session for mwapi. [12:45:26] ERROR:root:Failed to retrieve first paragraph: 503, message='Service Unavailable', url=URL('http://rest-gateway.discovery.wmnet:4111/fr.wikipedia.org/v1/page/summary/Clandonald') [12:45:26] INFO:root:Opening a new Asyncio session for mwapi. [12:45:26] ``` [12:45:38] looks like the model-server is failing access this url: `http://rest-gateway.discovery.wmnet:4111/fr.wikipedia.org/v1/page/summary/Clandonald` [12:45:38] on k8s/LiftWing. [12:50:27] I spawned a python3 interpreter inside the kserve container and fetched that URL the usual way using the requests module, and it worked fine [12:51:09] https://phabricator.wikimedia.org/P53946 [12:59:30] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 3 (Growth Team)), 10Serbian-Sites, and 3 others: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 (10Sgs) [12:59:43] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 3 (Growth Team)), 10User-notice: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 (10Sgs) [13:00:08] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 3 (Growth Team)), 10Turkish-Sites, 10User-notice: Deploy "add a link" to 17th round of wikis - https://phabricator.wikimedia.org/T308143 (10Sgs) [13:09:43] irrelevant to the above failure but kevinbazira: I figured out the issue we had with the header in mwapi.AsyncSession. we need to set session_params instead of header directly [13:10:05] https://www.irccloud.com/pastebin/9PwfsSdI/ [13:10:38] oh I see. thanks for sharing isaranto! [13:11:27] The team is firing on all cylinders today! [13:21:32] AND today :) [13:21:35] kevinbazira: I can reproduce the long hang, but it's unclear to me what exactly the async call to the mwapi is hanging on [13:21:38] haha [13:29:00] (03PS7) 10Ilias Sarantopoulos: article-descriptions: enable local run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976670 (https://phabricator.wikimedia.org/T351940) [13:30:06] (03PS8) 10Ilias Sarantopoulos: article-descriptions: enable local run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976670 (https://phabricator.wikimedia.org/T351940) [13:31:38] aiko: I'll be done today with my tasks (I'll only have review changes if requested) and we can work on refactoring revertrisk together tomorrow if u are available [13:38:33] kevinbazira: The patch for article-desc is ready for review. If you disagree with any of the changes lemme know and we can discuss here. the patch in integrations/config needs to be merged first https://gerrit.wikimedia.org/r/c/integration/config/+/978535 [13:39:30] kevinbazira: I'll need your help to run it on sandbox the same way you do. Do you do docker pull etc after the image has been pushed ? or is there any other procedure? [13:44:10] isranto: on the ml-sanbox I built the article-descriptions image with all the dependencies. whenever there's a new change to the model-server, I clone the new model-server and adjust the model-path (since the models are hosted locally in the container). Hereafter, I am able to test the kserve predictions locally. [13:44:27] isaranto:--^ [13:44:46] (03PS1) 10Elukey: article-descriptions: use a dedicated aiohttp session for rest-gateway [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/978542 (https://phabricator.wikimedia.org/T343123) [13:45:10] ok, thanks! then you'll be able to follow the same procedure locally from now on [13:45:26] not entirely sure if it will fix it --^ [13:45:34] but surely we need two separate sessions [13:47:48] also, meta discussion :) [13:47:49] (03CR) 10Klausman: [C: 03+1] article-descriptions: use a dedicated aiohttp session for rest-gateway [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/978542 (https://phabricator.wikimedia.org/T343123) (owner: 10Elukey) [13:47:50] Good pooint, Luca, +1'd [13:47:56] locally it worked but I'm not using the rest-gateway [13:48:09] we probably need to rethink how we manage aiohttp's async session [13:48:17] because as of now they connect to localhost [13:48:27] so we don't really need connection pools etc.. [13:48:41] (03CR) 10Kevin Bazira: [C: 03+1] "thanks for the suggestion Luca. this is definitely worth a try!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/978542 (https://phabricator.wikimedia.org/T343123) (owner: 10Elukey) [13:48:49] maybe using a new aiohttp session every time will be less error prone [13:49:09] (the connection pool etc.. is in the istio-proxy/envoy) [13:49:14] (03PS9) 10Ilias Sarantopoulos: article-descriptions: enable local run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976670 (https://phabricator.wikimedia.org/T351940) [13:49:22] I can open a task [13:52:48] mmm CI is taking a long time [13:55:28] (03CR) 10CI reject: [V: 04-1] article-descriptions: enable local run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976670 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [13:55:44] (03CR) 10Elukey: [C: 03+2] article-descriptions: use a dedicated aiohttp session for rest-gateway [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/978542 (https://phabricator.wikimedia.org/T343123) (owner: 10Elukey) [13:56:44] (03Merged) 10jenkins-bot: article-descriptions: use a dedicated aiohttp session for rest-gateway [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/978542 (https://phabricator.wikimedia.org/T343123) (owner: 10Elukey) [14:00:52] 10Machine-Learning-Team: Rethink aiohttp's session reuse in the isvc code - https://phabricator.wikimedia.org/T352290 (10elukey) [14:00:55] created --^ [14:00:57] lemme know :) [14:00:58] isaranto: sure! let's work on it tmr :) [14:01:58] isaranto: could you have a look on https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/977135? I managed to use KI_module as a class attribute in the constructor [14:02:33] (03PS5) 10AikoChou: revert-risk: add batch_model.py and USE_BATCHER env var [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/977135 (https://phabricator.wikimedia.org/T348536) [14:03:18] klausman: reminder about the hw procurement tasks for lift wing expansion :) [14:03:31] oh, right. [14:05:28] elukey: thanks! I see https://gerrit.wikimedia.org/r/978542 was merged. should I update the isvc image or you're working on it? [14:06:50] kevinbazira: yep yep go ahead, I was waiting for the new image [14:06:59] okok [14:27:53] the CI job that publishes the artictle-descritions image is taking longer than usual: [14:27:53] https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/978542/ [14:40:34] yeah weird [14:40:55] I don't see anything in https://integration.wikimedia.org/ci/job/inference-services-pipeline-article-descriptions-publish/ [14:41:08] that should be the job that publishes the image [14:42:39] https://integration.wikimedia.org/ci/job/trigger-inference-services-pipeline-article-descriptions-publish/ [14:42:52] so this one wasn't kicked off after my +2 [14:43:13] (the last https://integration.wikimedia.org/ci/job/trigger-inference-services-pipeline-article-descriptions-publish/7/ mentions another change) [14:43:48] maybe if I add another +2? [14:44:01] (03CR) 10Klausman: [V: 03+2 C: 03+2] article-descriptions: use a dedicated aiohttp session for rest-gateway [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/978542 (https://phabricator.wikimedia.org/T343123) (owner: 10Elukey) [14:44:25] it is merged already so it shouldn't matter [14:44:43] Well, then I am just showing my support [14:45:06] nono please it is good to reason out loud :) [14:48:22] ok so I see https://integration.wikimedia.org/ci/job/trigger-inference-services-pipeline-article-descriptions/45/ [14:50:13] (03CR) 10Elukey: [C: 03+2] "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/978542 (https://phabricator.wikimedia.org/T343123) (owner: 10Elukey) [14:51:04] ah ok nice [14:51:16] in https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/978542 [14:51:33] you can see at the top (after the PR text box) [14:52:06] "Checks" [14:52:13] if you click on "Waiting for jobs" [14:52:39] then info [14:52:46] there is a publish job pending for some reason [14:53:57] but nothing in https://integration.wikimedia.org/ci/job/trigger-inference-services-pipeline-article-descriptions-publish/ sigh [14:57:43] In wm-ops there is mention of a separate (still k8s) deploy taking long [14:58:30] klausman: very useful page https://integration.wikimedia.org/zuul/ [14:58:40] at the bottom you see job stats, there is a huge queue [14:59:17] and if you look for "inference" you'll see the patch in post merge waiting for a spot [14:59:24] always forget to check the page [14:59:34] so CI is a bit lagging [14:59:40] nothing that we can do [16:01:07] klausman: I am building the docker images on build2001 for istio etc.., and I am picking up the kserve ones as well [16:06:40] roger re: zuul. and also ack re: kserve [16:40:45] Stepping out folks ,will be online a bit later for an openai webinar [16:42:27] o/ [16:51:52] 10Machine-Learning-Team, 10serviceops: Multiple images fail to build from sources - https://phabricator.wikimedia.org/T350366 (10elukey) [17:24:38] kevinbazira: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/978542 completed, we have a docker image [17:27:46] 10Machine-Learning-Team, 10observability, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q2): Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390 (10elukey) The new request SLI only for revscoring-articlequality (as opposed to a broader "all revscoring model server... [17:28:51] elukey: thanks! I'll push this tomorrow and test it. [17:28:53] getting afk. have a good evening. o/ [17:29:26] ack [17:29:34] going afk as well, have a nice rest of the day folks [17:34:50] chrisalbon: I have a very basic question about LLMs, please answer anytime, it is just a curiosity. How does a chatbot knows when to call an external API to fetch data for a particular use case? [17:35:14] Is NLP involved somehow (outside the transformer architecture), are there special tags to use, etc..? [17:35:32] starting from a conversation of course [17:40:14] Most common strategy I’ve seen is fine tuning the LLM with training data containing examples of what types of prompts should use the API. [17:41:15] So you’d take a foundational model, then feed it examples like: [17:41:45] “What was the score of the game last night” / use-api [17:41:57] “Write a poem” / don’t use api [17:46:30] Even just a few thousand examples using in fine tuning can produce really nice results [18:06:29] ahhh okok nice, so it is always related to fine-tuning [18:10:14] yeah, the nuances are really complex, like what if you want it to decide between nothing, use an API, or generate an image [18:10:47] And fine tuning has a lot of issues. i.e. do you just fine tune the last layers of network? the whole thing? how do you handle overfitting etc etc [18:10:53] But fine tuning is basically the answer [19:45:02] (03PS10) 10Ilias Sarantopoulos: article-descriptions: enable local run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976670 (https://phabricator.wikimedia.org/T351940) [19:45:48] (03CR) 10CI reject: [V: 04-1] article-descriptions: enable local run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976670 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [19:49:14] (03PS11) 10Ilias Sarantopoulos: article-descriptions: enable local run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976670 (https://phabricator.wikimedia.org/T351940) [20:06:59] (03CR) 10Ilias Sarantopoulos: "integration/config patch has been merged and CI works again. The patch has been rebased to include latest changes." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/976670 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos)