[07:44:45] (03CR) 10Elukey: [C: 03+1] fix(ores-legacy): filter context based on request [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/917352 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [07:45:58] hello folks :) [07:46:13] .6 [07:46:15] err :) [07:51:35] (03CR) 10Ilias Sarantopoulos: "Added 1 comment. lemme know if what I'm proposing can't be done or doesn't make sense." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/917875 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou) [07:52:01] (03CR) 10Ilias Sarantopoulos: [C: 03+2] fix(ores-legacy): filter context based on request [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/917352 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [07:52:15] Hi Luca! [07:52:16] .7 [07:52:18] :) [07:54:55] (03Merged) 10jenkins-bot: fix(ores-legacy): filter context based on request [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/917352 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [07:55:19] :) [07:57:17] ah nice kserve merged by patch! [08:00:51] isaranto: IIRC I saw https://fluid-cloudnative.github.io/ used with KServe to cache huge model server binaries instead of pulling from s3 every time (may be useful for your research) [08:02:37] nice work elukey: ! (thanks for the link as well) [08:04:21] \o/ [08:04:47] so I am testing again ores-legacy from stat1004, and I keep seeing the ClientError response [08:05:09] I tested the localhost:6031 endpoint on the pod via nsenter and it works afaics [08:06:29] and in the pod's logs I see [08:06:30] 2023-05-10 08:05:55,551 app.utils INFO IP:10.64.5.104, User-Agent:curl/7.64.0 [08:06:33] 2023-05-10 08:05:55,552 app.liftwing.response INFO Made #1 calls to LiftWing [08:06:36] 2023-05-10 08:05:55,552 app.utils INFO response_time:0.0008001327514648438s [08:06:39] INFO: 10.64.5.104:0 - "GET /v3/scores/enwiki/1234331345/damaging HTTP/1.1" 200 OK [08:07:23] hmm.. [08:08:20] also it is weird since if the response is non 200 from Lift Wing we should see an error log in theory [08:09:00] ah no wait if aiohttp.ClientError happens we also generate an error response [08:09:07] maybe it would be good to add a log in there too? [08:09:22] ok, I'll work on that then! [08:09:36] isaranto: sending a patch in a min :) [08:09:55] when I check ores it seems that the revid doesnt exit https://ores.wikimedia.org/v3/scores/enwiki/1234331345/damaging [08:10:16] but we still get the same result even with one that does exist e.g. 123 [08:12:33] (03PS1) 10Elukey: ores-migration: add more logging when Lift Wing calls fail [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/918362 (https://phabricator.wikimedia.org/T330414) [08:13:02] there --^ [08:17:16] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ores-migration: add more logging when Lift Wing calls fail [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/918362 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [08:18:29] ok, so the async post fails then [08:18:30] thanksss [08:18:44] I think so, maybe I am misconfiguring the local proxy [08:18:51] (03CR) 10Elukey: [C: 03+2] ores-migration: add more logging when Lift Wing calls fail [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/918362 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [08:19:09] I suspect it is a PEBCAK use case, but with more logging I'll know for sure :D [08:19:56] (03Merged) 10jenkins-bot: ores-migration: add more logging when Lift Wing calls fail [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/918362 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [08:28:48] so it is definitely a ClientError excp, but [08:28:49] ERROR LiftWing call for model damaging and rev-id 1234331345 raised a ClientError excp with message: http://localhost:6031:30443/v1/models/enwiki-damaging:predict [08:28:52] lol [08:29:53] but the issue is the double port [08:30:21] not sure if the error msg can be improved, mmmm [08:32:59] (03PS1) 10Elukey: ores-migration: remove port definition from LIFTWING_URL [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/918386 (https://phabricator.wikimedia.org/T330414) [08:33:01] ok, so shall we add port as part of URL or even a different env var? [08:33:11] ok u did it already :) [08:33:22] isaranto: I think adding the port to the env var is fine, lemme know your preference [08:33:33] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ores-migration: remove port definition from LIFTWING_URL [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/918386 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [08:33:58] ack thanks! Building + Deploying again :) [08:37:53] (03CR) 10Elukey: [C: 03+2] ores-migration: remove port definition from LIFTWING_URL [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/918386 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [08:38:39] the other mistery that I'd love to solve is why uvicorn is able to bind on port 80 [08:38:55] that is privileged, and nothing indicates that it can [08:39:04] 1) it runs as "somebody", not root [08:39:14] 2) no suid for uvicorn afaics [08:39:17] (03Merged) 10jenkins-bot: ores-migration: remove port definition from LIFTWING_URL [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/918386 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [08:39:23] 3) no special net capabilities for the pod [08:40:10] (03PS2) 10Ilias Sarantopoulos: ores-legacy: switch app port to 8080 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/914786 (https://phabricator.wikimedia.org/T330414) [08:40:30] ¯\_(ツ)_/¯ [08:40:43] I opened a patch to change it to 8080 https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/914786 [08:41:01] yep yep saw it! [08:45:23] isaranto: works now!!! [08:45:42] \o/ [08:46:28] great! I tested it too [08:47:09] 10Machine-Learning-Team, 10Patch-For-Review: Create ORES migration endpoint (ORES/Liftwing translation) - https://phabricator.wikimedia.org/T330414 (10elukey) Update from T335756: We can now test the ores-legacy staging endpoint via the following (from any stat100x host for example): ` elukey@stat1004:~$ tim... [08:48:07] going to keep working on the staging VIP [09:14:03] 10Lift-Wing, 10Machine-Learning-Team: Move Revert-risk multilingual model from staging to production - https://phabricator.wikimedia.org/T333124 (10achou) Proposed model card for the model: https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Multilingual_revert_risk_model_card [09:33:42] ok so I am one patch away from having the LVS endpoint working for staging, but I'd need somebody from the traffic team to assist [09:45:42] so I think I found why we can bind port 80 [09:45:45] root@ores-legacy-main-5468c74c5b-ddfm4:/# cat /proc/sys/net/ipv4/ip_unprivileged_port_start [09:45:48] 0 [09:45:56] this is something that docker sets IIUC [09:48:12] https://github.com/moby/moby/pull/41030 [09:51:58] https://medium.com/@olivier.gaumond/why-am-i-able-to-bind-a-privileged-port-in-my-container-without-the-net-bind-service-capability-60972a4d5496 [09:52:03] explained in more details [09:52:13] (03CR) 10Elukey: [C: 03+2] ores-legacy: switch app port to 8080 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/914786 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [09:53:25] (03Merged) 10jenkins-bot: ores-legacy: switch app port to 8080 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/914786 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [09:57:14] thanks for the resource, interesting.. [09:59:11] big TIL for me [09:59:27] isaranto: if you have a moment https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/918415 (and next) :( [09:59:33] err :) [10:01:07] the port is hardcoded in multiple places, but it is fine by me (I doubt we will ever change it) [10:02:55] deploying now to see if all works [10:40:15] * elukey lunch! [11:23:43] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10Sgs) >>! In T308134#8790079, @Trizek-WMF wrote: > Any update? All wikis have now results except [[ https://jbo.wikipedia.org/w/index... [11:24:35] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10Sgs) >>! In T308133#8788082, @Tgr wrote: > so the maintenance script is getting recommendations from the service, just not good enough... [12:23:15] elukey: once you're back: what shall we do with the memory errors on ml-s1002? [12:24:48] "nothing"/"wait" are options, but it's happened more than once now, albeit a year apart. [13:01:04] managed to run the bloom-560m model in a kserve container \o/ [13:01:08] but... [13:01:23] one request took 3.5 minutes [13:01:24] hahahaha [13:01:29] cpu ofc [13:01:50] _but it ran_ [13:02:18] Bit surprised the k8s infra didn't terminate that connection because it was "obviously" taking too long. [13:03:07] I didn't clarify: I ran it locally in a container the way we do with other models [13:06:07] but definetely there is a lot of optimization that can happen over there. same thing on my host runs in 5 -10 seconds so nothing to safely report at the moment [13:07:13] ah, right [13:32:15] isaranto: lol. I doubt it would go from 3.5 minutes to seconds but I know santhosh has had success with the ctranslate package (https://opennmt.net/CTranslate2/guides/transformers.html#bloom). also bloom is technically not open-source because it has use restrictions (don't use to generate disinformation etc.). for testing to avoid concerns, you might want to stick with traditional OSI-licensed models like mt0: [13:32:16] https://huggingface.co/bigscience/mt0-base [13:32:46] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10Trizek-WMF) Let's go then with `gor` + all round 9 (except `jbo` and `ik`). Can we deploy next Wednesday, May 17? I'm adding it to T... [13:33:54] thanks isaacj! was just trying out a proof of concept with bloom, it is not that we will deploy bloom specifically [13:34:20] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10Trizek-WMF) [13:34:21] regarding latency it has to do with local docker configuration I guess [13:35:07] ahh that makes sense. and frankly just knowing that even these small ones are 3.5 minutes is a good data point for helping to understand the feasibility of the models that are another order of magnitude or two larger [13:37:26] exactly! 👌 Even 10 seconds on a "smaller" (by smaller we mean 1-3 GB) demonstrate the need for GPU in inference [13:37:41] klausman: you can open a task to dcops to get a replacement of memory banks, if there is a specific target [13:37:52] isaranto: nice test and results! [13:38:07] I will have to do some firmware-side log gigging. I think it was always DIMM_B1, but I need to be sure [13:38:15] (03PS2) 10AikoChou: revert-risk: add revert-risk wikidata model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/917875 (https://phabricator.wikimedia.org/T333125) [13:38:21] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10Trizek-WMF) [13:38:48] klausman: don't spend too much time on it though, it is fine also to ask an advice from the dcops folks [13:39:25] aye. I dunno what how much of a hassle with the vendor/warranty it is, that would make a difference [13:40:08] this is why I mentioned to ask to dcops first :) it may be very simple to get a replacement with racadm logs or similar if the host is under warranty [13:40:53] ayup. [14:03:56] isaranto: all settings for port 8080 work now! [14:04:04] (03CR) 10AikoChou: revert-risk: add revert-risk wikidata model server (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/917875 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou) [14:05:15] 10Machine-Learning-Team, 10Patch-For-Review: Create ORES migration endpoint (ORES/Liftwing translation) - https://phabricator.wikimedia.org/T330414 (10elukey) Changed all the settings to use port 8080 instead of 80, works fine again! [14:11:31] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10Sgs) [14:12:31] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10Sgs) 05In progress→03Open [15:13:04] isaranto: can you ssh to deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud by any chance? [15:14:14] \o/ [15:14:22] yep now I have access! [15:14:25] super :) [15:14:26] thanks Luca! [15:14:54] ok so you should also have a tab for "deployment-prep" in horizon.wikimedia.org (have you used it? Don't recall) [15:15:17] in there you can create vms and do stuff, basically full power (so be careful :D) [15:15:39] but the idea should be to be able to deploy to mediawiki from the above deploy03 node, when you merge the code in master [15:15:58] (or maybe we could cherry pick, not sure what's best, probably going through a full review is good anyway) [15:17:52] haven't used horizon! [15:25:58] I'm in [15:26:55] super, this is basically the place to check for vms etc.. related to various projects [15:27:05] you can look for their names sizes etc.. [15:27:30] in the top you should see a dropdown with your projects (namely where you can check vms etc..) [15:49:16] ack! [15:49:36] now I need to understand how to actually test things [16:04:40] logging off folks, cu! [16:34:52] * elukey afk!