[07:05:43] Good morning! Thanks! [07:17:01] isaranto: o/ all good with the prep steps for the hackathon?? [07:18:17] I carried stuff around but didn't get to do any technical related work yet [07:18:39] However I'll try to deploy a model on lift wing this morning if I can make it.. [07:18:49] I'll ping you for reviews if u have time [07:35:08] sure [07:41:36] elukey: can we deploy this on experimental namespace? https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/919293 [07:45:15] (03CR) 10Elukey: [C: 03+1] LLM: model server example with bloom [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/919293 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [07:45:29] Mornin! [07:45:31] isaranto: yes definitely, let's build the docker image and upload the model binary [07:45:39] then we can add the helmfile config [07:45:40] elukey: I'll do the codfw staging reboots today [07:45:53] klausman: o/ sure ok, I'll wait to do the eqiad workers [07:45:55] just to be sure [07:45:58] Hopefully I won't interfere with the deploy [07:46:11] the k8s cookbook is a little slow at the moment, it takes time when it drains the nodes [07:46:23] maybe we can check in spicerack if anything could be improved [07:46:38] it is the one with the openrail license but we're not going to expose it publicly anyway [07:46:59] 10Lift-Wing, 10Machine-Learning-Team: Move Revert-risk language agnostic model from staging to production - https://phabricator.wikimedia.org/T332998 (10achou) This model has been deployed to Lift Wing production. Note that the isvc/model has been renamed to `revertrisk-language-agnostic` test the internal en... [07:49:12] elukey: oh, you already did staging :) [07:49:54] 10Lift-Wing, 10Machine-Learning-Team: Move Revert-risk multilingual model from staging to production - https://phabricator.wikimedia.org/T333124 (10achou) Test the internal endpoint and it works correctly: ` aikochou@deploy1002:~$ time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/revertrisk-multilin... [07:50:27] well, I'll do the caching ones then [07:55:29] isaranto: I am still a little skeptical about the license, but deploying it to staging's experimental seems ok for the moment [07:58:05] elukey: yes it will be only for staging [08:01:06] ok let's merge the code for the docker image (so it will build while we chat) [08:01:37] * elukey commutes to the office [08:02:36] thanks! [08:02:53] (03CR) 10Ilias Sarantopoulos: [C: 03+2] LLM: model server example with bloom [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/919293 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [08:10:36] (03Merged) 10jenkins-bot: LLM: model server example with bloom [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/919293 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [08:16:30] 10Machine-Learning-Team, 10Research, 10Epic: Develop a ML-based service to predict reverts on Wikipedia(s) - https://phabricator.wikimedia.org/T314384 (10achou) [08:25:05] elukey: cassndra/cache in codfw done, now proceeding with etcds there [08:27:08] ack [08:31:26] I updated the image version so this is ready https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/919345 [08:31:31] hopefully :) [08:33:04] isaranto: free to deploy to staging :) [08:37:04] thank a lot! [08:49:45] etcd's done [08:50:38] now doing ctrl nodes in codfw [08:52:26] it works! [08:52:34] ``` [08:52:34] curl https://inference-staging.svc.codfw.wmnet:30443/v1/models/bloom-560m:predict -X POST -i -H "Host: bloom-560m.experimental.wikimedia.org" -d '{"prompt": "Once upon a time ", "result_length": 50}' [08:52:34] ``` [08:52:49] elukey: btw, any ide awhy ml-serve2001.codfw.wmnet is in SchedulingDisabled? [08:52:53] however in all models on staging I get this from time to time OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to inference-staging.svc.codfw.wmnet:30443 [08:54:21] klausman: I may have tested the k8s cookbook on codfw IIRC yesterday, probably forgot to uncordon [08:55:12] ack [08:55:39] isaranto: mmm maybe it is due to the double gateway situation, does it happen in prod as well? [08:55:46] (brb) [08:56:05] I'll check [08:59:32] ctrl nodes done [08:59:52] While the day is youn, I'll also do the workers in codfw unless there are any objections [09:06:40] klausman: no objections, there is a BGP alert though [09:07:46] Where can I see it? [09:08:04] #wikimedia-operations [09:08:11] it should also be in alerts. [09:08:19] alerts.w.o sorry [09:08:40] or if they haven't ported it yet, in icinga [09:11:05] I can't find it in either Icinga nor a.w.o [09:11:37] maybe it recovered and it didn't show in #operations [09:12:34] I'll wait another few minutes before I proceed with the workers (also, teapot empty) [09:13:14] I am trying to check on the devices but I have an issue with my ssh key sigh [09:16:06] ok finally in, checking [09:17:01] all BGP conns established, good [09:18:58] I suspect my reboot of the second ctrl node may have clipped the end of the checking window or sth [09:19:16] Proceeding with workers now [09:22:25] isaranto: I can repro, the good/weird thing is that it doesn't happen in prod [09:25:01] yes I didn't see it on prod - I did some calls on codfw (enwiki-goodfaith) and it never came up while in staging it is easy to reproduce. It happens to 2 out of 3 calls or somehting [09:31:23] elukey: I have a bad gut feeling about ml-serve2001 [09:31:47] elukey: it was sent for reboot but nothing is happening on the console. It's been minutes already [09:32:41] klausman: nono it is the slowness that I told you about, it takes ages [09:32:48] at somepoint it will start [09:32:51] Weird. [09:32:58] Is this for all of these or just 2001? [09:33:25] If so, it may be related to the ECC errors we have seen [09:33:45] I think all, I suspect that the cookbook uses retries to verify the uncordon op and the back-off at some point is very large [09:34:11] But I should still be able to use the console of the machine, no? [09:34:40] via iDRAC I mean. [09:34:56] yeah [09:35:05] either it should have a login prompt or BIOS messages. But nothing [09:35:25] powercycling it [09:35:29] okok then something may be borked, you can try to powercycle [09:39:13] ok, it booted now [09:39:52] goood [09:42:21] 10Machine-Learning-Team: Host open source LLM (bloom, etc.) on Lift Wing - https://phabricator.wikimedia.org/T333861 (10isarantopoulos) [[ https://huggingface.co/bigscience/bloom-560m | Bloom-560m ]] has been deployed on Lift Wing staging in the experimental namespace and can be accessed like this: ` curl https... [09:47:05] man, missed the good check on 2001 by seconds :D proceeding [10:04:54] so the issue in staging feels something related to kubeproxy, or similar [10:05:14] the openssl error happens more or less half the times, so something weird is going on [10:05:52] it is not a problem of ml-staging2001 vs 2002 afaics, tried to isolate them and they have the same behavior [10:07:53] the tls conn from stat1004 gets a RST when we get the SSL_ERROR_SYSCALL [10:08:19] can't find any meaningful log on istio proxies etc.. [10:09:34] Maybe a packetfilter issue (or the usual: DNS/SANs) [10:09:56] (SAN as un Subj Altname, not Storage :D) [10:10:29] not sure if it is networking, it happens on both nodes [10:13:53] the weird thing is that IIUC the gateway pods should terminate the conn [10:17:07] ok it is fixed, I found an extra istio selector in the knative gateway [10:17:15] SSL_ERROR_SYSCALL sounds like a connection reset during the SSL phase [10:17:19] I think that it was me trying some stuff, and didn't cleaned up [10:17:26] it was a conn reset yet [10:17:34] how the selector is connected is weird [10:17:53] isaranto: staging issue should be fixed, lemme know if you still see the errors [10:18:19] klausman: basically I tried a while ago to tie the knative gateways to specific istio selectors, to avoid having them on the new ingress pods for regular services [10:18:51] for some weird reason this caused the problem, not entirely sure why though [10:18:57] anyway, will check later [10:19:03] * elukey lunch! [10:25:04] (03PS1) 10Gerrit maintenance bot: Update moved class FauxRequest [extensions/ORES] - 10https://gerrit.wikimedia.org/r/921275 (https://phabricator.wikimedia.org/T321681) [10:52:22] (03CR) 10Jforrester: [C: 03+2] Update moved class FauxRequest [extensions/ORES] - 10https://gerrit.wikimedia.org/r/921275 (https://phabricator.wikimedia.org/T321681) (owner: 10Gerrit maintenance bot) [11:22:47] elukey: everything is ok, but I won't be using it for the demo since the results that model produces dont make any sense [11:23:03] thanks for the help and quick execution! [11:29:54] elukey: all workers in codfw done. lunch now :) [11:48:03] 10Machine-Learning-Team, 10Research, 10Epic: Develop a ML-based service to predict reverts on Wikipedia(s) - https://phabricator.wikimedia.org/T314384 (10Samwalton9) >>! In T314384#8863625, @achou wrote: > Both models (Language-Agnostic and Multilingual) have been deployed to Lift Wing production. (T332998,... [11:49:05] (03Merged) 10jenkins-bot: Update moved class FauxRequest [extensions/ORES] - 10https://gerrit.wikimedia.org/r/921275 (https://phabricator.wikimedia.org/T321681) (owner: 10Gerrit maintenance bot) [12:20:34] klausman: ack, proceeding with ml-serve eqiad workers [12:29:18] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Completion report on training 18 rounds of add-a-link models - https://phabricator.wikimedia.org/T336927 (10kevinbazira) Below are the results of training, evaluating, and publishing 18 rounds of add-a-link models. | **Round** | **Wikis** | **Models** |... [12:32:08] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Completion report on training 18 rounds of add-a-link models - https://phabricator.wikimedia.org/T336927 (10kevinbazira) Below are the results of training, evaluating, and publishing 18 rounds of add-a-link models. | **Round** | **Wikis** | **Models** |... [12:33:00] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Completion report on training 18 rounds of add-a-link models - https://phabricator.wikimedia.org/T336927 (10kevinbazira) Below are the results of training, evaluating, and publishing 18 rounds of add-a-link models. | **Round** | **Wikis** | **Models** |... [12:37:07] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Completion report on training 18 rounds of add-a-link models - https://phabricator.wikimedia.org/T336927 (10kevinbazira) Next steps: 1. T309263 - Liaise with MGerlach from the Research team to improve the link-recommendation algorithm in order to: - sup... [13:49:39] Morning all [14:10:51] o/ [14:24:50] \o [14:25:31] (03PS1) 10Ilias Sarantopoulos: feaT: change bloom model token output sampling [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/921363 (https://phabricator.wikimedia.org/T333861) [14:26:08] (03PS2) 10Ilias Sarantopoulos: feat: change bloom model token output sampling [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/921363 (https://phabricator.wikimedia.org/T333861) [14:26:32] (03CR) 10Ilias Sarantopoulos: [C: 03+2] feat: change bloom model token output sampling [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/921363 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [14:28:56] (03Merged) 10jenkins-bot: feat: change bloom model token output sampling [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/921363 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [14:29:10] o/ [14:29:28] I'm merging the above patch to change the sampling technique used for inference for the bloom model [14:29:44] ack [14:37:01] sre folks: can either of you merge this patch? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/921366 [14:37:24] Looking [14:37:25] it is just a bump to use the image from the change I submitted above [14:37:35] thankk u <3 [14:37:54] +2' [14:37:57] d [14:45:07] (03PS1) 10Ilias Sarantopoulos: fix: call class attribute [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/921368 (https://phabricator.wikimedia.org/T333861) [14:45:30] (03CR) 10Ilias Sarantopoulos: [C: 03+2] fix: call class attribute [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/921368 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [14:45:45] this is what u get when you don't wait for review --^ [14:46:14] sending a fix so another patch (last one for the day) coming your way. super thanks again! [14:46:23] :D [14:46:49] (03CR) 10Klausman: [C: 03+1] fix: call class attribute [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/921368 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [14:49:42] (03Merged) 10jenkins-bot: fix: call class attribute [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/921368 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [14:55:58] here it is ! https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/921371 [14:56:00] :D [14:59:30] all nodes rebooted! [15:15:06] have a great weekend folks! [15:25:51] isaranto: nice work on the LLM! [15:26:02] have a great weekend to all as well, logging off! [15:29:31] \o