[07:41:33] morning! [07:41:48] aiko: very weird.. If I contact ml-staging2001 it works nicely [07:41:58] curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict" -X POST -d @input.json -i -H "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" --http1.1 --resolve inference-staging.svc.codfw.wmnet:30443:10.192.0.201 [07:42:09] so it may be something related to the LV/VIP [07:43:57] ipvs adm on lvs2009 (the lb primary node) looks good [07:43:58] TCP 10.2.1.58:30443 wrr -> 10.192.0.201:30443 Route 1 0 0 -> 10.192.48.174:30443 Route 1 0 0 [07:50:25] it may be https://gerrit.wikimedia.org/r/c/operations/puppet/+/920630/, lemme see if it is Luca's pebcak [07:51:21] I see the loopback address on ml-staging2001 for inference-staging but probably it is not configured correctly now [07:52:52] I don't see how 920630 would break anything, there's no semantic change there [07:52:57] also, morning :) [07:54:18] klausman: there is, check the diff [07:54:20] now it works [07:54:30] aiko: sorry for the trouble! Now it should work [07:55:45] I completely fail to see a difference [07:56:27] Weird. That puppet-side change has _order_? [07:57:02] Oh, now I get it [07:57:18] The section being there twice erases the first instance? [07:57:20] # LVS service IPs to be bound to the loopback interface, [07:57:20] # separate using spaces [07:57:20] -LVS_SERVICE_IPS="10.2.1.83" [07:57:21] +LVS_SERVICE_IPS="10.2.1.58 10.2.1.83" [07:57:33] I was trying to find the problem from just the gerrit diff. [07:57:50] nono the diff shows, it came up to mind while I was debugging the no route to host [07:58:25] (also, I misunderstood your line "it may be ..." indicating that that change _caused_ the breakage, instead of fixing it) [07:58:26] the host was rightfully rejecting packets targeted to a LVS IP not bound to any loopback address [07:59:17] So I was thinking: how would merging two sections break anything, if two separate sections work fine [07:59:27] But the latter wasn't the case :) [08:01:08] I also fixed the prod code review [08:05:19] Great to catch something like this in staging, right? [08:10:59] definitely :) [08:16:10] isaranto: really nice summary in https://phabricator.wikimedia.org/T328494#8856293 [08:16:40] I am a little scared about how many things we'll need to add to Lift Wing to make embeddings to work in the future, [08:18:18] * elukey commuting to the offic [08:18:49] thanks! however imho embeddings can be totally independent work from lift wing. [08:19:35] How would they be "visible" from the code on LiftWing? New images created regularly? Or online, like a feature store? [08:22:08] the latter for sure [08:22:53] I'd imagine that a model server should only know what to call to translate some text to an embedding [08:23:37] or maybe some code in preprocess could do the work but not sure if it is something doable [08:25:05] yes something like a feature store (or a vector database). however what I cam arguing is wether the primary use case for embeddings would be models or wikipedia search [08:25:30] true true [08:25:35] In any case the big chunk of work is creating a consistent way of creating/updating embeddings as new articles and revisions come in [08:25:49] aka a training pipeline [08:27:55] isaranto: with the assumption that a model will not get random chunks of text to process.. right? [08:28:00] 10Machine-Learning-Team, 10Patch-For-Review: Create ORES migration endpoint (ORES/Liftwing translation) - https://phabricator.wikimedia.org/T330414 (10isarantopoulos) At the moment there is some inconsistency in the error messages returned from ores vs ores-legacy: Example: call for a non existing revision id... [08:28:08] otherwise an embedding would need to be created on the fly [08:31:47] (trying to understand embedding, really ignorant) [08:31:56] sure no worries :) . [08:35:24] you're always going to have some random chuck of text to process which means that you embed that in the embeddings space (see where it stands in spatially). In the case of article embeddings you get a query -> transform it to a vector using the embedding model and then find the article embeddings that are more close to it [08:36:22] the issue is that the model that creates the embeddings needs to updated. otherwise you would be doing search with stale information [08:55:43] once again I am lost in a sea of open browser tabs 😄 [09:02:36] So the model for creating embeddings would not be queried by users directly, right? Instead it would be part of a pipeline that creates embeddings for _another_ model? [09:03:00] (I recently started looking at embeddings for StableDiffusion and got entirely lost within 10m) [09:05:01] yep seems the right understanding... [09:06:09] in the ideal world we'd have a kubeflow pipeline that periodically publishes a new model to generate embeddings from text, and use it as micro service from other models' preprocess step [09:08:55] no, a service (e.g. search) would use the embeddings model to get a representation of its query and then compare it with the rest of the embeddings that are static and stored somewhere (elastic, database, feature store etc) [09:10:05] isaranto: mmm why wouldn't an LLM model be able to use embeddings as well? [09:10:41] in its preprocess() step I mean [09:10:44] yes yes [09:12:15] ok so both use cases are possible [09:12:42] most LLMs /GPT models can handle raw input so it is not necessary but ofc there are use cases [09:12:44] very nice, now I get why Fabian mentioned the fact that we should provide "wikipedia embeddings" [09:13:03] namely we should provide a model that translates text -> vectors [09:13:40] spot on! [09:14:49] thanks for the clarification, sooo difficult to get a grasp of these things [09:14:53] then there is also the "thing" you are actually embedding where you can have different models. a model that creates word embeddings, one for sentence embeddings, articles even articles +image [09:15:10] * elukey nods yes [09:15:14] it would be better to discuss it "in person" [09:15:32] anything that can be mapped to a vector space, so that nearest neighborg etc.. would work [09:15:44] or better, would make sense and produce meaningful results [09:15:50] yeah, u explain it better than me! [09:23:15] thansk both of you. [09:26:14] let's keep doing these braindumps/etc.. in here, it helps a lot to clarify things! [09:52:22] morning! nice conversation above :) learned something :D [09:53:20] elukey: thanks! no more routing errors. :) I got some other errors that related to mwapi, I'll figure it out [09:53:50] aiko: you can post them in here if you want, we can look at those together [09:57:29] klausman: big miss from my side in https://gerrit.wikimedia.org/r/c/operations/puppet/+/920649 sigh :( [09:57:42] sooo easy to make a mistake in service::catalog [09:57:56] the good thing is that it is easy to spot the mistake via ipvsadm [09:57:59] o/ aiko [09:59:43] elukey: I think I know what the problem is and am working to fix it. will post here if I need help <3 [10:00:15] ack! [10:00:59] isaranto: o/ [10:02:00] klausman: also very lovely, in eqiad we have only 4 hosts pooled https://config-master.wikimedia.org/pybal/eqiad/inference [10:12:47] (03PS23) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) [10:23:05] (03PS24) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) [10:24:46] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [10:32:14] I've been changing the wrong branch 🤦 [10:33:29] and I was ready to ask , why are messages sent since I've marked it as WIP [10:35:10] (03PS25) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) [10:36:34] elukey: nice! "A stony road, but the view is worth it" [10:37:17] Why were four hosts depooled? Is there any history as to when they were depooled? [10:37:26] the prod VIP is set up! I need to add the DNS config and then we should be ok [10:37:39] klausman: I think that we missed to add the new nodes to the kubesvc service via conftool [10:38:01] ooh. [10:39:27] You'd think 50% depooling would trigger some sort of alert [10:39:46] yeah but those were not even pooled, weight 0 and pooled false [10:39:51] so pybal didn't bother [10:39:58] anyway, all goodnow! [10:40:04] going afk for lunch, ttl! [10:41:09] same [10:45:19] 10Machine-Learning-Team, 10Patch-For-Review: Host open source LLM (bloom, etc.) on Lift Wing - https://phabricator.wikimedia.org/T333861 (10MoritzMuehlenhoff) Usual IANAL disclaimer ahead: If this were a software license this would not meet the standard required by OSI. They e.g. cover this in the FAQ at https... [10:51:01] the thresholds patch is ok now! oh my :Amir1 will hate me for these review <3 [10:52:49] haha, I will do my best [12:03:20] Going afk for a medical exam [12:43:15] elukey: okay to change the stream name? https://phabricator.wikimedia.org/T333468#8857312 [12:43:19] i'd like to do this today [13:05:26] elukey: any more comments on 920208 or good to submit? [13:09:46] ottomata: yesssss [13:11:58] klausman: free to go [13:15:46] alright, will merge in a hot minute [13:15:50] (and deploy) [13:19:17] isaranto: Moritz answered in https://phabricator.wikimedia.org/T333861#8858542 about BLOOM, very interesting [13:23:29] klausman: just seen https://phabricator.wikimedia.org/T335835 :( [13:23:32] reboots to do [13:24:12] Oh, that's quite a pile [13:25:01] deploy done, no errors [13:25:03] I'm back! [13:25:15] Yes I saw the answer [13:25:55] ottomata: ready with the patch - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/920696 [13:28:19] aiko: the rr config/namespaces should now be complete in prod [13:31:18] \o/ [13:31:19] klausman: o/ thank you!! I will deploy rr models later :) [13:31:24] this is a great result folks! [13:31:55] if all looks good we can add the api-gateway config and tell research about it [13:32:55] elukey: we should divide up the reboots between the two of us. I'm out tomorrow, and I dunno how you feel about reboots on Fridays. [13:33:42] it is good, we can do them, nothing is really in production and we can take a slower pace [13:34:08] Ben is using the k8s cookbook for rebooting, maybe it is not that painful [13:34:31] It'd still take a while and you'd have to keep a bit of an eye on it. [13:42:33] sure but we have to do it :) [13:43:05] How about I do codfw and you do eqiad and we sync up when we start/stop? [13:46:03] ack +1 [13:47:11] elukey: awesome okay [13:53:39] ottomata: v1 really? :D [13:59:27] elukey: https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration [13:59:32] oops [13:59:36] https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#Stream_versioning [14:00:08] https://phabricator.wikimedia.org/T332212 [14:00:23] ottomata: ah ok so basically we tie stream names to their schema versions [14:00:38] opt in yes, and major version only [14:00:47] in the same way rest APIs often do [14:00:56] sure sure [14:01:27] gonna help us a lot with https://phabricator.wikimedia.org/T331399, because then we don't have to bikeshed a brand new name for 'page_links_change' [14:03:26] going to read the tasks :) [15:00:03] (03PS26) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) [15:02:26] (03PS27) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) [15:04:13] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [15:12:52] (03CR) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [15:13:43] 10Lift-Wing, 10Machine-Learning-Team: Move Revert-risk multilingual model from staging to production - https://phabricator.wikimedia.org/T333124 (10klausman) The changes from 920208 have been deployed. [15:18:52] klausman: tried to deploy rr-ml to eqiad, got Init:CrashLoopBackOff, seems like it failed in storage-initializer: " File "/usr/local/lib/python3.9/dist-packages/botocore/auth.py", line 418, in add_auth [15:18:52] raise NoCredentialsError() [15:18:52] botocore.exceptions.NoCredentialsError: Unable to locate credentials" [15:19:17] taking a look [15:22:48] (03PS28) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) [15:24:09] Yep, forgot the Swift crednetials. Fixed on the PM, and running puppet-agent on the deploy machine. Not sure if that is enough [15:24:23] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [15:25:10] aiko: try again, it might work now [15:25:44] (03PS29) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) [15:26:41] klausman: I don't have permission to delete pods, or I just wait for the pods to restart? [15:27:19] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [15:27:23] I can delete them for you [15:27:38] thanks! [15:29:44] hrm. they're crashing now [15:29:53] I gotta run to a meeting. [15:29:59] elukey: can you take a look ^^^ [15:30:20] I flubbed the Swift credentials and now the pods are crashing. I _think_ I fixed the credentials problem [15:31:07] sure [15:31:12] thanks! [15:31:14] elukey aren't you giving a presentation like right now [15:31:32] oh, if you are, then nvm, I'll take a look afterwards :) [15:31:53] good luck with the presentation! [15:32:01] chrisalbon: just finished it :) [15:32:26] also hope things are better soon in Italy regarding the floods 🤞 [15:33:35] the pods still get the same credentials error [15:36:16] isaranto: hope so :( [15:36:22] isaranto: `curl "https://ores-legacy.discovery.wmnet:31443/v3/scores/enwiki/123433/damaging" -i` works :) [15:37:03] that is nice [15:37:21] I'm trying to deploy ORES with patchdemo. still trying though :) [15:37:31] :) [15:37:40] aiko: are you testing prod or staging? [15:37:47] because I see no isvcs in staging [15:38:01] anyway I'll fix prod [15:38:05] ml-serve-eqiad [15:38:57] for the revertrisk-multilingual and revertrisk [15:41:06] aiko: all pods running, can you recheck? [15:41:28] (03PS30) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) [15:41:54] elukey: ok! [15:43:18] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [15:43:59] (03PS31) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) [15:44:51] (03PS32) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) [15:45:49] __^ don't judge the above rejections from CI: it is because I am hardcoding wikiID in order to test enwiki in patchdemo [15:46:18] elukey: yep, both models work \o/ [15:46:26] great work :) [15:46:32] I'll sync also codfw [15:46:36] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [15:47:11] we may need to decide what to do in staging - I'd say that we want those isvs there to test new iterations, aiko what do you think? [15:48:37] but now need to query the revertrisk model like "https://inference.svc.eqiad.wmnet:30443/v1/models/revertrisk-language-agnostic:predict, with Host: revertrisk.revertrisk.wikimedia.org.... thinking maybe it's confusing [15:49:33] yeah a little bit [15:50:28] I am wondering why it is not `revertrisk-language-agnostic.revertrisk.wikimedia.org` [15:50:38] model name? [15:50:55] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/ml-services/revertrisk/values.yaml#43 [15:51:40] we only changed the MODEL_NAME, but not the inference_services name [15:51:52] ahhh yes maybe we want to do that [15:52:01] a little bit clearer what do you think? [15:52:27] ok, I'm back [15:52:56] yeah agreed! [15:53:02] I considered the naming for the isvc, but went with the shorter name on a coin toss :) [15:53:37] switching to rr-la as a name is fine by me. I should've asked befor making the change [15:53:57] elukey: what was I missing in repairing the services earlier? [15:55:00] elukey: I'll send a patch for that :) [15:55:38] klausman: the new swift creds needed a deploy :) [15:55:42] + the delete pod etc.. [15:55:46] 10Machine-Learning-Team, 10Patch-For-Review: Create ORES migration endpoint (ORES/Liftwing translation) - https://phabricator.wikimedia.org/T330414 (10elukey) [15:55:50] because they go into a config map [15:55:51] 10Machine-Learning-Team, 10Patch-For-Review: Create ORES migration endpoint (ORES/Liftwing translation) - https://phabricator.wikimedia.org/T330414 (10elukey) Prod endpoints up! ` elukey@stat1004:~$ time curl "https://ores-legacy.discovery.wmnet:31443/v3/scores/enwiki/123433/damaging" -i --http1.1 HTTP/1.1 2... [15:56:00] 10Machine-Learning-Team, 10Patch-For-Review: Create k8s ingress config and VIP for ores-legacy - https://phabricator.wikimedia.org/T336726 (10elukey) 05Open→03Resolved ` elukey@stat1004:~$ time curl "https://ores-legacy.discovery.wmnet:31443/v3/scores/enwiki/123433/damaging" -i --http1.1 HTTP/1.1 200 OK d... [15:56:23] elukey: ah, I thought that committing on pm and running p-a on the dploy host was enough. TIL :) [15:57:20] klausman: basically running puppet creates the file configs that helmfile picks up to create config maps [15:57:28] but it is not automatic [15:57:38] elukey: regarding isvcs in staging, yep we can leave them there to test new iterations [15:57:40] yeah, I somehow didn't think about the deploy->pod step [16:00:12] (03PS33) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) [16:02:00] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [16:03:27] ottomata: I am going to log off in a bit, I think that we can deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/920696 tomorrow without any rush, just post in here later on if we should proceed or not so I can read it tomorrow morning :) [16:03:44] the stream is not yet used by anybody [16:07:13] anywayyyy [16:07:21] going afk folks! Have a nice rest of the day [16:07:29] \o [16:09:27] bye luca! :) [16:11:04] (03PS34) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) [16:12:50] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [16:18:41] (03PS35) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) [16:18:56] o/ [16:20:23] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [16:42:38] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10isarantopoulos) ORES has been added to patchdemo. thanks @matmarex! I figured out an issue with any new installation. At the moment t... [17:09:12] 10Machine-Learning-Team, 10Patch-For-Review: Host open source LLM (bloom, etc.) on Lift Wing - https://phabricator.wikimedia.org/T333861 (10isarantopoulos) Thanks @MoritzMuehlenhoff for your valuable input! We have some way to go until we figure out what we are going to do with licensing regarding models devel... [17:12:47] I think licensing in models at the moment is soo messed up. I've observed the following: someone takes a model with XYZ license (e.g. openrail) which is a bit more restrictive and requires new model versions to follow the same restrictions and instead they just distribute the model with an apache 2.0 license :) [17:12:56] logging off folks, cu tomorrow! [17:45:54] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 18th round of wikis - https://phabricator.wikimedia.org/T308144 (10kevinbazira) [17:47:26] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 18th round of wikis - https://phabricator.wikimedia.org/T308144 (10kevinbazira) @kostajh, we published datasets for all models that passed the evaluation in this round. [18:02:57] oh elukey its ready to go! I posted in the +1 comment. go for it!