[07:03:54] (03CR) 10Elukey: [C: 03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/762933 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [09:56:51] (03CR) 10Kevin Bazira: [C: 03+2] "LGTM1" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/762933 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [10:04:10] (03Merged) 10jenkins-bot: editquality: handle http bad request [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/762933 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [10:11:40] kevinbazira: o/ if you want to bump the editquality images for --^ we can deploy [10:12:04] elukey: o/ [10:12:15] ok let me deploy them now. [10:18:32] elukey: i've pushed a patch https://gerrit.wikimedia.org/r/763480 [10:18:47] 10Lift-Wing: Implement an online feature store - https://phabricator.wikimedia.org/T294434 (10elukey) There was a nice talk about Feast during the last apply() meetup, recorded in https://www.youtube.com/watch?v=GXqK6HlYG6M&ab_channel=Tecton [10:25:50] kevinbazira: good! gimme a min to run puppet on deploy1002 and you are good to go [10:26:41] great. i'll be on standby! [10:27:12] kevinbazira: green light [10:28:25] thanks. starting deployment now. [10:35:53] eqiad and codfw deployments have been completed successfully [10:36:47] \o/ [10:37:17] new pods are up and running [10:37:18] NAME READY STATUS RESTARTS AGE [10:37:19] cswiki-damaging-predictor-default-9n79t-deployment-ddd68dck4ggs 2/2 Running 0 3m46s [10:37:19] cswiki-goodfaith-predictor-default-llmvl-deployment-d784fc4z77t 2/2 Running 0 3m44s [10:37:19] dewiki-damaging-predictor-default-wnj7j-deployment-5b56c48hsljw 2/2 Running 0 3m42s [10:37:19] dewiki-goodfaith-predictor-default-rwb6k-deployment-55ccc49b2dl 2/2 Running 0 3m41s [10:37:45] 400: Unrecognized request format: Expecting value: line 1 column 13 (char 12)400: Unrecognized request format: Expecting value: line 1 column 13 (char 12) [10:37:49] nice! [10:38:03] (I tried a rev_id: a133) [10:39:34] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Return meaningful HTTP responses in Lift Wing's revscoring backends - https://phabricator.wikimedia.org/T300270 (10elukey) The approach looks very good, I tried to get a score with `{"rev_id": a12345}` and got: ` 400: Unrecognized request forma... [10:53:06] <kevinbazira> yep, that error was handled well. :) [11:31:53] <klausman> Nice work [11:37:34] * elukey lunch [14:43:06] <wikibugs> 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) @elukey can you please get me the Partitioning/Raid information? Thanks [14:45:32] <wikibugs> 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) [14:45:37] <wikibugs> 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10elukey) @Papaul Hi! IIRC these nodes have two 2TB disks, so I'd go for the standard raid1 recipe: `echo partman/standard.cfg partman/raid1-2dev` Lemme... [14:52:18] <wikibugs> 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10elukey) Went ahead and merged the change, I've also ran puppet across install nodes, so you can install the os whenever you want :) [14:52:32] <elukey> ml-cache codfw nodes --^ [14:52:38] <elukey> (online feature store) [14:52:48] <elukey> the eqiad ones will come later on (hopefully) [15:07:34] <wikibugs> 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) @elukey thanks [15:30:45] <elukey> I am doing a roll restart of all pods in eqiad to pick up new settings (k8s priorities, not related to isvc) [15:30:48] <elukey> codfw already done [15:56:51] <accraze> o/ [15:57:09] <accraze> glad to see the error handling patch works :) [15:59:41] <elukey> o/ yep it worked nicely [16:03:29] <accraze> whoa and more models were deployed? [16:03:38] <accraze> things are coming along nice! [16:05:20] <elukey> yep! [16:09:44] <accraze> elukey: ahhh i think i found an editquality model that will break our helmfile abstraction :( [16:09:53] <accraze> enwiktionary-reverted [16:10:27] <accraze> we can skip for now and come back to it [16:11:15] <accraze> for that we'll need to point it to https://en.wiktionary.org [16:11:33] <elukey> accraze: we can deploy it via inference_services [16:11:41] <elukey> with the full config [16:11:47] <elukey> so breaking :) [16:11:50] <elukey> *no [16:11:55] <accraze> ohhhhh nice! [16:33:32] <accraze> elukey: i made a cr with wiktionary and wikibooks models under `inference_services` [16:33:35] <accraze> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/763556 [16:34:39] <elukey> accraze: lemme check one thing, I am wondering if the revscoring_inference_services could be extended [16:34:51] <accraze> oh that would be cool [16:34:53] <elukey> it works now what you created but I realized that it may be confusing [16:35:05] <accraze> yeah that was my thought too... [16:35:37] <accraze> i mean no biggie if we can't easily extend, but for clarity it would be helpful [16:39:37] <elukey> mmm I may be able to add this to inference_service as well [16:39:47] <elukey> I have to test some yaml horror [16:40:01] <accraze> of course ;) [17:00:44] <chrisalbon> One more dayyyyy [17:02:23] <elukey> o/ [17:08:53] <chrisalbon> hey elukey [17:28:16] <elukey> accraze: qq - do we need to override INFERENCE_NAME and WIKI_HOST? Or only WIKI_HOST? [17:30:21] <accraze> elukey: hmmm good question, i think both? [17:30:45] <elukey> it seems only WIKI_HOST to me [17:30:51] <elukey> I'll make sure to be able to override both [17:30:58] <elukey> but if we set [17:31:01] <elukey> wiki: eswikitionary [17:31:06] <elukey> model: goodfaith [17:31:20] <elukey> we get INFERENCE_NAME = eswikitionary-goodfaith [17:31:24] <elukey> that looks legit rith? [17:31:26] <elukey> right [17:31:36] <accraze> ohhhh i see what you are saying [17:31:44] <accraze> actually yeah that should work [17:54:03] <elukey> accraze: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/763580 and next [17:54:10] <elukey> lemme know what you think :) [17:55:34] <elukey> ah I missed en.wiktionary.org [17:57:35] <elukey> ok fixed [17:58:13] <elukey> chrisalbon: if you have time - any opinion about the structure of https://wikitech.wikimedia.org/wiki/Machine_Learning ? [17:58:33] <elukey> me and Aiko are adding info to https://wikitech.wikimedia.org/wiki/Machine_Learning/Onboarding [17:58:41] <elukey> and I'd like to create a page for deployments [17:58:54] <elukey> maybe Machine_Learning/LiftWing/Deploy ? [17:59:09] <elukey> (with LiftWing being a page with a description of what it is) [17:59:16] <chrisalbon> that makes sense [17:59:35] <chrisalbon> What I don't want is a lot of depth in the tree heirarchy [18:00:06] <chrisalbon> I want the homepage to have links to every major page so it is on the top of our minds to keep things updated [18:00:31] <chrisalbon> i.e. I dont want https://wikitech.wikimedia.org/wiki/Machine_Learning/Onboarding/v2/mle/oct2-2023/v4/draft [18:01:43] <chrisalbon> I want someone to go to https://wikitech.wikimedia.org/wiki/Machine_Learning/ and see it as a directory containing links to every page they might want to see, from onboarding to documentation [18:02:55] <chrisalbon> TL;DR I want a shallow structure, not a deep nested structure. Because I think things get lost in a deep structure. A lot of the ORES documentation feels like that. There are some gems of info but you have to know where to look, which is bad. [18:03:22] <elukey> chrisalbon: ack makes sense. What would you prefer for https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/Deploy ? [18:03:30] <elukey> I'd like to move it to a more canonical location [18:04:04] <chrisalbon> Machine_Learning/LiftWing/Deploy is good [18:04:14] <elukey> okok! [18:05:59] <accraze> +1 [18:13:27] <elukey> accraze: the code reviews are ready [18:13:32] <elukey> lemme know if you like it [18:16:47] <accraze> elukey: so we just use `domain` to use a non-wikipedia url? i think that's great [18:17:23] <elukey> accraze: it should be "host", but yes that's the idea [18:17:41] <accraze> awesome +1'd [18:17:57] <elukey> ah snap I put "domain" in the fixture [18:17:59] <elukey> bad Luca [18:18:01] <elukey> fixing :D [18:18:55] <elukey> accraze: going to wait for the diff and then merge, so you can deploy [18:21:36] <elukey> created https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy [18:26:14] <elukey> accraze: new models ready to be deployed :) [18:26:50] <accraze> cool!! [18:27:10] <accraze> doin it now [18:27:26] <elukey> lemme know how it goes [18:28:03] <elukey> chrisalbon: I watched a bit the apply() meetup, it seems that a lot of people are using Feast (including Twitter) [18:36:07] <accraze> elukey: pods are up and running on eqiad and codfw, going to try inference now [18:36:15] <elukey> ack [18:39:36] <accraze> hmmm getting 500 on enwiktionarywiki-reverted [18:41:02] <accraze> same with eswikibooks models [18:41:51] <elukey> so the stack trace says, for query_revisions_by_revids [18:41:58] <elukey> ValueError: Could not decode as JSON: [18:42:11] <elukey> I think when doing return {rd['revid']: rd for rd in rev_docs} [18:42:32] <elukey> revscoring extractor [18:42:38] <elukey> maybe the revid is not correct? [18:43:27] <accraze> oh hold up i was using the wrong input.json [18:43:45] <elukey> good fuzzy testing [18:43:59] <elukey> maybe another use case for a 400? [18:49:25] <accraze> hmmm yeah not sure whats going on... im using a revid that i know exists on enwiktionary and still getting the same issue [18:49:34] <accraze> lemme dig in to the model a bit more [18:52:32] <elukey> ack, I have to go to dinner, will check later! [18:52:45] <accraze> no worries, thanks for all the help elukey! [19:00:28] <accraze> ok so i just verified these models work in the ores api [19:01:14] <accraze> it must be us not able to connect to the other hosts (wikibooks, wiktionary etc) from our pod [19:02:59] <accraze> everything else looks correct [20:21:49] <accraze> ok confirming i can run the wiktionary models on ml-sandbox using WIKI_URL env var, so it has to be that we cannot connect w/ host [20:51:21] <chrisalbon> wikitionary models? [21:09:38] <chrisalbon> The more I talk to people, the more Feast feels like "the" feature store at the moment [21:16:29] <accraze> chrisalbon: yeah we have some editquality models for https://en.wiktionary.org and https://es.wikibooks.org [21:16:46] <accraze> they are an edge-case tho [21:16:54] <chrisalbon> but only for englihs [21:16:58] <chrisalbon> english [21:17:11] <accraze> yeah and spanish for wikibooks [21:19:32] <wikibugs> 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul)