[06:55:06] ml-etcd2003 is back to it's original non-DRBD/plain disk storage [07:09:12] 10Lift-Wing, 10artificial-intelligence, 10Machine-Learning-Team (Active Tasks): Create articletopic inference services - https://phabricator.wikimedia.org/T313307 (10kevinbazira) a:03kevinbazira [07:37:50] moritzm: thanks! [07:59:52] kevinbazira: o/ [08:00:04] so the name of the new namespace should be revscoring-articlequality-topic right? [08:00:15] elukey: o/ [08:00:37] nope the name should be revscoring-articletopic [08:00:56] ahh ok [08:01:56] super [08:10:16] klausman: o/ I just filed 3 code changes for the new revscoring-articletopic namespace, do you want to review them and roll them out? [08:10:35] (IIRC we should have already done it in the past but the more people know the process the better) [08:10:36] Sounds good. Gimme like 10m to finish breakfast :) [08:10:43] yep yep even after lunch :) [08:10:51] no real hurry [08:16:37] (03PS5) 10Elukey: Update Python model servers and requirements to KServe 0.8 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/815721 (https://phabricator.wikimedia.org/T311982) [08:20:35] (03PS6) 10Elukey: Update Python model servers and requirements to KServe 0.8 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/815721 (https://phabricator.wikimedia.org/T311982) [08:22:20] elukey: +1'd two of them. I guess the third is 815903? I'm not set as reviewer on that one, so unsure if it is ready [08:28:43] yep yep just added you [08:30:09] you can go ahead and merge them [08:30:15] roger [08:30:31] be careful with the puppet one, some secrets are needed first [08:30:35] (in the private repo) [08:31:32] Did you already do the actual private stuff on the puppetmasters? [08:32:13] nope [08:33:36] (03CR) 10Klausman: Update Python model servers and requirements to KServe 0.8 (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/815721 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey) [08:33:50] Also, first review of the kserve 0.8 change [08:35:09] Ok, merged the labs/private change, doing the pm-side changes now [08:39:27] (03PS7) 10Elukey: Update Python model servers and requirements to KServe 0.8 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/815721 (https://phabricator.wikimedia.org/T311982) [08:39:39] (03CR) 10Elukey: Update Python model servers and requirements to KServe 0.8 (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/815721 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey) [08:41:24] thanks for the review, should be fixed [08:41:34] now I need to test the docker images one by one first, it will take a bit [08:45:29] thank you elukey and klausman for adding the revscoring-articletopic namespace and configs. [08:45:40] except it's broken :) [08:45:49] woops [08:46:26] elukey: https://phabricator.wikimedia.org/P31596 [08:46:55] Apparently, g+r on the kube files now is an error? Seeing as how all of the files in that dir are g+r, has this changed recently? [08:48:00] broken how? :) [08:48:05] See the paste [08:48:20] But I am not sure I am reading that right. chmod g-r on the file did not fix the problem [08:48:36] that's a warning, the real problem is related to the client auth [08:49:04] I did the actual-secrets part on pm and ran puppet-merge, and ran the agent [08:49:22] so the file /etc/kubernetes/revscoring-articletopic-deploy-ml-staging-codfw.config on deploy1002 is there [08:49:36] but at the same time the k8s control plane needs to know about the new user [08:50:00] ` Error: Failed to get release service-secrets in namespace revscoring-articletopic: exit status 1: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/revscoring-articletopic-deploy-ml-staging-codfw.config` sounded to me like "I am trating this warning as an error" [08:50:09] oh, right [08:50:29] I am running puppet on the ml-stating-ctrl nodes, it should in theory add the new users [08:50:34] yeah [08:51:48] Still not working [08:52:57] klausman: we need to sync the new namespace first in admin-ng [08:53:15] ack [08:55:27] synced them and diff now works. deploying to staging cluster [08:56:30] super [08:57:37] `No resources found in revscoring-articletopic namespace.` Huh. [08:57:58] yeah if you look for pods nothing is there yet [08:58:39] I should have looked closer when doing the diff [08:59:24] syncing serve-* now as well [08:59:48] and done [09:00:34] (03CR) 10Klausman: [C: 03+1] Update Python model servers and requirements to KServe 0.8 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/815721 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey) [09:00:38] ok so now all is set for kevinbazira to start adding inference services [09:00:46] maybe let's start with staging first [09:00:54] yarp [09:01:06] kevinbazira: there is a new value yaml file for staging called "values-staging.yaml" that you'd need to use for staging [09:02:28] thanks klausman and elukey. So the workflow has changed to first adding isvc to staging then later add it to prod? [09:03:50] kevinbazira: the idea is to add one/two representative isvcs to staging only, just to see if they work, so we can test changes on a subset of prod when needed (say if in the future the docker image for articletopic changes etc..) [09:04:19] staging is only two worker nodes so we don't have the same capacity as prod [09:04:53] great. let me push a patch for this now. [09:10:01] klausman: can you join #wikimedia-operations ? There is a weird error about staging-codfw [09:10:07] did you sync in there too? [09:10:16] maybe accidentally, let me check my bash history [09:10:32] klausman@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [09:10:35] yep [09:10:39] Dammit. [09:11:04] it happens, it is staging no problem :) [09:11:08] Janis seems to be already on it [09:15:06] klausman / elukey: I see values-ml-staging-codfw.yaml doesn't have configurations for a predictor image. Does it pick it from values.yaml? [09:17:04] kevinbazira: yeah so in the helmfile.yaml config the values.yaml has the priority over the values-staging.yaml, so the result is that if you don't override anything what in values.yaml will be inherited/picked-up by values-staging.yaml [09:17:53] values.yaml doesn't have anything for articletopic yet [09:18:03] ok, so in the case of articletopic I am going to add configs for the predictor image as they are currently missing. [09:18:10] super [09:18:17] thanks for the clarification elukey. [09:18:39] <3 [09:19:49] need to run some errands, will join the chan later on in the afternoon! [09:32:43] patch pushed. please review whenever you get a minute: https://gerrit.wikimedia.org/r/815911 [09:32:56] I'll deploy after this change has been merged. [10:04:00] ml-etcd2001 is also back to it's original non-DRBD/plain disk storage [11:54:47] thanks for the review klausman. going to run the deployment now. [11:55:45] nothing in the docs specifies how to deploy on staging. I believe it is a variation of: helmfile -e ml-serve-codfw sync [11:56:01] sec [11:56:56] The staging cluster is ml-staging-codfw [11:57:52] so in this case we'll have to run: helmfile -e ml-staging-codfw sync ?? [11:58:26] yes [11:58:40] thanks klausman. let me deploy now. [11:58:45] diff first ;) [11:59:23] Yep, diff first. I am also going to add the new diff and deploy staging commands to the docs. [12:08:52] ml-staging-codfw deployment has been completed successfully. [12:08:52] checking pod now ... [12:11:19] the new pod on staging is up and running [12:11:19] NAME READY STATUS RESTARTS AGE [12:11:19] arwiki-articletopic-predictor-default-4vr2s-deployment-5c8276r7 3/3 Running 0 2m52s [12:11:24] Hooray! [12:18:15] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Use non-blocking HTTP calls to get outlinks for Outlinks topic model - https://phabricator.wikimedia.org/T311043 (10achou) [12:41:31] I've re-arranged diff and deploy docs into a table and added docs on how to deploy to staging: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy#How_to_deploy [12:42:47] LGTM [12:53:32] elukey: o/ do you have time for a short chat later? I wanna ask something about testing [13:46:23] (03PS3) 10AikoChou: WIP - outlink: use async HTTP calls to fetch data [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043) [13:51:05] thanks for the docs kevinbazira! [13:51:07] <- groveries, bbiab [13:51:07] aiko: sure! [13:51:11] groceries* [13:51:31] (03PS4) 10AikoChou: WIP - outlink: use async HTTP calls to fetch data [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043) [14:00:03] elukey: is it ok for you to meet in 30min? [14:00:55] aiko: sure [14:19:43] where can I find a model.bin for outlink? [14:20:42] ah it should be on swift [14:21:14] mmm not really [14:22:05] aiko: do you know? [14:22:21] (I am trying to test the local transformer -> predictor setup [14:25:08] (I'd also need to know how to call it, I see some parameters in model.py but I don't have any idea about them) [14:26:31] maybe kevinbazira knows? (about where I can find a model.bin for outlink) [14:27:11] no idea ... but let me check [14:27:54] I didn't find any repo in github, maybe there is a special one [14:28:38] https://analytics.wikimedia.org/published/datasets/one-off/isaacj/articletopic/model_alloutlinks_202012.bin [14:29:21] in this phab task: https://phabricator.wikimedia.org/T276862 [14:31:10] elukey: I'm here https://meet.google.com/bre-pcbo-hcx [14:31:17] ahhhh thanks! [14:31:22] joining in a sec [14:59:29] wow the outlink model file is huge :) [15:13:40] possibly one of the trade-offs for this model being language agnostic. [15:24:05] (03PS5) 10AikoChou: WIP - outlink: use async HTTP calls to fetch data [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043) [15:24:57] aiko, kevinbazira - I am currently testing the outlink transformer -> predictor locally on Docker :) [15:25:00] all works [15:25:25] I'll add some documentation [15:28:12] (03CR) 10CI reject: [V: 04-1] WIP - outlink: use async HTTP calls to fetch data [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043) (owner: 10AikoChou) [15:28:41] elukey: yay! thanks Luca. That's really great :) [15:31:08] (03PS6) 10AikoChou: WIP - outlink: use async HTTP calls to fetch data [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043) [15:35:13] (03CR) 10AikoChou: WIP - outlink: use async HTTP calls to fetch data (035 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043) (owner: 10AikoChou) [15:43:04] (03CR) 10AikoChou: [C: 03+1] Update Python model servers and requirements to KServe 0.8 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/815721 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey) [15:58:18] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Technical-Debt: Migrate usage of Database::select to SelectQueryBuilder in ORES - https://phabricator.wikimedia.org/T312454 (10Tgr) [16:07:22] going afk folks! [16:20:40] \o [16:20:45] heading out as well