[00:58:14] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) It looks like we're still not gettin... [05:25:41] (03CR) 10Accraze: editquality: handle revision not found error (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/764915 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [05:29:01] (03PS2) 10Accraze: editquality: handle revision not found error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/764915 (https://phabricator.wikimedia.org/T300270) [06:05:30] 10Lift-Wing: Implement an online feature store - https://phabricator.wikimedia.org/T294434 (10ACraze) To keep archives happy: We talked about Feast today at the ML Team technical meeting and discussed what we want to learn from a spike with [[ https://feast.dev/ | Feast ]]. 1. Figure out how we save and expos... [06:06:02] 10Lift-Wing, 10Machine-Learning-Team: Implement an online feature store - https://phabricator.wikimedia.org/T294434 (10ACraze) [07:36:52] hello folks [07:37:24] (03CR) 10Elukey: [C: 03+1] editquality: handle revision not found error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/764915 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [07:42:47] https://feast.dev/blog/a-state-of-feast/ is very interesting [07:46:20] The future or feast part is more or less what I hoped, namely using Spark everywhere [07:46:34] this would allow us, in theory, to use the DE Hadoop cluster as Offline feature store [07:46:50] (modulo figuring our how to deal with Kerberos in Kubernetes) [07:47:32] I also realized that the Feature registry is a service that can run anywhere, so it shouldn't really be on the Redis nodes [07:47:52] we can probably have a couple of light VMs that talk with a database somewhere [07:53:00] we could also explore the possibility of using Cassandra instead of Redis [07:53:45] it would be much simpler to handle a multi-node set up, and we could have separate keyspaces for each use case (score cache, feast, ..) [07:54:12] replication and routing client requests between instances is alreayd built in (so no need for proxies/sharding) [07:54:25] but of course it is a different beast from Redis, with different latencies [08:07:11] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10elukey) While reading https://feast.dev/blog/a-state-of-feast/, I started to think if Cassandra could be an alternative to Redis for our use case, since it is supported by... [08:07:28] added some thoughts in --^, I'll also talk with Joseph about it [08:07:41] (one of our Cassandra experts :) [08:07:58] on paper it seems a great solution/compromise [08:08:00] we'll see [08:42:56] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) Temporarily "fixed" it disabling pup... [08:55:50] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) I just realized that a RevisionNotFo... [08:57:52] 10Machine-Learning-Team, 10ORES, 10Discovery-Search, 10Growth-Team (Current Sprint): Investigate what would be required to include countries in ORES and accessible via a search keyword - https://phabricator.wikimedia.org/T301671 (10Tgr) Since this is about a long-term solution, should we move it out of the... [09:15:23] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) Quick example from deployment-deploy... [09:37:17] 10Lift-Wing, 10Machine-Learning-Team: Support (or not) the ORES augmented feature output in liftwing - https://phabricator.wikimedia.org/T301766 (10elukey) I'd be in favor to add this functionality, afaics from https://github.com/wikimedia/ores/commit/efe0b3111d5dc127601221934c7dd27ae371a266 it should be easy... [11:05:57] I was trying to get some info from the istio egress gw's envoy admin port [11:05:58] elukey@ml-serve1001:~$ sudo nsenter -t 91808 -n curl localhost:15000/clusters -s | wc -l [11:06:01] 14508 [11:06:13] and I noticed that we are keeping a ton of old clusters that are already gone [11:06:33] (a cluster is basically a endpoint for envoy) [11:07:00] the egress gw is populated by istiod, and it also contains routes for the internal knative services afaics [11:07:10] even if they don't need it [11:07:25] but the whole config dump is now ~75k lines [11:07:35] it smells like a bug [11:08:51] like https://github.com/istio/istio/issues/9480 [11:09:53] but we have 1.9.5 [11:11:58] https://github.com/istio/istio/issues/36222 [11:18:27] anyway I'll check after lunch [11:29:15] (03CR) 10Kevin Bazira: [C: 03+2] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/764915 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [11:32:15] * elukey lunch! [11:33:19] (03Merged) 10jenkins-bot: editquality: handle revision not found error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/764915 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [12:37:41] thanks for the merge elukey. going to deploy now. [12:43:10] woops not yet merged ... that was the jenkins-bot 🤦‍♂️ [13:59:19] kevinbazira: o/ I reviewed the change, and I didn't find the itwiki damaging model :( [13:59:22] can you check? [15:39:30] elukey: did we agree on a naming scheme for the staging control plane? [15:40:03] the machine names, I mean [15:47:34] klausman: not yet IIRC [15:48:39] ml-etcd-stagingXXX maybe? [15:48:48] I can't think of anything succinct but useful [15:49:30] yeah I think it is fine [15:49:37] very long but fine [15:49:46] Alright, will make a ticket for akosiaris to sign off on [15:50:17] Just to be 1000% sure: we only need this in qiad, right? [15:50:20] +e [15:50:41] codfw :) [15:50:47] Dammit :) [15:51:25] 1G mem, 10G for /? [15:51:45] I don't recall what we did for the ml-etcd nodes but I'd go for the same [15:51:52] Ok, 3G.20G then [15:52:35] it is convenient to have the same, but if you want we can consume less [15:52:40] lemme check what service ops does [15:53:13] yeah 3g/1core/20G [15:53:57] Alright then [15:54:06] and 3 machines because of election and quorum etc [15:54:59] yep [15:57:29] The names for the ml-serve-ctrl2001 equivalent is going to be messy [15:57:37] ml-stage-ctrl2001? [15:59:47] or maybe ml-staging-ctrl200[1,2] (the workers are called ml-staging200[1,2]) [16:02:02] that works, too [16:06:34] super [16:36:18] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Explore ingress filtering for Lift Wing - https://phabricator.wikimedia.org/T300259 (10elukey) While checking how to limit egress connections, I found https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/local_rate_limit_filter#config-... [16:40:42] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10achou) Hi, I generated a list of URIs to tes... [16:44:02] elukey: tried to use s3cmd but it looks like it's doesn't exist on ml-serve1001. please see error returned below. [16:44:17] kevinbazira@ml-serve1001:~$ s3cmd [16:44:28] -bash: s3cmd: command not found [16:45:42] kevinbazira: yeah we moved the s3cmd to the stat100x nodes, the docs are not updated afaics [16:46:05] for example [16:46:06] elukey@stat1004:~$ sudo s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/goodfaith/kowiki/20220214192234/ [16:46:09] 2022-02-14 19:22 9900788 s3://wmf-ml-models/goodfaith/kowiki/20220214192234/model.bin [16:46:17] how have you loaded model so far? [16:46:40] the idea is to avoid the k8s prod nodes now that we have all configs on the stat boxes, to separate concerns [16:47:09] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) https://github.com/wikimedia/ores/bl... [16:47:30] elukey: it's on the list of models that Andy loaded: https://phabricator.wikimedia.org/T301413#7708487 [16:47:57] ahhhh [16:50:21] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) Sometimes it happens in prod as well... [17:05:31] I have updated the docs to reflect that s3cmd was moved to the stat100x nodes: https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Deploy&diff=1952353&oldid=1951661 [17:09:58] I was about to do it thanks! [17:11:35] kevinbazira: I am going to add a section about how to add models as well [17:12:02] great. thanks 🙏 [17:19:54] kevinbazira: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy#How_to_upload_a_model_to_Swift [17:20:56] thanks elukey. do we have to always use "sudo"? [17:22:43] kevinbazira: nono error from my side [17:22:54] me and Tobias are not in that group, so I used sudo [17:23:00] you should be able to run it without it [17:23:14] ack. thanks 🙏 [17:33:47] o/ [17:34:24] thanks for updating the model upload docs elukey & kevinbazira! [17:34:28] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) @achou I just noticed that https://g... [17:41:02] accraze: o/ [17:41:20] elukey: itwiki model has been uploaded [17:41:22] kevinbazira@stat1004:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/damaging/itwiki/20220224172913 [17:41:22] DIR s3://wmf-ml-models/damaging/itwiki/20220224172913/ [17:41:35] going to update the patch now [17:45:12] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10Machine-Learning-Team (Active Tasks): Upload model binaries to storage - https://phabricator.wikimedia.org/T301413 (10kevinbazira) The itwiki damaging model was not here: s3://wmf-ml-models/damaging/itwiki/20220214192228/ So I have upload... [17:49:23] klausman: nice :) [17:49:29] err kevinbazira_ :) [17:49:56] Aw [17:52:14] second patch uploaded: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/765501/ [17:56:07] kevinbazira: ready to deploy when you want :) [17:56:46] ack. starting deployment now. [18:03:27] both eqiad and codfw deployments have been completed successfully. [18:03:37] checking pods now ... [18:12:57] 5/6 new pods are up and running. [18:13:07] the itwiki-goodfaith-predictor is not running. status:CrashLoopBackOff [18:16:00] Typeo in a path somewhere maybe? [18:16:44] to debug it [18:16:53] kubectl logs itwiki-goodfaith-predictor-default-kd56q-deployment-6c4f57gqhzn -n revscoring-editquality-goodfaith kserve-container [18:17:05] the itwiki-etc.. is the name of the pod [18:17:40] if you swap "kserve-container" with "storage-initializer" the logs change [18:17:54] (the storage initializer comes first, and everything looks good afaics) [18:17:59] in the kserve-container I see [18:18:06] File "/opt/lib/python/site-packages/revscoring/scoring/models/model.py", line 102, in load [18:18:09] model = pickle.load(f.buffer) [18:18:11] _pickle.UnpicklingError: invalid load key, 'v'. [18:18:20] this is when the model is loaded [18:18:31] kevinbazira: --^ [18:19:25] hmmm could it be that the itwiki goodfaith model was not uploaded as well? [18:19:29] let me check [18:19:51] IIRC the model was there [18:21:29] I've checked and it doesn't seem to be there. [18:21:33] let me upload it [18:22:51] elukey@stat1004:~$ sudo s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/goodfaith/itwiki/20220214171756/ [18:22:54] 2022-02-14 17:17 132 s3://wmf-ml-models/goodfaith/itwiki/20220214171756/model.bin [18:22:57] kevinbazira: --^ [18:22:59] it should be there no? [18:23:16] last change the 14th [18:27:32] it seems as if the model was not serialized correctly [18:27:53] * elukey blames accraze [18:27:54] :D [18:28:37] need to step afk for a bit, will check later :) [18:29:54] uh oh! [18:29:56] lol [18:30:14] lemme take a look, it could be a git lfs issue :( [18:33:27] accraze: that could be the possibility because on https://github.com/wikimedia/editquality/blob/master/models/itwiki.goodfaith.gradient_boosting.model the models size is: 8.65 MB [18:33:38] but the uploaded file on swift has: [18:33:45] kevinbazira@stat1004:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg du -H s3://wmf-ml-models/damaging/itwiki/20220214171756/model.bin [18:33:46] 132 1 objects s3://wmf-ml-models/damaging/itwiki/20220214171756/model.bin [18:35:42] it's coming to 10PM on my end. I'll pick this up tomorrow. [18:35:49] have a good day everyone. [18:45:45] have a good night kevinbazira! i'll see if i can figure out what's going on [18:46:41] great. thanks Andy 🙏