[07:25:04] good morning folks! [07:25:19] ores1001 seems working fine, I'd proceed with the rollout of the new mwparserfromhell dependency [08:00:21] deploying ORES right now [08:07:48] deploy completed! [08:14:34] 10Machine-Learning-Team, 10ORES: Ores mwparserfromhell causes celery segfaults - https://phabricator.wikimedia.org/T296563 (10elukey) 05Open→03Resolved a:03elukey Change deployed fleetwide, will keep monitoring metrics but so far everything green. [08:14:36] 10Machine-Learning-Team, 10ORES: Ores mwparserfromhell causes celery segfaults - https://phabricator.wikimedia.org/T296563 (10elukey) [08:32:59] added a graph to the dashboard related to traffic by wiki [08:33:00] https://grafana-rw.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=70&orgId=1&refresh=1m [08:33:31] it would be great to have such breakdowns by model, so we could have a sense about how to scale lift wing [15:09:04] o/ [15:10:22] o/ [15:10:45] elukey: thanks for looking into the sandbox last week :) [15:11:07] also nice catch on the weird ores/mwparserfromhell issue! [15:14:16] i plan on fixing up my kserve-local docs today, need to backtrack a bit a list out all the additional steps i needed to do w/ helm template / sed / etc. [15:14:45] accraze: np! The ores issue was a bit nasty, I think that we need to add some set of monitors [15:15:16] I think that there is also value in trying to improve ORES metrics, to see how much trafic we currently get for specific models [15:16:05] yeah that might help us make sure we don't miss anything on Lift Wing as well [15:22:06] today I added https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=71&orgId=1&refresh=1m but it is not enough [15:22:50] I think that ORES sends the model info in the statsd payload, but the local prometheus exporter (that receives statsd and convert to prometheus metrics) doesn't show it [15:58:47] 10Lift-Wing, 10Machine-Learning-Team: Create blubberfile for Outlinks model server & transformer - https://phabricator.wikimedia.org/T290929 (10ACraze) [16:01:27] 10Lift-Wing, 10Machine-Learning-Team: Create blubberfile for Outlinks model server & transformer - https://phabricator.wikimedia.org/T290929 (10ACraze) 05Open→03Resolved a:03ACraze The model-server and transformer service blubberfiles have been merged into the main branch for outlink topic models. The ne... [16:01:29] 10Lift-Wing, 10Machine-Learning-Team: Configure outlink topic model deployment pipeline - https://phabricator.wikimedia.org/T290930 (10ACraze) [16:07:12] elukey: i just found the url to that video you linked a couple weeks ago about ML SRE metrics etc... going to give it a watch this afternoon! [16:07:23] ack! [16:08:28] obv we should match ORES metricss, but i'm interested in seeing if there's a standardized approach for ML systems other people are starting to follow [16:10:44] accraze: we should also try to figure out what are HTTP return code best practices for KServer [16:11:22] oh right! yes agreed [16:15:53] param validation could be helpful [16:44:56] Morning all! [16:45:09] o/ [16:45:39] Primer.ai, the company Habeeb now works at released their first Wikipedia paper https://arxiv.org/abs/2111.11372 [16:46:05] whoa, right on! [16:48:17] Also I got the only Wikimedia swag I have ever got today and TIL I might have gained a few pounds [16:54:55] chrisalbon: what is this swag you speak of? hoodie from all hands? [17:04:30] the 20th anniversary hoodie [17:05:35] niiiice [17:13:56] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Migrate from Kfserving to Kserve - https://phabricator.wikimedia.org/T293331 (10ACraze) From IRC (wikimedia-ml) 2021-11-22: > elukey: folks I think that we should migrate to KServer from kserve==0.7, I suspect that https://github.com/ks... [17:46:52] KServer? [17:47:37] it's the class name for our model-servers [17:48:28] basically we need to update our model-servers to use the new kserve python package [17:48:33] ah got it [17:50:16] there's an issue with older kfserving versions w/ async workers which is giving us strange results when load testing on prod (w/ the new kserve stack) [17:51:01] i think we are in a good place with the new ml-sandbox to start working on upgrading the existing images to use kserve==0.7 [17:52:42] \o/ [17:53:32] I have a basic setup for egress gateway that kinda works, but it will need some more days of tests and configs [17:53:40] and probably a new TLS cert for it [17:54:12] If you want reviews/brainbouncing, lmk [17:54:48] sure! It is still a little too WIP for the moment, but I am collecting infos in the related task [17:55:19] the egress solution is nice if we decide to move to a mTLS mesh, in that case we'll need to change few things [17:55:31] (but we'll need to hook our cluster to cfssl/PKI for sure) [17:56:13] So this egress would be for models to talk outside the cluster, e.g. other people's feature stores? [18:01:37] in theory for anything outside the cluster and http based, like the mw api (So we'll have a way to do limits/circuit-breaking/etc.0 [18:03:22] Ack [19:25:09] (03PS1) 10Vlad.shapik: Avoid using User::getOption() method [extensions/ORES] - 10https://gerrit.wikimedia.org/r/742521 (https://phabricator.wikimedia.org/T296083) [20:07:12] (03CR) 10Umherirrender: [C: 03+2] Avoid using User::getOption() method [extensions/ORES] - 10https://gerrit.wikimedia.org/r/742521 (https://phabricator.wikimedia.org/T296083) (owner: 10Vlad.shapik) [20:19:03] (03Merged) 10jenkins-bot: Avoid using User::getOption() method [extensions/ORES] - 10https://gerrit.wikimedia.org/r/742521 (https://phabricator.wikimedia.org/T296083) (owner: 10Vlad.shapik) [22:07:52] ok new ml-sandbox is looking pretty good, i sent some notes over to kevinbazira so we will see if he can get working as well [22:08:27] i outlined the manual install w/ helm template here: https://wikitech.wikimedia.org/wiki/User:Accraze/MachineLearning/Local_Kserve [22:10:09] one caveat: we still need to figure out a good strategy for model storage in dev, right now it's just using our old s3 bucket [22:11:13] started looking into running minio on new sandbox but with how docker etc. is setup, we'd each need to run our own [22:11:46] we could also do pvc...but that's a bit more involved and would like to avoid if possible [22:13:35] i'm ok with a hack solution as this most likely won't be our long-term solution for dev environments [22:14:28] either way, tomorrow i hope to finally start upgrading the model-servers to use kserve v0.7.0