[07:25:04] <elukey>	 good morning folks!
[07:25:19] <elukey>	 ores1001 seems working fine, I'd proceed with the rollout of the new mwparserfromhell dependency
[08:00:21] <elukey>	 deploying ORES right now
[08:07:48] <elukey>	 deploy completed!
[08:14:34] <wikibugs>	 10Machine-Learning-Team, 10ORES: Ores mwparserfromhell causes celery segfaults - https://phabricator.wikimedia.org/T296563 (10elukey) 05Open→03Resolved a:03elukey Change deployed fleetwide, will keep monitoring metrics but so far everything green.
[08:14:36] <wikibugs>	 10Machine-Learning-Team, 10ORES: Ores mwparserfromhell causes celery segfaults - https://phabricator.wikimedia.org/T296563 (10elukey)
[08:32:59] <elukey>	 added a graph to the dashboard related to traffic by wiki
[08:33:00] <elukey>	 https://grafana-rw.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=70&orgId=1&refresh=1m
[08:33:31] <elukey>	 it would be great to have such breakdowns by model, so we could have a sense about how to scale lift wing
[15:09:04] <accraze>	 o/
[15:10:22] <elukey>	 o/
[15:10:45] <accraze>	 elukey: thanks for looking into the sandbox last week :)
[15:11:07] <accraze>	 also nice catch on the weird ores/mwparserfromhell issue!
[15:14:16] <accraze>	 i plan on fixing up my kserve-local docs today, need to backtrack a bit a list out all the additional steps i needed to do w/ helm template / sed / etc.
[15:14:45] <elukey>	 accraze: np! The ores issue was a bit nasty, I think that we need to add some set of monitors
[15:15:16] <elukey>	 I think that there is also value in trying to improve ORES metrics, to see how much trafic we currently get for specific models
[15:16:05] <accraze>	 yeah that might help us make sure we don't miss anything on Lift Wing as well
[15:22:06] <elukey>	 today I added https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=71&orgId=1&refresh=1m but it is not enough
[15:22:50] <elukey>	 I think that ORES sends the model info in the statsd payload, but the local prometheus exporter (that receives statsd and convert to prometheus metrics) doesn't show it
[15:58:47] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Create blubberfile for Outlinks model server & transformer - https://phabricator.wikimedia.org/T290929 (10ACraze)
[16:01:27] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Create blubberfile for Outlinks model server & transformer - https://phabricator.wikimedia.org/T290929 (10ACraze) 05Open→03Resolved a:03ACraze The model-server and transformer service blubberfiles have been merged into the main branch for outlink topic models. The ne...
[16:01:29] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Configure outlink topic model deployment pipeline - https://phabricator.wikimedia.org/T290930 (10ACraze)
[16:07:12] <accraze>	 elukey: i just found the url to that video you linked a couple weeks ago about ML SRE metrics etc... going to give it a watch this afternoon!
[16:07:23] <elukey>	 ack!
[16:08:28] <accraze>	 obv we should match ORES metricss, but i'm interested in seeing if there's a standardized approach for ML systems other people are starting to follow
[16:10:44] <elukey>	 accraze: we should also try to figure out what are HTTP return code best practices for KServer
[16:11:22] <accraze>	 oh right! yes agreed
[16:15:53] <accraze>	 param validation could be helpful
[16:44:56] <chrisalbon>	 Morning all! 
[16:45:09] <accraze>	 o/
[16:45:39] <chrisalbon>	 Primer.ai, the company Habeeb now works at released their first Wikipedia paper https://arxiv.org/abs/2111.11372
[16:46:05] <accraze>	 whoa, right on!
[16:48:17] <chrisalbon>	 Also I got the only Wikimedia swag I have ever got today and TIL I might have gained a few pounds
[16:54:55] <accraze>	 chrisalbon: what is this swag you speak of? hoodie from all hands?
[17:04:30] <chrisalbon>	 the 20th anniversary hoodie
[17:05:35] <accraze>	 niiiice
[17:13:56] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Migrate from Kfserving to Kserve - https://phabricator.wikimedia.org/T293331 (10ACraze) From IRC (wikimedia-ml) 2021-11-22: > elukey: folks I think that we should migrate to KServer from kserve==0.7, I suspect that https://github.com/ks...
[17:46:52] <chrisalbon>	 KServer?
[17:47:37] <accraze>	 it's the class name for our model-servers
[17:48:28] <accraze>	 basically we need to update our model-servers to use the new kserve python package
[17:48:33] <chrisalbon>	 ah got it
[17:50:16] <accraze>	 there's an issue with older kfserving versions w/ async workers which is giving us strange results when load testing on prod (w/ the new kserve stack)
[17:51:01] <accraze>	 i think we are in a good place with the new ml-sandbox to start working on upgrading the existing images to use kserve==0.7
[17:52:42] <elukey>	 \o/
[17:53:32] <elukey>	 I have a basic setup for egress gateway that kinda works, but it will need some more days of tests and configs
[17:53:40] <elukey>	 and probably a new TLS cert for it 
[17:54:12] <klausman>	 If you want reviews/brainbouncing, lmk
[17:54:48] <elukey>	 sure! It is still a little too WIP for the moment, but I am collecting infos in the related task
[17:55:19] <elukey>	 the egress solution is nice if we decide to move to a mTLS mesh, in that case we'll need to change few things
[17:55:31] <elukey>	 (but we'll need to hook our cluster to cfssl/PKI for sure)
[17:56:13] <klausman>	 So this egress would be for models to talk outside the cluster, e.g. other people's feature stores?
[18:01:37] <elukey>	 in theory for anything outside the cluster and http based, like the mw api (So we'll have a way to do limits/circuit-breaking/etc.0
[18:03:22] <klausman>	 Ack
[19:25:09] <wikibugs>	 (03PS1) 10Vlad.shapik: Avoid using User::getOption() method [extensions/ORES] - 10https://gerrit.wikimedia.org/r/742521 (https://phabricator.wikimedia.org/T296083)
[20:07:12] <wikibugs>	 (03CR) 10Umherirrender: [C: 03+2] Avoid using User::getOption() method [extensions/ORES] - 10https://gerrit.wikimedia.org/r/742521 (https://phabricator.wikimedia.org/T296083) (owner: 10Vlad.shapik)
[20:19:03] <wikibugs>	 (03Merged) 10jenkins-bot: Avoid using User::getOption() method [extensions/ORES] - 10https://gerrit.wikimedia.org/r/742521 (https://phabricator.wikimedia.org/T296083) (owner: 10Vlad.shapik)
[22:07:52] <accraze>	 ok  new ml-sandbox is looking pretty good, i sent some notes over to kevinbazira so we will see if he can get working as well
[22:08:27] <accraze>	 i outlined the manual install w/ helm template  here: https://wikitech.wikimedia.org/wiki/User:Accraze/MachineLearning/Local_Kserve
[22:10:09] <accraze>	 one caveat: we still need to figure out a good strategy for model storage in dev, right now it's just using our old s3 bucket
[22:11:13] <accraze>	 started looking into running minio on new sandbox but with how docker etc. is setup, we'd each need to run our own
[22:11:46] <accraze>	 we could also do pvc...but that's a bit more involved and would like to avoid if possible
[22:13:35] <accraze>	 i'm ok with a hack solution as this most likely won't be our long-term solution for dev environments
[22:14:28] <accraze>	 either way, tomorrow i hope to finally start upgrading the model-servers to use kserve v0.7.0