[07:07:26] it is a little weird, draftquality is already very fast [07:07:34] (good morning :) [07:07:45] and seems to scale fine even with blocking http calls [07:13:44] with async preprocess draftquality seems to run better with more clients, that is good [07:13:48] klausman: ack, will do [07:15:08] going to relocate in a bit, bbiab [07:16:34] (03CR) 10Elukey: [C: 03+2] drafttopic: move preprocess to async [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830084 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [07:25:59] (03Merged) 10jenkins-bot: drafttopic: move preprocess to async [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830084 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [08:20:08] Morning [08:21:58] hello :) [08:25:42] it seems that drafttopic and draftquality stop scaling up at after 3 connections in parallel (without async), but latencies don't go up a lot [08:25:49] with async preprocess is better :) [08:26:00] but the gain is less visible than edit/articlequality [08:26:14] going to test the edittopic image now, and then rollout to prod [08:26:46] (I am editing the isvc by hand on staging for the moment, quicker) [08:36:52] kevinbazira_: o/ [08:37:03] so IIUC drafttopic and articletopic use the same docker image right? [08:37:28] elukey: o/ [08:38:20] yes they do :) [08:38:34] ack perfect, going to test articletopic as well :) [08:38:52] I got a little confused when prepping the docker image change for deployment-charts [08:40:27] sorry about the confusion. it was probably not well documented. [08:40:47] they both use this image: https://docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-topic/tags/ [08:42:02] kevinbazira_: nono it is not your fault, it is mine, my brain is still not used to all the models that we have :) [08:45:10] created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/830788 [08:45:14] for the prod rollout [08:47:19] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) Tested in staging (manually edited the isvc) the new docker images for {draft,article}topic (they share the same image) and draftquality.... [08:52:28] elukey: LGTM'd [08:54:18] thanks! [09:04:53] doing a complete rollout to staging [09:11:02] all right staging has been updated [09:11:20] I'll do another quick pass to see if anything weird pops up [09:11:28] but overall I think we are ready to prod [09:12:00] procedure-wise we could think about having another person to check before a big prod rollout [09:12:03] check in staging I mean [09:12:17] in this case the blast radious is zero since we are not serving live traffic [09:12:23] but soon it will change [09:12:33] so we could start now and agree on a procedure [09:12:56] like: I cannot proceed in production with a rollout if somebody in my team haven't done basic checks [09:13:10] does it sound good or too much? [09:27:53] ok so I'll wait before the prod rollout to see if the above idea is good, so that somebody can check staging and give me the green light :) [09:32:28] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) The staging cluster has been upgraded. I proposed on IRC to establish a rule for these kind of deployments, to have somebody (other than... [09:38:35] elukey: I've checked both articletopic and drafttopic pods and they are all up and running. I guess staging is looking good so far :) [09:38:35] NAME READY STATUS RESTARTS AGE [09:38:35] arwiki-articletopic-predictor-default-pxrdc-deployment-7b75gb6f 3/3 Running 0 32m [09:38:35] cswiki-articletopic-predictor-default-7tkfb-deployment-cf8sfx6z 3/3 Running 0 32m [09:38:35] enwiki-articletopic-predictor-default-x244n-deployment-6484vdkz 3/3 Running 0 56m [09:38:36] arwiki-drafttopic-predictor-default-vfm28-deployment-6bc8bkqncl 3/3 Running 0 29m [09:38:38] cswiki-drafttopic-predictor-default-s8nxg-deployment-55f7dfvsh8 3/3 Running 0 29m [09:38:40] enwiki-drafttopic-predictor-default-jhk4l-deployment-7bcd9lbb8z 3/3 Running 0 72m [09:39:03] kevinbazira_: thanks! I have also in mind something more deep, like doing some test requests etc.. [09:39:11] ideally we should have all in CI [09:39:21] but I am not sure if we have the tools to do it yet :D [09:39:44] yep having them in CI would be great :) [09:56:45] runc updates all done on ml* [09:57:58] thanks! [10:17:01] going afk for lunch in a bit! ttl [10:17:34] (03CR) 10AikoChou: "Thanks for the review, Luca. :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/828481 (https://phabricator.wikimedia.org/T315994) (owner: 10AikoChou) [10:39:05] also lunch :) [11:18:47] elukey: something like a httpbb script we run on ORES beta? but is it possible to integrate it into CI test pipeline? I'm asking because we'll test it after deployment to staging, right? not before publishing the image [11:33:34] Guys, I am taking the afternoon off, I got a migraine and water and fresh air isn't making it go away [11:34:12] sorry klausman. hope you get well soon! [11:56:04] klausman: rest and recover :) [11:58:16] aiko: the httpbb script could be an option yes! Maybe we could have a CI job that we kick off manually and that checks the sanity of staging [11:58:39] my goal is to have somebody/something other than the deployer to verify the status of staging before hitting prod [11:58:43] especially for big changes [12:01:27] and yes we'd run the test after publishing the image, and after the staging deployment [12:02:09] I'll open a task with the httpbb idea, and proceed in this case [12:17:33] ml-serve-codfw done [12:17:50] I'll run some other wrk tests to see differences between async and non-async preprocess [13:01:10] very weird, I get better performances in staging compared to ml-serve-codfw [13:08:09] also, when comparing ml-serve-eqiad vs ml-serve-codfw, I don't see any discrepancy [13:08:29] namely I don't see in ml-serve-eqiad the problems in performances (scaling up conns) that I observed in staging [13:15:04] I am a little puzzled [13:18:20] staging seems to work way more nicely than ml-serve-codfw [13:18:27] and I expected the opposite [13:26:08] for example, I am testing articlequality [13:26:39] the same wrk test, 10 conns, leads to 38 rps on staging and 26 on ml-serve-codfw [13:27:45] it smells as if the istio+knative machinery (at least, our current versions) work less efficiently when there are a lot isvcs [13:29:54] checked the obvious and the docker images are the same (staging vs ml-serve-codfw) [13:45:44] from https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?from=now-6h&orgId=1&to=now&var-datasource=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-articlequality&var-pod=enwiki-articlequality-predictor-default-czch4-deployment-5ps7g8 the enwiki articlequality pod looks a little throttled in CPU [13:45:56] and some k8s workers are indeed overcommitted in "Limits" [13:46:01] in ml-serve-codfw I mean [13:46:07] but still, I don't explain.. [13:46:11] .12 [13:46:13] err [13:48:57] and a couple of k8s nodes in ml-serve-codfw have less than 10 pods, meanwhile the rest is runing 25+ [13:49:32] we have 170 pods (including istio's etc..) across 8 nodes [13:49:42] I can't believe that we are already hitting a limit [13:57:56] going to take a break to clear ideas :) [15:12:32] 10Lift-Wing: SSL issues when querying LiftWing with Python - https://phabricator.wikimedia.org/T317328 (10Isaac) [15:16:01] 10Lift-Wing: SSL issues when querying LiftWing with Python - https://phabricator.wikimedia.org/T317328 (10elukey) @Isaac does it work if you run `export REQUESTS_CA_BUNDLE=/etc/ssl/certs/wmf-ca-certificates.crt` before executing the code? [15:31:47] 10Lift-Wing: SSL issues when querying LiftWing with Python - https://phabricator.wikimedia.org/T317328 (10Isaac) Yep - that seems to do it! Thanks @elukey ! Caveat that for some reason I still get the SSL error when doing it locally from a PySpark notebook (though it works on the workers now) but I suspect that'... [15:33:48] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Support pre-transformed inputs for Outlink topic model - https://phabricator.wikimedia.org/T315998 (10Isaac) > With these changes, testing the model prediction/batch prediction from Hadoop seems to be feasible. That would be great if yo... [15:53:35] going afk for today folks o/