[00:15:44] FIRING: LiftWingServiceErrorRate: ... [00:15:50] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-draftquality&var-backend=enwiki-draftquality-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [00:50:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [01:50:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [05:50:59] FIRING: LiftWingServiceErrorRate: ... [05:50:59] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-draftquality&var-backend=enwiki-draftquality-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:42:55] Investigating ^^^ [07:51:15] kevinbazira: sent you a patch to switch the above service to MP [07:58:20] klausman: o/ I've +1'd. [07:58:26] thankyou! [07:58:32] will deploy in a minute [07:58:36] okok [08:06:12] pushed to codfw, letting it soak for a bit [08:19:30] deployed to eqiad, too [08:20:44] RESOLVED: LiftWingServiceErrorRate: ... [08:20:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-draftquality&var-backend=enwiki-draftquality-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:02:06] * klausman lunch [12:25:29] the new readability-predictor has been crahslooping in staging [12:26:16] I think Ilias started looking into that last week, but I don't know if he found anything (or if he wrote it in a ticket) [12:26:37] checking the pod, it shows the reason is OOMKilled [12:28:00] klausman: ack, can you delete the pod? I want to see if it has the same issue after restarts [12:28:10] ack, in a sec [12:30:30] deleted it, but it's OOMing again [12:31:53] One thing of note: the old revision (18) has a memory limit of 2Gi, the new one is 1Gi [12:32:05] ohhh! [12:32:12] ... I think. [12:32:18] The describe otuput is a bit confusing [12:32:26] why is that [12:33:11] there's many containers and assorted limits. I just diffed the two pods. The limits are identical, so false alarm there [12:34:02] okk [12:34:36] maybe the new model needs more memory [12:35:12] We could hand-edit it to a bigger value, see if it helps [12:35:30] yes I was going to say it [12:37:28] ok, bumping to 4Gi [12:38:31] it's running! [12:38:32] Looks like that made it start. Let's see how long it lives :) [12:38:51] Can you fire some req's at it? [12:39:05] yes, in a sec [12:39:41] Good morning all [12:39:46] heyo Chris [12:40:11] It’s cold now [12:41:17] hi Chris! [12:42:04] I sent a request. it works, but the new model is very slow [12:42:26] It also has only one 1cpu atm, I can bump that as well [12:43:05] that's fine, I need to first do some load tests [12:43:08] 2? 3? 4? wdyt? [12:47:00] the performance should be similar to the old one based on research team. but here we see it slower (it was also slower when I tested locally) [12:48:44] and it needs more memory. that's not ideal. we need to figure out if it is because of the model itself or sth else [12:51:17] klausman: but thanks for the help! I'll ping you if I need to bump the cpu [12:52:01] roger that. should I send a patch to deployment-charts that has the new memory limit? [12:52:11] (just for staging) [12:52:30] ohh yes that's better [12:54:02] alright, will be ready in a jiffy [12:55:58] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1064019 [12:57:21] +1ed [13:00:38] merged and deployed [13:38:32] (03CR) 10Máté Szabó: [C:03+2] Add missing documentation to class properties [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1063894 (owner: 10Umherirrender) [14:01:57] (03Merged) 10jenkins-bot: Add missing documentation to class properties [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1063894 (owner: 10Umherirrender) [14:12:24] 06Machine-Learning-Team: Fix API Gateway examples for Javascript - https://phabricator.wikimedia.org/T369865#10077684 (10kevinbazira) 05Open→03Resolved a:03kevinbazira [14:19:10] 06Machine-Learning-Team: Fix articlequality model-server local-run - https://phabricator.wikimedia.org/T371677#10077727 (10kevinbazira) This issue was fixed, and the articlequality model-server can now be built and run locally, as shown in: https://github.com/wikimedia/machinelearning-liftwing-inference-services... [14:19:28] 06Machine-Learning-Team: Fix articlequality model-server local-run - https://phabricator.wikimedia.org/T371677#10077729 (10kevinbazira) 05Open→03Resolved [14:25:27] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#10077773 (10kevinbazira) 05Open→03Resolved [14:31:53] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass the maximum number of uploads to the logo detection service - https://phabricator.wikimedia.org/T363505#10077811 (10kevinbazira) 05Open→03Resolved We are closing this ticket for now. Please feel free to reopen it if needed. [14:37:51] (03Abandoned) 10Kevin Bazira: logo-detection: use cookie to access stash images [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1028937 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira) [14:44:33] interesting.. load test results for readability are within the threshold [14:46:38] I wonder if the old model was already slow in staging before [14:48:06] for the same request, new model in staging took 6s and old one in prod took 0.9s [14:48:43] (03Abandoned) 10Kevin Bazira: logo-detection: restrict image processing to trusted domains [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023542 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira) [14:49:48] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1064046 [14:52:03] klausman: ---^ if you have time [14:52:31] Looking... [14:53:12] LGTM! [14:54:09] thankss [15:05:48] 06Machine-Learning-Team, 13Patch-For-Review: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons - https://phabricator.wikimedia.org/T363449#10077961 (10kevinbazira) 05Open→03Resolved [15:06:53] 06Machine-Learning-Team, 10Automoderator, 06Moderator-Tools-Team: [SPIKE]Perform a load test for Multilingual Revert Risk on LiftWing[4H] - https://phabricator.wikimedia.org/T372298#10077972 (10Scardenasmolinar) [15:09:22] 06Machine-Learning-Team: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#10077987 (10kevinbazira) 05Open→03Resolved [15:17:35] mykola did mention that the new model will require more RAM in https://phabricator.wikimedia.org/T369712#10038210 [16:21:15] 06Machine-Learning-Team, 10Automoderator, 10Moderator-Tools-Team (Kanban): [SPIKE]Perform a load test for Multilingual Revert Risk on LiftWing[4H] - https://phabricator.wikimedia.org/T372298#10078348 (10Samwalton9-WMF) [16:55:48] 10Lift-Wing, 06Machine-Learning-Team: Request to update Readability model on Lift Wing - https://phabricator.wikimedia.org/T369712#10078580 (10achou) We've deployed the model to ml-staging. Initially, the service was crashlooping due to out of memory. The issue was resolved after increasing the memory to 4Gi (... [16:58:37] ---^ wrapped out the findings for the readability load tests [17:02:45] didn't have time to check slow queries. I will do it tmr :) [17:03:24] logging off today! [19:51:49] 10Lift-Wing, 06Machine-Learning-Team: Request to update Readability model on Lift Wing - https://phabricator.wikimedia.org/T369712#10079166 (10Trokhymovych) Hi @achou, thanks so much for your work! I’ve run the tests and can confirm the scale of your observations. The old model averages 1.07s per item, while t...