[00:15:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[00:15:50] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-draftquality&var-backend=enwiki-draftquality-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[00:50:44] <jinxer-wm>	 FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate  - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[01:50:44] <jinxer-wm>	 FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate  - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[05:50:59] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[05:50:59] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-draftquality&var-backend=enwiki-draftquality-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[06:42:55] <klausman>	 Investigating ^^^
[07:51:15] <klausman>	 kevinbazira: sent you a patch to switch the above service to MP
[07:58:20] <kevinbazira>	 klausman: o/ I've +1'd.
[07:58:26] <klausman>	 thankyou!
[07:58:32] <klausman>	 will deploy in a minute
[07:58:36] <kevinbazira>	 okok
[08:06:12] <klausman>	 pushed to codfw, letting it soak for a bit
[08:19:30] <klausman>	 deployed to eqiad, too
[08:20:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[08:20:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-draftquality&var-backend=enwiki-draftquality-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[10:02:06] * klausman lunch
[12:25:29] <aiko>	 the new readability-predictor has been crahslooping in staging 
[12:26:16] <klausman>	 I think Ilias started looking into that last week, but I don't know if he found anything (or if he wrote it in a ticket)
[12:26:37] <aiko>	 checking the pod, it shows the reason is OOMKilled
[12:28:00] <aiko>	 klausman: ack, can you delete the pod? I want to see if it has the same issue after restarts
[12:28:10] <klausman>	 ack, in a sec
[12:30:30] <klausman>	 deleted it, but it's OOMing again
[12:31:53] <klausman>	 One thing of note: the old revision (18) has a memory limit of 2Gi, the new one is 1Gi
[12:32:05] <aiko>	 ohhh!
[12:32:12] <klausman>	 ... I think. 
[12:32:18] <klausman>	 The describe otuput is a bit confusing
[12:32:26] <aiko>	 why is that
[12:33:11] <klausman>	 there's many containers and assorted limits. I just diffed the two pods. The limits are identical, so false alarm there
[12:34:02] <aiko>	 okk
[12:34:36] <aiko>	 maybe the new model needs more memory
[12:35:12] <klausman>	 We could hand-edit it to a bigger value, see if it helps
[12:35:30] <aiko>	 yes I was going to say it
[12:37:28] <klausman>	 ok, bumping to 4Gi
[12:38:31] <aiko>	 it's running!
[12:38:32] <klausman>	 Looks like that made it start. Let's see how long it lives :)
[12:38:51] <klausman>	 Can you fire some req's at it?
[12:39:05] <aiko>	 yes, in a sec
[12:39:41] <chrisalbon>	 Good morning all
[12:39:46] <klausman>	 heyo Chris
[12:40:11] <chrisalbon>	 It’s cold now
[12:41:17] <aiko>	 hi Chris!
[12:42:04] <aiko>	 I sent a request. it works, but the new model is very slow
[12:42:26] <klausman>	 It also has only one 1cpu atm, I can bump that as well
[12:43:05] <aiko>	 that's fine, I need to first do some load tests
[12:43:08] <klausman>	 2? 3? 4? wdyt?
[12:47:00] <aiko>	 the performance should be similar to the old one based on research team. but here we see it slower (it was also slower when I tested locally)
[12:48:44] <aiko>	 and it needs more memory. that's not ideal. we need to figure out if it is because of the model itself or sth else
[12:51:17] <aiko>	 klausman: but thanks for the help! I'll ping you if I need to bump the cpu
[12:52:01] <klausman>	 roger that. should I send a patch to deployment-charts that has the new memory limit?
[12:52:11] <klausman>	 (just for staging)
[12:52:30] <aiko>	 ohh yes that's better
[12:54:02] <klausman>	 alright, will be ready in a jiffy
[12:55:58] <klausman>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1064019
[12:57:21] <aiko>	 +1ed
[13:00:38] <klausman>	 merged and deployed
[13:38:32] <wikibugs>	 (03CR) 10Máté Szabó: [C:03+2] Add missing documentation to class properties [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1063894 (owner: 10Umherirrender)
[14:01:57] <wikibugs>	 (03Merged) 10jenkins-bot: Add missing documentation to class properties [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1063894 (owner: 10Umherirrender)
[14:12:24] <wikibugs>	 06Machine-Learning-Team: Fix API Gateway examples for Javascript - https://phabricator.wikimedia.org/T369865#10077684 (10kevinbazira) 05Open→03Resolved a:03kevinbazira
[14:19:10] <wikibugs>	 06Machine-Learning-Team: Fix articlequality model-server local-run - https://phabricator.wikimedia.org/T371677#10077727 (10kevinbazira) This issue was fixed, and the articlequality model-server can now be built and run locally, as shown in: https://github.com/wikimedia/machinelearning-liftwing-inference-services...
[14:19:28] <wikibugs>	 06Machine-Learning-Team: Fix articlequality model-server local-run - https://phabricator.wikimedia.org/T371677#10077729 (10kevinbazira) 05Open→03Resolved
[14:25:27] <wikibugs>	 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#10077773 (10kevinbazira) 05Open→03Resolved
[14:31:53] <wikibugs>	 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass the maximum number of uploads to the logo detection service - https://phabricator.wikimedia.org/T363505#10077811 (10kevinbazira) 05Open→03Resolved We are closing this ticket for now. Please feel free to reopen it if needed.
[14:37:51] <wikibugs>	 (03Abandoned) 10Kevin Bazira: logo-detection: use cookie to access stash images [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1028937 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira)
[14:44:33] <aiko>	 interesting.. load test results for readability are within the threshold
[14:46:38] <aiko>	 I wonder if the old model was already slow in staging before
[14:48:06] <aiko>	 for the same request, new model in staging took 6s and old one in prod took 0.9s
[14:48:43] <wikibugs>	 (03Abandoned) 10Kevin Bazira: logo-detection: restrict image processing to trusted domains [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023542 (https://phabricator.wikimedia.org/T363449) (owner: 10Kevin Bazira)
[14:49:48] <aiko>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1064046
[14:52:03] <aiko>	 klausman: ---^ if you have time
[14:52:31] <klausman>	 Looking...
[14:53:12] <klausman>	 LGTM!
[14:54:09] <aiko>	 thankss
[15:05:48] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons - https://phabricator.wikimedia.org/T363449#10077961 (10kevinbazira) 05Open→03Resolved
[15:06:53] <wikibugs>	 06Machine-Learning-Team, 10Automoderator, 06Moderator-Tools-Team: [SPIKE]Perform a load test for Multilingual Revert Risk on LiftWing[4H] - https://phabricator.wikimedia.org/T372298#10077972 (10Scardenasmolinar)
[15:09:22] <wikibugs>	 06Machine-Learning-Team: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#10077987 (10kevinbazira) 05Open→03Resolved
[15:17:35] <aiko>	 mykola did mention that the new model will require more RAM in https://phabricator.wikimedia.org/T369712#10038210
[16:21:15] <wikibugs>	 06Machine-Learning-Team, 10Automoderator, 10Moderator-Tools-Team (Kanban): [SPIKE]Perform a load test for Multilingual Revert Risk on LiftWing[4H] - https://phabricator.wikimedia.org/T372298#10078348 (10Samwalton9-WMF)
[16:55:48] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Request to update Readability model on Lift Wing - https://phabricator.wikimedia.org/T369712#10078580 (10achou) We've deployed the model to ml-staging. Initially, the service was crashlooping due to out of memory. The issue was resolved after increasing the memory to 4Gi (...
[16:58:37] <aiko>	 ---^ wrapped out the findings for the readability load tests
[17:02:45] <aiko>	 didn't have time to check slow queries. I will do it tmr :)
[17:03:24] <aiko>	 logging off today!
[19:51:49] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Request to update Readability model on Lift Wing - https://phabricator.wikimedia.org/T369712#10079166 (10Trokhymovych) Hi @achou, thanks so much for your work! I’ve run the tests and can confirm the scale of your observations. The old model averages 1.07s per item, while t...