[07:23:35] <wikibugs>	 10Machine-Learning-Team: Move the article-descriptions model server from staging to production - https://phabricator.wikimedia.org/T358467#9579190 (10kevinbazira) Thanks @klausman. As discussed yesterday, with the current configuration, a request that was taking <3s on staging is now >8s in prod as shown below:...
[09:45:02] <klausman>	 Morning!
[09:45:21] <klausman>	 kevinbazira: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1006855 this should increas the resource caps for a-desc in prod.
[09:49:24] <kevinbazira>	 klausman: o/
[09:49:36] <kevinbazira>	 thank you for bumping up the caps
[09:52:59] <kevinbazira>	 let me push a patch to update the values in: https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/ml-services/article-descriptions/values.yaml#L68-L75
[09:55:01] <klausman>	 We can also do a quick hot edit to see if it works
[09:55:09] <klausman>	 Your call
[09:58:57] <kevinbazira>	 it's ok. let's do a hot-edit before the patch
[10:00:58] <kevinbazira>	 we can test with this request:
[10:00:58] <kevinbazira>	 ```
[10:00:58] <kevinbazira>	 time curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}' -H  "Host: article-descriptions.article-descriptions.wikimedia.org" -H "Content-Type: application/json" --http1.1
[10:00:58] <kevinbazira>	 ```
[10:01:20] <klausman>	 ack. I am pushing the admin_ng right now
[10:01:28] <kevinbazira>	 okok
[10:09:42] <klausman>	 Ok, rev 4 is started with the 16 CPU limit
[10:10:21] <klausman>	 Latency looks like ~4.5s
[10:11:29] <kevinbazira>	 yes, that's the range I've also got.
[10:11:46] <klausman>	 sec, poking and prodding.
[10:11:55] <kevinbazira>	 good thing it's dropped from >8s
[10:12:45] <klausman>	 One thing I noticed, Grafana does not show the container maxing the 6 CPU allocation we gave it, but we'll see what 16 does.
[10:13:57] <klausman>	 Seeing 3.5s now
[10:14:24] <kevinbazira>	 niiice ... I guess there's a caching layer involved
[10:14:37] <klausman>	 https://grafana.wikimedia.org/goto/qxlnwsoIz?orgId=1 Graphs
[10:14:54] <kevinbazira>	 should I go ahead to push a patch to update the values?
[10:15:11] <klausman>	 Yes, please do
[10:15:20] <kevinbazira>	 okok ..
[10:15:56] <klausman>	 I think the actual benefit of the extra CPU quota is due to low query count, i.e. while my query benefits from the 16 CPUs, it's one a minute, and the graphs just don't show it/it's averaged out.
[10:24:49] <kevinbazira>	 sure sure. I've pushed the patch: https://gerrit.wikimedia.org/r/1006866
[10:26:07] <klausman>	 +1'd
[10:30:37] <kevinbazira>	 thanks. going to merge and deploy ...
[10:36:48] <kevinbazira>	 helmfile diff doesn't seem to detect the change. is the hot-edit still there?
[10:38:04] <klausman>	 Yes
[10:38:12] <klausman>	 no diff means we got it right :)
[10:38:21] <kevinbazira>	 okok ...
[10:38:30] <kevinbazira>	 hihi
[10:39:46] <klausman>	 Oh, I just got an error with the request you pasted
[10:41:01] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Move the article-descriptions model server from staging to production - https://phabricator.wikimedia.org/T358467#9579784 (10klausman) I just got an error when querying the service:  ` $ time curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptio...
[10:41:16] <klausman>	 I put the error and the logs into the phab task
[10:49:28] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Move the article-descriptions model server from staging to production - https://phabricator.wikimedia.org/T358467#9579815 (10kevinbazira) @klausman helped [[ https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1006855 | increase the caps ]] on this model...
[10:51:00] <kevinbazira>	 hmmm... that's interesting. try running it without the `Content-Type` header as I did in: https://phabricator.wikimedia.org/T358467#9579815
[10:51:19] <kevinbazira>	 klausman: ^---
[10:52:42] <klausman>	 Note that is was transient, I re-did the same request seconds later and it was fine
[10:53:53] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Move the article-descriptions model server from staging to production - https://phabricator.wikimedia.org/T358467#9579821 (10klausman) One addendum to the 'None has no attribute "shape"': this happened only once, the same request seconds later (and before!) worked j...
[11:29:33] * klausman lunch
[11:49:28] <wikibugs>	 (03PS1) 10AikoChou: revertrisk-multilingual: bump torch and transormers version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1006909 (https://phabricator.wikimedia.org/T356045)
[12:08:21] <wikibugs>	 (03CR) 10AikoChou: "In the previous patch it ended up installing some nvidia stuff. Now testing these versions to see if they work." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1006909 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou)
[13:11:38] <chrisalbon>	 Good morning all!
[13:15:35] <aiko>	 morning o/
[13:17:51] <wikibugs>	 (03CR) 10Kevin Bazira: "I built the RRML model server successfully but when I query the isvc, I get a `TextClassificationPipeline` error as shown here:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1006909 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou)
[13:19:23] <klausman>	 Heyo Chris! Back among the living? :)
[13:22:34] <chrisalbon>	 I feel so much better
[13:23:05] <klausman>	 Flying west is usually easier for me, but 9+h difference (and a 12+h flight) still do a number on you
[13:27:14] <chrisalbon>	 The jet lag!
[14:05:48] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9580397 (10klausman)
[14:06:39] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9537731 (10klausman) I've updated the partman lines. I will update `modules/profile/data/profile/installserver/preseed.yaml` to include the new host in a moment, so standa...
[14:14:12] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9580435 (10klausman)
[15:29:08] <wikibugs>	 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Automoderator, 10Moderator-Tools-Team: Enable Language-agnostic revert risk model in ORES for Indonesian Wikipedia - https://phabricator.wikimedia.org/T358344#9580754 (10calbon) a:03calbon
[15:30:33] <wikibugs>	 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Automoderator, 10Moderator-Tools-Team: Enable Language-agnostic revert risk model in ORES for Indonesian Wikipedia - https://phabricator.wikimedia.org/T358344#9572128 (10calbon) @Samwalton9-WMF Hey Sam, is this a task for the machine learning team or...
[15:45:05] <wikibugs>	 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog: Investigate increased preprocessing latencies on LW of article-descriptions model - https://phabricator.wikimedia.org/T358195#9580807 (10calbon) Can we investigation reducing the computational need to just the language requested?
[15:48:02] <wikibugs>	 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog: Investigate increased preprocessing latencies on LW of article-descriptions model - https://phabricator.wikimedia.org/T358195#9580812 (10calbon) a:03isarantopoulos
[15:48:52] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Technical-Debt: Replace usage of wfGetDB() in ORES before the 1.42 cut so it can be hard-deprecated - https://phabricator.wikimedia.org/T357654#9580821 (10calbon) a:03isarantopoulos
[16:19:00] <wikibugs>	 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Automoderator, 10Moderator-Tools-Team: Enable Language-agnostic revert risk model in ORES for Indonesian Wikipedia - https://phabricator.wikimedia.org/T358344#9580903 (10Samwalton9-WMF) A task for our team, it just looks like #machine-learning-team go...
[16:28:48] <jinxer-wm>	 (InfServiceHighMemoryUsage) firing: (2) High Memory usage detected in Inference Service - https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Alerts#Inference_Services_High_Memory_Usage_-_InfServiceHighMemoryUsage_alert  - https://alerts.wikimedia.org/?q=alertname%3DInfServiceHighMemoryUsage
[16:42:04] <klausman>	 kevinbazira: onr thing we didn't cover today: memory usage of art-desc. It's still flying close to the limit (4GiB). Should we increase it to 5Gi and see how that flies?
[18:03:52] <klausman>	 heading out now, seeya tomorrow \o
[18:18:19] <chrisalbon>	 night klausman
[22:16:19] <wikibugs>	 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Growth-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#9581892 (10KStoller-WMF) >>! In T348298#9576023, @kostajh wrote: >> @KStoller-WMF suggested that we sta...