[07:23:35] 10Machine-Learning-Team: Move the article-descriptions model server from staging to production - https://phabricator.wikimedia.org/T358467#9579190 (10kevinbazira) Thanks @klausman. As discussed yesterday, with the current configuration, a request that was taking <3s on staging is now >8s in prod as shown below:... [09:45:02] Morning! [09:45:21] kevinbazira: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1006855 this should increas the resource caps for a-desc in prod. [09:49:24] klausman: o/ [09:49:36] thank you for bumping up the caps [09:52:59] let me push a patch to update the values in: https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/ml-services/article-descriptions/values.yaml#L68-L75 [09:55:01] We can also do a quick hot edit to see if it works [09:55:09] Your call [09:58:57] it's ok. let's do a hot-edit before the patch [10:00:58] we can test with this request: [10:00:58] ``` [10:00:58] time curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}' -H "Host: article-descriptions.article-descriptions.wikimedia.org" -H "Content-Type: application/json" --http1.1 [10:00:58] ``` [10:01:20] ack. I am pushing the admin_ng right now [10:01:28] okok [10:09:42] Ok, rev 4 is started with the 16 CPU limit [10:10:21] Latency looks like ~4.5s [10:11:29] yes, that's the range I've also got. [10:11:46] sec, poking and prodding. [10:11:55] good thing it's dropped from >8s [10:12:45] One thing I noticed, Grafana does not show the container maxing the 6 CPU allocation we gave it, but we'll see what 16 does. [10:13:57] Seeing 3.5s now [10:14:24] niiice ... I guess there's a caching layer involved [10:14:37] https://grafana.wikimedia.org/goto/qxlnwsoIz?orgId=1 Graphs [10:14:54] should I go ahead to push a patch to update the values? [10:15:11] Yes, please do [10:15:20] okok .. [10:15:56] I think the actual benefit of the extra CPU quota is due to low query count, i.e. while my query benefits from the 16 CPUs, it's one a minute, and the graphs just don't show it/it's averaged out. [10:24:49] sure sure. I've pushed the patch: https://gerrit.wikimedia.org/r/1006866 [10:26:07] +1'd [10:30:37] thanks. going to merge and deploy ... [10:36:48] helmfile diff doesn't seem to detect the change. is the hot-edit still there? [10:38:04] Yes [10:38:12] no diff means we got it right :) [10:38:21] okok ... [10:38:30] hihi [10:39:46] Oh, I just got an error with the request you pasted [10:41:01] 10Machine-Learning-Team, 10Patch-For-Review: Move the article-descriptions model server from staging to production - https://phabricator.wikimedia.org/T358467#9579784 (10klausman) I just got an error when querying the service: ` $ time curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptio... [10:41:16] I put the error and the logs into the phab task [10:49:28] 10Machine-Learning-Team, 10Patch-For-Review: Move the article-descriptions model server from staging to production - https://phabricator.wikimedia.org/T358467#9579815 (10kevinbazira) @klausman helped [[ https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1006855 | increase the caps ]] on this model... [10:51:00] hmmm... that's interesting. try running it without the `Content-Type` header as I did in: https://phabricator.wikimedia.org/T358467#9579815 [10:51:19] klausman: ^--- [10:52:42] Note that is was transient, I re-did the same request seconds later and it was fine [10:53:53] 10Machine-Learning-Team, 10Patch-For-Review: Move the article-descriptions model server from staging to production - https://phabricator.wikimedia.org/T358467#9579821 (10klausman) One addendum to the 'None has no attribute "shape"': this happened only once, the same request seconds later (and before!) worked j... [11:29:33] * klausman lunch [11:49:28] (03PS1) 10AikoChou: revertrisk-multilingual: bump torch and transormers version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1006909 (https://phabricator.wikimedia.org/T356045) [12:08:21] (03CR) 10AikoChou: "In the previous patch it ended up installing some nvidia stuff. Now testing these versions to see if they work." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1006909 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [13:11:38] Good morning all! [13:15:35] morning o/ [13:17:51] (03CR) 10Kevin Bazira: "I built the RRML model server successfully but when I query the isvc, I get a `TextClassificationPipeline` error as shown here:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1006909 (https://phabricator.wikimedia.org/T356045) (owner: 10AikoChou) [13:19:23] Heyo Chris! Back among the living? :) [13:22:34] I feel so much better [13:23:05] Flying west is usually easier for me, but 9+h difference (and a 12+h flight) still do a number on you [13:27:14] The jet lag! [14:05:48] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9580397 (10klausman) [14:06:39] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9537731 (10klausman) I've updated the partman lines. I will update `modules/profile/data/profile/installserver/preseed.yaml` to include the new host in a moment, so standa... [14:14:12] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9580435 (10klausman) [15:29:08] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Automoderator, 10Moderator-Tools-Team: Enable Language-agnostic revert risk model in ORES for Indonesian Wikipedia - https://phabricator.wikimedia.org/T358344#9580754 (10calbon) a:03calbon [15:30:33] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Automoderator, 10Moderator-Tools-Team: Enable Language-agnostic revert risk model in ORES for Indonesian Wikipedia - https://phabricator.wikimedia.org/T358344#9572128 (10calbon) @Samwalton9-WMF Hey Sam, is this a task for the machine learning team or... [15:45:05] 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog: Investigate increased preprocessing latencies on LW of article-descriptions model - https://phabricator.wikimedia.org/T358195#9580807 (10calbon) Can we investigation reducing the computational need to just the language requested? [15:48:02] 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog: Investigate increased preprocessing latencies on LW of article-descriptions model - https://phabricator.wikimedia.org/T358195#9580812 (10calbon) a:03isarantopoulos [15:48:52] 10Machine-Learning-Team, 10ORES, 10Technical-Debt: Replace usage of wfGetDB() in ORES before the 1.42 cut so it can be hard-deprecated - https://phabricator.wikimedia.org/T357654#9580821 (10calbon) a:03isarantopoulos [16:19:00] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Automoderator, 10Moderator-Tools-Team: Enable Language-agnostic revert risk model in ORES for Indonesian Wikipedia - https://phabricator.wikimedia.org/T358344#9580903 (10Samwalton9-WMF) A task for our team, it just looks like #machine-learning-team go... [16:28:48] (InfServiceHighMemoryUsage) firing: (2) High Memory usage detected in Inference Service - https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Alerts#Inference_Services_High_Memory_Usage_-_InfServiceHighMemoryUsage_alert - https://alerts.wikimedia.org/?q=alertname%3DInfServiceHighMemoryUsage [16:42:04] kevinbazira: onr thing we didn't cover today: memory usage of art-desc. It's still flying close to the limit (4GiB). Should we increase it to 5Gi and see how that flies? [18:03:52] heading out now, seeya tomorrow \o [18:18:19] night klausman [22:16:19] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Growth-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#9581892 (10KStoller-WMF) >>! In T348298#9576023, @kostajh wrote: >> @KStoller-WMF suggested that we sta...