[07:29:06] 06Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518#9836588 (10SD0001) GET endpoints are more user-friendly as you can hit them on a browser - which can be a way to show an inference result to a non-technical user to explain why a bot or tool behaves the way it d... [07:31:36] morning o/ [08:03:14] 10Lift-Wing, 06Machine-Learning-Team, 10ORES, 10ChangeProp, and 5 others: Selectively disable changeprop functionality that is no longer used - https://phabricator.wikimedia.org/T361483#9836666 (10dcausse) @achou except expert search users explicitly searching for topics (which I suspect are rare) the grow... [10:04:03] * klausman lunch [10:52:04] 06Machine-Learning-Team: Investigate why article-descriptions LiftWing API returns 404 when encoded colon is used in request URL - https://phabricator.wikimedia.org/T365439#9837287 (10hnowlan) It seems Envoy only normalises a subset of urlencoded characters: ` hnowlan@plunkett ~/Code/deployment-charts (hnowlan/... [11:04:44] FIRING: LiftWingServiceErrorRate: ... [11:04:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=eswiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:15:17] eswikiiiii [11:15:36] lol [11:24:44] RESOLVED: LiftWingServiceErrorRate: ... [11:24:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=eswiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:28:45] the max replicas for eswiki-damaging are set to 4, but autoscaling did not scale it. I only saw 1 replica [11:32:35] I've saved the eswiki kserve logs. will check it later [12:08:45] I'm having a look-see at the graphs, see if there is anything different than usual [12:09:33] Looks very burst-y [14:16:44] 06Machine-Learning-Team, 13Patch-For-Review: Tweak partman recipe for ML k8s workers - https://phabricator.wikimedia.org/T365971#9838033 (10klausman) [14:18:03] 06Machine-Learning-Team, 13Patch-For-Review: Allow setting huggingfaceserver cmd args from deployment-charts - https://phabricator.wikimedia.org/T365842#9838046 (10klausman) [14:20:39] 06Machine-Learning-Team: Append wikitech link and contact info to revscoring model servers - https://phabricator.wikimedia.org/T365834#9838049 (10klausman) [14:27:58] 06Machine-Learning-Team, 13Patch-For-Review: Run load tests for the rec-api-ng and update production resources to meet expected load - https://phabricator.wikimedia.org/T365554#9838060 (10klausman) [15:04:37] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU - https://phabricator.wikimedia.org/T362670#9838207 (10klausman) - Mistral crashlooping, startup checks usually 5m , so we bumped to 10m, but it didn't help - Bert model works, so likely Mi... [15:05:58] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services - https://phabricator.wikimedia.org/T362674#9838209 (10klausman) - we had another instance of high lat (eswiki) - logs show fetch features being slow (extract_cache) - we... [15:57:06] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Growth-Team, 10MediaWiki-Recent-changes, and 2 others: Enable Revert Risk RecentChanges filter on id.wiki - https://phabricator.wikimedia.org/T365701#9838417 (10Samwalton9-WMF) We're not sure what the steps are to launch this on id.wiki. The ORES exte...