[07:21:46] Good morning! [07:53:09] making another attempt with mistral-7b-instruct this time https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1018633 [08:36:28] https://docs.nixtla.io/docs/anomaly_detection - really really nice :D [08:36:33] (hello :) [08:39:28] o/ [08:42:29] It seems really nice, but it is a paid service :( [08:44:17] aaaaan mistral got OOMkilled [08:55:40] yes yes, https://github.com/time-series-foundation-models/lag-llama is open [08:55:46] but not so powerful [08:56:13] it was to point out that transformers working on time series are also interesting [08:59:13] ah yes! I've done a lot of work with transformers and time series in the past but not with foundational models, just for specific use cases. What seems really nice in the library though is how easy it is to do things (predict, plot etc) [08:59:50] turns out my OOM errors were because I was using 2Gi in the pod due to wrong identation in the yaml https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1018646 [09:00:59] however it is possible that this will fail again, due to copying of the model. Let's see... [09:21:01] morning o/ [09:23:50] Guten tag aiko o/ [11:43:43] * isaranto lunch! [12:30:32] o/ [12:33:04] \o [12:33:31] I'm battling with resources on liftwing. Managed to run mistral-7b with 35Gi memory! [12:34:19] 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Rename the envoy's uses_ingress option to sets_sni - https://phabricator.wikimedia.org/T346638#9703617 (10JMeybohm) [12:37:37] 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Rename the envoy's uses_ingress option to sets_sni - https://phabricator.wikimedia.org/T346638#9703621 (10JMeybohm) Unfortunately version 1.4.3 of mesh.configuration still uses `uses_ingress` in one if-block. So the initially assumed version requirem... [12:41:32] isaranto: oh my, 35.. [12:41:41] do you need any help? If so ping :) [12:42:33] I will if needed, now that I can change the isvc in the experimental namespace I do it myself manually :) [12:43:35] with some modifications in the huggingfaceserver (similar to the ones we did in article-descriptions and llm image ) we could cut the memory requirements to half or similar. I'll test it [13:22:23] aiko: o/ [13:22:42] I was checking https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-component=All&var-namespace=revertrisk&var-model_name=revertrisk-language-agnostic&from=now-30d&to=now and I noticed that around moddle March the RR-LA's preprocess latency wend up considerably [13:23:05] did we change anything in that regard? [13:30:48] * elukey wishes we had SLO dashboards working [13:40:58] this is the upgrade that happened last week (previous one was in december) https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1014545/ [13:42:28] with the upgrade in knowledge_integrity v0.6 some increased latencies were expected, since we have added validation using pydantic. However the observed increases in grafana do no align with the time of the deployment (grafana spikes started around ~19/3 while deployment took place on 4/4) [15:09:37] "On Wednesday March 20th 2024, the SRE team will run a planned datacentre switchover, moving all wikis from codfw to eqiad." >>> could be this? [15:11:23] elukey: --^ I'm wondering if it could be related to the datacentre switchover [15:12:58] iiiinteresting [15:15:08] aiko: take a look to https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revertrisk&var-backend=api-ro.discovery.wmnet&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&from=now-30d&to=now [15:15:21] this should be the latency to api-ro as measured by the RR pods [15:15:24] aha! the timing is exactly the time of the switchover. nice aiko! [15:15:41] p75+ clearly increases [15:16:06] yes indeed nice one! [15:17:05] the fact that p99 doesn't increase accordingly is weird [15:18:42] worth opening a task, not really sure why we pay a penalty [15:24:57] I think p99 does increase. there are some big outliers so it's not so obvious [15:27:21] aiko: right right, the other one is .95 (not .75, my bad) and i jumps from ~50ms to ~80ms [15:27:34] so yes yes the latency towards the mwapi increased from the RR point of view [15:27:56] we should hit eqiad though [15:34:52] okok so p75 was on the kserve latencies, p95 for istio metrics [15:34:54] * elukey brainfault [15:35:43] I still not get what happened though, since api-ro.discovery.wmnet should be active-active, namely available for eqiad and codfw [15:35:58] so from kserve in eqiad, api-ro.discovery.wmnet should solve to the eqiad endpoint/ip [15:36:08] for us it shouldn't have changed anything [15:36:16] does it make sense? [15:38:23] yeah makes sense. that's weird we don't hit eqiad [15:38:55] or we are still hitting eqiad but something adds the latency [15:39:38] interesting [15:50:10] Going afk folks, have a nice evening! [15:51:48] bye Ilias! have a nice rest of the day :) [15:55:24] o/ [16:23:36] going afk folks! [16:23:40] have a nice rest of the day! [16:33:18] o/ bye Luca! [18:18:31] (03PS9) 10Kosta Harlan: Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman) [18:21:37] (03CR) 10Kosta Harlan: Exclude first/only revision on page from scoring (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman) [18:23:02] (03CR) 10Kosta Harlan: [C:03+1] update revertrisk-language-agnostic min & desc [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014519 (https://phabricator.wikimedia.org/T348298) (owner: 10Jsn.sherman)