[06:44:28] Good morning folks! [07:23:36] gry: we don't have an llm running in production that one could use but we're working on it in https://phabricator.wikimedia.org/T371395. Unfortunately no access can be provided at the moment as it is only available through our internal infrastructure. [08:30:12] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1002 for host ml-l... [08:43:36] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143390 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1002 for host ml-lab10... [08:47:47] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143392 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1002 for host ml-l... [09:10:04] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1002 for host ml-lab10... [09:12:49] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1002 for host ml-l... [09:41:47] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143519 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1002 for host ml-lab10... [09:42:36] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10143533 (10klausman) [09:56:16] 10Lift-Wing, 06Machine-Learning-Team: Transfer article-descriptions model to article-models ns - https://phabricator.wikimedia.org/T374697 (10isarantopoulos) 03NEW [10:17:45] * klausman lunch [10:35:38] * isaranto lunch [11:08:12] 06Machine-Learning-Team, 10Temporary accounts: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#10143793 (10kostajh) This is done, as far as I know, so I am marking it resolved. Expect to see some calls the endpoi... [11:08:13] 06Machine-Learning-Team, 10Temporary accounts: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#10143794 (10kostajh) 05Open→03Resolved [11:26:32] hello [11:36:35] (03PS2) 10AikoChou: locust: entry for reference_quality model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1072541 (https://phabricator.wikimedia.org/T371902) [11:40:58] isaranto: why hugging faces thing and not some other? is there some note published on that? [11:42:27] gry: because it is a viable way to deploy models to host models in a production environment using the huggingfaceserver from kserve https://github.com/kserve/kserve/tree/master/python/huggingfaceserver [11:43:27] isaranto: is it the same as "Hugging Face Transformers"? [11:43:27] (03PS3) 10AikoChou: locust: entry for reference_quality model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1072541 (https://phabricator.wikimedia.org/T371902) [11:44:33] and our infra is already using kserve. By huggingface models we mean models that exist in the HF repository https://huggingface.co/models. Almost all open source models end up there so it is a good point to start [11:44:48] isaranto: is there something like what you want, that is available to public to test and play with elsewhere? [11:46:01] would you be making all these models available or only some? [11:50:32] gry: I'm not sure if there is anything publicly available elsewhere but one could test locally with tools like ollama https://github.com/ollama/ollama. [11:50:32] Only some models and not with public access - at least in the beginning for safety concerns and resources [11:52:00] 06Machine-Learning-Team, 10Temporary accounts: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#10143922 (10Strainu) I see an opportunity to use this endpoint for evaluating new pages. Any opinions @achou ? [11:52:18] isaranto: i thought more like "gry, you can visit gemini.google.com now, it will be same as our gemini that we will be hosting" [11:52:39] isaranto: do you anticipate having ollama available early on? [11:53:13] i would like to code something that uses an LLM, preferably one similar to what you will have got [11:56:53] gry: got it! So, ollama is something that you could try locally or in a test server to host some LLM from huggingface so you can try with your app. We're not sure which LLM will be the one we host in the end. This is a candidate for example https://cohere.com/blog/aya23 as we're interested in multilingual efforts and not english only. [11:57:22] yes i will need multilingual too [11:57:29] but as I said unfortunately we won't be able to provide outside access - definitely not in the early stages of the project [11:57:51] i suspect you might need to host multiple ones as each wiki might want to have different goals in mind [11:58:21] so you could start asking them soon,which llms or which goals would they be most interested in [11:58:54] ack! [11:59:08] tasks at e.g. wiktionary can be very different from wikipedia or Wikimedia commons [11:59:48] aiko: do you have the human readable table from the ref need load test? I was thinking maybe we can start adding these in a README which we can revisit later [12:02:29] gry: you're totally right! There is already work done by other folks on narrowing down on use cases and models. For example this one here https://meta.wikimedia.org/wiki/Research:Develop_a_model_for_text_simplification_to_improve_readability_of_Wikipedia_articles/FY24-25_WE.3.1.3_content_simplification [12:03:40] isaranto: this looks good. so for you i would suggest actively reviewing the needs of smaller wikis. [12:04:47] isaranto: for me i need advice what i can use, preferably already hosted elsewhere, that is likely to be offered by you in future, that can handle task of identifying locations and other analysis based on source texts. [12:06:14] isaranto: so i can do code for my app in alpha stage using some llm hosted somewhere (or easy to install while only knowing debian and not kserve or vagrant) and without needing to do complete rewrite when you are ready to provide public access. [12:07:35] definitely! knowledge equity is important. I'm sorry but I don't know which model we'll be hosting. My advice would be to either host a small version of a model (e.g. llama3). Again as I said we don't plan to provide public access in the near future so don't build it with that in mind [12:16:47] isaranto: yes I have it. Okay! I'll add it to a README under the ref-quality dir [12:17:22] isaranto: thank you. i will subscribe to that phab ticket. [12:17:46] aiko: thanks! Other than that the patch looks good! [12:17:48] isaranto: http://www.wikimedia.org has a list of sisters. [12:18:16] isaranto: ping me when you want me to test something, i am happy to help. [12:18:57] gry: Will do! Nice chatting and thank you for your contibutions <# [12:18:59] <3 [12:19:15] :-) [12:27:39] isaranto: thanks for the review. I'm testing the model with more cpu in exp and checking the latency. it was the default value 1 cpu, like RRML we give it 4 cpus. [12:28:10] (03CR) 10Ilias Sarantopoulos: [C:03+1] locust: entry for reference_quality model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1072541 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [12:29:35] 06Machine-Learning-Team, 10Temporary accounts: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#10144068 (10kostajh) >>! In T356102#10143922, @Strainu wrote: > I see an opportunity to use this endpoint for evaluat... [12:31:26] isaranto: so 1 is clearly not enough [12:31:53] ack! [12:32:34] I think the inputs are ok. Perhaps we can get another set so that we can cross check [13:10:04] * isaranto afk - bbl [14:08:09] 06Machine-Learning-Team, 10Temporary accounts: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#10144392 (10diego) >>! In T356102#10143922, @Strainu wrote: > I see an opportunity to use this endpoint for evaluatin... [14:21:41] we have two inputs set, sample_all.input and sample_top_view.input [14:27:38] https://phabricator.wikimedia.org/P69115 [14:28:01] ---^ results for 1, 2, 4 cpus [15:05:11] at least there is a vast improvement with more cpus! [15:05:46] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10144567 (10Jclark-ctr) 05Open→03Resolved [15:07:44] FIRING: LiftWingServiceErrorRate: ... [15:07:50] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=fiwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:08:21] Friday afternoon alert, yayyyyyy! [15:12:14] always the best [15:12:36] eq-damage, fiwiki [15:12:51] It seems it already recovered [15:13:27] ack, Murphy would approve! [15:13:31] I found the logs with the error [15:15:23] seems it was a 503 while fetching data from mwapi [15:20:39] Latency is picking up again :-/ [15:29:06] I don't think it's an MWAPI 503, but our own. The pattern looks pretty much like the big-wikitext-causes-slowdown we've seen [15:36:20] probably so here are some logs, trying to repro atm https://phabricator.wikimedia.org/P69116 [15:37:44] RESOLVED: LiftWingServiceErrorRate: ... [15:37:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=fiwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:45:31] isaranto: should we turb on MP, add some resources and hope it gets us over the weekend? [15:45:56] I found one revid, looking for more [15:46:02] The Latency is going down again, but who knows for how long [15:46:20] but yes I'd do it. I'll open a patch ok? [15:46:24] sgtm [15:51:34] klausman: in the attached logs I do see the 503 which is different that the standard case we had in the past (like in this old/parent task -> https://phabricator.wikimedia.org/T363336) [15:51:47] ack [15:52:56] the issue is within the get_extractor_cache but afaiu it seems that it is a network issue, not time spent on cpu [15:54:58] something like a packet drop? [15:58:00] dunno! but it is failing to fetch the features from mwapi. I don't recall what we do with retries so I'm looking at that [16:03:22] ack [16:17:21] I don't see any retry logic in https://github.com/mediawiki-utilities/python-mwapi/blob/master/mwapi/async_session.py [16:19:42] just realized it is Friday the 13th [16:20:14] lol [16:22:45] It's not a bad omen if you only realize afterwards ;) [16:27:05] klausman: do you have some time to help me with https://phabricator.wikimedia.org/T370149#10140125 on Monday or Tuesday? [16:29:15] I’ll make sure to provide more updates on the task. In summary, I’m struggling to resolve issues related to ROCm failures. There is an "official" instruction for a fix for `libamdhip64.so` but once I fix that I have other issues [16:29:40] anyway sorry for the info dump! [16:31:20] Sure, can do! [16:43:42] ty! [16:46:59] (03PS4) 10AikoChou: locust: entry for reference_quality models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1072541 (https://phabricator.wikimedia.org/T371902) [16:54:58] (03PS5) 10AikoChou: locust: entry for reference_quality models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1072541 (https://phabricator.wikimedia.org/T371902) [16:55:51] (03CR) 10AikoChou: [C:03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1072541 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [16:56:47] (03CR) 10AikoChou: [V:03+2 C:03+2] locust: entry for reference_quality models [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1072541 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [16:59:26] https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/test/locust/models/reference_quality/ [16:59:39] the format is not working properly lol. I'll submit a patch to fix it [18:03:31] * isaranto afk - have a nice weekend