[04:02:02] 06Machine-Learning-Team, 10ORES, 05FY2023-24-WE 2.1 Typography and palette customizations, 10MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), 10Web-Team-Backlog (FY2024-25 Q1 Sprint 1): Special:ORESModels doesnt work in night theme - https://phabricator.wikimedia.org/T366379#9975741 (10Edtadros) [04:04:23] 06Machine-Learning-Team, 10ORES, 05FY2023-24-WE 2.1 Typography and palette customizations, 10MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), and 2 others: Special:ORESModels doesnt work in night theme - https://phabricator.wikimedia.org/T366379#9975744 (10Edtadros) ### Test Result - Prod **Status:** ✅ PASS... [04:05:26] 06Machine-Learning-Team, 10ORES, 05FY2023-24-WE 2.1 Typography and palette customizations, 10MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), and 2 others: Special:ORESModels doesnt work in night theme - https://phabricator.wikimedia.org/T366379#9975747 (10Edtadros) [05:44:22] o/ [06:58:38] I sent a patch to enable mp for arwiki-damaging that fired the alert https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1053835 [06:58:55] this would enable it only for large revisions (above 100KB) [06:59:12] perhaps we could enable it if it fires again (this was the second time) [07:43:37] 06Machine-Learning-Team, 13Patch-For-Review: Reorganize LiftWing isvcs repo structure to improve maintainability - https://phabricator.wikimedia.org/T369344#9975861 (10kevinbazira) [07:45:48] isaranto: o/ morning [07:46:15] the arwiki-damaging lgtm. i've +1'd [07:47:22] also, the article_descriptions model-server migrated to the src dir is up and running in prod (both eqiad and codfw): https://phabricator.wikimedia.org/P66378 [08:02:53] Hey Kevin! [08:02:56] Nice! [08:22:47] kevinbazira: regarding nsfw I suggest we don't add support in the Makefile. the model isn't even deployed so we can focus on other things. wdyt? [08:23:22] isaranto: sounds good to me. [08:23:30] should we move it to the src dir? [08:24:30] we could move it, but imo that is not urgent either [08:25:03] okok [09:49:57] Morning! [09:50:38] I'll be pushing the policy update (and only that, i.e. not the example deletion) to eqiad in a bit. There should be no disruption, but as usual, please report any oddities [09:50:47] morning Tobias o/ [09:54:15] hey Ilias \o [09:58:37] klausman: we had an alert for arwiki-damaging yesterday. shall we enable multiprocessing for that one since it the 2nd time it has fired? or shall we wait and do it if we have another one? [09:59:31] On the one hand, it would be a Friday change. On the other hand, we've done it before, and it never broke™ [10:04:33] I can even do it on Monday. I'm just trying to answer the question "after how many alerts shall we enable multiprocessing?", since we use more resources when doing so [10:05:11] The old adage is "Once is happenstance, twice is coincidence, three times is enemy action" [10:05:50] So we might go with 3x by default, unless it's a high-value service [10:06:20] I agree, cause also third time's the charm (although this quote has a different meaning :P ) [10:06:46] kevinbazira: you mentioned you use the ml-sandbox regularly. For what specifically? I am trying to figure out if we are still using the whole miniklube setup. [10:10:28] klausman: o/ I nolonger use minikube. since the ml-sandbox is way faster than my local environment, I use it to build and test model-servers before making patches or when reviewing patches. [10:10:31] 06Machine-Learning-Team: Update kserve and knative-serving charts for new-style Calico network policies - https://phabricator.wikimedia.org/T365479#9976106 (10klausman) All pushed and working. [10:11:01] kevinbazira: ah, so similar to how some people use statboxes (but there you wouldn't have docker, I presume) [10:11:24] yep, no docker on statboxes :) [10:12:18] elukey: I am wondering what to do with the ml-sandbox, since it's still on buster. We don't really use minikube anymore, as far as I can tell. We might make a new bookworm vm of similar size (disk, ram cpu) and install the docker necessities there. You have more experience with the old sandbox, wdyt? [10:13:16] hmm I could use ml-sandbox for testing llms with the hf image. had forgotten about that (so I dont have to deal with mac/m1 issues) [10:18:02] also a good idea [10:18:20] just no GPU :) and I dunno how fast the CPU-mode would be [10:19:24] also true, at least if there is space I can test smaller models [10:19:35] yeah, totally agreed. [10:20:01] It also has the upside of good networking, if-when you have to fetch big images or models [10:25:25] Since we can't re-use the machine name (I think), what would be a good new name? [10:27:46] ml-testing, ml-playground [10:28:32] I like testing a bit more than playground, but could be convinced either way [10:39:40] ok, testing it is [10:43:32] klausman: I think we should re-think about https://phabricator.wikimedia.org/T305447 [10:43:51] if we upgrade to something else [10:44:08] Well, as far as I can tell, wikikube is not used by anyone anymore. [10:44:39] Kevin mentioned using it for image building, but not wikikube (if I understood correctly) [10:45:16] this is something that the ML team needs to decide, but minikube is useful to test the whole stack (or, an approximation of it) before staging [10:45:28] if it is not worth anymore, then it is good to be nuked [10:46:48] we have to update to bookworm anyway, so I created a VM already. I'll keep the old one around as long as possible. [10:50:34] no real need, if nobody uses the sandbox with minikube then nuke it [10:50:39] less garbage to keep around :) [10:50:53] and we still have the setup instructions on Wikitech [10:51:09] Now I just need to figure out how to best migrate the /srv volume [10:51:47] Normally I'd create a new one of same size and let people copy over what they need, but we're out of disk quota for the project [10:52:15] yeah I'm ok with nuking it as well since no one is using [10:52:20] * isaranto lunch! [11:01:30] kevinbazira: if it's ok with you, I'd shutdown the sandbox, and attach the old /srv volume (docker, homedirs) to the new machine and set everything up. But it would mean no access for a few hours at least [11:02:23] klausman: yes please go ahead. thanks! [11:28:45] ok, I think the basic setup is done, feel free to give it a try (at ml-testing.machine-learning.eqiad1.wikimedia.cloud) [11:31:28] docker is installed and the images and containers should be copied over (but none running, of course) [11:32:10] ok, I'll give it a try in a bit [11:32:48] Oh, homedirs are of course also there, since they live on /srv [11:33:10] I'm gonna have lunch, but I'll keep an eye here if anything is needed. [11:36:07] ack! [11:36:21] we can chat afterwards if anything is needed, enjoy your lunch! [11:52:01] klausman: I've been able to ssh into ml-testing and all lgtm. thank you for transfering docker and homedirs. the only change I've noticed so far is that now I have to use `sudo docker ps -a` instead of just `docker ps -a`, which is fine with me :) [11:52:37] same here! no permission for docker , everything else works fine [11:53:36] I'll see if that can be made to work as it used to (probably) [12:10:38] thank you <3 [12:23:10] 10Lift-Wing, 06Machine-Learning-Team: [httpbb] fix failing httpbb test in production enwiki-articletopic - https://phabricator.wikimedia.org/T363334#9976336 (10isarantopoulos) I've ran all tests for staging and prod. Staging is fine but I get this error on prod eqiad which seems transient: ` httpbb --host infe... [12:26:46] (03PS1) 10Ilias Sarantopoulos: docs: update httpbb instructions in README.md [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1053906 [12:36:29] Alright, it's fixed, no more sudo necessary [12:36:58] Fix was to add `"group": "wikidev"` to `/etc/docker/daemon.json` [12:38:21] ack , thanks Tobias! [12:41:33] super! danke Tobias! [13:00:34] 06Machine-Learning-Team, 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "machine-learning" project Buster deprecation - https://phabricator.wikimedia.org/T367537#9976479 (10klausman) We have decided to create a new VM, ml-testing. Since the odler use case of wikikube is not really relevant anymore, all tha... [13:30:01] (03CR) 10Kevin Bazira: [C:03+1] docs: update httpbb instructions in README.md [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1053906 (owner: 10Ilias Sarantopoulos) [13:51:49] 06Machine-Learning-Team, 06Content-Transform-Team, 06Research: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#9976639 (10isarantopoulos) I see you've done a lot of great work on feature engineering and preprocessing so I don't mean to interfere with your work! My suggestion... [13:51:56] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] docs: update httpbb instructions in README.md [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1053906 (owner: 10Ilias Sarantopoulos) [14:03:44] 10Lift-Wing, 06Machine-Learning-Team: [httpbb] fix failing httpbb test in production enwiki-articletopic - https://phabricator.wikimedia.org/T363334#9976668 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos Resolving this as it can't be reproduced. [14:15:11] (03CR) 10Klausman: [C:03+1] docs: update httpbb instructions in README.md (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1053906 (owner: 10Ilias Sarantopoulos) [14:34:49] Logging off folks, have a nice weekend! [14:43:53] 06Machine-Learning-Team, 10MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), 10Structured-Data-Backlog (Current Work): [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard - https://phabricator.wikimedia.org/T364551#9976791 (10AUgolnikova-WMF) [15:44:07] 06Machine-Learning-Team, 10MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), 10Structured-Data-Backlog (Current Work): [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard - https://phabricator.wikimedia.org/T364551#9977032 (10mfossati) > @matthiasmullie wrote: Hi @kevinbazira; we fina... [17:02:44] FIRING: LiftWingServiceErrorRate: ... [17:02:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=enwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [17:22:09] 06Machine-Learning-Team, 10MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), 10Structured-Data-Backlog (Current Work): [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard - https://phabricator.wikimedia.org/T364551#9977508 (10kevinbazira) Great to see that the Media Detection API is n... [17:22:44] RESOLVED: LiftWingServiceErrorRate: ... [17:22:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=enwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [17:31:15] FIRING: ORESFetchScoreJobKafkaLag: Kafka consumer lag for ORESFetchScoreJob over threshold for past 1h. ... [17:31:15] - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#Kafka_Consumer_lag_-_ORESFetchScoreJobKafkaLag_alert - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&orgId=1&to=now&var-cluster=main-eqiad&var-consumer_group=cpjobqueue-ORESFetchScoreJob&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DORESFetchScoreJobKafkaLag [17:51:15] RESOLVED: ORESFetchScoreJobKafkaLag: Kafka consumer lag for ORESFetchScoreJob over threshold for past 1h. ... [17:51:15] - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#Kafka_Consumer_lag_-_ORESFetchScoreJobKafkaLag_alert - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&orgId=1&to=now&var-cluster=main-eqiad&var-consumer_group=cpjobqueue-ORESFetchScoreJob&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DORESFetchScoreJobKafkaLag [20:58:47] Great alerts on a Friday night :) [20:59:29] Luckily they were resolved,but I'll take a look tomorrow [22:40:28] 06Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2024-25): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123#9978595 (10JTannerWMF)