[00:59:07] FIRING: [4x] ErrorBudgetBurn: liftwing - liftwing-revscoring-latency - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:27:18] 06Machine-Learning-Team: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10395668 (10MunizaA) >>! In T371344#10392746, @isarantopoulos wrote: > > I tried to install the wheel from this in a new env and although it installs it cant be used > ` > ImportError: /home/isarant... [01:53:00] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:44:07] FIRING: [8x] ErrorBudgetBurn: liftwing - liftwing-revscoring-latency - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:49:26] kevinbazira: o/ do you know any prebuilt aya-expanse-32b GPTQ quantized model? the only one I could find on huggingface is this one: https://huggingface.co/2z299/aya-expanse-32b-GPTQ-4bit [02:49:54] I plan to run benchmark across the aya32b, awq, and gptq models, similar to the llama 8b experiment from yesterday's meeting [03:59:08] FIRING: [8x] ErrorBudgetBurn: liftwing - liftwing-revscoring-latency - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:11:37] (03Abandoned) 10Santhosh: performance: Use asynchronous iterator for fetching from collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1100055 (https://phabricator.wikimedia.org/T381366) (owner: 10Santhosh) [05:16:46] aiko: o/ looks like the one you found is the only one available at the moment: https://huggingface.co/models?other=gptq&sort=trending&search=aya+32b [05:36:37] 06Machine-Learning-Team: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10395856 (10kevinbazira) >>! In T371344#10394886, @isarantopoulos wrote: > @kevinbazira could you mention the process you followed as well as the environment on which you built it? (python+pytorch v... [05:53:00] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:59:07] FIRING: [6x] ErrorBudgetBurn: liftwing - liftwing-revscoring-latency - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:06:37] hello! [09:27:09] morning! [09:49:52] I have silenced the ErrorBudgetBurn alerts [09:53:00] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:11:42] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10396327 (10brouberol) a:03brouberol [10:49:17] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20), 13Patch-For-Review: Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10396504 (10brouberol) ` brouberol@cephosd1001:~$ sudo radosgw-admin user create --uid=postgresql-airflow-ml --display-n... [10:52:57] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20), 13Patch-For-Review: Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10396515 (10brouberol) ` brouberol@stat1008:~$ s3cmd --access_key=$access_key --secret_key=$secret_key --host=rgw.eqiad.... [11:03:16] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20), 13Patch-For-Review: Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10396557 (10brouberol) ` brouberol@krb1001:~$ sudo kadmin.local addprinc -randkey airflow/airflow-ml.discovery.wmnet@WIK... [11:06:24] kevinbazira: regarding the API gw: do we need to add the model to the api gw? [11:06:37] if it is going to be used internally we can skip it for now [11:06:59] isaranto: oh, btw, thanks for taking care of the SLOs alerts [11:07:47] isaranto: earlier today, I pushed a patch to add it to the api gw: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1102150 [11:08:04] isaranto: o/ have anybody followed up with Keith about the SLO alerts and next stepS? [11:08:14] I see that he is onboarding rec-api-ng [11:09:57] yes I have. We are going to go with pyrra. We'll need to work on adjusting the thresholds on the puppet repo [11:11:11] kevinbazira: perhaps you misunderstood my question. I wasn't asking you to do it, I was asking if it is something that we need to do [11:25:10] isaranto: ah ok nice! Did they fix https://phabricator.wikimedia.org/T352756 ? [11:25:28] I mean, we are going to see the same in my opinion [11:25:36] for big namespaces like revscoring at least [11:26:13] I don't know if it is fixed yet. I'll need to follow up on that [11:27:28] I can help in case, I followed the problem a lot and we added all the workarounds on our side (including dropping extra istio labels etc..) [11:27:54] ideally it would be nice to have this for all k8s services using the istio ingress, wikikube ones included (the ones like rec-api-ng and ores-legacy basically) [11:30:58] isaranto: you can see the issue via https://slo.wikimedia.org/ -> revert risk lang agnostic availability -> 4weeks [11:31:06] ok, I'll need to re-read the tasks and then ping you cause tbh I don't remember the details atm [11:31:24] from Nov 26 to Dec 8/9 there is a gap [11:31:35] aha [12:53:51] * isaranto afk lunch [13:17:04] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: hw troubleshooting: Stuck/bugged BMC on ml-lab1002.eqiad.wmnet - https://phabricator.wikimedia.org/T381902#10397020 (10klausman) The management interface works now, for unclear reasons. Maybe it just took forever to recover from reset(s)? It's all ver... [13:17:33] kevinbazira: the tests for art-country should be visible on the deployment server [13:18:46] klausman: danke! 🙏 [13:19:08] gern geschehen :) [13:25:58] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20), 13Patch-For-Review: Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10397032 (10brouberol) The following group members will get the `Op` Airflow role: https://ldap.toolforge.org/group/airf... [13:28:18] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20), 13Patch-For-Review: Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10397037 (10brouberol) Both airflow and the cloudnative PG cluster were deployed ` brouberol@deploy2002:~$ kubectl get p... [13:41:13] kevinbazira: o/ I tried the aya-8b gptq model you built and it performed super fast! https://phabricator.wikimedia.org/P71700 [13:41:45] the prompt you used only takes 2.14 s [13:43:06] aiko: nice! [13:43:42] don't know what your setup was. I pasted the steps I installed gptq here https://phabricator.wikimedia.org/P71700#287363 [13:45:05] I pip installed auto_gptq, instead of building it the way you did. that's probably where the issue stemmed from [13:51:14] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: [SPIKE] How could we add topic filtering to Recent Changes? [16H] - https://phabricator.wikimedia.org/T381569#10397116 (10Samwalton9-WMF) [13:51:14] how did u install it? using the rocm 5.7 wheel? [13:51:19] I'm referring to this pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/ [13:51:52] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: Expose ORES topics in recent changes filters - https://phabricator.wikimedia.org/T245906#10397117 (10Samwalton9-WMF) [13:51:59] in the docs it says there are only prebuilt wheels for rocm 5.7 so for 6.1 we'd have to build it from source [13:55:03] folks please put your findings with clear instructions on the phabricator tasks. pastes are nice for quick sharing but if we don't also report on the task that information gets lost (along with the work) [13:55:46] I was following the HF docs: https://huggingface.co/docs/transformers/v4.47.1/en/quantization/gptq#gptq [13:55:46] so I used: https://phabricator.wikimedia.org/P71473$1 [13:58:07] I'll build it from source and give it another try [13:59:20] okk [14:00:06] kevinbazira: you don't have to run it again [14:00:59] let's just look for explicit ROCm/AMD instructions in each package. e.g. on GH you'll be able to see how to install for each backend https://github.com/AutoGPTQ/AutoGPTQ [14:01:30] I'll have to ... since I want to run other tests with gptq + fa2 [14:01:56] and gptq + exllama2, etc [14:02:25] ok! [14:02:49] then let's produce the heatmaps for all these using llmperf [14:03:21] there are instructions here -> https://gitlab.wikimedia.org/repos/research/llm_evaluation/-/blob/mnz/llmperf/llmperf/README.md?ref_type=heads [14:04:18] anyway don't want to interrupt your flow atm we can discuss about this in 1h [14:36:46] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10397267 (10brouberol) 05Open→03Resolved All done! {F57799575} The URL is https://airflow-ml.wikimedia.org For all of y'all who are... [14:59:16] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10397346 (10isarantopoulos) Thanks @brouberol ! I'm getting an error when check the k8s namespace for ariflow-ml. ` kube_env airflow-ml ds... [15:16:24] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10397451 (10brouberol) Ah, it does, but it's owned by `root` and `analytics-deployers`, which you might not be a member of. Let me see whe... [15:19:29] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10397460 (10brouberol) @isarantopoulos I've temporarily chown the user config files to `root:deploy-ml-service`. Can you confirm that it w... [15:22:36] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10397478 (10BTullis) >>! In T380258#10397451, @brouberol wrote: > Ah, it does, but it's owned by `root` and `analytics-deployers`, which y... [15:28:50] 06Machine-Learning-Team, 10ORES, 10MediaWiki-extensions-WikimediaEvents, 13Patch-For-Review: Emit revertrisk scores to statsd and plot in Grafana - https://phabricator.wikimedia.org/T356158#10397522 (10kostajh) a:05kostajh→03None [15:29:21] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10397525 (10isarantopoulos) I can now use the configuration but it throws an error: ` isaranto@deploy2002:~$kube_env airflow-ml dse-k8s-e... [15:34:59] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10397542 (10BTullis) We have used `analytics-deployers` for all of the other instances, but even then I think that it's mainly #data-platf... [15:48:27] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Request to host article-country model on Lift Wing - https://phabricator.wikimedia.org/T371897#10397570 (10kevinbazira) @Isaac, thank you for the confirmation. The article-country inference service is now [[ https://phabricator.wikimedia... [15:49:02] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10397574 (10brouberol) @BTullis It just struck me that you need access to the `-deploy` user credentials to use the airflow CLI, as you ne... [15:59:53] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10397601 (10BTullis) >>! In T380258#10397574, @brouberol wrote: > @BTullis It just struck me that you need access to the `-deploy` user cr... [16:01:47] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10397622 (10isarantopoulos) I was just following the documentation to see if everything works. At the moment I don't need need it for anyt... [16:04:51] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10397631 (10brouberol) Pleasure! [17:17:58] (03PS1) 10Nik Gkountas: shuffle recommendations for search and popular cases [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1102354 [17:19:40] going afk folks, have a nice evening/rest of day o/ [17:31:50] (03PS1) 10Nik Gkountas: remove support for default collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1102356 (https://phabricator.wikimedia.org/T374597) [17:33:43] (03PS2) 10Nik Gkountas: remove support for default collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1102356 (https://phabricator.wikimedia.org/T374597) [18:51:56] (03PS1) 10Sbisson: Extra logging in the cache_updater task [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1102369 (https://phabricator.wikimedia.org/T381889) [20:43:15] hi [22:12:21] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: hw troubleshooting: Stuck/bugged BMC on ml-lab1002.eqiad.wmnet - https://phabricator.wikimedia.org/T381902#10399191 (10Jclark-ctr) 05Open→03Resolved