[04:25:04] (03CR) 10Kevin Bazira: [C:03+2] revertrisk-wikidata: improve error handling and logging [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1202710 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [04:26:15] (03Merged) 10jenkins-bot: revertrisk-wikidata: improve error handling and logging [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1202710 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [05:02:57] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11352405 (10kevinbazira) The revertrisk-wikidata model-server has been containerized and inte... [07:31:15] good morning [08:14:23] morning [08:14:59] https://www.irccloud.com/pastebin/uPEhN2OR/ [08:15:28] elukey: what am i missing here? [08:15:46] trying to update the knative settings first [08:29:18] is it looking for chartversions somewhere under values/eqiad/ or values/common ? [08:52:07] dpogorzelski: o/ [08:52:26] so admin_ng has a single shared helmfile.yaml config, that is present in the parent dir [08:52:37] so if you cd .. and execute the diff you should see it [08:53:07] you'll see likely a diff for various things, if you want to restrict to knative you can use `-l name=knative-serving diff` [08:53:21] one note - before doing prod we usually test in ml-staging-codfw [08:53:36] and as unwritten rule we don't deploy changes to production on a Friday :D [08:53:51] "what could go wrong" :P [08:53:53] so staging is probably ok for today, I'd wait for monday before proceeding on ml-serve-eqiad [08:53:57] yeah :D [08:53:57] kk [08:55:30] ok seems good dkff wise both knative settings and new service [08:55:53] a bunch of other stuff to be deleted across other parts from within admin_ng [08:56:04] i'll wait to monday then [08:56:20] https://drive.google.com/file/d/1r04GY9ueeR2--jqJOTdxUduIjjp285if/view is great stuff btw, best type of content [08:56:41] can't access :( [08:57:25] Life of LW request/ ML Team Mid-Week Planning Meeting (2023-11-15 07_23 GMT-8).mp4 it's your presentation :D [08:58:16] which geoip db/data set do you use? [08:58:36] maxmind or smth else? [08:59:05] they removed access!!111!! See how a former ML member is considered :P [08:59:26] you work a lot for something and then you get excluded :D [08:59:42] anyway, glad that it was useful! [09:00:08] re: geoip - yes we use maxmind, if nothing has changed recently [09:52:59] * dpogorzelski seems https://config-master.wikimedia.org/pybal/eqiad/inference doesn't list 1012. are nodes added manually to this list? [09:58:44] also dns discovery seems to balance between codfw and eqiad and you mentioned in the video that choosing to route calls to a specific service into a single location is hard and risky but I assume that in the current situation with the gpu nodes that will become a necessity [10:02:44] if the api gateway is envoy could it be solved there? or perhaps it's an issue that was already addressed since the recording of the video? also are there any further followup recordings on this topic? [10:10:35] got point for config-master! [10:10:50] so https://wikitech.wikimedia.org/wiki/Conftool#Add_a_server_node_to_a_service is a good reading [10:11:13] we use confd to dynamically modify weights and pooling state of a service behind a LVS load balancer [10:11:20] in this cae [10:11:23] *in this case [10:11:35] elukey@puppetserver1001:~$ sudo confctl select 'name=ml-serve1012.eqiad.wmnet' get [10:11:35] {"ml-serve1012.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=ml_serve,service=kubesvc"} [10:12:03] so it is in a list (see conftool-data/node/eqiad.yaml in puppet) [10:12:25] but its weight is 0 so it is not displayed in config-master [10:12:57] so in theory it should be sufficient to prep a conftool change to set the weight to 10 and pooled=true [10:14:34] dpogorzelski: if you want to prepare it I can review it and then you can execute it [10:17:10] re: discovery - it is handled via DNS and it is an abstraction for the "real" LVS services. For example, inference.discovery.wmnet is abstracting inference.svc.eqiad.wmnet and inference.svc.codfw.wmnet. If you use the eqiad LVS service for example, you'll hit directly the ml-serve-eqiad cluster. Now this could in theory be used at the API gateway level to route requests for given set of pods (maybe requiring to run on ml-serve1012) only to eqiad [10:17:25] but as you mentioned it is risky since we cannot really depool a DC etc.. [10:17:55] the main motivation for this situation is that in codfw we don't have enough power to host and rack GPU nodes like ml-serve1012 [10:18:08] (they are 8U IIRC consuming a lot of KWs) [10:18:35] so at the moment we only have two in eqiad, but the long term plan is to have the same in codfw [10:19:04] when we'll get the power that we need is a question mark, the DCops team didn't get any confirmation from the DC owners yet [10:19:46] for the time being we'll probably have to set a generous SLO target for specific services running in eqiad only [10:19:54] I don't have other ideas :( [10:20:16] hi [10:21:05] thanks for the info, and yea i'll prep the change [10:21:25] my work has a chatbot with a button "add agent" where i can supply a few docs and ask it to review a document. does this place have one like that for wmf projects? [10:21:50] ask it to read rules, it helps to determije if a page is notable [10:21:55] or something [10:40:30] GrHi! At the moment we don't support agents or chatbots, we are working towards adding LLM support for various projects but nothing in the direction that you asked. [10:40:38] Gry: --^ [10:40:49] mispelled the nick the first time sorry :) [10:42:24] elukey: ok [10:43:03] elukey: can i do that thing, that i mentioned, on my laptop or a vps, as a trial [10:43:25] is there someone here like you, who could give me a few hints [10:46:05] I am not 100% what is your use case, but if it is around agents we may not have the experience that you need. [10:47:28] well [10:48:17] i have the nastiest of the wikis, a wikinews, where work is time sensitive. need a chatbot that can instantly tell the users what it thinks needs to be fixed. [10:48:54] other issue coi editors in wikipedia, it could be a gatekeeper and tell self promotion to go away [10:49:18] other issue is wiktionary where wiki markup knowledge is required to add stuff [10:49:58] chatbot could "learn" and provide instant help to newbies somewhere where it is transparent, so it can be corrected if needed [10:50:18] hope that explains it a bit [10:52:58] elukey: ^^^ [10:55:13] Gry: really interesting, I'd suggest to create a Phabricator task with the Machine-Learning team tag and list all the use cases with as many details as you could [10:55:26] so the team we'll be able to figure out how to best help you [10:55:30] does it sound good? [10:56:54] sure i will do that this weekend! [11:58:13] 06Machine-Learning-Team: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11353174 (10achou) > LiftWing is only in eqiad (right?) LiftWing is in both [[ https://config-master.wikimedia.org/pybal/eqiad/inference | eqiad ]] and [[ https://config-master.wikim... [13:13:17] elukey: hmmm the host is there in conftool-data/node/eqiad.yaml but there's no place to set weight or pooled parameters there [13:16:20] or anywhere else under conftool-data from what i can see [13:17:10] o/ [13:17:10] thanks for the review George [13:17:10] going to deploy rrwikidata in the experimental ns ... [13:20:24] dpogorzelski: so the weights and pool status are applied via conftool command on puppetserver (see the one that I pasted earlier on) [13:20:36] they are dynamic and stored on etcd [13:20:43] then rendered in config files etc.. [13:20:47] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, and 2 others: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11353371 (10kevinbazira) The revertrisk-wikidata model-server has been deployed in the LiftWing experimental namespace. It is... [13:20:57] so you can pool/depool them etc.. without using puppet [13:21:29] rrwikidata is up and running in LW experimental ns: https://phabricator.wikimedia.org/T406179#11353371 [13:26:54] gotcha, i misunderstood [13:30:40] `confctl select 'name=ml-serve1012.eqiad.wmnet' set/pooled=yes:weight=10` [13:35:57] 06Machine-Learning-Team, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 06Editing-team (Tracking), 07Essential-Work: Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11353536 (10Gehel) [13:36:39] This is good, but sometimes a given host may have multiple services etc.. so to address only the one that you need, it is usually good to also add selectors like `cluster=ml_serve,service=kubesvc` instead of only "name" [13:36:57] (you can find the right value in the get call's result) [13:42:39] `confctl select 'name=ml-serve1011.eqiad.wmnet,dc=eqiad,cluster=ml_serve,service=kubesvc' set/pooled=yes:weight=10` [13:47:30] ok, changed [13:48:17] but it's still not listed under https://config-master.wikimedia.org/pybal/eqiad/inference [13:48:21] something else missing? [15:00:35] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11353962 (10ldelench_wmf) [15:31:05] dpogorzelski: ml-serve1012 vs ml-serve1011 :) [15:33:22] fair :) [15:50:26] (03CR) 10Nik Gkountas: [C:03+2] Sort page collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1202781 (owner: 10Sbisson) [15:51:52] (03Merged) 10jenkins-bot: Sort page collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1202781 (owner: 10Sbisson) [16:04:40] (03PS1) 10Nik Gkountas: page collection groups: include single ‘/’ collections without siblings [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203047 [16:06:17] (03CR) 10Nik Gkountas: [C:03+2] Don't ignore sitematrix and interwiki map errors [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1202795 (owner: 10Sbisson) [16:06:59] (03Merged) 10jenkins-bot: Don't ignore sitematrix and interwiki map errors [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1202795 (owner: 10Sbisson) [16:19:22] (03CR) 10Sbisson: [C:03+2] page collection groups: include single ‘/’ collections without siblings [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203047 (owner: 10Nik Gkountas) [16:19:43] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11354277 (10DPogorzelski-WMF) regarding 1. i suspect i can just re-use this part https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1029139/ regarding 2. would flipping egress to... [16:20:01] (03Merged) 10jenkins-bot: page collection groups: include single ‘/’ collections without siblings [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203047 (owner: 10Nik Gkountas) [16:23:28] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11354310 (10DPogorzelski-WMF) @Eevans i guess we can just start with a set of shared credentials and split later if needed [16:45:15] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Task generation engine for Revise Tone task - https://phabricator.wikimedia.org/T408341#11354392 (10achou) **Weekly Report** Progress update on the hypothesis for the week, including if something has shipped: - Cassandra + Data Gateway are ready T401021#1134... [16:50:53] 06Machine-Learning-Team: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11354434 (10elukey) Option A would require some talk with SRE but given [[ https://grafana.wikimedia.org/d/000000234/kafka-by-topic?from=now-6M&to=now&timezone=utc&orgId=1&var-dataso... [17:46:56] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10PersonalDashboard, and 2 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11354578 (10Kgraessle) a:03Kgraessle [19:08:02] (03PS1) 10Sbisson: Page collection validation script [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203077