[06:09:59] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review, 10User-notice: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10elukey) @Quiddity Hi! As reported in the email the only issue that it was reported is T343308, so the RC... [06:12:46] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, 10User-ItamarWMDE: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10elukey) @Ladsgroup we have a dedicated Revertrisk model dedicated to Wikidata in staging (more info in T333125), but I thi... [06:32:27] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, and 2 others: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10Lydia_Pintscher) [08:08:06] hello folks! [09:37:11] Mornin! [09:52:36] 10Machine-Learning-Team: [ores-legacy] Clienterror is returned in some responses - https://phabricator.wikimedia.org/T341479 (10elukey) 05Open→03Resolved a:03elukey I double checked and the docs are updated in ores-legacy.wikimedia.org, so we are good to close :) [10:01:49] elukey: I think i may have finally ironed out most of the Grizzly problems for the SLO dashboards [10:02:26] elukey: one problem is that we can't really make one dash for all the Revscoring services: the panels/widgets get very crowded, and about half of the time, the resultset from Thanos is so big, it doesn't even load. [10:02:43] So we'd have to have separate dahses for e.g rs-articletopic, rs-damaging etc [10:03:24] https://grafana.wikimedia.org/dashboard/snapshot/JfCJctZrWp7QeAJNQGyG7tHGnp0mbYED?orgId=1 Here's e.g. Revertrisk (the UID warning can be ignored) [10:03:52] This is not permanent/committed yet, but Grizzly's preview functionality) [10:04:31] https://grafana.wikimedia.org/dashboard/snapshot/3NT9oV6VXG0krO5kpYYGo82MEybdr7s4?orgId=1 And here's articletopic. [10:05:06] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, and 2 others: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10Lydia_Pintscher) [10:05:20] I haven't defined the remainign revscoring ones, since I want to check with osbervability folks what to do on the git repo side, since I don't want to pollute the file with 1000 revscoring configs (or maybe that is ok, we'll see) [10:05:21] klausman: great stuff! [10:05:54] one qs - could we just use a single SLO for every model server? For example, I think it is fine to have a single one for damaging, one for goodfaith, etc.. [10:06:11] not counting the single wiki pods I mean [10:06:15] yes, that is the fanout I mentioned [10:06:31] Oh, I see. Hmm. I will have to see how to express that in a Grizzly way [10:07:07] it should hopefully be a matter of grouping more the metrics [10:07:38] for article topic the error budget data is still readable, but it may be confusing [10:07:54] if we have only one it will be easier [10:08:16] (also we can definitely group some pods undert a single umbrella) [10:08:20] but really nice progress! [10:13:00] Like this? https://grafana.wikimedia.org/dashboard/snapshot/Ay2SbQX42pQAo0tcS2dPhJ5eK8wk4vYM?orgId=1 [10:13:49] I also tried having all revscoring models (but not separated by wiki) on one dashboard, but that still seems to hit the query size problem of Thanos [10:14:06] But having e.g. all th editquality ones on one dash might fit. [10:15:12] 10Lift-Wing, 10Machine-Learning-Team, 10ORES, 10PageTriage: Migrate PageTriage to use LiftWing instead of ORES - https://phabricator.wikimedia.org/T343514 (10MPGuy2824) [10:16:39] klausman: exactly yes! [10:16:43] https://grafana.wikimedia.org/dashboard/snapshot/lcr8lU3tTWEX1ew9nQWwODCFJ4kUGXKW?orgId=1 [10:16:53] so this is what EQ would look like, then [10:17:45] But that already fails half the time because Thanos :-/ [10:17:58] yeah a ton of metrics, sigh [10:18:12] maybe we can follow up with observability, there must be a way to aggregate metrics [10:18:24] so the new aggregated ones would be easy to load [10:18:28] We could do it at scraping time. [10:18:45] i.e. Prometheus recording rule. [10:18:54] But I'm not sure it would be easy [10:19:03] let's ask to observability [10:19:08] will do [10:19:13] super [10:19:25] aiko: o/ [10:20:58] elukey: hi luca [10:21:11] aiko: qq - what is the status of RR Wikidata? [10:22:09] elukey: the model is in staging experimental [10:22:51] aiko: yep yep, but I was wondering if Research has plans for prod [10:23:07] the Wikidata folks asked if we have something for them, so they skip revscoring models [10:23:13] but if it is not ready no problem [10:25:00] elukey: I remember it is not ready yet [10:25:08] ahh okok [10:35:40] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10phuedx) [10:40:54] * elukey lunhc! [10:42:24] same [10:49:19] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, and 2 others: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10achou) @elukey Research team's plan for the RevertRisk Wikidata model is to evaluate it in Q1, and then improve and deploy it in Q2. [12:35:07] 10Lift-Wing, 10Machine-Learning-Team, 10ORES, 10PageTriage: Migrate PageTriage to use LiftWing instead of ORES - https://phabricator.wikimedia.org/T343514 (10elukey) Thanks for the task! Could you give us more info about how PageTriage uses ORES? [12:37:11] ok wow a single activator (knative) container seems to run 78 threads [12:37:19] I can imagine why we get throttled so fast [12:40:14] https://github.com/knative/serving/blob/main/config/core/deployments/activator.yaml#L52 makes zero sense [12:40:26] but it probably depends on the use case [12:41:04] I think that for knative we could try without limits (hence without cgroup for the cpu usage) [12:44:18] 10Lift-Wing, 10Machine-Learning-Team, 10ORES, 10PageTriage: Migrate PageTriage to use LiftWing instead of ORES - https://phabricator.wikimedia.org/T343514 (10Novem_Linguae) It powers the red "potential issues" labels such as "Vandalism", "Spam", and "Attack" in https://en.wikipedia.org/wiki/Special:NewPage... [12:49:10] will start with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/945772/ [12:51:07] 10Lift-Wing, 10Machine-Learning-Team, 10ORES, 10PageTriage: Migrate PageTriage to use LiftWing instead of ORES - https://phabricator.wikimedia.org/T343514 (10elukey) @Novem_Linguae very nice thanks! Do you know how the special page loads ORES scores? Does it use the ores_classification table? [12:59:18] elukey: are the activators written in Go? Then a lot of OS threads is normal, and doesn't necessarily indicate "busy" [13:01:34] The Go runtime uses a lot of threads, but they're not all always running. [13:03:44] klausman: they are, but it is how the quota is assigned the problem, at least IIUC [13:04:09] Hmm. I thought it was just CPU seconds/second [13:04:26] the time slice is 100ms (for each CPU) and your threads gets to be scheduled according to Limits [13:04:38] the more threads you have, the quicker you will finish the quota assigned [13:04:48] 10Lift-Wing, 10Machine-Learning-Team, 10ORES, 10PageTriage: Migrate PageTriage to use LiftWing instead of ORES - https://phabricator.wikimedia.org/T343514 (10Novem_Linguae) Not 100% sure, but searching the code suggests it does. {F37173741} [13:04:50] I was reading https://medium.com/omio-engineering/cpu-limits-and-aggressive-throttling-in-kubernetes-c5b20bd8a718 [13:04:55] and others [13:05:44] if you check the cpu usage is not a lot [13:05:54] but they get throttled quite frequently [13:07:52] Yeah... We could give no-limits a go early next week, keep an eye on it for a day or three, see how it fares [13:09:36] klausman: last change worked nicely so far - https://grafana-rw.wikimedia.org/d/Q1HD5X3Vk/elukey-k8s-throttling?forceLogin&from=now-1h&orgId=1&to=now&var-dc=thanos&var-ignore_container_regex=&var-prometheus=k8s-mlserve&var-service=knative-serving&var-site=eqiad&var-sum_by=container&var-sum_by=pod&var-container=activator [13:10:12] oh my [13:11:10] That's a factor of 100 or more [13:11:41] this was istio before / after the limits change [13:11:42] https://grafana-rw.wikimedia.org/d/Q1HD5X3Vk/elukey-k8s-throttling?forceLogin&from=now-7d&orgId=1&to=now&var-dc=thanos&var-ignore_container_regex=&var-prometheus=k8s-mlserve&var-service=istio-system&var-site=eqiad&var-sum_by=container&var-sum_by=pod&var-container=All [13:11:51] we still have some throttling that is not great [13:12:28] how the CFS quotas are calculated is beyond my understanding at the moment [13:13:22] weird that controllers and autoscalers didn't really show a change [13:15:16] 10Lift-Wing, 10Machine-Learning-Team, 10ORES, 10PageTriage: Migrate PageTriage to use LiftWing instead of ORES - https://phabricator.wikimedia.org/T343514 (10elukey) Makes sense thanks! In theory PageTriage shouldn't do anything, since the ORES extension populates the `ores_classification` table and most o... [13:16:26] ok so the activators are getting more throttled now [13:16:35] so it was maybe a post-deployment thing [13:16:43] but way less than before [13:17:21] You mean after deploying the throttling has to ramp back up before it's "steady state" [13:17:23] ? [13:18:00] well some fluctuation after the deploy is expected, it is now only 1/10th than before [13:18:11] yeah, agreed. [13:18:54] 10Lift-Wing, 10Machine-Learning-Team, 10ORES, 10PageTriage: Migrate PageTriage to use LiftWing instead of ORES - https://phabricator.wikimedia.org/T343514 (10Novem_Linguae) Sounds good. Should we close this ticket then? [13:20:26] 10Lift-Wing, 10Machine-Learning-Team, 10ORES, 10PageTriage: Migrate PageTriage to use LiftWing instead of ORES - https://phabricator.wikimedia.org/T343514 (10elukey) 05Open→03Resolved a:03elukey Let's do it, thanks for the brainbounce! Really appreciated :) [13:20:49] I'll raise a little istio gw's resources too - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/945778 [13:22:11] :+1: [13:32:23] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, and 2 others: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10elukey) Ack thanks! Then I'd suggest to start with damaging and goodfaith, and then migrate later on when RR Wikidata is ready? [13:56:17] 10Machine-Learning-Team, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10MGerlach) weekly update: * no update [14:30:59] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, and 2 others: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10diego) >>! In T343419#9068806, @achou wrote: > @elukey Research team's plan for the RevertRisk Wikidata model is to evaluate it i... [14:31:55] Okay there are more ORES uses than I anticipated [14:34:56] yeah :) [14:35:11] it is a good occasion to remind people to use User Agents [14:35:24] since most of what we see are generic UAs (like go/python ones etc..) [14:35:50] I don't see a match with what I find in the logs and what people want to migrate [14:36:01] Yeah [14:37:12] It seems clear over the last two weeks that the migration is a bigger task than we initially though. That is fine, lets just acknowledge it to ourselves and adapt [14:37:36] yep [14:39:36] random idea - if we return HTTP cache headers from our model servers, our CDN will cache results for X time without any need of a score cache [14:39:49] this is something that we don't currently do, and it could be an easy win [14:40:39] I don't know if changeprop could be instructed to call the api-gateway, to warm up the CDN cache, it would break a lot of production boundaries [14:41:03] but we could think about HTTP caching [14:41:08] will open a task on monday [14:49:08] okay cool [14:49:12] Lets chat monday [15:01:09] chrisalbon_: just sent another email to wikitech-l for the revision-score stream [15:01:13] we should be all covered now [15:02:29] great thanks [15:04:22] logging off for today! [15:04:29] have a nice weekend folsk [15:04:32] *folks [15:04:33] have a great weekend! [15:05:12] bye luca! [15:27:02] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, and 2 others: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10diego) And if you want to help with the evaluation, please go to this site: https://annotool.toolforge.org/ and help us to annota... [17:12:53] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, and 2 others: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10diego) Also the experimental model is available through the [[ https://gitlab.wikimedia.org/repos/research/knowledge_integrity |... [17:27:41] 10Machine-Learning-Team: Store and fetch the recommendation-api embedding from Swift - https://phabricator.wikimedia.org/T343576 (10kevinbazira) [19:44:04] 10Machine-Learning-Team, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10leila) @elukey Research accepts accountability for the readability model for a period of 12 months (We will revisit then if we want to contin... [20:58:26] just curious what ya'll are using for word tokenization across various languages [22:18:34] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, and 2 others: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10diego)