[07:03:06] 10Machine-Learning-Team: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 (10elukey) Great analysis Aiko! One thing that I still don't understand is why now it works fine, meanwhile it doesn't when we switch to Lift Wing. IIUC `wgOresFiltersThresholds`... [07:32:11] 10Machine-Learning-Team: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 (10elukey) Relevant task https://phabricator.wikimedia.org/T319170#8807964 From https://ores.wikimedia.org/v3/scores/fiwiki/?models=goodfaith&model_info=statistics.thresholds.fals... [08:54:35] aiko: o/ [08:54:41] I am rolling out the new version of eventgate [08:54:42] (03PS2) 10AikoChou: events: update prediction_classification_change schema to 1.1.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/944227 (https://phabricator.wikimedia.org/T343002) [08:54:58] (03CR) 10Elukey: [C: 03+1] events: update prediction_classification_change schema to 1.1.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/944227 (https://phabricator.wikimedia.org/T343002) (owner: 10AikoChou) [08:55:13] elukey: nice, thanks! [08:55:44] aiko: we can merge your events.py change and then update the outlink's docker image [08:56:15] yess I'm going to do it [08:56:28] (03CR) 10AikoChou: [C: 03+2] events: update prediction_classification_change schema to 1.1.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/944227 (https://phabricator.wikimedia.org/T343002) (owner: 10AikoChou) [08:59:32] aiko: eventgate rolled out [09:03:01] 10Machine-Learning-Team, 10Patch-For-Review: Some outlink events are rejected by EventGate - https://phabricator.wikimedia.org/T343002 (10elukey) New eventgate version rolled out (that is able to accept the new schema that Aiko created). Now we are going to rollout https://gerrit.wikimedia.org/r/c/machinelearn... [09:03:06] (03Merged) 10jenkins-bot: events: update prediction_classification_change schema to 1.1.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/944227 (https://phabricator.wikimedia.org/T343002) (owner: 10AikoChou) [09:18:56] elukey: rolled out :) [09:23:50] niceee [09:23:56] let's see if the 500s go away [09:29:27] 🤞 [09:34:30] aiko: did you see what I wrote for the fiwiki thing? [09:34:36] I am even more puzzled than yesterday [09:35:36] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.41-notes (1.41.0-wmf.22; 2023-08-15), 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10elukey) Currently blocked by T343308 [09:35:50] klausman: o/ what is the rate limit strangeness? [09:36:35] The discrepancy of me having a wme-tier token but being presented a 50k ratelimit, as we discovered last week. [09:37:11] (plsu that WME seemed to have no limit at all, as you reported) [09:37:31] klausman: nope wme correctly have limits, I updated the channels etc.. about it [09:37:35] elukey: let me see now [09:37:53] oh, oops, missed that. Well my odd RL still is odd, and I want to make sure I understand what's going on [09:37:55] and I also tested the internal tier, worked nicely with wrk (namely rate limited at 100k) [09:38:09] klausman: can you repro it now? Or is it a past thing? [09:39:53] That (repro) is what I'm working on rn [09:39:58] super [09:41:11] 10Machine-Learning-Team, 10Wikidata: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10elukey) [09:51:21] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10elukey) [09:51:45] elukey: https://phabricator.wikimedia.org/P50060 confirmed reproducible [09:52:21] klausman: could you please add the steps to repro? [09:52:29] sure [09:52:38] does it happen only with the wme tier? Or internal too? [09:53:00] Only WME tier/via APIGW, as far as I can tell [09:53:44] it is strange, wme tested it and it worked [09:53:52] but they used a personal api token though [09:54:32] well, so did I, my working hypothesis rn is that my token is not correctly on the higher tier or I mis-c&p'd something [09:57:39] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10elukey) https://github.com/Ladsgroup/Vandalism-dashboard/issues/15 [09:58:43] elukey: I just decoded the token I am using. It has a 250k qph rate limit encoded. [09:59:59] klausman: can you create another one and see if the same applies? [10:00:13] will do, after making notes re: repro [10:01:54] gah, on-the-hour ratelimit reset just happened :D [10:04:24] I suggested another token to start completely fresh [10:04:44] maybe you have an old one that for some reason yields this result [10:04:53] There is more weirndess going on, too [10:05:31] I can reliably eat all my (50k) quota and then trigger 429s with my tool, but if I then make one request with curl (same endpoint), I get a 200 [10:05:44] Wonder if the rate limit is also tied to the UA? [10:07:23] nope that's not it. [10:10:01] I'll keep digging [10:10:09] klausman: the CDN also have some rate limit, maybe your tool is too aggressive? [10:10:20] what UA do you use? [10:10:31] But wouldn't WME have the same problem? [10:11:47] klausman: it depends on the UA probably, which one do you use? [10:11:51] the standard golang one? [10:12:00] no, my own, but I used the same for curl [10:13:28] ack then not an issue of the CDN [10:13:33] I also see the X-Ratelimit-Limit value [10:13:54] it doesn't seem a local rate limit [10:14:20] I think something in my tool re: Auth token is broken [10:14:58] as in: it creates _some_ Auth header, but it's malformed [10:15:12] ahh and you get the anonymous rate limit [10:15:14] it would make sense [10:16:08] Oh god, I am such an idiot [10:16:18] if key == "" { [10:16:19] bearer = fmt.Sprintf("Bearer %s", key) [10:16:21] } [10:16:23] see a problem here? [10:16:34] != :) [10:16:38] exactly [10:17:05] it is a gooood news since it means that we don't have a horrible bug to fix in our infra! [10:17:32] yes, I will gladly wear the dunce cap for a day in exchange for that :) [10:17:39] 10Machine-Learning-Team: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 (10achou) @elukey In Ilias's comment https://phabricator.wikimedia.org/T319170#8807964, the example url is querying thresholds for frwiki **damaging** model. ` https://ores.wikime... [10:18:04] Ok, 429s are not happening naymore. Seeing if I can eat all the 250k quota as well, and get those ratelimit messages [10:18:44] 10Machine-Learning-Team: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 (10elukey) ahhh okok sigh I thought swapping damaging with goodfaith was enough, sorry for the confusion :( [10:18:52] aiko: sorry for the noise in the task :( [10:20:17] 10Machine-Learning-Team: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 (10elukey) >>! In T343308#9065876, @achou wrote: > The result is the same as the hardcoded values we have in `wgOresModelThresholds` for `fiwiki`, but I suspect that maybe somethi... [10:20:33] elukey: this also means that if someone wants to use the 50k ratelimit, they can just send an empty access token [10:20:51] gooood [10:21:04] or not set it at all [10:21:12] I am unsure about the latter [10:21:29] I am relatively sure, but I can test it [10:21:33] will test and document that after the 250k trigger attempt [10:23:16] elukey: ahh that's totally fine! It's great that more people know how it works :) [10:23:42] aiko: do you think that the parsing of the true => null thing triggers the issue then? [10:25:07] elukey: maybe, because only fiwiki has that config [10:25:57] I'm going to look into ThresholdLookupConfig.php [10:26:05] super <3 [10:26:33] check also the tests, maybe we can add a unit test use case [10:26:37] and see how it behaves [10:27:20] okeee [10:30:41] * elukey lunch! [10:31:14] elukey: one more thing [10:31:27] elukey: 250k limit triggered correctly with correct token usage [10:31:36] and now bon appetit :) [12:17:32] nice! [12:23:32] tested the 50k limit without authorization header, worked nicely [12:25:22] so the api-gateway works [12:43:18] (03CR) 10Abijeet Patro: [V: 03+2] Localisation updates from https://translatewiki.net. [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/945562 (owner: 10L10n-bot) [12:45:59] elukey: I see the API latency alerts more often these days. Should we increase the thrsholds for the alerts? Or maybe the API components need more quota? I'll see if I can figure out if the latter have been throttled on CPU [12:49:52] calico-kubecontrollers, calico-typha, eventrouter and helm-state-metrics all seem to be occasionally throttled, especially the first two [12:50:32] https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad+prometheus%2Fk8s-mlserve&var-namespace=kube-system&var-pod=eventrouter-778c744b69-gq9fr&var-pod=helm-state-metrics-6b4477d4d7-t6hjc&var-pod=calico-kube-controllers-b47dfd47-jdtx4&var-pod=calico-typha-7dfd5d6865-4f95s&var-pod=calico-typha-7dfd5d6865-cd6nk&var-pod=calico-typha-7dfd5d6865-rk57n&var-c [12:50:34] ontainer=All&from=1690980620844&to=1691067020844 [12:50:49] crap, that got truncated. maybe this will work: https://grafana.wikimedia.org/goto/PkehoR64k?orgId=1 [12:51:20] access denied :( [12:51:43] ah, but clicking on the "Login" bit at the bottom left takes you to a working dash [12:56:32] I created https://grafana-rw.wikimedia.org/d/Q1HD5X3Vk/elukey-k8s-throttling to easily inspect throttling [12:56:56] ah, even betterer [12:57:09] maybe open a task so we can schedule it appropriately [12:57:20] it is not 100% clear to me how to solve these problems [12:57:29] keep increasing the Limit may not be the solution [12:57:34] Agreed [12:58:12] I think some throttling is fine, in and of itself it's not a problem. But if API latency is high (too high), the most-throttled services might need a quota bump [12:58:18] I'll make a task [12:58:41] it depends where, in anaything that handles traffic it is an issue [12:59:00] I honestly don't know what the typha service does [12:59:02] I raised the knative limits yesterday, but we still see throttling despite a high limit and low cpu usage [12:59:46] so the 100ms slot that CFS manages has probably a complicated heuristic to assign compute quota [13:00:18] see https://grafana-rw.wikimedia.org/d/Q1HD5X3Vk/elukey-k8s-throttling?forceLogin&orgId=1&var-dc=thanos&var-ignore_container_regex=&var-prometheus=k8s-mlserve&var-service=knative-serving&var-site=eqiad&var-sum_by=container&var-sum_by=pod [13:00:42] a solution could be to avoid Limits, ending up basically in not using the cgroup for cpu [13:00:47] but it may backfire [13:01:52] 10Machine-Learning-Team: Investigate high API latency on LW k8s - https://phabricator.wikimedia.org/T343446 (10klausman) [13:32:01] And to answer the question of Typha does: it's a service that reduces the load on the k8s API by doing fan-out/-in of r/o calls, in order to keep the API load constant as the cluster grows. [13:32:26] > Without Typha, every calico node would have to register its own watch with API Server, and the load on API server would multiply as you scale up the number of nodes. By having Typha, all the watch events are off-loaded to Typha and read only once from API server. Hence Typha is not optional, but is a necessary component of your Calico deployment for any decent sized production cluster. [13:32:33] (from https://medium.com/@bikramgupta/why-use-typha-in-your-kubernetes-calico-deployments-5c0ca4da30dd) [13:33:44] So one possibility is that since Typha gets throttled, its API calls slow down, increasing perceived latency. [13:54:00] hello [14:12:25] \o [14:12:41] (Luca, Aiko and I are in the mtg with research atm) [14:17:26] I'm on a zoom call with the community [14:21:59] :+1: [14:53:54] oh wow nice [15:53:16] I found some ip addresses of folks using eventstreams' revision-score [15:53:45] may be difficult to contact, some from AWS that hopefully are all WME-related [16:14:19] * elukey afk! [16:14:24] have a nice rest of the day folks! [16:18:06] bye luca :) [16:43:35] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10Lydia_Pintscher) [16:45:51] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10User-ItamarWMDE: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10Lydia_Pintscher) @Ladsgroup @Lucas_Werkmeister_WMDE @Michael or anyone else following: Can you think of other tools? @ItamarWMDE flagging that w... [16:46:21] night all! [16:46:47] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, 10User-ItamarWMDE: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10Lydia_Pintscher) @Arian_Bozorg putting this on your radar for next or following sprint planning [16:47:09] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, 10User-ItamarWMDE: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10Lydia_Pintscher) p:05Triage→03High [21:17:31] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, 10User-ItamarWMDE: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419 (10Ladsgroup) The damaging and goodfaith model of Wikidata is quite different from the rest of wikis and I don't think the [[... [21:34:41] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review, 10User-notice: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10Quiddity) Thanks for the draft, it's greatly appreciated! I'm wondering if the phrase "weird results" co... [21:46:08] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review, 10User-notice: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10JJMC89) I was hoping someone from the #machine-learning-team would clarify my draft. "weird results" was...