[07:15:44] good morning! [07:15:49] very interesting results Aiko [08:43:12] * elukey quick break [09:35:16] 10Machine-Learning-Team, 10Data Engineering Planning, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10Isaac) Tagging @Tgr as someone who might know more about Growth's consumptio... [09:36:01] aiko: o/ very nice tests that you ran yesterday! [09:36:17] qq - is the docker image for outlink the same across all clusters? [09:39:56] 10Lift-Wing, 10Documentation, 10Machine-Learning-Team (Active Tasks): Improve Lift Wing documentation - https://phabricator.wikimedia.org/T316098 (10Isaac) Thanks @achou for putting this together. I can't speak so well to the deployment / tech-stack side of things though I really appreciate the depth you wen... [09:47:28] 10Machine-Learning-Team, 10Data Engineering Planning, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10elukey) >>! In T317768#8238625, @Isaac wrote: > > Maybe out of scope, but I... [10:18:58] o/ yep, it's the same docker image on eqiad, codfw and staging [10:21:52] <- early lunch [10:36:55] I am rolling out the same images for revscoring-based models also in eqiad [10:36:59] for better comparison [10:37:22] (I rolled out only articlequality in eqiad) [10:42:20] ran the same test for editquality goodfaith enwiki, staging is still more performant [10:42:39] p99 latency is super good [11:00:02] I checked the rack configs, ml-staging2001 has some mw2* nodes in the same rack but just a few [11:00:19] and p99 latency is always good (since we hit random appservers) [11:01:02] * elukey lunch! [11:03:11] 10Machine-Learning-Team, 10Data Engineering Planning, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10Tgr) >>! In T317768#8238625, @Isaac wrote: > Tagging @Tgr as someone who mig... [12:44:21] I am playing with some istio configs on ml-serve-codfw, it may impair its functionality [13:18:23] ok so I have finally understood more of the sidecar metrics for istio [13:18:40] https://istio.io/latest/docs/reference/config/metrics/#metrics [13:18:47] For TCP traffic, Istio generates the following metrics: [13:18:54] that are available indeed, but not the http ones [13:19:27] what we do in the isvc pods is to start a https connection to api-ro.discovery.wmnet, and the sidecar rightfully let it pass [13:19:38] it cannot inspect more than TCP, so cannot add metrics [13:20:13] the serviceops team, when it proxies traffic to the tls-proxy (envoy), uses http://localhost:$port traffic afaics [13:20:30] so we could try to do the same, but IIRC it didn't work in my past test [13:20:32] *tests [13:20:39] I'll try to work more on it [13:37:51] 10Machine-Learning-Team, 10ORES, 10MediaWiki-Core-Preferences, 10Moderator-Tools-Team (Kanban): 'Highlight likely problem edits' preference doesn't work in mobile web - https://phabricator.wikimedia.org/T314026 (10eigyan) The file responsible for styling the RecentChanges page markup with prediction filter... [13:54:09] 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10Isaac) More or less copying over a comment from another task that's more pertinent here though likely beyond scope: the ORES Extension has the [[https://www.mediawiki.org/wiki/Extension:ORES#Database_schema|t... [13:56:30] 10Machine-Learning-Team, 10Data Engineering Planning, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10Isaac) > A lot of traffic going through ORES is duplicate between the revisi... [14:07:47] interesting - if I try to run multiple http calls to ml-serve-codfw for the same URL, not via wrk but simple curl, I get at some point high latency responses [14:07:51] like 700/800 ms [14:08:08] but If I check the "duration" field in the istio-proxy logs, I don't get anything more than 200/300ms [14:08:17] even less, nothing more than 200ms [14:08:45] so the mw api calls succeeds and it is very fast, then something adds the latency [14:14:17] for example [14:14:17] [I 220915 14:05:13 web:2243] 200 POST /v1/models/enwiki-goodfaith:predict (127.0.0.1) 787.08ms [14:15:10] this is from kserve [14:15:21] that matches with the high latency reported by curl [14:15:47] but istio proxy reports 117ms of latency [14:23:12] maybe it is tornado blocking somehow? [14:46:43] testing on staging, it may be broken for a while [14:51:59] qwhen you say "simple curl" are those req's to the locally running pod or directly to api-ro? [14:52:12] ah, local. [14:53:04] So if the latency does not come from the api call, where are we spending half a second? Very puzzling, since it's not consistent [15:09:54] yeah [15:10:13] I am trying to test a way to avoid the TLS conn from kserve [15:10:43] something like plain http en.wikipedia.org -> istio-proxy <-> api-ro.discovery.wmnet:443 (https) [15:10:55] in this way we'd get HTTP metrics and latencies [15:11:13] (from istio-proxy I mean) without having to add 100k envoy metrics :D [15:11:23] but of course istio doesn't like me [15:11:47] if I manage to do it, we'll have metrics about MW API call latencies to verify what I wrote above [15:21:57] It's going to be DNS latency, isn't it? :D [15:53:34] * elukey bbiab [16:20:02] https://istio.io/latest/docs/tasks/traffic-management/egress/egress-tls-origination/ is basically what I was looking for.. [16:20:17] " The application can send unencrypted HTTP requests and Istio will then encrypt them for the application." [16:20:42] "Another benefit of sending unencrypted HTTP requests from the source, and letting Istio perform the TLS upgrade, is that Istio can produce better telemetry and provide more routing control for requests that are not encrypted." [16:20:46] exxxactly [16:20:46] AIUI, this must be configured per (outside) service, right? [16:21:05] yes yes, we already configure something similar per service [16:21:13] the ServiceEntry istio CRDs [16:21:44] Sounds like a good plan even if the telemetry doesn't help us with the prod/staging discrepancy [16:22:39] I'm a bit wary of using it for outside (non-WMF) services if we ever get to that. Getting routing wrong might result in un-auth/encrypt'd queries and that would be Not Good. [16:23:04] definitely yes [16:31:19] doesn't work, I'll try to work on it tomorrow morning [16:31:31] staging is still not working but it is not a big deal :) [16:31:33] ttl! [16:31:47] \o