[04:48:19] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ruwiki) - https://phabricator.wikimedia.org/T362503 (10MBH) 03NEW
[04:52:09] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9712213 (10MBH)
[06:22:34] <wikibugs>	 (03PS7) 10Kevin Bazira: logo-detection: add KServe custom model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803)
[06:23:04] <wikibugs>	 (03CR) 10Kevin Bazira: logo-detection: add KServe custom model-server (038 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) (owner: 10Kevin Bazira)
[07:09:05] <isaranto>	 Good morning!
[07:17:43] <Iluvatar>	 ml-team, ORES doesn't work. https://phabricator.wikimedia.org/T362503
[07:29:50] <klausman>	 thanks for the ping
[07:29:55] <klausman>	 I'm taking a look
[07:30:49] <klausman>	 isaranto: you have more knowledge of how the extension works, can you peek at that side, while I check our (LW) side?
[07:31:03] <klausman>	 also, good mroning :)
[07:32:18] <isaranto>	 good morning!
[07:32:33] <isaranto>	 klausman: welcome back! I'm looking as we speak
[07:33:06] <klausman>	 excellent, thanks 
[07:38:23] <isaranto>	 yeah it seems we can't access mwapi to fetch the features
[07:39:12] <klausman>	 So in Grafana, I see that yesterday at 14:44 UTC, we started serving a lot of non-200 responses. Digging further
[07:41:40] <isaranto>	 it is related to the switch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1018998
[07:42:02] <isaranto>	 here is the task for more context :https://phabricator.wikimedia.org/T362316
[07:42:22] <klausman>	 I saw those PRs still open and wondered if they were complete
[07:42:36] <klausman>	 As in: all bits to do that switch were in place
[07:43:36] <klausman>	 But that was submitted on the 12th, bit odd that it would only start breaking yesterday
[07:44:22] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9712315 (10isarantopoulos) Thanks for reporting this. We are investigating and will report back with more information.
[07:46:52] <klausman>	 Also, the change was only to the staging pods, prod should not be affected
[07:47:01] <isaranto>	 need to dig a bit more, it may be specific to some requests and unrelated to the switch. An example of a successful request https://ores.wikimedia.org/v3/scores/ruwiki/723/goodfaith
[07:47:18] <isaranto>	 I'm running all the httpbb tests we have to check all wikis
[07:47:30] <klausman>	 ack
[07:48:11] <Iluvatar>	 +https://phabricator.wikimedia.org/T362506 :)
[07:50:38] <klausman>	 Yeah, from the isvc's POV, these are upstream timeouts for mwapi
[07:51:33] <klausman>	 mwapi.errors.TimeoutError at /srv/rev/revscoring_model/model_servers/extractor_utils.py:91 
[07:59:17] <isaranto>	 httpbb tests gave a failure only for ruwiki, so let's focus on that and leave ukwiki investigation for later
[07:59:40] <isaranto>	 ruwiki-damaging I mean
[08:01:06] <klausman>	 ack
[08:14:49] <klausman>	 I'm trying to find "our" requests in the logstash dashboards for mwapi, but I am unsure what they look like since it's all abstracted away in the mwapi package
[08:19:39] <isaranto>	 I'm trying to figure out why requests coming from goodfaith succeed while damaging fails 
[08:19:50] <klausman>	 That is the other oddity
[08:20:32] <klausman>	 Though I don;t see too many requests there, mayb 2-3/min
[08:21:54] <klausman>	 Have you tried hitting the codfw endpoint for the affected service, see if it's DC-specific?
[08:22:42] <klausman>	 So far I see no errors in its logs, but it's low traffic, too
[08:23:34] <klausman>	 One thing we can try is restarting the isvc, if that "fixes" it, we still need to figure out the root cause, but at least the functionality would be back
[08:23:45] <klausman>	 wdyt?
[08:25:43] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9712429 (10Aklapper)
[08:28:24] <isaranto>	 ok I tried the following:
[08:28:24] <isaranto>	 - run it locally using ru.wikipedia.org as the WIKI_URL. all fine!
[08:28:24] <isaranto>	 - hit the ruwiki-damaging on codfw from statbox . all fine again so it is DC related to some extent https://phabricator.wikimedia.org/P60489
[08:28:42] <klausman>	 ack
[08:29:34] <isaranto>	 shall we just restart to check? 
[08:29:46] <klausman>	 yeah, I'll do it
[08:30:02] <aiko>	 +1 I think we can give it a try because the most important thing is to first restore the prod service
[08:30:17] <aiko>	 morning!
[08:30:29] <klausman>	 morning aiko
[08:30:40] <isaranto>	 the question we need to answer is the following (correct me or add stuff): why don't we get a response from mwapi in eqiad? why does it only happen for ruwiki-damaging
[08:30:45] <isaranto>	 hey Aiko!
[08:31:14] <klausman>	 And post restart, the errors are gone
[08:32:20] <klausman>	 hypothesis: the istio sidecar cached something or broke in another way that resulted in it trying to connect to a wrong address (it's always DNS)
[08:32:47] <klausman>	 As for the UK wiki, I don't see errors in the logs for that eq-damaging service
[08:33:06] <aiko>	 checking the Istio dashboards, the outages occurred around 14:45 yesterday 
[08:33:14] <aiko>	 https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=ruwiki-damaging-predictor-default-00017&var-backend=ruwiki-damaging-predictor-default-00017-private&var-response_code=All&var-quantile=All&from=1713096000000&to=1713121199000
[08:33:25] <klausman>	 Yep, agreed
[08:35:00] <isaranto>	 Nice Tobias!
[08:35:17] <isaranto>	 there is still something off though. Requests are taking 8-13s instead of 300-500ms
[08:35:36] <klausman>	 I haven't been able to repro an ukwiki failures, did you?
[08:35:53] <isaranto>	 ok now they're much faster 
[08:36:07] <isaranto>	 > I haven't been able to repro an ukwiki failures, did you?
[08:36:07] <isaranto>	 nope
[08:37:46] <klausman>	 Updated the bug
[08:38:02] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9712479 (10klausman) We have restarted an associated services and its logs show no more errors. It's not quite root-caused yet, but the functionality should be back to working order now....
[08:38:33] <aiko>	 when it happened, ruwiki-goodfaith was also affected but it resumed functioning afterwards
[08:38:36] <aiko>	 https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=ruwiki-goodfaith-predictor-default-00019&var-backend=ruwiki-goodfaith-predictor-default-00019-private&var-response_code=All&var-quantile=All&from=1713096000000&to=1713121199000
[08:38:48] <klausman>	 Huh, that is extra odd
[08:38:55] <isaranto>	 thanks Tobias for the update
[08:39:06] <klausman>	 But then again, sounds like some caching issue in Istio
[08:39:40] <isaranto>	 it was because it is "goodfaith" 😛
[08:47:45] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9712490 (10MBH) Looks like ukwiki works now.
[08:49:51] <klausman>	 I've also saved the logs of all containers in the malfunctioning pod, so we can do more digging
[08:50:07] <aiko>	 nice!
[08:54:09] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9712500 (10Q-bit-array) As a creator of the other ticket (T362506) I would add that the ORES/LiftWing infrastructure in the Russian Wikipedia was quite unstable during the whole last wee...
[09:15:43] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) (owner: 10Kevin Bazira)
[09:24:22] <wikibugs>	 (03CR) 10Kosta Harlan: Exclude first/only revision on page from scoring (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman)
[10:03:43] * klausman lunch
[10:34:24] * isaranto lunch!
[12:36:59] <elukey>	 hello folks!
[12:37:31] <elukey>	 very interesting morning!
[12:37:46] <isaranto>	 morging Luca o/
[12:37:51] <isaranto>	 *morning
[12:42:42] <klausman>	 Heyo, Luca
[12:42:57] <klausman>	 Yes, it was interesting as afirst thing after two weeks off :)
[12:47:39] <elukey>	 ok so I'd have a suggestion for the task - we should add a complete timeline with UTC date/times and actions taken by us
[12:47:47] <elukey>	 start of the outage, recovery, etc..
[12:47:54] <elukey>	 so it will be easier to dig into logs etc..
[12:48:46] <elukey>	 can anybody take care of it if you have time?
[12:49:39] <klausman>	 sure, will do
[12:49:54] <klausman>	 I've pored over the logs for a while now, but I ahven't found any leads
[13:00:34] <chrisalbon>	 Good morning all
[13:02:20] <elukey>	 o/
[13:02:29] <isaranto>	 \o Chris!
[13:09:05] <aiko>	 o/
[13:10:24] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9713264 (10klausman) Timeline (times in UTC):  20240414 14:45:00 Rate of 200 responses from the service drops to essentially zero, average latency of requests jumps to over 5s. 20240414...
[13:10:33] <klausman>	 elukey: I have a preliminary timeline on the bug. 
[13:10:40] <elukey>	 super thanks
[13:11:12] <klausman>	 I am trying to figure out getting older k8s-side logs of the container (it restarted at 0830 UTC, don't know yet why)
[13:11:45] <elukey>	 there are multiple containers at stake in here, I'd be curious to see the istio-proxy and the queue-proxy 
[13:11:46] <klausman>	 ah nvm, that was me, I got timezone-confused
[13:12:09] <klausman>	 on the dpeloyment server I dumped some logs in my homedir
[13:13:41] <klausman>	 ruwiki-damaging-predictor-default-00017-deployment-64c6587v5lxx <- this was the malfunctioning container, but the logs are gone from k8s
[13:15:22] <klausman>	 crap, I saved the wrong (post-restart) ones :((
[13:15:59] <elukey>	 they should be on logstash don't worry
[13:16:17] <klausman>	 yeah, digging there now
[13:16:45] <elukey>	 This is the point of view of the queue-proxy
[13:16:46] <elukey>	 https://grafana-rw.wikimedia.org/d/Rvs1p4K7k/kserve?forceLogin&orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-kubernetes_namespace_controller=kserve&var-kubernetes_namespace_queue_proxy=revscoring-editquality-damaging&var-app=ruwiki-damaging-predictor-default-00017&from=now-12h&to=now
[13:16:58] <elukey>	 that in theory sits between kserve and the istio-proxy
[13:17:24] <elukey>	 IIRC we drop the sidecar istio proxy logs by default since they are too big
[13:17:28] <elukey>	 we keep only the istio gateway ones
[13:17:49] <elukey>	 the only one that I can see from queue-proyx is https://logstash.wikimedia.org/goto/a955efec41bd274cfd5c9eca988a60c1
[13:19:26] <klausman>	 Unfortunately not much detail beyond "it was a timeout". lemme check if the IP(s) it's contacting are right
[13:20:00] <elukey>	 yes but we have multiple timeouts at stake now
[13:20:07] <elukey>	 1) there seems to be a mwtimeout error
[13:20:14] <elukey>	 2) the queue proxy mentions a timeout
[13:20:41] <elukey>	 and probably the istio-proxy has another one, we should set it via destination rules
[13:21:24] <elukey>	 for api-ro is
[13:21:26] <elukey>	     Connection Pool:
[13:21:26] <elukey>	       Http:
[13:21:26] <elukey>	         Idle Timeout:                 5s
[13:21:29] <elukey>	         Max Requests Per Connection:  1000
[13:21:31] <elukey>	         Max Retries:                  0
[13:21:34] <elukey>	       Tcp:
[13:21:36] <elukey>	         Connect Timeout:  30s
[13:21:39] <elukey>	         Max Connections:  100
[13:23:43] <elukey>	 from
[13:23:43] <elukey>	 https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=ruwiki-damaging-predictor-default-00017-private&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&from=now-12h&to=now
[13:24:15] <elukey>	 right before the restart there were a lot of 0 return codes, DC (downstream closes IIRC)
[13:24:45] <klausman>	 Yes, but I don't know if that is maybe a timeout on the extension side
[13:25:16] <klausman>	 i.e. our timeout for mwapi etc is longer than the one on the extension side. So a DC might be that or something else.
[13:25:18] <elukey>	 I think those are the bots that hit ores-legacy
[13:26:39] <elukey>	 the weird thing is that if I select the api-ro backend, I don't see a clear pattern in errors
[13:27:01] <elukey>	 if it was the mw api, I'd have expected some clear indication
[13:27:02] <elukey>	 mmmmm
[13:27:15] <klausman>	 Yeah I tried finding anything on the api-ro side, but eventually gave up
[13:29:58] <elukey>	 no indication of memory/cpu pressure on any container too
[13:30:21] <elukey>	 envoy/istio-proxy may have been borked for some reason, it is the only thing that I can think of
[13:30:27] <klausman>	 yeah, the only stuff in that department is what happened when things started working again and we had a backlog burst
[13:30:40] <elukey>	 do you recall on what node it was running the failing pod?
[13:30:50] <elukey>	 maybe it was specific to one k8s worker
[13:30:56] <klausman>	 1001, I think
[13:35:09] <klausman>	 ml-serve1008.eqiad.wmnet I am pretty sure. The new pod runs there and the restart was very quick, so it likely didn't change hosts
[13:39:00] <elukey>	 so from the logs it seems 1004
[13:39:05] <elukey>	 the kubelet logs I mean
[13:39:17] <elukey>	 the pod should be revscoring-editquality-damaging/ruwiki-damaging-predictor-default-00017-deployment-64c6587v5lxx in theory
[13:39:34] <klausman>	 yes, that is definitely the right pod ID
[13:41:50] <elukey>	 the 1004's kubelet has some logs related to pod shutting down etc.., but nothing useful
[13:42:58] <elukey>	 other pods are running on it and nothing happened to them
[13:44:06] <klausman>	 Nothing relevant in dmesg, either
[13:45:58] <isaranto>	 o/ mercelisv:
[13:46:14] <mercelisv>	 check
[13:51:52] <klausman>	 my current hypothesis is that something got broken in Istio for outgoing connections, something like the IP for api-ro being wrong (or changing and the cached DNS response never expiring). Alternatively, iptables for that pod got scrambled. How either of these would happen, I don't know.
[13:56:42] <elukey>	 another alternative could be that for some reason connections started to pile up, maybe because of some responses from the mwapi
[13:58:49] <klausman>	 Also plausible, but then something should be visible on Logstash for the api-ro? 
[13:59:48] <elukey>	 maybe it was a little hiccup, nothing that would totally be seen in the percentiles, that caused a ton of pileups
[14:00:27] <klausman>	 hmm. I also considered a network problem, but that would affect just ruwiki requests
[14:00:34] <klausman>	 wouldn't*
[14:00:48] <elukey>	 also we have another variable, namely the aiohttp's connection pool
[14:01:30] <elukey>	 we do keep one local to the model-server, in the past I wondered if we could migrate away from it to just use a plain new conn every time (since it would be only to localhost, istio does already connection pooling)
[14:01:39] <elukey>	 we do set 5s of timeout for aiohttp though
[14:03:05] <klausman>	 yeah, connpools on the kserve container probably are not all that useful. Not that it's proven that they were the cause. But them malfunctioning "somehow" could cause this
[14:05:12] <elukey>	 https://grafana-rw.wikimedia.org/d/Rvs1p4K7k/kserve?forceLogin&orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-kubernetes_namespace_controller=kserve&var-kubernetes_namespace_queue_proxy=revscoring-editquality-damaging&var-app=ruwiki-damaging-predictor-default-00017&from=1713105184263&to=1713106818043&viewPanel=21
[14:05:26] <elukey>	 this is when everything happened, from the point of view of the queue proxy
[14:06:33] <elukey>	 now one thing is very interesting - why do 502 responses took so long to complete?
[14:07:26] <klausman>	 almost a minute is very long indeed.
[14:08:52] <klausman>	 the 5s timeout you mentioned is our clientside timeout for requests to mwapi?
[14:09:10] <elukey>	 in theory yes, what we set for aiohttp's sessions
[14:09:15] <elukey>	 that we pass to mwapi
[14:09:51] <klausman>	 hm. what i the TCP side connection went fine, but the packat rate was very slow or 0...
[14:12:03] <elukey>	 another thing to consider is the cpu time used by kserve
[14:12:04] <elukey>	 https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-damaging&var-pod=ruwiki-damaging-predictor-default-00017-deployment-64c6587tgp2c&var-pod=ruwiki-damaging-predictor-default-00017-deployment-64c6587v5lxx&var-pod=ruwiki-damaging-predictor-default-00017-deployment-64c6587xv4zc&var-container=All&fro
[14:12:10] <elukey>	 m=now-30d&to=now&viewPanel=45
[14:12:55] <elukey>	 it seems spiking a lot, and the main issue with the aiohttp stuff is that we have a single thread, that gets totally blocked on cpu time
[14:13:12] <klausman>	 Still waiting for the dashboard to load
[14:13:26] <klausman>	 (yes I added the split-off part that IRC did)
[14:15:58] <wikibugs>	 (03CR) 10Kevin Bazira: "Thanks everyone for taking the time to review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) (owner: 10Kevin Bazira)
[14:18:51] <isaranto>	 o/ I'm back in the conversation
[14:20:48] <klausman>	 I think the CPU load is a symptom rather than an indicator of root cause. due to the timeouts, there are many requests in flight, and starts stepping on its own feet. 
[14:21:55] <elukey>	 the spikes were also before the timeouts, I think that in general the kserve pod for ruwiki spends a lot of cpu time during preprocess, that slows down requests (every time there is cpu time everything is stalled). Not sure if it was the root cause for today's trouble though
[14:23:31] <klausman>	 I wish I had gotten the right istio-proxy logs, then we would have more data on what happened at 14:45
[14:25:04] <isaranto>	 shall we gather some action points to explore in the following days? 
[14:26:37] <isaranto>	 I agree with moving away from the aiohttp connection pool (using a new conn every time). It is supposed to be faster but it may cause issues
[14:27:02] <klausman>	 Plus, the pooling gains we already get from Istio, most likely
[14:33:02] <klausman>	 as for AIs, I will take another look if I can tease something from the mwapi dahsboards on Logstash. Removing pooling from aiohttp is probably not super urgent, but we might want to give it a try (with testing of course, so we don't have a perf regression). I'm also considering digging a bit into non-ruwiki damaging isvcs, see if they have the same error at a low rate.
[14:33:22] <isaranto>	 kevinbazira: plz wait until tomorrow so Luca and Aiko have time to re-review since we are focusing on the production issue we had
[14:38:23] <klausman>	 Have to run an errand, be back in a bit
[14:40:50] <isaranto>	 this is interesting https://docs.aiohttp.org/en/latest/client_advanced.html#persistent-session
[14:46:32] <elukey>	 added some info to the task
[14:46:47] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9713711 (10elukey) Something strange: from the istio gateway logs, the HTTP response code logged is `0` https://logstash.wikimedia.org/goto/9003f0bd1a3c34e303ac5fbe86eff693  Usually this...
[14:50:34] <isaranto>	 thanks Luca!
[14:51:46] <isaranto>	 I was wondering about the 0 response code. Could it be the case that the client drops the connection (has set a timeout)? When I was checking this morning I would get a 5xx after approx 300 seconds
[14:52:02] <elukey>	 ah ok this is a good data point
[14:52:26] <elukey>	 maybe we have a very long timeout at the api-gateway level?
[14:52:36] <elukey>	 and conns pile up instead of failing fast
[14:53:59] <elukey>	 in any case, probably the most pressing thing is to get some basic alerts, until we have the SLO ones in place
[14:54:38] <elukey>	 timeout at the api-gateway is 30s
[14:54:56] <elukey>	 but how is it possible that Ilias got a 50x after 300s?
[14:55:15] <elukey>	 and also our istio gateway should have tighter timeouts
[14:56:23] <isaranto>	 lemme double check my logs in case my memory is failing me
[15:00:44] <isaranto>	 I can't find any reference, so nevemind the 300s . It was definitely a lot but unfortunately I didn't log a `time curl ` command at the moment
[15:01:13] <elukey>	 so they could have been 30s 
[15:10:31] <klausman>	 Mh, istio typically logs the amount of bytes transferred, if it's 0, it's likely a complete TCP SYN timeout, if there are some bytes, the tcp side would be ok, but the timeout hit before the req/resp was complete. Not sure if this is helpful, but something that occurred to me.
[15:23:44] <elukey>	 everything points to the queue proxy afaics
[15:23:53] <elukey>	 the revision-timeout-seconds is the default, 5 mins
[15:25:31] <elukey>	 maybe queue proxy keeps the connection open to the kserve container even if the client aborted the conn
[15:25:57] <elukey>	 waiting everytime for 5mins, so ending up with connection pileups if kserve doesn't return anything quickly
[15:26:17] <elukey>	 sockstat says
[15:26:18] <elukey>	 TCP: inuse 211 orphan 3 tw 113 alloc 13542 mem 6117
[15:26:58] <elukey>	 so the queue proxy had ~200 conns inuse, 113 in time wait, and alloc (so also closed ones IIUC) ~13k
[15:27:13] <elukey>	 that seems really a lot for a single pod
[15:31:23] <klausman>	 not sure if alloc is the # of connections though. "mem" doesn't fit that either
[15:31:31] <klausman>	 I'll see if I can figure it out from the code
[15:32:07] <elukey>	 I think it just dumps sockstat
[15:32:46] <elukey>	 but 211 + 113 in timewait are a lot
[15:32:58] <elukey>	 for the traffic that handles, this is my point
[15:33:37] <klausman>	 that is still true, though IIRC, time_wait sockets cost very little. maybe we should get those numbers from a non-broken service
[15:34:17] <elukey>	 sure but they are an indication of being busy, the pod gets few rpses
[15:34:44] <klausman>	 oh, totally agreed
[15:43:44] <klausman>	 So I just docker exec't into an enwiki editqual-damaging queue proxy container and sockstat says:
[15:43:48] <klausman>	 TCP: inuse 634 orphan 0 tw 58 alloc 12231 mem 6433
[15:44:08] <klausman>	 Granted, enwiki is fairly busy
[15:44:26] <klausman>	 but it tells us that the ruwiki numbers for alloc and mem are fine.
[15:44:59] <klausman>	 that time_wait and orphan are much lower might also be due to higher traffic (higher resourc reusage pressure)
[15:45:18] <klausman>	 But I doubt ruwiki has 1/3 the traffic of enwiki
[15:47:08] <elukey>	 yep yep
[15:52:36] <elukey>	 another option that I can think of is that, for some reason, a ruwiki revision-id passing through preprocess (maybe it was super heavy) caused revscoring to become completely unusable
[15:53:04] <elukey>	 preprocess latency right after the issue is constant 10s
[15:54:05] <elukey>	 added to the summary
[15:56:29] <klausman>	 good observation. it ties in a bit with the previous report that ruwiki stuff was "unstable" in the week before.
[16:14:18] <elukey>	 INFO:root:Function get_revscoring_extractor_cache took 438.1846 seconds to execute.
[16:15:21] <elukey>	 we don't log the rev-id IIRC, but the one above is all cpu time 
[16:15:41] <elukey>	 so pod totally borked, piling up connections and causing timeouts
[16:16:00] <elukey>	 first that I can see happened around 14:59
[16:16:51] <elukey>	 INFO:root:Function get_revscoring_extractor_cache took 511.2014 seconds to execute. at 15:54
[16:16:55] <elukey>	 sigh
[16:16:58] <elukey>	 isaranto: --^
[16:17:26] <elukey>	 we should probably have a slow-logger action in our isvcs
[16:17:38] <elukey>	 in which we dump how to reproduce
[16:17:47] <isaranto>	 :(
[16:17:50] <elukey>	 like "this is the request payload that caused it"
[16:17:58] <isaranto>	 what do you mean by "slow-logger"?
[16:18:17] <elukey>	 something that logs a warning if something takes more than x seconds to complete
[16:18:22] <elukey>	 like preprocess
[16:18:28] <elukey>	 but with useful info
[16:18:52] <elukey>	 otherwise we don't know the trigger
[16:18:57] <elukey>	 it will surely re-happen in the future
[16:19:52] <isaranto>	 +1
[16:20:47] <isaranto>	 also make sure we log payload. I know we do in some exceptions but I'm not sure it happens all the time
[16:22:31] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9714358 (10elukey) I think that multiple requests caused a ton of time spent in preprocess(), causing the isvc to totally stall and get into a weird state (most probably revscoring ended...
[16:22:48] <elukey>	 ok added some links, please double check when you have a moment that it makes sense
[16:22:54] <elukey>	 I think it is the most plausible explanation
[16:23:11] <elukey>	 also we need to add monitoring
[16:23:52] <isaranto>	 awesome, thanks!
[16:24:20] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9714360 (10elukey) >>! In T362503#9712500, @Q-bit-array wrote: > As a creator of the other ticket (T362506) I would add that the ORES/LiftWing infrastructure in the Russian Wikipedia was...
[16:24:44] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9714369 (10elukey) p:05Triage→03High
[16:26:10] <isaranto>	 +1 on monitoring . Tomorrow we have our planning and we could discuss this on Wednesday 
[16:27:12] <isaranto>	 I think we have some action items already (connection pooling, logging request payload etc). and we can look into adding some alerts separate from SLOs (if that is what you meant)
[16:31:50] <wikibugs>	 (03CR) 10Elukey: "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1017453 (https://phabricator.wikimedia.org/T361803) (owner: 10Kevin Bazira)
[16:33:38] <elukey>	 isaranto: exactly yes, the SLO monitoring/dashboards/etc.. may stall for a while until we solve the issues on the olly side
[16:33:58] <elukey>	 we should probably alert for really clear issues like this one
[16:48:46] <elukey>	 going afk for today folks!
[16:48:49] <elukey>	 have a nice rest of the day!
[16:49:46] <isaranto>	 ciao Luca, I'm going afk as well in a bit folks!
[16:57:22] <wikibugs>	 (03PS11) 10Jsn.sherman: Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281)
[16:59:08] <wikibugs>	 (03CR) 10Jsn.sherman: Exclude first/only revision on page from scoring (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman)
[17:09:58] <klausman>	 heading out as well, \o
[20:29:26] <wikibugs>	 (03CR) 10Umherirrender: [C:04-1] Migrate usage of Database::delete, insert, update and upsert to QueryBuilder (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007862 (https://phabricator.wikimedia.org/T358831) (owner: 10MPGuy2824)
[23:21:41] <wikibugs>	 06Machine-Learning-Team, 06Research: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#9715829 (10XiaoXiao-WMF) @achou please let us know if there is a corresponding ML board related to this task? We'd like to know if...