[01:25:12] 10Machine-Learning-Team: Upload recommendation-api preprocessed numpy binaries to Swift - https://phabricator.wikimedia.org/T346411 (10kevinbazira) [01:28:46] 10Machine-Learning-Team: Upload recommendation-api preprocessed numpy binaries to Swift - https://phabricator.wikimedia.org/T346411 (10kevinbazira) The recommendation-api preprocessed numpy binaries were uploaded successfully to [[ https://wikitech.wikimedia.org/wiki/Thanos | Thanos Swift ]]. Below are their sto... [05:42:58] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Nice work Kevin! I left some comments, feel free to follow or disregard them as you wish!" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T346218) (owner: 10Kevin Bazira) [06:33:25] (03CR) 10Elukey: "Nice work! I think we have some things to fix but almost there!" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T346218) (owner: 10Kevin Bazira) [06:46:26] 10Machine-Learning-Team: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10isarantopoulos) a:03isarantopoulos As part of the initial investigation we can create the a group/team for the alerts in [[ https://gerrit.wikimedia.org/g/operations/alerts | operations/alerts repo ]] and then create a d... [06:55:11] 10Machine-Learning-Team: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10elukey) I think that we should coordinate with SRE (@RLazarus for example) before proceeding further with SLO alarming, we don't want to derail from the SRE recommendations :) My 2c: having an alert that fires at certain... [06:56:58] Afk for approx 1h [07:00:04] hello folks [07:09:45] 10Machine-Learning-Team, 10Patch-For-Review: Adapt the recommendation-api to use float32 preprocessed numpy arrays from swift - https://phabricator.wikimedia.org/T346218 (10kevinbazira) [07:33:28] 10Machine-Learning-Team: Tune LiftWing autoscaling settings for Knative - https://phabricator.wikimedia.org/T344058 (10elukey) 05Open→03Resolved We applied the `rps` strategy to all our isvcs, and re-calibrated autoscaling settings. The autoscaling graphs looks much better now, I am inclined to close. Thanks... [07:34:00] 10Machine-Learning-Team, 10Wikimedia Enterprise: Elevate LiftWing access to WME tier for development and production environment - https://phabricator.wikimedia.org/T346032 (10elukey) 05Open→03Resolved [07:34:45] 10Machine-Learning-Team, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10elukey) a:05MGerlach→03achou [07:35:14] 10Machine-Learning-Team, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10elukey) a:05achou→03klausman [07:35:57] 10Machine-Learning-Team, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10elukey) @klausman Assigned the task to you since there are a couple of steps that are more related to SRE (lemme know if you don't have time,... [08:11:21] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared - https://phabricator.wikimedia.org/T346175 (10elukey) Really nice finding! It seems to match exactly https://sal.toolforge.org/log/HIAEhIoBGiVuUzOdD... [08:18:22] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared - https://phabricator.wikimedia.org/T346175 (10elukey) [[ https://grafana-rw.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?forceLogin&orgId=1&var-backend=e... [08:20:55] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared - https://phabricator.wikimedia.org/T346175 (10Ladsgroup) I suggest upping concurrency value of ORESFetchScore jobs in change prop (it's in helmfile.... [08:21:37] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared - https://phabricator.wikimedia.org/T346175 (10elukey) The Kafka consumer lag dashboard [[ https://grafana-rw.wikimedia.org/d/000000484/kafka-consume... [08:25:37] Amir1: o/ thanks for the suggestion, filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/957864 [08:25:51] Thanks elukey ! [08:25:58] 60 is way too much IMO :D [08:26:10] let's start with 25, if not enough slowly bump it [08:26:28] increase in this means decrease for other jobs [08:26:37] so they might run into issues [08:26:52] Amir1: let's do 30? [08:26:58] deal [08:27:07] ack, also contacted serviceops [08:27:11] we should do negotiation more often [08:27:18] definitely :D [08:28:05] the biggest difference is that for LW, it makes reqs for each model for each revision. While old ores batched them, so this one is making around 4 times more reqs in one job [08:29:31] good point, I'll write it [08:30:23] updated [08:31:11] Amir1: all right deploying [08:31:19] awesome [08:33:28] nice finding folks! [08:34:06] also enwiki is one of the few (or only) wikis that requests scores for 4 models (most wikis just have damaging and goodfaith) [08:34:43] ah didn't know that [08:34:46] what are the others? [08:36:42] articlequality and draftquality [08:38:19] I didn't check their latency, maybe we could do something [08:38:38] I'm checking now [08:40:31] it is even lower than damaging. Still the issue is that these requests are sequential which means Job time = sum(time it takes for all requests) [08:44:32] ahhh TIL, I didn't get they were sequential [08:44:39] but with PHP it makes sense yes [08:48:45] ok changes deployed [08:50:16] watching https://grafana-rw.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&var-dc=eqiad%20prometheus%2Fk8s&var-job=ORESFetchScoreJob&from=now-1h&to=now [08:50:25] 10Machine-Learning-Team: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10isarantopoulos) Sure! I agree, we'll follow what SRE does regarding SLO alarming. [08:52:50] let's see if it improves [08:52:52] * elukey bbiab [08:57:59] 🤞 [09:12:54] back! [09:13:02] p50 backlog time is decreasing [09:13:39] consumer lag for the job seems down right now [09:13:56] one weird thing - "retry" seems to have a neverending lag [09:18:15] kafkacat -C -t eqiad.cpjobqueue.retry.mediawiki.job.ORESFetchScoreJob -b kafka-main1001.eqiad.wmnet:9092 -o latest [09:18:27] this shows a constant amount of events [09:19:44] it seems that we throw 500s [09:22:59] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared - https://phabricator.wikimedia.org/T346175 (10isarantopoulos) Another interesting thing seen [[ https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?o... [09:23:20] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared - https://phabricator.wikimedia.org/T346175 (10elukey) The metrics now look better! One thing that I noticed is that we have a lot of events in the O... [09:26:27] isaranto: we have a serious issue I think [09:26:37] I am trying [09:26:38] curl -s https://inference.svc.eqiad.wmnet:30443/v1/models/eswiki-damaging:predict -X POST -d '{"rev_id": 153735}' -i -H "Host: eswiki-damaging.revscoring-editquality-damaging.wikimedia.org" --http1.1 [09:26:50] it is one of the revid listed in the retry topic [09:26:52] and it hangs [09:26:58] meanwhile if I hit codfw it works [09:27:02] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared - https://phabricator.wikimedia.org/T346175 (10Ladsgroup) The backlog is back to the previous values \o/ https://grafana.wikimedia.org/d/CbmStnlGk/jo... [09:27:36] 10Machine-Learning-Team, 10Observability-Alerting: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10Aklapper) [09:28:12] indeed [09:29:23] it makes zero sense, from the istio metrics I don't see high latencies [09:30:33] eswiki-damaging-predictor-default-00011-deployment-754dcd86j6kb 2/3 Running 0 2d22h [09:30:36] what the hell [09:31:40] ouch [09:31:50] it hanged again (?) [09:32:12] Tobias deleted eswiki, ruwiki, zhwiki and nlwiki because they were failing [09:32:58] it is the queue proxy [09:32:59] aggressive probe error (failed 72 times): dial tcp 127.0.0.1:8080: i/o timeout [09:35:11] same is happening again for ruwiki, and nlwiki [09:35:24] all the ones in the retry queue [09:36:04] the queue proxy fails since the kserve container is not responding, for some reason [09:37:14] I am going to kill all pods but one to keep investigating [09:38:39] isaranto: do you recall when was the last kill action? [09:39:12] ah wait I have a suspiction [09:39:34] we don't really have logs right now with kserve 0.10, since there was all the mess of config that got fixed in 0.11 [09:39:45] so it is probably kserve emitting some horror that we don't see [09:41:04] elukey: the last kill action was 2d22h ago! [09:41:22] exactly when the pods where started [09:41:35] I don't recall if we deployed anything, so maybe kserve went into a weird space with some requests [09:41:46] we need to plan the upgrade to 0.11 asap [09:42:12] I noticed that they were failing and asked Tobias to delete them (it was Tuesday) [09:45:10] the retry lag seems improving [09:45:25] and its offset increment in kafka improving as well [09:45:29] uff this is not great [09:45:37] we have a serious bug [09:48:42] klausman: o/ around? [09:54:11] yes! [09:54:21] isaranto: my first instinct is to go forward with kserve 0.11 in say damaging, at least our docker imgs [09:55:12] I was halfway into making the change to expose readability on the APIGW, let me catch up [09:55:12] klausman: hello :) We have a serious issue with revscoring pods (and potentially all isvcs), namely that sometimes the kserve container goes awol and starts blackholing traffic [09:55:15] yes I agree, so we can also create proper access logs and dashboards! [09:55:33] exactly yes, and then we can upgrade the control plan [09:55:35] *plane [09:55:59] trying to build revscoring with kserve 0.11 [09:56:10] huh. [09:58:34] elukey: when you say "sometimes", what is the rough ratio? [09:59:20] we are still trying it figure it out, you killed pods a couple of days ago for the same reason [09:59:31] today it was 4 pods in damaging and 4 in goodfaith more or less [10:00:08] So the pod goes broken and answers no req's at all? [10:00:28] the reqs hang until the client timeout kicks in IIUC [10:00:43] the queue proxy container fails to contact 127.0.0.1 that is the kserve port [10:00:51] I just tried the eswiki one, and it seems fine, I guess you recently restarted it? [10:00:54] so somehow the kserve container hangs, but we have no logs [10:01:09] klausman: yep check pod update timings [10:02:06] the "no logs" part is seriously annoying [10:04:09] so my plan is to start the upgrade to kserve 0.11 in our docker images, say damaging, so we can roll it out next week in hopefully one namespace [10:04:22] while you work on the control plane, that can be done separately [10:04:33] since the migration is yours, do you like the idea? [10:04:42] or do you prefer something else? [10:05:04] eah, it seems like the best option [10:05:07] +y [10:05:39] I just hope nothing on the container side relies on changes in the control plane (it would surprise me a bit, but who knows) [10:06:45] never happened so far, hopefully we can test it in staging [10:07:45] ack [10:07:58] I presume these errors are normal: [10:08:00] 2023-09-15 10:07:14.292 72 root ERROR [get_revscoring_extractor_cache():99] Received a badrevids error from the MW API for rev-id 153740127. Complete response: {'batchcomplete': '', 'query': {'badrevids': {'153740127': {'revid': 153740127, 'missing': ''}}}} [10:09:09] sort of [10:09:39] ack [10:11:15] sry the ack above was for a previous msg (somehow irc hadnt received the latest ones) [10:11:20] Ok, I am running kubetail on all the editquality kserve containers, filtering the usual timing noise. If another container falls over, we may get some logs from just before it fails [10:13:29] klausman: I checked the pod logs before killing, there wasn't anything interesting [10:13:54] Hurm. [10:14:18] I wonder if I can find something in the k8s logs. What was the pod ID of one that you killed? [10:14:41] you can see in get events some info in theory [10:15:04] an upstream reference is https://github.com/kserve/kserve/issues/1151 [10:15:20] but it basically says that the kserve container is misbehaving [10:15:22] klausman: those errors happen when a page is deleted before we have a chance to read its content. MW API will throw badrevids [10:18:00] the weird thing is that I don't see traces of traffic dropped etc.. from the istio metrics [10:19:17] nothing even from https://grafana-rw.wikimedia.org/d/Rvs1p4K7k/kserve?orgId=1&from=now-7d&to=now [10:20:31] #24 96.66 kserve 0.11.0 depends on tabulate<0.10.0 and >=0.9.0 [10:20:31] #24 96.66 revscoring 2.11.10 depends on tabulate<0.8.999 and >=0.8.7\ [10:20:37] * elukey cries in a corner [10:21:08] revscoring, it has been a while [10:23:31] oh god. [10:28:08] and there is an issue with PyYaml, so we need to file a req for yamlconf ( [10:33:23] the issue seems https://github.com/yaml/pyyaml/issues/724 [10:33:24] I can take a look in the dependencies as well! [10:35:21] so https://github.com/yaml/pyyaml/pull/702 seems to have fixed [10:35:41] isaranto: I am going to send some patches for you to review [10:35:50] but yamlconf will need a release first [10:38:13] isaranto: https://github.com/halfak/yamlconf/pull/8 [10:38:51] not sure how to add reviewers though [10:38:56] * elukey sends an email to Aaron [10:39:51] I have an appointment in 20m. Ok for me to leave you to it? I should be back by 1430 or so [10:40:09] yeah nothing on fire atm, I am going afk as well for a couple of hours [10:40:24] (email sent) [10:40:45] isaranto: going to open a task for the upgrade later on, I'll prioritize it [10:41:12] in the meantime, we should figure out if there is a metric that indicates when a pod is misbehaving [10:41:15] and alarm on it [10:41:28] (https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=cpjobqueue-ORESFetchScoreJob&from=now-3h&to=now looks good! retry is getting better) [10:43:22] * elukey lunch [10:47:21] for we can just ping aaron in a comment like we did when we needed to bump it again https://github.com/halfak/yamlconf/pull/7 [10:47:32] otherwise we can use our own fork [10:49:06] this could be the first set of alerts we work on. (k8s system metrics about pods) instead of SLOs [10:52:00] (03PS7) 10Kevin Bazira: Load preprocessed numpy arrays from swift [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T346218) [10:55:59] (03CR) 10Kevin Bazira: Load preprocessed numpy arrays from swift (0310 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T346218) (owner: 10Kevin Bazira) [10:58:44] Looking at kube-metrics we could set an alert on `kube_pod_status_ready` but I don't see these kind of metrics available in our cluster in https://prometheus-eqiad.wikimedia.org/k8s-mlserve/ [10:59:47] unless they are gathered in another prometheus instance e.g. https://prometheus-eqiad.wikimedia.org/ops/ [10:59:47] I remember you mentioned that there is work being done regarding kube-metrics ... [11:06:05] elukey: and Amir1: thanks for all the help and actually resolving the issue <3 [11:06:29] <3 glad to be useful sometimes [11:09:01] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared - https://phabricator.wikimedia.org/T346175 (10isarantopoulos) Great work @elukey and @Ladsgroup ! I'll keep an eye on this and if all continues wel... [11:22:55] \o/ [11:31:36] * isaranto lunch [12:36:57] Amir1: "sometimes" --- don't sell yourself short, you're doing great work [13:04:41] yep! [13:04:59] isaranto: kube-metrics still not deployed, I have to work with Kamila on importing the chart [13:12:42] (03CR) 10Elukey: Load preprocessed numpy arrays from swift (038 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T346218) (owner: 10Kevin Bazira) [13:16:38] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared - https://phabricator.wikimedia.org/T346175 (10elukey) We found a serious bug though, namely sometimes the kserve container inside an isvc pod stops... [13:18:04] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes: User: Wikipedia recent changes list the edit highlighting by ORES has disappeared - https://phabricator.wikimedia.org/T346175 (10calbon) Thanks for this work. Lets talk about that ticket on Tuesday [13:23:26] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) [13:26:22] 10Machine-Learning-Team: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10elukey) [13:29:59] 10Machine-Learning-Team: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10elukey) [13:33:20] elukey: do you want me to work on python dependencies for kserve 0.11? [13:33:38] or to rephrase it: how may I help? [13:37:23] isaranto: I'll send some stuff to review :) [13:37:58] * isaranto sharpening his local python env [13:42:18] <3 [13:55:07] a! my GH notifications work again :) [14:01:03] 10Machine-Learning-Team: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10elukey) Filed https://github.com/halfak/yamlconf/pull/8 for yamlconf Filed https://github.com/wikimedia/revscoring/pull/547 for revscoring [14:06:59] elukey: I am reviewing now, cross checking if these dependencies will play nicely with inf-services [14:07:31] isaranto: wait a sec, still working on it :) [14:07:40] kk [14:08:23] nevermind I thought it was ready to review [14:08:47] (03CR) 10Kevin Bazira: Load preprocessed numpy arrays from swift (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T346218) (owner: 10Kevin Bazira) [14:11:55] nono still trying to make the revscoring docker image deps right :) [14:12:08] In theory no more changes for revscoring buuuut [14:12:09] :) [14:12:17] little errand, bbiab! [14:36:35] isaranto: done! The revscoring version builds correctly with kserve 0.11 [14:36:39] free to review :) thanks! [14:36:43] It is friday afternoon, I appreciate all the effort but if it can be held over until Monday, now is the time write the final commit message and close the IDE [14:37:17] Basically don't take on something huge you can't put down until 8pm on a friday [14:38:12] chrisalbon: we are not upgrading anything today, it is just prep work :) [14:38:21] okay whew [14:38:28] it will require a lot of testing etc.. [14:38:39] but since it is a multi-piece thing, I started to day [14:38:41] *today [14:39:10] (03PS1) 10Elukey: WIP - Upgrade revscoring images to KServe 0.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/957948 (https://phabricator.wikimedia.org/T346446) [14:52:40] (03CR) 10Ilias Sarantopoulos: Load preprocessed numpy arrays from swift (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T346218) (owner: 10Kevin Bazira) [14:55:42] (03CR) 10Elukey: Load preprocessed numpy arrays from swift (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/956846 (https://phabricator.wikimedia.org/T346218) (owner: 10Kevin Bazira) [15:02:09] kevinbazira: thanks for the explanation and sorry for the confusion [15:03:28] elukey: no problem. thanks for the reviews :) [15:12:37] I'm building the prod revscoring image on isvc repo as well and will let you know (takes long time unfortunately to rebuild) [15:13:21] really happy that we also bump ray version as there are issues with apple silicon cpus with older versions! [15:21:54] done, works great! I even ran a model server with it [15:22:05] nice! [15:22:09] did you try https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/957948 ? [15:23:00] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "I've tested this with the updated version of yamlconf from the upcoming 2.11.11 release in https://github.com/wikimedia/revscoring/pull/54" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/957948 (https://phabricator.wikimedia.org/T346446) (owner: 10Elukey) [15:23:09] super thanks :) [15:23:15] yes I did [15:23:32] we can think about a rollout on monday then [15:23:41] in staging of course, then we can load test etc.. [15:24:02] we never saw these issue before, I suspect they may be related to autoscaling [15:24:09] I tried it with the yamlconf version from the revscoring repo. feel free to merge the GH PR at least so we have the new version [15:24:30] By monday we may have the new yamlconf version on pypi [15:24:38] otherwise I'll merge, wdyt? [15:24:47] yes you're right! [15:26:00] I think next week we should focus on rolling this out + creating proper access logs as well as the dashboard for ores-legacy https://phabricator.wikimedia.org/T341547 [15:26:20] sure yes [15:26:35] I was trying to find alternatives for alerts but I don't see any other clean way if we dont have kube-metrics [15:27:03] the indirect way would be to publish a custom metric [15:38:37] elukey: klausman: eswiki-damaging is failing again. Could u delete the pod once more? [15:38:56] what the.. [15:39:24] I mean since I don't have permissions to do it (who can I ask for them?) [15:40:35] not sure what we should to do add the extra perms, I think only SRE have them and it is difficult to change this (we'd be the first I think) [15:40:56] kowiki goodfaith same thing [15:41:06] I could create a task and ask for them [15:41:39] otherwise I would have to create a patch in dep-charts and issue a dummy commit that changes sth so that I can go and sync in order to do this [15:41:52] nono it is fine for the moment to ask me or Tobias [15:42:17] I want to check knative logs fifrst [15:42:19] *first [15:43:02] ok! [15:49:22] it is very weird that I see [15:49:24] "Readiness probe failed: HTTP probe failed with statuscode: 500" [15:49:33] this is for kowiki's kserve container [15:49:49] in the kserve logs there are traces of requests only up to 10 UTC [15:52:31] and this happens only for revscoring afaics [15:52:46] mmmmm [15:53:02] maybe there is a specific request, triggered by mediawiki calls, that hits kserve badly [15:56:21] isaranto: is there a way to see if we hit kowiki's goodfaith from mediawiki? [15:57:14] (probably not but worth to ask) [15:57:14] ORES extension is enabled for kowiki so we hit LW [15:57:25] if that answers your question.. [15:57:42] I think the issue could be a spike in requests https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=kowiki-goodfaith-predictor-default-00012-private&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&from=now-9h&to=now [15:57:51] yes yes I wanted to know if we hit kowiki's in the past hour, to figure out why it failed [15:57:54] I see a spike in requests/s in all these containers [15:58:26] the FLAG is downstream close [15:58:28] weird [15:58:50] `the FLAG is downstream close` ?? [15:59:10] second row, second graph [16:00:28] aa got it [16:02:39] ok lemme try to nsenter the container [16:06:33] since for all these containers we have spikes around 10-11 UTC I could bump minReplicas to 2 to see if it will resolve the issue at least for the weekend and then we can check on Monday why autoscaleup didnt trigger (if that was the issue) [16:06:37] wdyt? [16:12:02] kowiki has one non-usual message: [kowiki-damaging-predictor-default-00010-deployment-55b857fscfxc] 2023-09-15 14:30:50.760 72 root ERROR [get_revscoring_extractor_cache():160] An error has occurred while fetching feature values from [16:12:04] the MW API: 503, message='Service Unavailable', url=URL('http://api-ro.discovery.wmnet/w/api.php?action=query&prop=revisions&revids=35570662&rvslots=main&rvprop=contentmodel%7Ccomment%7Csize%7Cuser%7Ccontent%7Cids%7Cuserid%7Ctimestamp&format=json') [16:12:24] But I don't think that's indicative of our problem [16:13:27] kevinbazira: it is good faith not damaging for kowiki [16:13:48] Bounced the eswiki pod [16:14:03] klausman: could you please coordinate with me before bouncing pods? [16:14:10] sure. [16:14:14] we are trying to figure out what the problem is [16:14:17] thanks :) [16:14:27] I checked just now and the age of the pod was 6h31m, so I suspected you hadn't touched it [16:15:40] As for kowiki I'll not touch kowiki unless you (plural) tell me to [16:15:53] kowiki-goodfaith-predictor-default-00012-deployment-574db495bz8 is the pod name [16:16:30] yes we are working on it [16:17:11] I think the problem is the python process that runs the model server, somehow it is stuck not serving requests, but attaching gdb to it is not trivial [16:17:19] Sorry about juming the gun with eswiki :-/ [16:17:36] np just remember to coodinate before taking actions, it happens :) [16:17:54] strace maybe as a simpler approach? that way you don't need sources, but you could see if it's spinning on a particular syscall [16:18:06] (or ltrace) [16:18:31] not sure if it works 100%, gdb works fine from the host but it says that it is in a different pid namespace [16:18:34] etc.. [16:18:39] ah, right [16:18:40] so I am not sure if what I am reading is reliable [16:19:29] I got https://phabricator.wikimedia.org/P52514 so far [16:19:57] nothing seems out of the ordinary [16:20:40] but if I curl 127.0.0.1:8080 via nsenter it hangs [16:22:31] hwo do you find the PID of the server? [16:22:41] at 9:52 UTC (time of the last request in the logs) there is an increase in aborted requests [16:22:41] https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=kowiki-goodfaith-predictor-default-00012-private&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&from=1694771244000&to=1694776030000&viewPanel=2 [16:23:20] klausman: docker ps on the node [16:23:25] then docker inspect [16:23:30] ack thx [16:24:35] isaranto: very weird [16:26:54] strace shows [16:26:55] futex(0x7f05d7361540, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1694795166, tv_nsec=123714523}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [16:26:59] but not sure what it is [16:27:10] a yes [16:27:20] just kidding I have no idea :) [16:27:35] me too :D [16:28:23] checked also memory/cpu for the container, all good [16:28:26] Yeah, spinning on a futex like that is normal for multihtreaded programs. That it's doing nothing else is odd however [16:29:03] Looks like a Livelock to me [16:29:22] I'm thinking if using 2 pods (having 2 minReplicas) would resolve the issue for now. I mean if one of the pods hang the other could serve requests [16:30:41] or If I'm completely off and I should stop thinking about autoscaling :P [16:30:54] the main issue is that it will eventually fail [16:32:23] What I find odd is that the curl to 8080 hangs on the tcp dial already. Could this be a network (netfilter) issue instead of the process inside being broken? [16:33:46] could be an istio weirdness, but in the logs I don't see much [16:33:55] it fails the health probe at some point [16:34:03] but this is due to the python process hanging [16:34:34] at this point I'd propose a roll restart of all pods [16:34:51] all revscoring? or all-all? [16:35:04] goodfaith and damaging on ml-serve-eqiad [16:35:54] having the access logs will be essential to understand this issue, but maybe it is just a weird status [16:36:37] ok to start? [16:37:03] yes from me [16:38:26] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) Tried to log on a ml-serve node running a pod hanging, and tried to run gdb (without nsenter): ` (gdb) thread apply all py-bt Thread 5 (Thread 0x7f0571338700 (LW... [16:38:31] klausman: ok? [16:39:40] starting with goodfaith, I'll do small batches [16:39:55] sadly we cannot use the roll restart helfile thing, we need to kill one-by-one [16:42:15] ah no wait there is a change to deploy [16:42:30] a no-op, Aiko increased the version of the kserve-inference chart [16:42:46] isaranto: ok to deploy that instead? It will clean all pods [16:43:49] yes! [16:44:06] ack doing damaging then [16:44:19] I'll wait for the current pods in terminating state in goodfaith to finish [16:48:44] elukey: sgtm [16:50:28] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) As desperate attempt, we restarted all the pods in goodfaith/damaging (eqiad and codfw). It is likely not gonna help but worth trying. [16:51:51] ok everything cleaned up [16:52:22] at this point we can spot check over the weekend and kill pods if needed (hopefully not) [16:52:31] but it is also fine if we not do it [16:53:09] going afk folks, hopefully we'll have a good weekend :) [16:53:10] o/ [16:55:12] thank uu [16:55:20] logging off as well o/ [16:56:11] o/ heading out as well [17:53:25] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10MW-1.41-notes (1.41.0-wmf.26; 2023-09-12), 10Moderator-Tools-Team (Kanban): ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10jsn.sherman) So that approach fixed one specific problem, but changing m... [18:37:13] 10Machine-Learning-Team, 10Observability-Alerting: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10RLazarus) We have some plans for SLO-based alerting in the pipeline, but nothing implemented yet. The summary is that @elukey is exactly right, as ever: we'll alert on error budget //burn rate/... [18:40:00] (03PS1) 10Jsn.sherman: Don't use live configuration [extensions/ORES] - 10https://gerrit.wikimedia.org/r/957970 (https://phabricator.wikimedia.org/T345922) [18:42:36] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10MW-1.41-notes (1.41.0-wmf.26; 2023-09-12), and 2 others: ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10jsn.sherman) [18:46:39] (03CR) 10Novem Linguae: Don't use live configuration (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/957970 (https://phabricator.wikimedia.org/T345922) (owner: 10Jsn.sherman) [19:11:50] (03CR) 10Jsn.sherman: Don't use live configuration (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/957970 (https://phabricator.wikimedia.org/T345922) (owner: 10Jsn.sherman)