[07:25:36] klausman: o/ kart_ and I are trying to deploy the rec-api on staging but it's hanging when we run `helmfile -e ml-staging-codfw sync` [07:25:51] `helmfile -e ml-staging-codfw diff` shows that the image is now comming from an internal endpoint, but we didn't make this change: [07:25:51] ``` [07:25:51] - image: "docker-registry.wikimedia.org/wikimedia/research-recommendation-api:2024-09-24-105455-production" [07:25:51] + image: "docker-registry.discovery.wmnet/wikimedia/research-recommendation-api:2024-11-08-142328-production" [07:25:51] ``` [07:30:48] here is the error log from kart_: https://pastebin.com/FJTK5TDN [07:35:06] logstash: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.11.11?id=vScYGpMBPAEUXp-LY8nG [07:35:40] kevinbazira: I'm here as well now :) [07:37:05] kart_: nice! I've shared the logs in the communication right before you joined :) [08:41:48] (03PS1) 10Kevin Bazira: article-country: normalize score based on categories and properties [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1089646 (https://phabricator.wikimedia.org/T371897) [08:49:58] ah, I already packed away my work laptop because I am flying to AMS this afternoon. [08:51:24] elukey: would you eb able to assist Kevin and KArtik? [08:57:08] klausman, klausman I can check later on, currently handling my kid screaming :D Hope it is not urgent [08:57:36] elukey: no problem. please take your time :) [08:57:47] klausman: travel safely! [09:27:28] kevinbazira: so in theory the change shouldn't be a problem, the correct setting is the internal endpoint, other ml services use it as well [09:27:58] in theory if the deployment was hanging it may probably mean that the new pod failed to come up [09:28:06] did you check while deploying the logs of the new pod? [09:28:39] okok logs show: `Readiness probe failed` [09:29:46] yep this is a good start, but there should be more [09:29:59] what I usually do is [09:30:12] 1) create a tmux session on the deployment server, then start the deployment with helmfile [09:30:34] 2) if it hangs, I just detach the tmux session and check the new pods [09:30:51] kubect get pods -n etc.. + kubectl describe pod etc.. + kubectl logs etc.. [09:31:34] pinging kart_: to follow the conversation as he'll be deploying rec-api [09:32:54] kevinbazira: you can let kart_ to retry the deployment and while it hangs you can check point 2) [09:37:10] `kube_env recommendation-api-ng ml-staging-codfw` and `kubectl get pods` shows pod from 20d ago. [09:46:22] yep yep you need to check while kart_ is deploying [09:47:58] okok I've pinged him on slack. will proceed when he's available ... [09:48:28] * elukey bbl! [10:51:50] kevinbazira: back! did you find the issue? [10:52:01] otherwise we can try to redeploy and check togther [10:54:00] elukey: kart_ wasn't back yet. so we can do the deployment together and I'll update hime [10:54:58] good! Lemme try to deploy [10:55:23] started! [10:55:25] let's check [10:56:06] so we have: [10:56:07] recommendation-api-ng-main-bb9f594c6-vs8pf 1/2 Running 0 32s [10:56:34] Just back. Sorry for the delay. I can watch here :) [10:56:47] kart_: o/ [10:56:54] so from kubectl describe pod etc.. I see [10:56:55] Readiness probe failed: Get "http://10.194.61.160:8080/docs": read tcp 10.192.48.174:54034->10.194.61.160:8080: read: connection reset by peer [10:57:22] is it possible that the new image changes /docs in any way? [10:58:03] Yes. I remember seeing similar error for other services. Something changed in k8s recently? [10:58:29] Wait. Different issue. Sorry. [10:59:09] kart_: is the same image deployed on wmcloud? I see the docs work there: https://recommend.wmcloud.org/docs [10:59:11] elukey: We have new image running fine on wmcloud. [10:59:47] kevinbazira: yes. I deployed latest tag today morning. [11:03:48] and does it listen to the same port? 8080? [11:04:36] yes. same port. 8080. [11:05:29] yep just verified for a transient pod [11:06:02] elukey: did we had any change in usingg docker image from docker-registry.discovery.wmnet/wikimedia/research-recommendation-api:2024-11-08-142328-production instead of docker-registry.wikimedia.org/wikimedia/research-recommendation-api:2024-09-24-105455-production ? [11:06:19] I saw that in the diff before deployment? [11:08:39] shouldn't be a problem, from the logs I see this though [11:08:39] {"@timestamp":"2024-11-11T11:06:29.894Z","log.level":"error","message":"Could not fetch the list","ecs.version":"1.6.0","log":{"logger":"recommendation","origin":{"file":{"line":474,"name":"fetcher.py"},"function":"get_collection_pages"},"original":"Could not fetch the list"},"process":{"name":"MainProcess","pid":10,"thread":{"id":140447997997760,"name":"MainThread"}}} [11:10:26] kart_: did you add http calls to services that don't go through the localhost envoy instance? [11:16:22] I am asking since I see [11:16:23] opt/lib/venv/lib/python3.11/site-packages/httpcore/_exceptions.py\", line 14, in map_exceptions\n raise to_exc(exc) from exc\nhttpcore.ConnectTimeout\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File \"/app/recommendation/external_data/fetcher.py\", line 35, in get\n response = await httpx_client.get(\n [11:16:33] I was checking for any errors on wmcloud instance. Let me check with Nik if there were any such recent changes.. [11:16:39] kevinbazira: I just did `kubectl logs $pod etc..` [11:17:03] kart_: you wouldn't see those in the wmfcloud instance since most probably all the egress connections are not firewalled [11:17:26] on k8s we allow only a subsection of them, and most of the time we suggest to go through the mesh/envoy localhost instance [11:19:05] Right. [11:19:31] okok in the logs I see http://localhost:6500/w/api.php?action=query [11:19:33] etc.. [11:19:41] basically the debug in https://gerrit.wikimedia.org/r/plugins/gitiles/research/recommendation-api/+/refs/heads/master/recommendation/external_data/fetcher.py#31 [11:21:57] I've asked Nik if he can join here. [11:23:11] these are all the logs https://phabricator.wikimedia.org/P70998 [11:24:25] Thanks. [11:25:33] curl "https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&list=pagecollectionsmetadata&titles=" [11:25:36] {"batchcomplete":true,"warnings":{"query":{"warnings":"Unrecognized value for parameter \"list\": pagecollectionsmetadata"}}} [11:26:18] thanks. elukey! [11:26:37] maybe it is a new HTTP call to the mw-api that returns zero results [11:26:55] if this helps, these are the changes that have been made since the last rec-api deployment: [11:26:55] https://github.com/wikimedia/research-recommendation-api/compare/e917a66..01a5f5a [11:31:29] `"list": "pagecollectionsmetadata",` was added in this change: https://github.com/wikimedia/research-recommendation-api/commit/810e0e44b6f25afe0f2ce506d936062a671a770e [11:34:56] I think that the issue may be that on wmfcloud, without local proxy, it works, meanwhile on k8s with the mesh settings it doesn't [11:35:15] Yes. [11:44:31] Checking with Nik if we can test service behind proxy as well. [13:10:25] looks like the meta API call made here: [13:10:25] https://github.com/wikimedia/research-recommendation-api/blob/7436e34a54c4b72c12617ab236de4506c9ace0d9/recommendation/external_data/fetcher.py#L456-L469 [13:10:25] returns data without the required pages which ends up logging this error: [13:10:26] https://github.com/wikimedia/research-recommendation-api/blob/7436e34a54c4b72c12617ab236de4506c9ace0d9/recommendation/external_data/fetcher.py#L474 [13:10:26] the same error appears in the logs shared by elukey: https://phabricator.wikimedia.org/P70998 [13:10:28] I simulated that API call outside the envoy proxy and indeed it doesn't return the required pages: [13:10:28] https://meta.wikimedia.org/w/api.php?action=query&format=json&formatversion=2&generator=categorymembers&gcmlimit=max&gcmnamespace=0&gcmtitle=Category:Pages%20including%20a%20page%20collection&prop=info [14:00:51] Shared this with Nik who doesn't use IRC :) Trying to get him here. [14:02:08] nice work kevinbazira! [15:29:36] (03PS1) 10Nik Gkountas: fix UnboundLocalError in candidate_finders.py [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1089813 [15:32:01] (03CR) 10Nik Gkountas: [C:03+2] API Continue support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1061713 (https://phabricator.wikimedia.org/T379037) (owner: 10Santhosh) [15:32:44] (03Merged) 10jenkins-bot: API Continue support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1061713 (https://phabricator.wikimedia.org/T379037) (owner: 10Santhosh) [15:43:05] (03CR) 10Sbisson: [C:03+2] fix UnboundLocalError in candidate_finders.py [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1089813 (owner: 10Nik Gkountas) [15:43:45] (03Merged) 10jenkins-bot: fix UnboundLocalError in candidate_finders.py [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1089813 (owner: 10Nik Gkountas) [20:04:35] (03CR) 10Sbisson: [C:03+2] "I will merge this as is since it's already a big improvement but I will note in the task that we should do a little more before considerin" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1088376 (https://phabricator.wikimedia.org/T379036) (owner: 10Nik Gkountas) [20:05:16] (03Merged) 10jenkins-bot: update cache in a single thread [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1088376 (https://phabricator.wikimedia.org/T379036) (owner: 10Nik Gkountas)