[07:25:36] <kevinbazira>	 klausman: o/ kart_ and I are trying to deploy the rec-api on staging but it's hanging when we run `helmfile -e ml-staging-codfw sync`
[07:25:51] <kevinbazira>	 `helmfile -e ml-staging-codfw diff` shows that the image is now comming from an internal endpoint, but we didn't make this change:
[07:25:51] <kevinbazira>	 ```
[07:25:51] <kevinbazira>	 -           image: "docker-registry.wikimedia.org/wikimedia/research-recommendation-api:2024-09-24-105455-production"
[07:25:51] <kevinbazira>	 +           image: "docker-registry.discovery.wmnet/wikimedia/research-recommendation-api:2024-11-08-142328-production"
[07:25:51] <kevinbazira>	 ```
[07:30:48] <kevinbazira>	 here is the error log from kart_: https://pastebin.com/FJTK5TDN
[07:35:06] <kevinbazira>	 logstash: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.11.11?id=vScYGpMBPAEUXp-LY8nG
[07:35:40] <kart_>	 kevinbazira: I'm here as well now :)
[07:37:05] <kevinbazira>	 kart_: nice! I've shared the logs in the communication right before you joined :)
[08:41:48] <wikibugs>	 (03PS1) 10Kevin Bazira: article-country: normalize score based on categories and properties [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1089646 (https://phabricator.wikimedia.org/T371897)
[08:49:58] <klausman>	 ah, I already packed away my work laptop because I am flying to AMS this afternoon.
[08:51:24] <klausman>	 elukey: would you eb able to assist Kevin and KArtik?
[08:57:08] <elukey>	 klausman, klausman I can check later on, currently handling my kid screaming :D Hope it is not urgent
[08:57:36] <kevinbazira>	 elukey: no problem. please take your time :)
[08:57:47] <kevinbazira>	 klausman: travel safely!
[09:27:28] <elukey>	 kevinbazira: so in theory the change shouldn't be a problem, the correct setting is the internal endpoint, other ml services use it as well
[09:27:58] <elukey>	 in theory if the deployment was hanging it may probably mean that the new pod failed to come up
[09:28:06] <elukey>	 did you check while deploying the logs of the new pod?
[09:28:39] <kevinbazira>	 okok logs show: `Readiness probe failed`
[09:29:46] <elukey>	 yep this is a good start, but there should be more 
[09:29:59] <elukey>	 what I usually do is
[09:30:12] <elukey>	 1) create a tmux session on the deployment server, then start the deployment with helmfile
[09:30:34] <elukey>	 2) if it hangs, I just detach the tmux session and check the new pods
[09:30:51] <elukey>	 kubect get pods -n etc.. + kubectl describe pod etc.. + kubectl logs etc..
[09:31:34] <kevinbazira>	 pinging kart_: to follow the conversation as he'll be deploying rec-api
[09:32:54] <elukey>	 kevinbazira: you can let kart_ to retry the deployment and while it hangs you can check point 2)
[09:37:10] <kevinbazira>	 `kube_env recommendation-api-ng ml-staging-codfw` and `kubectl get pods` shows pod from 20d ago.
[09:46:22] <elukey>	 yep yep you need to check while kart_ is deploying
[09:47:58] <kevinbazira>	 okok I've pinged him on slack. will proceed when he's available ...
[09:48:28] * elukey bbl!
[10:51:50] <elukey>	 kevinbazira: back! did you find the issue?
[10:52:01] <elukey>	 otherwise we can try to redeploy and check togther
[10:54:00] <kevinbazira>	 elukey: kart_ wasn't back yet. so we can do the deployment together and I'll update hime
[10:54:58] <elukey>	 good! Lemme try to deploy
[10:55:23] <elukey>	 started!
[10:55:25] <elukey>	 let's check
[10:56:06] <elukey>	 so we have:
[10:56:07] <elukey>	 recommendation-api-ng-main-bb9f594c6-vs8pf    1/2     Running   0          32s
[10:56:34] <kart_>	 Just back. Sorry for the delay. I can watch here :)
[10:56:47] <elukey>	 kart_: o/
[10:56:54] <elukey>	 so from kubectl describe pod etc.. I see
[10:56:55] <elukey>	 Readiness probe failed: Get "http://10.194.61.160:8080/docs": read tcp 10.192.48.174:54034->10.194.61.160:8080: read: connection reset by peer
[10:57:22] <elukey>	 is it possible that the new image changes /docs in any way?
[10:58:03] <kart_>	 Yes. I remember seeing similar error for other services. Something changed in k8s recently?
[10:58:29] <kart_>	 Wait. Different issue. Sorry.
[10:59:09] <kevinbazira>	 kart_: is the same image deployed on wmcloud? I see the docs work there: https://recommend.wmcloud.org/docs
[10:59:11] <kart_>	 elukey: We have new image running fine on wmcloud.
[10:59:47] <kart_>	 kevinbazira: yes. I deployed latest tag today morning.
[11:03:48] <elukey>	 and does it listen to the same port? 8080?
[11:04:36] <kart_>	 yes. same port. 8080.
[11:05:29] <elukey>	 yep just verified for a transient pod
[11:06:02] <kart_>	 elukey: did we had any change in usingg docker image from docker-registry.discovery.wmnet/wikimedia/research-recommendation-api:2024-11-08-142328-production instead of docker-registry.wikimedia.org/wikimedia/research-recommendation-api:2024-09-24-105455-production ?
[11:06:19] <kart_>	 I saw that in the diff before deployment?
[11:08:39] <elukey>	 shouldn't be a problem, from the logs I see this though
[11:08:39] <elukey>	 {"@timestamp":"2024-11-11T11:06:29.894Z","log.level":"error","message":"Could not fetch the list","ecs.version":"1.6.0","log":{"logger":"recommendation","origin":{"file":{"line":474,"name":"fetcher.py"},"function":"get_collection_pages"},"original":"Could not fetch the list"},"process":{"name":"MainProcess","pid":10,"thread":{"id":140447997997760,"name":"MainThread"}}}
[11:10:26] <elukey>	 kart_: did you add http calls to services that don't go through the localhost envoy instance?
[11:16:22] <elukey>	 I am asking since I see
[11:16:23] <elukey>	 opt/lib/venv/lib/python3.11/site-packages/httpcore/_exceptions.py\", line 14, in map_exceptions\n    raise to_exc(exc) from exc\nhttpcore.ConnectTimeout\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/app/recommendation/external_data/fetcher.py\", line 35, in get\n    response = await httpx_client.get(\n 
[11:16:33] <kart_>	 I was checking for any errors on wmcloud instance. Let me check with Nik if there were any such recent changes..
[11:16:39] <elukey>	 kevinbazira: I just did `kubectl logs $pod etc..`
[11:17:03] <elukey>	 kart_: you wouldn't see those in the wmfcloud instance since most probably all the egress connections are not firewalled
[11:17:26] <elukey>	 on k8s we allow only a subsection of them, and most of the time we suggest to go through the mesh/envoy localhost instance
[11:19:05] <kart_>	 Right.
[11:19:31] <elukey>	 okok in the logs I see http://localhost:6500/w/api.php?action=query
[11:19:33] <elukey>	 etc..
[11:19:41] <elukey>	 basically the debug in https://gerrit.wikimedia.org/r/plugins/gitiles/research/recommendation-api/+/refs/heads/master/recommendation/external_data/fetcher.py#31
[11:21:57] <kart_>	 I've asked Nik if he can join here.
[11:23:11] <elukey>	 these are all the logs https://phabricator.wikimedia.org/P70998
[11:24:25] <kart_>	 Thanks.
[11:25:33] <elukey>	 curl "https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&list=pagecollectionsmetadata&titles="
[11:25:36] <elukey>	 {"batchcomplete":true,"warnings":{"query":{"warnings":"Unrecognized value for parameter \"list\": pagecollectionsmetadata"}}}
[11:26:18] <kevinbazira>	 thanks. elukey!
[11:26:37] <elukey>	 maybe it is a new HTTP call to the mw-api that returns zero results 
[11:26:55] <kevinbazira>	 if this helps, these are the changes that have been made since the last rec-api deployment:
[11:26:55] <kevinbazira>	 https://github.com/wikimedia/research-recommendation-api/compare/e917a66..01a5f5a
[11:31:29] <kevinbazira>	 `"list": "pagecollectionsmetadata",` was added in this change: https://github.com/wikimedia/research-recommendation-api/commit/810e0e44b6f25afe0f2ce506d936062a671a770e
[11:34:56] <elukey>	 I think that the issue may be that on wmfcloud, without local proxy, it works, meanwhile on k8s with the mesh settings it doesn't
[11:35:15] <kart_>	 Yes.
[11:44:31] <kart_>	 Checking with Nik if we can test service behind proxy as well.
[13:10:25] <kevinbazira>	 looks like the meta API call made here:
[13:10:25] <kevinbazira>	 https://github.com/wikimedia/research-recommendation-api/blob/7436e34a54c4b72c12617ab236de4506c9ace0d9/recommendation/external_data/fetcher.py#L456-L469
[13:10:25] <kevinbazira>	 returns data without the required pages which ends up logging this error:
[13:10:26] <kevinbazira>	 https://github.com/wikimedia/research-recommendation-api/blob/7436e34a54c4b72c12617ab236de4506c9ace0d9/recommendation/external_data/fetcher.py#L474
[13:10:26] <kevinbazira>	 the same error appears in the logs shared by elukey: https://phabricator.wikimedia.org/P70998
[13:10:28] <kevinbazira>	 I simulated that API call outside the envoy proxy and indeed it doesn't return the required pages:
[13:10:28] <kevinbazira>	 https://meta.wikimedia.org/w/api.php?action=query&format=json&formatversion=2&generator=categorymembers&gcmlimit=max&gcmnamespace=0&gcmtitle=Category:Pages%20including%20a%20page%20collection&prop=info
[14:00:51] <kart_>	 Shared this with Nik who doesn't use IRC :) Trying to get him here.
[14:02:08] <elukey>	 nice work kevinbazira!
[15:29:36] <wikibugs>	 (03PS1) 10Nik Gkountas: fix UnboundLocalError in candidate_finders.py [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1089813
[15:32:01] <wikibugs>	 (03CR) 10Nik Gkountas: [C:03+2] API Continue support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1061713 (https://phabricator.wikimedia.org/T379037) (owner: 10Santhosh)
[15:32:44] <wikibugs>	 (03Merged) 10jenkins-bot: API Continue support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1061713 (https://phabricator.wikimedia.org/T379037) (owner: 10Santhosh)
[15:43:05] <wikibugs>	 (03CR) 10Sbisson: [C:03+2] fix UnboundLocalError in candidate_finders.py [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1089813 (owner: 10Nik Gkountas)
[15:43:45] <wikibugs>	 (03Merged) 10jenkins-bot: fix UnboundLocalError in candidate_finders.py [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1089813 (owner: 10Nik Gkountas)
[20:04:35] <wikibugs>	 (03CR) 10Sbisson: [C:03+2] "I will merge this as is since it's already a big improvement but I will note in the task that we should do a little more before considerin" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1088376 (https://phabricator.wikimedia.org/T379036) (owner: 10Nik Gkountas)
[20:05:16] <wikibugs>	 (03Merged) 10jenkins-bot: update cache in a single thread [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1088376 (https://phabricator.wikimedia.org/T379036) (owner: 10Nik Gkountas)