[07:10:19] elukey: Is there any way to test simple curl command to staging running recommendation-api? [07:10:53] Also, it seems logs restricted to root? I can't run: https://phabricator.wikimedia.org/P70998 [07:11:15] Neither: https://phabricator.wikimedia.org/P69386 kevinbazira :/ [07:12:49] kart_: o/ sorry about this experience with your first rec-api deployment. SREs are the ones who gave us access to run kube_env commands. [07:14:42] I unfortunately don't have the rights to grant you access to kube_env nor run curl within the rec-api staging pod. [07:15:08] kevinbazira: no issue! There is always new learning in doing something new :) [08:24:56] kart_: o/ [08:25:39] re: curl command - I can try one for sure, but it needs to be done in a special way since the docker image doesn't carry the curl binary IIRC (so I need to use a tool like nsenter that is root only) [08:26:28] ah but I may have misunderstood - do you mean curling the staging endpoint, or making a call from the new rec-api pod towards an endpoint? [08:27:57] re: logs, in theory you should be able to fetch them, you are in the deploy-ml-group afaics [08:28:32] have you tried kube_env recommendation-api-ng ml-staging-codfw and then kubect get pods -n etc.. for example? What error do you get? [08:37:56] https://www.irccloud.com/pastebin/02EYm8MV/ [08:38:04] elukey: ^ [08:38:41] Can we run something like: `kubectl exec recommendation-api-ng-main-bb9f594c6-89lwb -- curl -vk http://localhost:6500/w/api.php` [08:39:20] This will check if pod is reaching certain internal api [08:50:27] kart_: something is weird on your bash config I think, can you try the following (on deploy2002) [08:50:30] /etc/profile.d/kube-conf.sh [08:50:36] /etc/profile.d/kube-env.sh [08:50:43] and then retry kube_env [08:51:21] for the exec - that works only if curl is installed as dependency/package on the docker image, usually we don't do it for security reasons [08:55:39] ah. Was it issue with screen to run kube_env?? [08:56:43] in theory no, tmux should support it, it is strange that it doensn't happen automatically [08:57:05] the /etc/profile.d should be global for all users [08:57:18] so unless you have a specific bash config that overrides it or similar, it should always work [08:57:40] anyway, we can check this later, I think we solved one problem :D [08:58:58] yes. My bash profile is simple: `screen -R` [08:59:08] ahhh [08:59:26] all right then it must be it, not entirely sure why but it may wipe all the global settings [09:00:21] Must be overriding. This is strange. I'll fix that later.. [09:04:01] kart_: I've also verified (via nsenter, on the k8s node) that localhost:6500 works fine (on the current pod) [09:05:24] but we can do it again when you deploy, just to be sure [09:05:49] Interesting that, `$ helmfile -e ml-staging-codfw diff --context 5` still showing diff and that means we never got deploy the newer image. [09:06:14] elukey: sure. Thanks! [09:06:46] I'll retry deployment newer image in sometime and report any issue. Let's see how it goes. kevinbazira I'll sync up before that. [09:07:51] kart_: ping me before proceeding so I can test the curl command.. for the image, it makes sense that it still shows the old one, since if the readiness probe fails during a deployment after 300s a rollback is kicked off automatically [09:08:20] thanks elukey! [09:08:21] kart_: sure sure ... [09:19:09] trying to deploy folks, I want to test one thing [09:19:57] kart_: localhost:6500 works on new pods, I verified it [09:23:48] also this is worth following up [09:23:49] log.error(f\"Error response {exc.response.status_code} while requesting {exc.request.url!r}.\")\n ^^^^^^^^^^^^\nAttributeError: 'ConnectTimeout' object has no attribute 'response'\n"}," [09:24:26] that should be https://gerrit.wikimedia.org/r/plugins/gitiles/research/recommendation-api/+/refs/heads/master/recommendation/external_data/fetcher.py#63 [09:24:40] in theory catching a ConnectionTimeout error should be handled differently [09:24:53] and we could get better info [09:27:03] kart_: --^ [10:53:17] Noted. Let me add this in the task. [11:47:21] Added in: https://phabricator.wikimedia.org/T379592#10311833 [11:55:22] kart_: https://phabricator.wikimedia.org/T379592#10311878 [11:55:53] my suspiction is that in the new code we don't use localhost:6500 everywhere [12:00:48] aha. Thanks! [12:01:14] or better, it may be something else that issues a https:// call [12:01:30] from netstat I only see that we are targeting the frontend lbs and port 443 [12:06:13] o/ elukey thanks for the help, we are at the offsite this week and communication is a bit slow on our side [12:06:24] isaranto: o/ sure sure I figured :) [12:34:57] ml-etcd1002 will switch temporarily to DRBD to move to a new ganeti node, latencies might go up a bit [12:58:38] and completed [16:38:12] (03PS1) 10KartikMistry: Update wmflabs > wmcloud in header [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1090514 [17:04:12] (03CR) 10Sbisson: [C:03+2] Update wmflabs > wmcloud in header [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1090514 (owner: 10KartikMistry) [17:04:51] (03Merged) 10jenkins-bot: Update wmflabs > wmcloud in header [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1090514 (owner: 10KartikMistry) [20:58:20] 06Machine-Learning-Team, 13Patch-For-Review: Test the feasibility of deployment of Aya-23 model in LiftWing - https://phabricator.wikimedia.org/T379052#10314444 (10isarantopoulos) [20:59:50] 06Machine-Learning-Team, 13Patch-For-Review: Test the feasibility of deployment of Aya-23 model in LiftWing - https://phabricator.wikimedia.org/T379052#10314465 (10isarantopoulos) The aya-expanse-8B model has been deployed. Example request: ` curl "https://inference-staging.svc.codfw.wmnet:30443/openai/v1/comp...