[03:10:50] <wikibugs>	 (03CR) 10KartikMistry: Extra logging for cache debugging (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098584 (owner: 10Sbisson)
[05:21:45] <wikibugs>	 (03PS1) 10Santhosh: entrypoint.sh: Remove poetry install [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098688
[05:22:14] <wikibugs>	 (03CR) 10Santhosh: [C:03+2] Use cx-server language pairs API v2 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098555 (owner: 10Sbisson)
[05:23:54] <wikibugs>	 (03Merged) 10jenkins-bot: Use cx-server language pairs API v2 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098555 (owner: 10Sbisson)
[05:24:35] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] entrypoint.sh: Remove poetry install [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098688 (owner: 10Santhosh)
[05:25:27] <wikibugs>	 (03Merged) 10jenkins-bot: entrypoint.sh: Remove poetry install [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098688 (owner: 10Santhosh)
[05:39:39] <wikibugs>	 (03CR) 10Santhosh: "This patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098684" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098584 (owner: 10Sbisson)
[05:39:49] <wikibugs>	 (03CR) 10Santhosh: [C:04-1] Extra logging for cache debugging [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098584 (owner: 10Sbisson)
[06:07:37] <kart_>	 I'm deploying rec-api but seems it is timing out. Can we increase limit for liveliness? As readiness probe seems shutting down the staging instance.
[06:26:27] <kart_>	 Can we increase 'timeout: 600' from helmfile.yaml?
[06:37:14] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Run unit tests for the inference-services repo in CI - https://phabricator.wikimedia.org/T360120#10364540 (10kevinbazira)
[06:38:47] <wikibugs>	 (03PS1) 10Kevin Bazira: test: update revscoring predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098814 (https://phabricator.wikimedia.org/T360120)
[06:44:31] <kart_>	 kevinbazira: hi. What is the procedure of deployment if we change anything in helmfile.yaml? Seems it is not seen in the diff?
[06:50:03] <kevinbazira>	 kart_: hi o/
[06:50:40] <kart_>	 hola
[06:51:03] <kart_>	 kevinbazira: patch in question is: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098811
[07:03:06] <kart_>	 Started deployment with ^, let's see!
[07:13:52] <kart_>	 kartik@deploy2002:~$ kubectl get pods
[07:13:52] <kart_>	 NAME                                         READY   STATUS             RESTARTS        AGE
[07:13:52] <kart_>	 recommendation-api-ng-main-9fffcd85f-gzvqp   1/2     CrashLoopBackOff   6 (2m29s ago)   10m
[07:13:52] <kart_>	 again. Do we know why another healthy pod disappear?
[07:22:50] <kevinbazira>	 I checked on staging and it didn't start because the readiness probe failed: https://phabricator.wikimedia.org/P71311
[07:25:14] <kart_>	 Yes! It was timing out and rediness probe was killing. So, I increased timeout in helmfile.yaml but I guess that didn't help.
[07:26:19] <kevinbazira>	 both see to be running now:
[07:26:19] <kevinbazira>	 ```
[07:26:19] <kevinbazira>	 kevinbazira@deploy2002:~$ kubectl get pods
[07:26:19] <kevinbazira>	 NAME                                          READY   STATUS    RESTARTS   AGE
[07:26:19] <kevinbazira>	 recommendation-api-ng-main-66755847d5-zbwgz   2/2     Running   0          3m25s
[07:26:19] <kevinbazira>	 ```
[07:28:59] <kart_>	 That's rolled back version, right?
[07:33:08] <kart_>	 Any other way to debug why it is timing out to deploy?
[07:37:24] <kevinbazira>	 looking at the pod and its logs usually helps:
[07:37:24] <kevinbazira>	 ```
[07:37:24] <kevinbazira>	 $ kubectl describe pod recommendation-api-ng-main-9fffcd85f-gzvqp
[07:37:24] <kevinbazira>	 $ kubectl logs recommendation-api-ng-main-9fffcd85f-gzvqp -c recommendation-api-ng-main
[07:37:24] <kevinbazira>	 ```
[07:37:24] <kevinbazira>	 klausman whenever you get a minute please help kart with this issue. thanks!
[07:45:35] <kart_>	 Yes. logs and describe wasn't helpful in this case though :/
[08:21:23] <isaranto>	 hello folks!
[08:22:32] <isaranto>	 CI should show a diff if there is a change in the manifest that is produced
[08:22:54] <isaranto>	 it seems that here we had nothing https://integration.wikimedia.org/ci/job/helm-lint/21774/console (it is the link from this  patch  https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098811)
[08:36:15] <isaranto>	 lemme try sth and post back
[08:45:04] <isaranto>	 did you folks manage to deploy it ? I am trying on staging but helmfile sync  is taking too long
[08:49:52] <isaranto>	 after we get past that the readiness prob is another issue: iiuc the app is taking too long to start because it is warming up a cache, so we should increase the readiness prob to something that makes more sense (not 10s) - I filed a patch for this https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098877
[09:01:24] <kart_>	 Sorry, I was in the meeting.
[09:01:56] <isaranto>	 no worries :)
[09:02:10] <kart_>	 Thanks isaranto
[09:02:12] <isaranto>	 this is the timeout I get when running helfmile sync https://phabricator.wikimedia.org/P71320
[09:02:30] <isaranto>	 are you getting the same or did you manage to deploy it before getting the other error?
[09:03:01] <kart_>	 No. I had same error.
[09:03:27] <kart_>	 So, we can +2 this change and try deploy?
[09:04:54] <kart_>	 isaranto: ^
[09:05:19] <isaranto>	 I'm trying to figure out why we can't sync , then we can try it
[09:05:28] <kart_>	 Sure
[09:08:09] <isaranto>	 regarding the value you changed previously (setting it from 600 to 1200) that is the timeout for the helm command so it doesn't affect the application but the deployment
[09:08:49] <isaranto>	 for example in our case it is the 600s in this command `helm3 upgrade --install --reset-values main wmf-stable/python-webapp --timeout 600s`
[09:10:19] <isaranto>	 to get more info about the deployment  you can use `helmfile --log-level=info -e ml-staging-codfw sync` - perhaps this has too much verbosity because it will output all the manifests
[09:10:20] <kart_>	 Noted!
[09:11:13] <isaranto>	 I can't figure it out let's merge the other patch and try again. Can you review this? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098877
[09:11:30] <isaranto>	 what would a good timeout be in order to have time to warm up the cache?
[09:11:54] <isaranto>	 (that is if the cache is the reason it is taking too long)
[09:12:22] <kart_>	 A little over 5 mins. ~6 mins. That is what docker run docker-registry.wikimedia.org/wikimedia/research-recommendation-api:stable takes. <-- from Santhosh.
[09:12:42] <isaranto>	 ack
[09:13:47] <kart_>	 Should I try: `helmfile --log-level=info -e ml-staging-codfw sync` ?
[09:15:42] <isaranto>	 iirc that is the default logging level so you can even omit this
[09:16:08] <isaranto>	 I try helmfile diff and I don't see the timeout change in staging, only when I do diff for prod
[09:16:09] <kart_>	 OK!
[09:16:22] <kart_>	 Interesting!
[09:17:00] <kart_>	 Because we didn't change in staging?
[09:17:03] <isaranto>	 oh ofc, because staging has a different yaml
[09:18:16] <kart_>	 yes.
[09:19:32] <isaranto>	 ok here is the new patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098890
[09:23:40] <kart_>	 +2ed
[09:24:10] <isaranto>	 I am deploying to staging now...
[09:26:02] <kart_>	 Nice!
[09:29:44] <isaranto>	 helmfile sync is still taking too long on staging... :(
[09:29:56] <kart_>	 :/
[09:30:11] <isaranto>	 klausman: o/ do you have any idea what could be the issue? 
[09:30:42] <isaranto>	 the summary is that helmfile sync is hanging in ml-staging for rec-api
[09:30:44] <isaranto>	 `/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ helmfile -e ml-staging-codfw sync`
[09:33:22] <kart_>	 Taking some break, but IRC is on :)
[09:37:56] <isaranto>	 👍
[09:45:21] <wikibugs>	 (03PS1) 10Santhosh: Parallelize API requests for Wikidata IDs and page titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895
[09:46:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Parallelize API requests for Wikidata IDs and page titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895 (owner: 10Santhosh)
[09:46:17] <wikibugs>	 (03PS2) 10Santhosh: Parallelize API requests for get_articles_by_qids and get_articles_by_titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895
[09:46:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Parallelize API requests for get_articles_by_qids and get_articles_by_titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895 (owner: 10Santhosh)
[09:51:08] <wikibugs>	 (03PS3) 10Santhosh: Parallelize API requests for get_articles_by_qids and get_articles_by_titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895
[10:18:17] <elukey>	 isaranto: what is the status of the pods while helmfile is syncing?
[10:18:36] <elukey>	 the hanging is usually the pod not coming up for some reason (namely not reaching Ready)
[10:18:55] <elukey>	 totally outside perspective but using a readiness probe of 600s indicates that something is really wrong :D
[10:19:28] <elukey>	 code-wise or elsewhere, it means that a pod can take up to 10 minutes to respond correctly to the first probe
[10:19:43] <elukey>	 during an emergency or a scale up this can bite us very badly
[10:20:05] <isaranto>	 ack, thanks for the valuable input!
[10:20:13] <elukey>	 it is ok in the case of loading huge models, but in this case not sure.. cache warming shouldn't take that much
[10:20:28] <isaranto>	 the pod is up and running in this case and helmfile sync is just stuck
[10:21:18] <isaranto>	 it seems that there was an attempt to deploy and then it failed cause I see the last pod has been up for 85m (but it is the previous version)
[10:21:21] <isaranto>	 lemme recheck
[10:22:21] <wikibugs>	 (03PS1) 10Kevin Bazira: article-country: sort results by score [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098901 (https://phabricator.wikimedia.org/T371897)
[10:24:54] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10364906 (10isarantopoulos) >>! In T371344#10362864, @MunizaA wrote: > It looks you can override just invocations of `nvcc` or `hipcc` without overriding invocations of g++ or...
[10:26:47] <klausman>	 morning! reading backlog
[10:28:11] <isaranto>	 o/
[10:28:53] <elukey>	 isaranto: is it staging? If so please retry so we'll check live
[10:29:17] <isaranto>	 yes it is staging , I am retrying already and I am checking the events in teh ns
[10:29:20] <isaranto>	 *the ns
[10:30:17] <elukey>	 isaranto: the pod is not up
[10:30:17] <elukey>	 NAME                                          READY   STATUS    RESTARTS      AGE
[10:30:21] <elukey>	 recommendation-api-ng-main-648ffcf9f4-cw9qg   1/2     Running   2 (40s ago)   2m41s
[10:30:50] <isaranto>	 ok now I see some issues in the events
[10:31:09] <isaranto>	 this is the new pod - helmfile sync is still running
[10:31:49] <elukey>	 ERROR : recommendation : fetcher : get_articles_by_titles : Could not find dbname for wiki prefix toollabs
[10:32:00] <isaranto>	 if I cancel the command or it gets a timeout we rollback to the previous version
[10:32:17] <klausman>	 I have no obvious thing to add, besides that I agree with Luca that 10m readiness wait is already _very_ long. 
[10:34:18] <klausman>	 isaranto: when you canceled helm just now, how long had it been running?
[10:34:34] <isaranto>	 thanks for that error msg. I am able to see logs while the pod is trying to run
[10:34:56] <isaranto>	 iirc it will run until helm timeouts (600s) 
[10:36:15] <elukey>	 in general the pattern that I saw recently in the logs is that any exception fired in the python code causes uvicorn to crash, that is not great
[10:36:37] <elukey>	 the code should be probably resilient enough to log and progress (if something is not terribly wrong)
[10:36:49] <elukey>	 otherwise we'll keep seeing crash looping behaviors
[10:38:21] <isaranto>	 kart_: I pasted some pod logs here https://phabricator.wikimedia.org/P71322
[10:39:03] <isaranto>	 iiuc the readiness prob should be low, the app should start instantly , otherwise helmfile sync will hang as the pod will not be ready. is that correct?
[10:40:24] <klausman>	 It seems to me that during the `DEBUG : recommendation : fetcher : get` phase, the pod does not answer readiness probes at all, and so Helm/k8s/... assume it is not working
[10:41:19] <klausman>	 Is this cache warming? If so, the tool should probably just serve (slowly) until the cache is comepletely warmed, no?
[10:45:23] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10365002 (10MunizaA) Looking at your paste, it seems like its loading hip from `/usr`: ` In file included from /usr/include/hip/hip_fp16.h:29: ` Can you run `hipconfig` to chec...
[10:46:04] <isaranto>	 iiuc from the latest commits it is a cache warming. I agree that the api should work in parallel
[10:46:43] <klausman>	 what URL/endpoint is used for healthchecking? Have we tried hitting it while helm is hanging?
[10:49:52] <elukey>	 isaranto: the app shouldn't start instantly, it is fine to have some bootstrap time and it is fine to change the readiness probe imho to values not more than 20/30s (and those already are a lot). The main issue with more is that scaling up will be really painful, same thing when deploying etc..
[10:50:18] <elukey>	 also per-host-cache is not a great recipe in my opinion
[10:50:32] <elukey>	 maybe they could use cassandra?
[10:50:57] <isaranto>	 understood  (and I agree on the cache per host)
[10:52:55] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10365034 (10isarantopoulos) Thank Muniza! I have no idea how this was set (I don't see anything in my bash history). Unsetting the HIP_CLANG_PATH and setting `export HIP_PATH=/...
[11:02:51] <isaranto>	 cassandra would be the best option in this case. for now I have created patches to revert the readiness probs
[11:03:54] <isaranto>	 I'm not aware what kind of cache the team wants to have but it seems that the current setup is problematic
[11:08:00] <klausman>	 Agreed. Maybe loading a blob from S3 instead of using Cassandra would be a short-term option, similar to the blob research already gets there.
[11:09:47] <isaranto>	 since recapi is not an kserve-inference-service (doesnt have the storage-initializer) they would still have to setup fetching from swift 
[11:10:26] <klausman>	 Ah, right, there's that. 
[11:10:54] <isaranto>	 that said, I understand that this would be easier than setting up Cassandra usage
[11:16:16] <elukey>	 speaking of swift, I don't recall if we decided anything but https://phabricator.wikimedia.org/T279621 is completed and Data Persistence is looking for getting users
[11:16:43] <elukey>	 the apus cluster is the hopefully s3-like replacement for thanos swift, that we have been abusing :D
[11:16:57] <elukey>	 I'd suggest to follow up with data persistence to prioritize the ML binaries
[11:17:16] <elukey>	 the use case is small, and it would be nice to properly test it with some support from them
[11:17:50] <isaranto>	 we are most likely going to migrate from January
[11:18:33] <elukey>	 is data persistence aware?
[11:19:25] <klausman>	 Yes
[11:20:01] <klausman>	 I had a chat with Ben last week about assorted bits like Ceph homedirs for the ml-lab machines, and I mentioned us being interested in moving our S3 stuff off of Thanos-Swift
[11:24:43] <elukey>	 but that is not data persistence, it is data platform :)
[11:25:13] <elukey>	 you'd need to chat with Emperor (Matthew)
[11:25:39] <elukey>	 Data platform manages IIRC eqiad-only clusters, meanwhile apus is properly replicated in codfw too
[11:25:50] <elukey>	 it is ceph based with s3 api
[11:26:05] <elukey>	 klausman: --^
[11:26:42] <klausman>	 argh, too many dpe's
[11:29:31] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] test: update revscoring predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098814 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira)
[11:32:04] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] test: update revscoring predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098814 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira)
[11:32:47] <wikibugs>	 (03Merged) 10jenkins-bot: test: update revscoring predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098814 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira)
[11:36:24] <kart_>	 isaranto, elukey, klausman If we look at, https://logstash.wikimedia.org/goto/4ee90bc2612a31b188eebd4477e8b8fa - app seems restarting at every one minute interval? 'Run `poetry install` to resolve and get rid of this message.' - msg is appearing when server starts.
[11:36:27] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098901 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira)
[11:36:34] <wikibugs>	 (03PS2) 10Kevin Bazira: article-country: sort results by score [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098901 (https://phabricator.wikimedia.org/T371897)
[11:37:28] <wikibugs>	 (03PS1) 10Kevin Bazira: test: update llm test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098919 (https://phabricator.wikimedia.org/T360120)
[11:37:51] <klausman>	 kart_: https://phabricator.wikimedia.org/P71332 It's part of this message
[11:38:24] <klausman>	 the every-minute-restart is likely unrelated
[11:38:55] <isaranto>	 more logs available here -> https://phabricator.wikimedia.org/P71322
[11:40:37] <kart_>	 so, it appears when server starts and that's repeating at every 1 minute interval (logstash is filtered for mlstaging2002 + this message = shows how restarts happens at every one minute)
[11:41:00] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] article-country: sort results by score [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098901 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira)
[11:41:18] <klausman>	 Yes, it would be logged at every restart, I just meant that it itself is not the reason for the restart :)
[11:41:38] <kart_>	 yes :)
[11:42:50] <elukey>	 kart_: o/ another thing that may be worth following up is the fact that any exception that is not caught seems to cause uvicorn to completely crash. I may be wrong but I'd check if a catch-all is needed to prevent unnecessary failures
[11:43:28] <kart_>	 Q: Is it possible that it is being killed and restart because it isn't ready within readiness probe?
[11:43:47] <klausman>	 Yeah, usually we'd have something like a catchall that sends a 500 response and tries to carry on
[11:44:28] <wikibugs>	 (03Merged) 10jenkins-bot: article-country: sort results by score [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098901 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira)
[11:44:47] <klausman>	 Currently trying to figure out the value/timeout for readiness probes
[11:45:21] <klausman>	     Readiness:  http-get http://:8080/docs delay=0s timeout=10s period=10s #success=1 #failure=3
[11:45:43] <wikibugs>	 (03PS2) 10Kevin Bazira: test: update llm test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098919 (https://phabricator.wikimedia.org/T360120)
[11:46:04] <klausman>	 So it does a probe every 10s, starting immediately, any single success is sufficient, after 3 failures the conntainer is considered dead
[11:46:29] <klausman>	 (this was the main container, the tls proxy container ahs the same settings)
[11:47:04] <klausman>	 (except it's /healthz instead of /docs)
[11:50:22] * isaranto afk for 10'
[11:59:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] test: update llm test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098919 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira)
[12:17:29] <kart_>	 klausman: any way to increase that for a test? say around 65 sec as we see server restarting at every one minute?
[12:17:55] <klausman>	 yeah, I think that can be done
[12:17:57] <klausman>	 lemme see
[12:21:03] <klausman>	 mmmh, the helmfile has extra diffs, let me pastebin it
[12:21:29] <klausman>	 https://phabricator.wikimedia.org/P71341
[12:22:07] <klausman>	 That failureThreshold should be 10 of course, but what about the entrypoint?
[12:23:21] <klausman>	 Ah , I guess that's part of the intended update?
[12:25:50] <klausman>	 82s         Warning   Unhealthy           pod/recommendation-api-ng-main-6bf5c46666-r2gd5    Liveness probe failed: dial tcp 10.194.61.169:8080: connect: connection refused
[12:26:38] <klausman>	 So it's not the readinessprobe that fails
[12:29:13] <kart_>	 Yes, that was the intended update.
[12:29:32] <klausman>	 The weird thing is that the pod has no liveness probes defined
[12:30:29] <klausman>	 (hitting ^c now to revert, since this is clearly broken differently)
[12:32:17] <kart_>	 hmm. I need to be AFK for sometime, but will check message while waiting at Dentist :)
[12:40:19] <klausman>	 ack, I am going to fetch lunch and some groceries. Maybe the fresh air will job some memories
[13:01:43] <wikibugs>	 (03PS4) 10Nik Gkountas: Parallelize API requests for get_articles_by_qids and get_articles_by_titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895 (owner: 10Santhosh)
[13:22:49] <wikibugs>	 (03CR) 10Nik Gkountas: [C:03+2] Parallelize API requests for get_articles_by_qids and get_articles_by_titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895 (owner: 10Santhosh)
[13:23:28] <wikibugs>	 (03Merged) 10jenkins-bot: Parallelize API requests for get_articles_by_qids and get_articles_by_titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895 (owner: 10Santhosh)
[14:01:11] <klausman>	 elukey: I have a question about templating you may have the answer for. If we look at charts/python-webapp/templates/vendor/app/generic_1.0.2.tpl and .../values.yaml there seems to be a default definition for the liveness check being on port 8080. However, if I use kubectl describe on the recapi-ng pods, there is no mention of liveness checks. I wonder what the default is for the
[14:01:13] <klausman>	 timeout/retries etc
[14:02:32] <klausman>	 aaargh. thanks describe for inlining things
[14:02:46] <klausman>	 Liveness:   tcp-socket :8080 delay=0s timeout=1s period=10s #success=1 #failure=3
[14:08:18] <klausman>	 kart_: isaranto tadaaa!
[14:08:57] <klausman>	 I bumped the liveness probe to 10x10s needed for failur, and now at least helm is satisfied. We'll see if it keeps running for more than100s :)
[14:19:17] <klausman>	 It's been running for 12m now with zero restarts
[14:29:33] <kart_>	 aha
[14:30:00] <kart_>	 Thanks klausman, Let me know when it is OK to proceed for production deployment.
[14:30:31] <klausman>	 we will have to decide whether my changes to the liveness probe are actually ok, or rather, what theresholds are right
[14:30:42] <klausman>	 also, the service changes (helm) aren't committed anywhere yet
[14:31:22] <klausman>	 https://phabricator.wikimedia.org/P71348 This is the change I live-pushed
[14:32:03] <klausman>	 I think the Threshold specifically should probably lower. 115s (15 initial, then 10 failed probes with an interval of 10s) is likely too long to be robust enough
[14:32:53] <klausman>	 I think the default is 0 initial delay and 3 probes with 10s interval.
[14:33:40] <klausman>	 We could leave the initial interval at 10, but require six failures at 10s spacing, that would make it a minute. Keep in mind that the langer we make it, the higher the likelihood that in a failure case that users will get errors or no response.
[14:34:46] <klausman>	 The liveness probe is for the steady state, readiness is for determining whether a new container is ready for traffic (cf. https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/)
[14:35:15] <klausman>	 (also note startup probes, which have yet different meaning)
[14:35:50] <klausman>	 Atm, the liveness probe also is a simple TCP connection check (not GET /docs or anything), so that is also not ideal.
[14:38:23] <klausman>	 But at least we now know that the probes, as configured by default from the python-webapp helm chart, are not suitable for rec-api-ng as-is.
[14:40:12] <klausman>	 Since I do not know what the UX side of the extension does with a timing-out service, it's hard for me to tell what the acceptable tradeoffs are.
[14:45:02] <kart_>	 From UX side, recommendation service suggests articles to users in the suggestion tab of ContentTranslation. We added collection feature which is customized set of articles to suggest to users. 
[14:47:43] <kart_>	 Are there any other services dealing with complex liveness probe? We can probably do some work on it too.
[14:49:03] <klausman>	 For us, usually we only need to tweak readiness probes since we download potentially large ML models.
[14:49:36] <klausman>	 I'd have to lookup the specifics in our kservice charts regarding probe thresholds etc.
[14:54:07] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10365672 (10isarantopoulos) I think we need to configure rocm properly first:  after setting HIP_PATH=/opt/rocm I get this in hipconfig:  ` == hip-clang HSA_PATH         : /opt...
[15:01:03] <wikibugs>	 06Machine-Learning-Team, 10Recommendation-API, 10SRE-Access-Requests: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108 (10Nikerabbit) 03NEW
[15:01:26] <aiko>	 5 mins
[15:04:15] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10365728 (10isarantopoulos) Alright I tried so many things in a disorganized way so now I tried to do things from start  This is what I am trying and it seems to be building  `...
[15:15:28] <elukey>	 klausman: sorry I was in meeting
[15:15:31] <elukey>	 *meetings
[15:16:07] <elukey>	 tweaking the liveness probe it is fine, weird that tcp socket connections don't happen in 1s
[15:16:12] <wikibugs>	 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): [M] Create the logo detection model card - https://phabricator.wikimedia.org/T370759#10365768 (10mfossati) 05Open→03In progress a:03mfossati
[15:16:23] <elukey>	 but maybe if the bootstrap is heavy it could happen, not sure
[15:16:46] <elukey>	 it feels to me that rec-api tries to do something heavy that should be reviewed
[15:34:45] <klausman>	 Agreed.
[16:18:46] <wikibugs>	 (03PS1) 10Sbisson: Populate the cache async on startup [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098991 (https://phabricator.wikimedia.org/T380838)
[16:26:52] <isaranto>	 Going afk folks,cu tomorrow!
[16:28:11] <klausman>	 \o
[16:37:21] <wikibugs>	 (03CR) 10Nik Gkountas: [C:03+2] Populate the cache async on startup [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098991 (https://phabricator.wikimedia.org/T380838) (owner: 10Sbisson)
[16:38:01] <wikibugs>	 (03Merged) 10jenkins-bot: Populate the cache async on startup [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098991 (https://phabricator.wikimedia.org/T380838) (owner: 10Sbisson)
[16:55:43] <wikibugs>	 (03Abandoned) 10Sbisson: Extra logging for cache debugging [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098584 (owner: 10Sbisson)
[22:05:39] <Woogler>	 Subject: Seeking Advice on Using ORES Without Affecting Wikipedia Revisions
[22:05:40] <Woogler>	 Hi everyone,
[22:05:40] <Woogler>	 I’m Sangwook Lee, a Ph.D. student in Computer Engineering at Virginia Tech.
[22:05:41] <Woogler>	 I’m working on a research project where I need to measure the quality of edits made by experimental participants on specific Wikipedia articles using ORES’s article quality prediction model. I’ve encountered a challenge: the ORES API seems to evaluate only revisions that are stored on Wikipedia and identified by a revision ID.
[22:05:41] <Woogler>	 To assess our participants’ edits, we’d need to save each edit as a new revision on Wikipedia, which would leave behind a revision history solely for experimental purposes. I believe this might not be ideal for the Wikipedia community.
[22:05:42] <Woogler>	 I’m reaching out to see if there’s a way to use the ORES model to evaluate edits without saving them to Wikipedia, thus avoiding any unintended impact. Any guidance or suggestions would be greatly appreciated!
[22:05:42] <Woogler>	 Thank you for your time.
[22:05:43] <Woogler>	 Best regards,
[22:05:43] <Woogler>	 Sangwook Lee
[22:14:26] <Woogler>	 I I tried emailing ml@wikimedia.org, but it appears that the email address doesn’t exist. So, I’m reaching out here.
[22:40:47] <Woogler>	 I would appreciate it if you could reply via chat or send a response to my email at sangwooklee@vt.edu.