[03:10:50] (03CR) 10KartikMistry: Extra logging for cache debugging (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098584 (owner: 10Sbisson) [05:21:45] (03PS1) 10Santhosh: entrypoint.sh: Remove poetry install [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098688 [05:22:14] (03CR) 10Santhosh: [C:03+2] Use cx-server language pairs API v2 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098555 (owner: 10Sbisson) [05:23:54] (03Merged) 10jenkins-bot: Use cx-server language pairs API v2 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098555 (owner: 10Sbisson) [05:24:35] (03CR) 10KartikMistry: [C:03+2] entrypoint.sh: Remove poetry install [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098688 (owner: 10Santhosh) [05:25:27] (03Merged) 10jenkins-bot: entrypoint.sh: Remove poetry install [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098688 (owner: 10Santhosh) [05:39:39] (03CR) 10Santhosh: "This patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098684" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098584 (owner: 10Sbisson) [05:39:49] (03CR) 10Santhosh: [C:04-1] Extra logging for cache debugging [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098584 (owner: 10Sbisson) [06:07:37] I'm deploying rec-api but seems it is timing out. Can we increase limit for liveliness? As readiness probe seems shutting down the staging instance. [06:26:27] Can we increase 'timeout: 600' from helmfile.yaml? [06:37:14] 06Machine-Learning-Team, 13Patch-For-Review: Run unit tests for the inference-services repo in CI - https://phabricator.wikimedia.org/T360120#10364540 (10kevinbazira) [06:38:47] (03PS1) 10Kevin Bazira: test: update revscoring predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098814 (https://phabricator.wikimedia.org/T360120) [06:44:31] kevinbazira: hi. What is the procedure of deployment if we change anything in helmfile.yaml? Seems it is not seen in the diff? [06:50:03] kart_: hi o/ [06:50:40] hola [06:51:03] kevinbazira: patch in question is: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098811 [07:03:06] Started deployment with ^, let's see! [07:13:52] kartik@deploy2002:~$ kubectl get pods [07:13:52] NAME READY STATUS RESTARTS AGE [07:13:52] recommendation-api-ng-main-9fffcd85f-gzvqp 1/2 CrashLoopBackOff 6 (2m29s ago) 10m [07:13:52] again. Do we know why another healthy pod disappear? [07:22:50] I checked on staging and it didn't start because the readiness probe failed: https://phabricator.wikimedia.org/P71311 [07:25:14] Yes! It was timing out and rediness probe was killing. So, I increased timeout in helmfile.yaml but I guess that didn't help. [07:26:19] both see to be running now: [07:26:19] ``` [07:26:19] kevinbazira@deploy2002:~$ kubectl get pods [07:26:19] NAME READY STATUS RESTARTS AGE [07:26:19] recommendation-api-ng-main-66755847d5-zbwgz 2/2 Running 0 3m25s [07:26:19] ``` [07:28:59] That's rolled back version, right? [07:33:08] Any other way to debug why it is timing out to deploy? [07:37:24] looking at the pod and its logs usually helps: [07:37:24] ``` [07:37:24] $ kubectl describe pod recommendation-api-ng-main-9fffcd85f-gzvqp [07:37:24] $ kubectl logs recommendation-api-ng-main-9fffcd85f-gzvqp -c recommendation-api-ng-main [07:37:24] ``` [07:37:24] klausman whenever you get a minute please help kart with this issue. thanks! [07:45:35] Yes. logs and describe wasn't helpful in this case though :/ [08:21:23] hello folks! [08:22:32] CI should show a diff if there is a change in the manifest that is produced [08:22:54] it seems that here we had nothing https://integration.wikimedia.org/ci/job/helm-lint/21774/console (it is the link from this patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098811) [08:36:15] lemme try sth and post back [08:45:04] did you folks manage to deploy it ? I am trying on staging but helmfile sync is taking too long [08:49:52] after we get past that the readiness prob is another issue: iiuc the app is taking too long to start because it is warming up a cache, so we should increase the readiness prob to something that makes more sense (not 10s) - I filed a patch for this https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098877 [09:01:24] Sorry, I was in the meeting. [09:01:56] no worries :) [09:02:10] Thanks isaranto [09:02:12] this is the timeout I get when running helfmile sync https://phabricator.wikimedia.org/P71320 [09:02:30] are you getting the same or did you manage to deploy it before getting the other error? [09:03:01] No. I had same error. [09:03:27] So, we can +2 this change and try deploy? [09:04:54] isaranto: ^ [09:05:19] I'm trying to figure out why we can't sync , then we can try it [09:05:28] Sure [09:08:09] regarding the value you changed previously (setting it from 600 to 1200) that is the timeout for the helm command so it doesn't affect the application but the deployment [09:08:49] for example in our case it is the 600s in this command `helm3 upgrade --install --reset-values main wmf-stable/python-webapp --timeout 600s` [09:10:19] to get more info about the deployment you can use `helmfile --log-level=info -e ml-staging-codfw sync` - perhaps this has too much verbosity because it will output all the manifests [09:10:20] Noted! [09:11:13] I can't figure it out let's merge the other patch and try again. Can you review this? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098877 [09:11:30] what would a good timeout be in order to have time to warm up the cache? [09:11:54] (that is if the cache is the reason it is taking too long) [09:12:22] A little over 5 mins. ~6 mins. That is what docker run docker-registry.wikimedia.org/wikimedia/research-recommendation-api:stable takes. <-- from Santhosh. [09:12:42] ack [09:13:47] Should I try: `helmfile --log-level=info -e ml-staging-codfw sync` ? [09:15:42] iirc that is the default logging level so you can even omit this [09:16:08] I try helmfile diff and I don't see the timeout change in staging, only when I do diff for prod [09:16:09] OK! [09:16:22] Interesting! [09:17:00] Because we didn't change in staging? [09:17:03] oh ofc, because staging has a different yaml [09:18:16] yes. [09:19:32] ok here is the new patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1098890 [09:23:40] +2ed [09:24:10] I am deploying to staging now... [09:26:02] Nice! [09:29:44] helmfile sync is still taking too long on staging... :( [09:29:56] :/ [09:30:11] klausman: o/ do you have any idea what could be the issue? [09:30:42] the summary is that helmfile sync is hanging in ml-staging for rec-api [09:30:44] `/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ helmfile -e ml-staging-codfw sync` [09:33:22] Taking some break, but IRC is on :) [09:37:56] 👍 [09:45:21] (03PS1) 10Santhosh: Parallelize API requests for Wikidata IDs and page titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895 [09:46:01] (03CR) 10CI reject: [V:04-1] Parallelize API requests for Wikidata IDs and page titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895 (owner: 10Santhosh) [09:46:17] (03PS2) 10Santhosh: Parallelize API requests for get_articles_by_qids and get_articles_by_titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895 [09:46:55] (03CR) 10CI reject: [V:04-1] Parallelize API requests for get_articles_by_qids and get_articles_by_titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895 (owner: 10Santhosh) [09:51:08] (03PS3) 10Santhosh: Parallelize API requests for get_articles_by_qids and get_articles_by_titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895 [10:18:17] isaranto: what is the status of the pods while helmfile is syncing? [10:18:36] the hanging is usually the pod not coming up for some reason (namely not reaching Ready) [10:18:55] totally outside perspective but using a readiness probe of 600s indicates that something is really wrong :D [10:19:28] code-wise or elsewhere, it means that a pod can take up to 10 minutes to respond correctly to the first probe [10:19:43] during an emergency or a scale up this can bite us very badly [10:20:05] ack, thanks for the valuable input! [10:20:13] it is ok in the case of loading huge models, but in this case not sure.. cache warming shouldn't take that much [10:20:28] the pod is up and running in this case and helmfile sync is just stuck [10:21:18] it seems that there was an attempt to deploy and then it failed cause I see the last pod has been up for 85m (but it is the previous version) [10:21:21] lemme recheck [10:22:21] (03PS1) 10Kevin Bazira: article-country: sort results by score [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098901 (https://phabricator.wikimedia.org/T371897) [10:24:54] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10364906 (10isarantopoulos) >>! In T371344#10362864, @MunizaA wrote: > It looks you can override just invocations of `nvcc` or `hipcc` without overriding invocations of g++ or... [10:26:47] morning! reading backlog [10:28:11] o/ [10:28:53] isaranto: is it staging? If so please retry so we'll check live [10:29:17] yes it is staging , I am retrying already and I am checking the events in teh ns [10:29:20] *the ns [10:30:17] isaranto: the pod is not up [10:30:17] NAME READY STATUS RESTARTS AGE [10:30:21] recommendation-api-ng-main-648ffcf9f4-cw9qg 1/2 Running 2 (40s ago) 2m41s [10:30:50] ok now I see some issues in the events [10:31:09] this is the new pod - helmfile sync is still running [10:31:49] ERROR : recommendation : fetcher : get_articles_by_titles : Could not find dbname for wiki prefix toollabs [10:32:00] if I cancel the command or it gets a timeout we rollback to the previous version [10:32:17] I have no obvious thing to add, besides that I agree with Luca that 10m readiness wait is already _very_ long. [10:34:18] isaranto: when you canceled helm just now, how long had it been running? [10:34:34] thanks for that error msg. I am able to see logs while the pod is trying to run [10:34:56] iirc it will run until helm timeouts (600s) [10:36:15] in general the pattern that I saw recently in the logs is that any exception fired in the python code causes uvicorn to crash, that is not great [10:36:37] the code should be probably resilient enough to log and progress (if something is not terribly wrong) [10:36:49] otherwise we'll keep seeing crash looping behaviors [10:38:21] kart_: I pasted some pod logs here https://phabricator.wikimedia.org/P71322 [10:39:03] iiuc the readiness prob should be low, the app should start instantly , otherwise helmfile sync will hang as the pod will not be ready. is that correct? [10:40:24] It seems to me that during the `DEBUG : recommendation : fetcher : get` phase, the pod does not answer readiness probes at all, and so Helm/k8s/... assume it is not working [10:41:19] Is this cache warming? If so, the tool should probably just serve (slowly) until the cache is comepletely warmed, no? [10:45:23] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10365002 (10MunizaA) Looking at your paste, it seems like its loading hip from `/usr`: ` In file included from /usr/include/hip/hip_fp16.h:29: ` Can you run `hipconfig` to chec... [10:46:04] iiuc from the latest commits it is a cache warming. I agree that the api should work in parallel [10:46:43] what URL/endpoint is used for healthchecking? Have we tried hitting it while helm is hanging? [10:49:52] isaranto: the app shouldn't start instantly, it is fine to have some bootstrap time and it is fine to change the readiness probe imho to values not more than 20/30s (and those already are a lot). The main issue with more is that scaling up will be really painful, same thing when deploying etc.. [10:50:18] also per-host-cache is not a great recipe in my opinion [10:50:32] maybe they could use cassandra? [10:50:57] understood (and I agree on the cache per host) [10:52:55] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10365034 (10isarantopoulos) Thank Muniza! I have no idea how this was set (I don't see anything in my bash history). Unsetting the HIP_CLANG_PATH and setting `export HIP_PATH=/... [11:02:51] cassandra would be the best option in this case. for now I have created patches to revert the readiness probs [11:03:54] I'm not aware what kind of cache the team wants to have but it seems that the current setup is problematic [11:08:00] Agreed. Maybe loading a blob from S3 instead of using Cassandra would be a short-term option, similar to the blob research already gets there. [11:09:47] since recapi is not an kserve-inference-service (doesnt have the storage-initializer) they would still have to setup fetching from swift [11:10:26] Ah, right, there's that. [11:10:54] that said, I understand that this would be easier than setting up Cassandra usage [11:16:16] speaking of swift, I don't recall if we decided anything but https://phabricator.wikimedia.org/T279621 is completed and Data Persistence is looking for getting users [11:16:43] the apus cluster is the hopefully s3-like replacement for thanos swift, that we have been abusing :D [11:16:57] I'd suggest to follow up with data persistence to prioritize the ML binaries [11:17:16] the use case is small, and it would be nice to properly test it with some support from them [11:17:50] we are most likely going to migrate from January [11:18:33] is data persistence aware? [11:19:25] Yes [11:20:01] I had a chat with Ben last week about assorted bits like Ceph homedirs for the ml-lab machines, and I mentioned us being interested in moving our S3 stuff off of Thanos-Swift [11:24:43] but that is not data persistence, it is data platform :) [11:25:13] you'd need to chat with Emperor (Matthew) [11:25:39] Data platform manages IIRC eqiad-only clusters, meanwhile apus is properly replicated in codfw too [11:25:50] it is ceph based with s3 api [11:26:05] klausman: --^ [11:26:42] argh, too many dpe's [11:29:31] (03CR) 10Ilias Sarantopoulos: [C:03+1] test: update revscoring predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098814 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [11:32:04] (03CR) 10Kevin Bazira: [C:03+2] test: update revscoring predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098814 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [11:32:47] (03Merged) 10jenkins-bot: test: update revscoring predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098814 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [11:36:24] isaranto, elukey, klausman If we look at, https://logstash.wikimedia.org/goto/4ee90bc2612a31b188eebd4477e8b8fa - app seems restarting at every one minute interval? 'Run `poetry install` to resolve and get rid of this message.' - msg is appearing when server starts. [11:36:27] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098901 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [11:36:34] (03PS2) 10Kevin Bazira: article-country: sort results by score [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098901 (https://phabricator.wikimedia.org/T371897) [11:37:28] (03PS1) 10Kevin Bazira: test: update llm test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098919 (https://phabricator.wikimedia.org/T360120) [11:37:51] kart_: https://phabricator.wikimedia.org/P71332 It's part of this message [11:38:24] the every-minute-restart is likely unrelated [11:38:55] more logs available here -> https://phabricator.wikimedia.org/P71322 [11:40:37] so, it appears when server starts and that's repeating at every 1 minute interval (logstash is filtered for mlstaging2002 + this message = shows how restarts happens at every one minute) [11:41:00] (03CR) 10Kevin Bazira: [C:03+2] article-country: sort results by score [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098901 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [11:41:18] Yes, it would be logged at every restart, I just meant that it itself is not the reason for the restart :) [11:41:38] yes :) [11:42:50] kart_: o/ another thing that may be worth following up is the fact that any exception that is not caught seems to cause uvicorn to completely crash. I may be wrong but I'd check if a catch-all is needed to prevent unnecessary failures [11:43:28] Q: Is it possible that it is being killed and restart because it isn't ready within readiness probe? [11:43:47] Yeah, usually we'd have something like a catchall that sends a 500 response and tries to carry on [11:44:28] (03Merged) 10jenkins-bot: article-country: sort results by score [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098901 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [11:44:47] Currently trying to figure out the value/timeout for readiness probes [11:45:21] Readiness: http-get http://:8080/docs delay=0s timeout=10s period=10s #success=1 #failure=3 [11:45:43] (03PS2) 10Kevin Bazira: test: update llm test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098919 (https://phabricator.wikimedia.org/T360120) [11:46:04] So it does a probe every 10s, starting immediately, any single success is sufficient, after 3 failures the conntainer is considered dead [11:46:29] (this was the main container, the tls proxy container ahs the same settings) [11:47:04] (except it's /healthz instead of /docs) [11:50:22] * isaranto afk for 10' [11:59:42] (03CR) 10CI reject: [V:04-1] test: update llm test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098919 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [12:17:29] klausman: any way to increase that for a test? say around 65 sec as we see server restarting at every one minute? [12:17:55] yeah, I think that can be done [12:17:57] lemme see [12:21:03] mmmh, the helmfile has extra diffs, let me pastebin it [12:21:29] https://phabricator.wikimedia.org/P71341 [12:22:07] That failureThreshold should be 10 of course, but what about the entrypoint? [12:23:21] Ah , I guess that's part of the intended update? [12:25:50] 82s Warning Unhealthy pod/recommendation-api-ng-main-6bf5c46666-r2gd5 Liveness probe failed: dial tcp 10.194.61.169:8080: connect: connection refused [12:26:38] So it's not the readinessprobe that fails [12:29:13] Yes, that was the intended update. [12:29:32] The weird thing is that the pod has no liveness probes defined [12:30:29] (hitting ^c now to revert, since this is clearly broken differently) [12:32:17] hmm. I need to be AFK for sometime, but will check message while waiting at Dentist :) [12:40:19] ack, I am going to fetch lunch and some groceries. Maybe the fresh air will job some memories [13:01:43] (03PS4) 10Nik Gkountas: Parallelize API requests for get_articles_by_qids and get_articles_by_titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895 (owner: 10Santhosh) [13:22:49] (03CR) 10Nik Gkountas: [C:03+2] Parallelize API requests for get_articles_by_qids and get_articles_by_titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895 (owner: 10Santhosh) [13:23:28] (03Merged) 10jenkins-bot: Parallelize API requests for get_articles_by_qids and get_articles_by_titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098895 (owner: 10Santhosh) [14:01:11] elukey: I have a question about templating you may have the answer for. If we look at charts/python-webapp/templates/vendor/app/generic_1.0.2.tpl and .../values.yaml there seems to be a default definition for the liveness check being on port 8080. However, if I use kubectl describe on the recapi-ng pods, there is no mention of liveness checks. I wonder what the default is for the [14:01:13] timeout/retries etc [14:02:32] aaargh. thanks describe for inlining things [14:02:46] Liveness: tcp-socket :8080 delay=0s timeout=1s period=10s #success=1 #failure=3 [14:08:18] kart_: isaranto tadaaa! [14:08:57] I bumped the liveness probe to 10x10s needed for failur, and now at least helm is satisfied. We'll see if it keeps running for more than100s :) [14:19:17] It's been running for 12m now with zero restarts [14:29:33] aha [14:30:00] Thanks klausman, Let me know when it is OK to proceed for production deployment. [14:30:31] we will have to decide whether my changes to the liveness probe are actually ok, or rather, what theresholds are right [14:30:42] also, the service changes (helm) aren't committed anywhere yet [14:31:22] https://phabricator.wikimedia.org/P71348 This is the change I live-pushed [14:32:03] I think the Threshold specifically should probably lower. 115s (15 initial, then 10 failed probes with an interval of 10s) is likely too long to be robust enough [14:32:53] I think the default is 0 initial delay and 3 probes with 10s interval. [14:33:40] We could leave the initial interval at 10, but require six failures at 10s spacing, that would make it a minute. Keep in mind that the langer we make it, the higher the likelihood that in a failure case that users will get errors or no response. [14:34:46] The liveness probe is for the steady state, readiness is for determining whether a new container is ready for traffic (cf. https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/) [14:35:15] (also note startup probes, which have yet different meaning) [14:35:50] Atm, the liveness probe also is a simple TCP connection check (not GET /docs or anything), so that is also not ideal. [14:38:23] But at least we now know that the probes, as configured by default from the python-webapp helm chart, are not suitable for rec-api-ng as-is. [14:40:12] Since I do not know what the UX side of the extension does with a timing-out service, it's hard for me to tell what the acceptable tradeoffs are. [14:45:02] From UX side, recommendation service suggests articles to users in the suggestion tab of ContentTranslation. We added collection feature which is customized set of articles to suggest to users. [14:47:43] Are there any other services dealing with complex liveness probe? We can probably do some work on it too. [14:49:03] For us, usually we only need to tweak readiness probes since we download potentially large ML models. [14:49:36] I'd have to lookup the specifics in our kservice charts regarding probe thresholds etc. [14:54:07] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10365672 (10isarantopoulos) I think we need to configure rocm properly first: after setting HIP_PATH=/opt/rocm I get this in hipconfig: ` == hip-clang HSA_PATH : /opt... [15:01:03] 06Machine-Learning-Team, 10Recommendation-API, 10SRE-Access-Requests: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108 (10Nikerabbit) 03NEW [15:01:26] 5 mins [15:04:15] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10365728 (10isarantopoulos) Alright I tried so many things in a disorganized way so now I tried to do things from start This is what I am trying and it seems to be building `... [15:15:28] klausman: sorry I was in meeting [15:15:31] *meetings [15:16:07] tweaking the liveness probe it is fine, weird that tcp socket connections don't happen in 1s [15:16:12] 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): [M] Create the logo detection model card - https://phabricator.wikimedia.org/T370759#10365768 (10mfossati) 05Open→03In progress a:03mfossati [15:16:23] but maybe if the bootstrap is heavy it could happen, not sure [15:16:46] it feels to me that rec-api tries to do something heavy that should be reviewed [15:34:45] Agreed. [16:18:46] (03PS1) 10Sbisson: Populate the cache async on startup [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098991 (https://phabricator.wikimedia.org/T380838) [16:26:52] Going afk folks,cu tomorrow! [16:28:11] \o [16:37:21] (03CR) 10Nik Gkountas: [C:03+2] Populate the cache async on startup [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098991 (https://phabricator.wikimedia.org/T380838) (owner: 10Sbisson) [16:38:01] (03Merged) 10jenkins-bot: Populate the cache async on startup [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098991 (https://phabricator.wikimedia.org/T380838) (owner: 10Sbisson) [16:55:43] (03Abandoned) 10Sbisson: Extra logging for cache debugging [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098584 (owner: 10Sbisson) [22:05:39] Subject: Seeking Advice on Using ORES Without Affecting Wikipedia Revisions [22:05:40] Hi everyone, [22:05:40] I’m Sangwook Lee, a Ph.D. student in Computer Engineering at Virginia Tech. [22:05:41] I’m working on a research project where I need to measure the quality of edits made by experimental participants on specific Wikipedia articles using ORES’s article quality prediction model. I’ve encountered a challenge: the ORES API seems to evaluate only revisions that are stored on Wikipedia and identified by a revision ID. [22:05:41] To assess our participants’ edits, we’d need to save each edit as a new revision on Wikipedia, which would leave behind a revision history solely for experimental purposes. I believe this might not be ideal for the Wikipedia community. [22:05:42] I’m reaching out to see if there’s a way to use the ORES model to evaluate edits without saving them to Wikipedia, thus avoiding any unintended impact. Any guidance or suggestions would be greatly appreciated! [22:05:42] Thank you for your time. [22:05:43] Best regards, [22:05:43] Sangwook Lee [22:14:26] I I tried emailing ml@wikimedia.org, but it appears that the email address doesn’t exist. So, I’m reaching out here. [22:40:47] I would appreciate it if you could reply via chat or send a response to my email at sangwooklee@vt.edu.