[06:52:32] Good morning folks! [06:52:46] running an errand, be back in ~ 1h [07:20:38] morning! [07:37:49] hello folks! [07:46:16] I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/955869/ to move ores-legacy to python-webapp [07:46:51] the diff seems ok overall, we have also updated some modules in the chart after a suggestion from serviceops (so some configs etc.. do vary for that reason) [08:03:18] deployed in staging, all working! [08:04:28] going to roll it out to prod as well [08:06:26] kevinbazira: o/ [08:06:59] we need to create the new namespace and puppet config for recommendation-api before you can deploy the helmfile.d config [08:07:08] I'll work on it while you file the patch [08:09:28] niiice [08:12:51] all deployed, looks fine! [08:29:51] I don't like 100% the recommendation-api-ng name, we are still in time to change it if we want [08:30:08] but I really can't think of another name [08:30:35] now I have to create the namespace on k8s, after that we'll have to stick with it :D [08:31:01] recommendation-api-new2-test? [08:31:02] Haha [08:32:20] My mind is really stuck.. I am fine with the name as long as we can distinguish which repo/deployment we are talking about when we mention it [08:32:22] I was about to suggest a Trump's catch phrase but I decided to stop it :D [08:32:56] yeah ok -ng is fine [08:33:07] I am going to create the settings [08:37:38] *-ng is fine with me too, we've been using it in other places like the CI jobs, helm charts, etc :) [08:53:32] ok so secrets etc.. created [08:53:37] and filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/955885/ [08:53:42] for the new namespace [08:55:23] and https://gerrit.wikimedia.org/r/c/operations/puppet/+/955887/ [08:55:26] for the puppet part [08:55:49] please review when you have a moment, after that we should be ready [08:56:02] reviewing now [09:00:11] thanks! [09:00:38] I'm not sure what the second patch does...any more info for education purposes? [09:00:54] ah yes it is mostly for k8s configs [09:01:12] basically for whoever can deploy our apps etc.. [09:01:15] for deployment permissions? [09:01:34] deploy-ml-service is a posix group [09:01:38] yes exactly [09:01:50] plus there is also a puppet private config behind the scenes [09:01:50] ack, thanks! [09:02:04] to generate the helmfile private yamls on the deploy1002 node [09:02:11] for example, to keep passwords etc.. [09:02:17] that we can reference in deployment-charts [09:03:35] interesting to see how secrets are kept :) [09:04:03] btw -ng is always a good suffix for an ML service :) [09:04:04] https://en.wikipedia.org/wiki/Andrew_Ng [09:04:24] ahahha yes [09:04:45] lol that's a good one [09:08:05] 😂 [09:10:45] syncing the new namespace to all clusters [09:10:56] kevinbazira: you are free to send the helmfile.d change now [09:11:18] ok, let me prepare it and send it in a bit! [09:13:49] by checking at the SLO dashboards I see that the damaging model server error budget is gone already 😢 https://grafana.wikimedia.org/d/slo-Lift_Wing_Revscoring/lift-wing-revscoring-slo-s?orgId=1 [09:14:04] 10Machine-Learning-Team: Utilize ChatGPT for categorizing and extract metadata from files on Commons - https://phabricator.wikimedia.org/T345898 (10Hoi) [09:14:42] sry, my mistake, I mean we started burning budget a lot [09:18:13] 10Machine-Learning-Team: Utilize ChatGPT for categorizing and extractinb metadata from files on Commons - https://phabricator.wikimedia.org/T345898 (10Reedy) [09:18:21] elukey: I've pushed the helmfile.d change here: https://gerrit.wikimedia.org/r/955018 [09:18:27] 10Machine-Learning-Team: Utilize ChatGPT for categorizing and extracting metadata from files on Commons - https://phabricator.wikimedia.org/T345898 (10Hoi) [09:23:33] isaranto: yeah we can check what's happening in the logs, but the first round of the SLO try will be a disaster for sure :D [09:24:25] yeah it is understood. it is also a shift in the way of thinking in a good direction! [09:43:12] kevinbazira: I left some comments :) [10:07:13] elukey: ack, I've pushed patchset 2 [10:22:46] kevinbazira: I don't recall exactly but what environment variables can we configure for swift in rec-api-ng? [10:23:15] the user pass one needs to be set in the puppet private, I need to check that everything is good [10:23:33] then we need to change the URL, this is something that we have to do in deployment-charts [10:23:38] via the config etc.. bits [10:25:46] elukey: these are the env vars we need to set: https://github.com/wikimedia/research-recommendation-api/blob/master/recommendation/api/types/related_articles/candidate_finder.py#L140-L145 [10:29:59] ack so we need to set SWIFT_AUTHURL in deployment-charts :) [10:30:06] writing it in the code review [10:32:04] done [10:35:24] * elukey lunch! [10:58:10] * isaranto lunch as well [11:39:45] (03PS1) 10AikoChou: test: add load test script and input for ores-legacy [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/955910 [11:41:46] * aiko lunch [12:15:14] (03CR) 10Elukey: [C: 03+1] test: add load test script and input for ores-legacy [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/955910 (owner: 10AikoChou) [12:21:40] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Looks great!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/955910 (owner: 10AikoChou) [12:34:49] elukey: in order to deploy the api-gateway I do diff and sync for staging and production environments right? [12:35:05] is that deployed to codfw and eqiad then? I want to try to deploy I think I have access [12:40:06] yeah "staging" "codfw" "eqiad" [12:40:28] no deployments whatsoever ongoing [12:40:29] so +1 [12:40:35] maybe drop a line in #serviceops [12:42:05] 10Machine-Learning-Team: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10elukey) I am not 100% of the ores-legacy one, but https://grafana.wikimedia.org/d/slo-ORES_Legacy/ores-legacy-slo-s?orgId=1 The main difference with the others is that I had to factor in 2xx 3xx 4xx among the "g... [12:44:51] ack, will do! [12:52:55] elukey: I don't see anything different for `lw_inference_editquality_damaging` when I do a diff. Is it expected? [12:53:08] I don't see anything that could change tbh [12:53:31] isaranto: did you check with git log if your change is in? [12:54:04] * isaranto sighs... [12:54:25] because puppet needs to run to get it pulled [12:54:28] lemme force it [12:54:29] seems like I was inpatient. sry for the hassle :) [12:54:43] dont bother, I will w8 [12:56:53] isaranto: ready to go [12:57:10] <3 thank uuu [12:59:33] kevinbazira: one qs - what user did you use to test the swift code running on rec-api-ng? mlserve:prod? [13:00:05] elukey: yes, SWIFT_USER="mlserve:prod" [13:02:05] ok perfect, set up everything in puppet private [13:03:53] great, thanks! [13:04:48] left a nit but I think we are ready to go [13:05:08] isaranto: if you want to check https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/955018 as well [13:05:27] kevinbazira: I am 99% sure that it will not work at first, but we can refine as we go [13:05:30] :) [13:05:39] (not because of your code but we have surely forgot something) [13:06:41] no problem, we shall refine as we move a long :) [13:12:52] I've fixed the nit! [13:14:29] Apologies for being sloppy...I never added the change for enwiktionary in the previous commit (I just had updated the commit msg) 🤦 [13:17:40] isaranto: is it only for reverted right? [13:17:49] 10Machine-Learning-Team, 10Commons: Utilize ChatGPT for categorizing and extracting metadata from files on Commons - https://phabricator.wikimedia.org/T345898 (10Reedy) [13:18:56] kevinbazira: let's try to deploy in staging! [13:19:44] yes only reverted [13:21:04] elukey: ack, going to merge now and deploy in a bit! [13:32:49] in the deployment node now, the new change is reflecting. going to deploy on staging ... [13:39:02] RuntimeError: Failed to get object from Swift: Auth GET failed: http://localhost:6022/auth/v1.0 404 Not Found [13:42:12] great, everything works in the API-gateway now! [13:42:23] nice! [13:42:27] wooghoo! [13:43:05] question: with the recent changes in our board what do we do with the completed tasks. Do we just resolved them and remove them? I'm asking since there is no completed column [13:44:04] there should be in theory [13:45:36] kevinbazira: did the helm deployment fail? (plus rollback) [13:47:08] ok, I found it, it was a hidden column [13:47:38] Because the work never ends. It is a deep lesson about life. [13:48:32] 10Machine-Learning-Team: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10elukey) I tried to review the exact metrics that we used in our SLO/SLI calculations, and something doesn't feel right. For istio we have the following recording rules: ` - record: destsvc_rev_ns_rc_rf:ist... [13:50:22] haha [13:50:47] there is no such thing as completed work :) [13:51:24] 10Machine-Learning-Team: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10kevinbazira) Deployment settings for the recommendation-api-ng have been merged but when we try to deploy on staging we get: ` kevinbazira@deploy1002:/srv/deployment-charts/helmfile.d/ml-ser... [13:52:02] 10Machine-Learning-Team, 10Patch-For-Review: Remove traffic from old eswikibooks and eswikiquote deployments - https://phabricator.wikimedia.org/T345850 (10isarantopoulos) a:03isarantopoulos [13:52:15] elukey: https://phabricator.wikimedia.org/T339890#9152478 [13:54:57] kevinbazira: yes see above, I checked the logs and I saw [13:55:07] RuntimeError: Failed to get object from Swift: Auth GET failed: http://localhost:6022/auth/v1.0 404 Not Found [13:58:06] alright, since http://localhost:6022/auth/v1.0 is failing, would https://thanos-swift.discovery.wmnet/auth/v1.0 work? [13:58:23] not really, we wouldn't go through the local proxy [13:58:24] 10Machine-Learning-Team, 10Patch-For-Review: Remove traffic from old eswikibooks and eswikiquote deployments - https://phabricator.wikimedia.org/T345850 (10isarantopoulos) The following endpoints have been made accessible via the API- Gateway: ` - /lw/inference/v1/models/enwiktionary-reverted:predict - /lw/inf... [13:58:38] there is probably something off with the swift fetch settings [14:00:14] should I send you all the settings I was using including the password privately via dm just to be sure all settings match what we've set privately? [14:01:31] for the pass it is fine the last 4 chars [14:01:36] but yes let's do it! [14:03:05] ok, sent! [14:10:08] kevinbazira: everything checks out [14:11:46] (03PS1) 10Jsn.sherman: DO NOT MERGE - CI TEST [extensions/ORES] - 10https://gerrit.wikimedia.org/r/955947 [14:19:02] kevinbazira: where have you tested the code? On a stat10xx node? [14:19:15] (also, did you use https etc..) [14:19:24] just to rule out missing bits [14:19:40] thumbor uses swift client, the only relevant thing that I found is https://github.com/wikimedia/operations-software-thumbor-plugins/commit/bdb2461f252bb9097f20b18caeafe40b30306fa1 [14:21:58] elukey: yes, I tested on the stat1008 node and used uri: https://thanos-swift.discovery.wmnet/auth/v1.0 [14:22:25] 10Machine-Learning-Team: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10elukey) ` 2023-09-08 13:37:33,415 recommendation.api.types.related_articles.candidate_finder fetch_embedding():157 ERROR -- Failed to get object from Swift Traceback (most recent call last):... [14:22:52] kevinbazira: do you still have the code on stat1008? [14:28:09] sending it privately since it has all settings [14:28:36] kevinbazira: you can tell me the path on the node, I copy it [14:30:25] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES: ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10jsn.sherman) [14:30:46] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES: ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10jsn.sherman) [14:35:05] elukey: on stat1008 run`$ python3 /home/kevinbazira/test-rec-api-swift-settings.py` [14:42:30] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10Patch-For-Review: ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10jsn.sherman) Looping in listed maintainer @Ladsgroup: I'm really not sure what to do about this, but it's blocking builds for #paget... [14:42:39] isaranto: --^ [14:42:42] did you see this? [14:49:42] kevinbazira: thanks! We can probably restart the investigation on monday [14:49:56] yes just saw this and looking into it [14:50:22] elukey: no problem. thank you for your help today! [14:50:35] np1 [14:50:36] ! [14:51:24] going afk for 30' will be back to fix the extension CI - although we never encountered this ¯\_(ツ)_/¯ [14:52:09] (03PS1) 10Ladsgroup: Avoid hard-coding non-deterministic revision id in tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/955959 (https://phabricator.wikimedia.org/T345922) [14:56:10] (03CR) 10Ladsgroup: "recheck" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/955959 (https://phabricator.wikimedia.org/T345922) (owner: 10Ladsgroup) [15:00:18] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10Patch-For-Review: ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10Ladsgroup) I made a patch that according to my localhost fixes the issue. Just want to say while I'm one of the authors of that exte... [15:00:48] (03PS2) 10AikoChou: test: add load test script and input for ores-legacy [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/955910 [15:03:19] (03CR) 10AikoChou: test: add load test script and input for ores-legacy (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/955910 (owner: 10AikoChou) [15:08:00] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10Patch-For-Review: ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10jsn.sherman) >>! In T345922#9152643, @Ladsgroup wrote: > I made a patch that according to my localhost fixes the issue. Just want to... [15:13:35] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10Patch-For-Review: ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10Ladsgroup) No worries! The patch is straightforward and jenkins is green. Feel free to +2 it to unblock your team's work. [15:15:17] (03CR) 10Jsn.sherman: "thanks for the quick patch!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/955959 (https://phabricator.wikimedia.org/T345922) (owner: 10Ladsgroup) [15:15:31] (03CR) 10Jsn.sherman: [C: 03+2] Avoid hard-coding non-deterministic revision id in tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/955959 (https://phabricator.wikimedia.org/T345922) (owner: 10Ladsgroup) [15:16:04] revert-risk autoscaling changes deployed! [15:17:08] nice! [15:17:16] I am going afk for the weekend folks! [15:17:17] have a nice one [15:18:49] have a great weekend Luca! :) [15:20:19] (03Merged) 10jenkins-bot: Avoid hard-coding non-deterministic revision id in tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/955959 (https://phabricator.wikimedia.org/T345922) (owner: 10Ladsgroup) [15:23:42] 10Machine-Learning-Team, 10Patch-For-Review, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10MGerlach) weekly update: * no update [15:39:46] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10Patch-For-Review: ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10jsn.sherman) Thanks again for getting us unstuck; this totally solves the issue in ci and in a vanilla mediawiki docker environment.... [15:43:05] (03CR) 10AikoChou: [C: 03+2] "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/954613 (owner: 10AikoChou) [16:00:12] Amir1: thanks for resolving the above issue [16:07:38] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10MW-1.41-notes (1.41.0-wmf.26; 2023-09-12), and 2 others: ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10jsn.sherman) a:03jsn.sherman [16:33:40] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/955910 (owner: 10AikoChou) [16:42:49] Going afk,have a nice weekend folks! [16:44:23] bye Ilias! :) [16:52:13] load test result for ores-legacy https://phabricator.wikimedia.org/P52344 [17:00:38] the result looks not bad, especially since the test requests contain multiple models/revids which means one request actually generates multiple calls to liftwing [17:02:04] logging off as well! [17:13:53] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10MW-1.41-notes (1.41.0-wmf.26; 2023-09-12), 10Moderator-Tools-Team (Kanban): ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10jsn.sherman) [17:14:44] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10MW-1.41-notes (1.41.0-wmf.26; 2023-09-12), 10Moderator-Tools-Team (Kanban): ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10jsn.sherman) [17:28:18] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10ORES, 10MW-1.41-notes (1.41.0-wmf.26; 2023-09-12), 10Moderator-Tools-Team (Kanban): ORES Extension master branch is failing tests - https://phabricator.wikimedia.org/T345922 (10jsn.sherman) My next step will be to define the actual default config in...