[06:18:32] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Add revertrisk-multilingual to RecentChanges filters - https://phabricator.wikimedia.org/T348298 (10isarantopoulos) [07:01:46] hello folks [07:04:52] morning elukey o/ [07:04:53] hope you're feeling better today! [07:15:44] yes yes thanks :) [07:18:31] great! [07:18:32] I tried testing the workers on staging in vain. running `helmfile -e ml-staging-codfw sync` throws this error: [07:18:32] Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set ... json: cannot unmarshal number into Go struct field Container.spec.template.spec.containers.args of type string [07:19:30] the `helmfile -e ml-staging-codfw diff` shows that there's a change to `chart: python-webapp-0.0.3`. This was not part of the patch I pushed. Hope it's not the one affecting the sync. [07:19:51] isaranto, klausman: o/ re: https://gerrit.wikimedia.org/r/c/operations/alerts/+/962056 - the 1 hour time window may lead to false positives, I checked in thanos the past weeks of data and sometimes we had peaks that auto-resolved.. Anyway, we can keep it in this way and refine. The thing that is missing is the runbook - we need to have one under our namespace in Wikitech that explains the problem, [07:19:57] and how to fix it [07:20:07] so everybody in the team can start working on it anytime [07:20:51] kevinbazira: lemme check [07:24:26] kevinbazira: ah no wait the error explains it well [07:24:38] so it says that the issue is in spec->containers->args [07:24:43] that is what we are changing [07:25:07] so it wants a string rather than a number [07:25:47] mmm so it wants things quoted [07:25:57] not 2 but say "2" [07:26:06] you already added it in the values.yaml [07:26:16] but the app template may not quote things [07:29:18] kevinbazira: so the "app" stuff is rendered in the deployment-charts -> modules -> app -> generic_1.0.0.tpl [07:29:32] we copy that file automatically in the python-webapp chart [07:30:08] (deployment-charts -> python-webapp -> templates -> vendor -> app) [07:30:18] and it does this [07:30:19] args: {{- range .Values.app.args }} - {{ . }} {{- end }} [07:30:52] my suspicion is that we are the first ones to add numbers :D [07:32:07] thanks alot for this breakdown elukey :) [07:32:08] so now that we've written the args as required but the app is not taking them should we just create a new image with the entrypoint we want and avoid the args override? [07:32:44] kevinbazira: we can do it, but we could also change the module to add the quotes, so we can modify args as we want in the future (we may need it) [07:32:57] I can send a quick patch, but we'll need serviceops' review before proceeding [07:33:11] ok, no problem. thanks! [07:38:09] kevinbazira: so the patch should be https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/963944 [07:38:17] hopefully it will be accepted [07:39:08] lemme also add the changes for python-webapp [07:39:47] super! [07:42:40] kevinbazira: so https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/963945 [07:42:48] if you see the CI's diff, the args are now quoted [07:44:03] ok so let's wait for serviceops' approva [07:44:06] *approval [07:44:49] great. I've seen the args and they are quoted: https://integration.wikimedia.org/ci/job/helm-lint/13256/console [07:47:59] kevinbazira: lemme know if you have questions about the modules etc.. [07:48:19] to update and manage them we use a tool called "sextant", more info in https://gitlab.wikimedia.org/repos/sre/sextant [07:50:09] sure I will. I've worked with "sextant" before we ended up going with the python-webapp chart. not an expert on it yet. :) [07:50:41] o/ [07:50:54] isaranto: o/ [07:51:22] elukey: I'll create some runbooks! Regarding the 1h window what do u think would be a better one? [07:51:53] tbh I'd like a way to test alerts before they are deployed but haven't found sth similar on our stack [07:52:04] isaranto: let's leave 1h, in the future we can increase it if needed, just mention it in the runbook so we don't spend hours and hours if something auto-resolve (namely if changeprop is slow by itself etc..) [07:52:17] checking alerts is difficult yes [07:54:49] isaranto: I was wondering what to do, is it ok if I open a task to start importing the batcher docker image and config for kserve? [07:55:16] it should be a quick one, production-images + deployment-charts [07:55:27] then you and Aiko will be able to use it anytime [08:02:18] ah wow the batcher uses the agent image [08:02:22] and we already have support for it [08:02:29] {{done}} [08:02:29] :D [08:03:36] Hehe [08:03:44] Auto resolved! [08:03:58] wow: https://github.com/wikimedia/research-recommendation-api/blob/a9022b7e49f9793021cbe42da523ea478dacf431/recommendation/api/external_data/wikidata.py#L105 [08:04:03] multi process? Really? [08:08:29] in theory that thing makes http requests eventually right? Isn't multi-threading ok? [08:09:09] We can overscale a ton rec-api-ng but it is a waste of resources, and I'd love to keep them as much as possible for autoscaling of isvcs [08:10:57] 10Machine-Learning-Team: Goal: Lift Wing users can request multiple predictions using a single request. - https://phabricator.wikimedia.org/T348153 (10elukey) @achou as FYI our Kserve Docker images + deployment charts config should already support the kserve batcher, so you are free to test/enable it anytime :) [08:13:16] (03Abandoned) 10Elukey: WIP - Add read-only cache support to ores-legacy [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/922561 (https://phabricator.wikimedia.org/T337287) (owner: 10Elukey) [08:17:38] isaranto: I removed some private config for ores-legacy, added when I was testing the redis support.. going to deploy the update (basically it removes a configmap + env vars) [08:17:41] is it ok? [08:18:11] (s/ConfigMap/Secret) [08:18:32] ofc! [08:19:22] isaranto: do you mind, for safety, to double check the diff for ml-serve-codfw and +1 it? [08:19:30] (helmfile diff I mena) [08:19:31] *mean [08:19:36] already deployed to staging [08:21:47] on it [08:21:54] +1 on which patch though? [08:22:10] nono sorry since it is a change in puppet private I mentioned helmfile diff [08:22:23] for say ml-serve-codfw [08:22:33] ack [08:23:33] yep everything seems ok, just some redis stuff gone + checksum of secrets changed [08:26:53] done! [09:01:31] (03PS1) 10Ilias Sarantopoulos: revscoring: fix WIKI_URL validation [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963959 [09:02:49] (03PS2) 10Ilias Sarantopoulos: revscoring: fix WIKI_URL validation [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963959 [09:03:43] I encountered the above issue while testing multiprocessing in revscoring. probably worth fixing in other servers as well [09:19:27] would runbooks better go under https://wikitech.wikimedia.org/wiki/Machine_Learning/Runbooks? [09:20:56] (03CR) 10Elukey: [C: 03+1] revscoring: fix WIKI_URL validation (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963959 (owner: 10Ilias Sarantopoulos) [09:21:51] isaranto: not sure what's best, I like /LiftWing/Alerts, but we can do anything [09:23:00] for the sha512 checksum I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/963956 [09:23:21] the idea is to reserve /srv/published/wmf-ml-models only for ml-admins on every stat100x node [09:23:29] so we are the only ones able to publish model binaries [09:23:42] ack [09:24:39] also I agree with LiftWing/Alerts. Although the first alert is not directly about LiftWing but I'd rather have it over there [09:43:20] elukey: re: permissions, I think we want 0775 (r-x for other, otherwise people can't cd into the directory (or its subdirectories) [09:44:25] klausman: right right sorry [09:44:35] I don't think that people will have to do it [09:44:44] but I am ok to have it [09:45:22] klausman: amended [09:45:27] <3 [09:46:34] after that I'll add all the sha512 files [09:46:39] Roger! [09:46:46] and then support for it in the model-upload script [09:46:55] What a lovely Yak :) [09:46:58] does it make sense? Or do we want a separate script for uploading? [09:47:12] Yak?? :) [09:47:45] Yak shaving :) [09:48:13] ah TIL [09:48:14] > Any apparently useless activity which, by allowing one to overcome intermediate difficulties, allows one to solve a larger problem. [09:48:52] I don't have goals this quarter! :P [09:48:56] before the upload can be done properly, you have to change the perms, and for that, the group must be there [09:49:11] Yeah , we'll find something for you, I am sure ;) [09:49:32] klausman: I don't get the group comment, what do you mean? [09:50:13] I was explaining the steps that made me make the yak shaving comments [09:51:34] ah okok :) [09:51:56] in general we should think more about security steps [09:52:00] I feel it is the time [09:52:14] yeah, agreed [09:53:13] klausman: not sure if you saw it but I added a step in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_a_model [09:53:25] to explicitly ask in phab the sha512 of the file etc.. [09:53:35] so at least we have a trace of the binary handover [09:54:05] elukey@stat1004:~$ ls -ld /srv/published/wmf-ml-models/ [09:54:05] drwxrwxr-x 2 root ml-team-admins 4096 Oct 6 09:52 /srv/published/wmf-ml-models/ [09:54:08] all good [09:54:20] Sounds good, re: sha256 [09:54:37] I'll add a note for the team in slack [10:27:12] ok uploaded all the shas, let's see when it syncs how it goes [10:29:21] 10Machine-Learning-Team: Add sha512 checksum files to all the ML's models in the public dir - https://phabricator.wikimedia.org/T347838 (10elukey) The original upload task was T334111. All the stat boxes now have a directory called `/srv/published/wmf-ml-models` that can be writable only by members of the posix... [10:30:42] https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/language-agnostic/20221026144108/ [10:32:36] to avoid reloading all model binaries I assume good faith, nothing has been changed since when Aiko uploaded the files [10:32:59] I quickly skimmed last-modified of the files and except some legit change, all good [10:33:18] last bit is to add support for uploading the models [10:34:40] * elukey lunch! [10:35:30] same [11:25:20] * isaranto lunch for me as well! [13:27:30] how's the wording `When ORESFetchScoreJobKafkaLag alert fires it means that there is a lag between messages landing in Kafka topics and message consumption rate from Changeprop.`? [13:27:42] just asking to see if I am misunderstanding anything [13:29:39] +1 [13:29:54] one thing to add could be that CP is probably retrying multiple times [13:30:02] and Lift Wing is returning errors [13:31:10] ack [13:31:11] Cofeeee [13:51:28] first version ! https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts [13:53:07] lemme know if sth isn't clear or it is wrong and ofc feel free to change the page as you think it is better [13:54:11] kevinbazira: rec-api-ng running in staging :) [13:54:59] isaranto: really nice +1! [13:55:13] a, I forgot to mention the thing about 1h [13:57:55] isaranto: I created the TOC for the page, we can use https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#Kafka_Consumer_lag_-_ORESFetchScoreJobKafkaLag_alert [13:58:10] with multiple alerts it is easier to get to the right bit [13:58:25] noice! [13:58:36] adding the memory alert as well [14:04:13] done! [14:04:24] patches updated with the link to the runbooks [14:05:29] Nice runbook entries, those are always worth their weight in gold. [14:06:22] (03PS3) 10Ilias Sarantopoulos: revscoring: fix WIKI_URL validation [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963959 [14:06:37] (03CR) 10Ilias Sarantopoulos: revscoring: fix WIKI_URL validation (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963959 (owner: 10Ilias Sarantopoulos) [14:19:03] Let's hope we don't need them that much! :D [15:16:43] klausman: tried to improve the k8s high latency alert - https://gerrit.wikimedia.org/r/c/operations/alerts/+/964025/ (it always confuses me) [15:17:06] Looking [15:17:41] the idea is to land to a specific dashboard that shows the issue [15:17:46] rather than a generic one [15:18:15] Ah, yes. The sixteen levels of templating and indirection don't help :) [15:18:35] LGTM! [15:18:36] the current alert gets us to the main dashboard [15:18:45] (after being confused a bit about "moon" :D) [15:18:46] that is not alwasy straightforward [15:22:32] one question about the goals - did we discuss about the kserve 0.11 upgrade? [15:25:20] anyway, going to log off, have a nice weekend folks! [15:33:15] we said that we need to serve a certain amount of time for maintenance [15:33:30] but we didn't add a goal. let's revisit next week [15:36:16] and now i spend half the day writing weekly updates [15:58:23] automate [15:58:36] :P [16:01:40] ChatGPT is so wordy [16:02:47] i know (and i am wordy)! you need your own autosummarizer. but the _really_ hard part: dom elements or apis of gdocs and asana [16:03:04] i kid i kid, i suppose it's not that bad [20:48:26] (03PS2) 10Jdlrobson: Don't use live configuration [extensions/ORES] - 10https://gerrit.wikimedia.org/r/957970 (https://phabricator.wikimedia.org/T345922) (owner: 10Jsn.sherman)