[06:31:56] ragesoss: o/ it depends about the use case - for regular bare metal nodes, we tend to use the canonical python version brought by the Debian os version, for k8s we do the same but we rely on the version installed in the Docker image. In general we tend not to have multiple version of Python on the same node/container/etc.. (not sure if I have answered your question) [08:52:50] (03CR) 10AikoChou: editquality: refactor preprocess common code (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/829847 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [09:05:36] (03CR) 10AikoChou: articlequality: refactor code to use the new extractor_utils module (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830058 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [09:07:17] (03CR) 10AikoChou: [C: 03+1] drafttopic: move preprocess to async [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830084 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [09:09:35] (03CR) 10AikoChou: [C: 03+1] "LGTM :) I was wondering have you tested if the predicted scores with mw_http_cache are the same as the predicted scores without mw_http_ca" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830061 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [09:39:11] aiko: thanks for the reviews! [09:39:23] the tool that I had in mind yesterday is https://www.benthos.dev/, in go not python [09:41:51] (03CR) 10Elukey: editquality: refactor preprocess common code (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/829847 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [10:15:20] (03PS4) 10Elukey: editquality: refactor preprocess common code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/829847 (https://phabricator.wikimedia.org/T313915) [10:15:22] (03PS2) 10Elukey: articlequality: refactor code to use the new extractor_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830058 (https://phabricator.wikimedia.org/T313915) [10:15:24] (03PS2) 10Elukey: draftquality: move to async preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830061 (https://phabricator.wikimedia.org/T313915) [10:15:26] (03PS2) 10Elukey: drafttopic: move preprocess to async [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830084 (https://phabricator.wikimedia.org/T313915) [10:15:49] (03CR) 10Elukey: editquality: refactor preprocess common code (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/829847 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [10:26:59] elukey: I approve of Go tools ;) [10:28:00] elukey: speaking of: we might want to propose making a .deb for https://github.com/stern/stern --- it's basically multitail for k8s container logs, and it's very useful for debugging. While the same can be done with shell scripts and the like, stern is much nicer to use,a nd more robust [10:29:05] sure! [10:31:29] 10artificial-intelligence, 10Technical-Tool-Request: Auto copyeditor - https://phabricator.wikimedia.org/T317050 (10Aklapper) p:05Triage→03Low [10:31:42] (03PS3) 10Elukey: articlequality: refactor code to use the new extractor_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830058 (https://phabricator.wikimedia.org/T313915) [10:31:44] (03PS3) 10Elukey: draftquality: move to async preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830061 (https://phabricator.wikimedia.org/T313915) [10:31:46] (03PS3) 10Elukey: drafttopic: move preprocess to async [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830084 (https://phabricator.wikimedia.org/T313915) [10:32:17] (03CR) 10Elukey: "Code updated with Aiko's suggestions! Lemme know if it is missing anything :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830058 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [10:32:40] aiko: code updated with your suggestions :) [10:34:21] * elukey lunch! [11:20:10] Guys. GUYS. From my home workstation: [11:20:23] curl "https://api.wikimedia.org/service/lw/inference/v1/models/enwiki-articlequality:predict" -X POST -d '{ "rev_id": 123456 }' -H "Authorization: Bearer $TOKEN";echo [11:20:24] {"predictions": {"prediction": "Stub", "probability": {"B": 0.017382693143129683, "C": 0.011305576384229396, "FA": 0.002078191955918339, "GA": 0.0029161293780774434, "Start": 0.05709479871741571, "Stub": 0.9092226104212294}}} [11:43:50] <- lunch [12:17:47] (03CR) 10AikoChou: editquality: refactor preprocess common code (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/829847 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [12:18:35] (03CR) 10AikoChou: [C: 03+1] articlequality: refactor code to use the new extractor_utils module (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830058 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [12:52:19] klausman: wow nice! [12:53:02] There's still some open stuff about routing to different LW services/endpoints, but as a proof of concept that the API GW can do Its Thing for us, it works [12:54:17] 10Machine-Learning-Team, 10artificial-intelligence, 10Research: [Epic] Article importance prediction model - https://phabricator.wikimedia.org/T155541 (10Isaac) a:05Isaac→03None [12:54:21] 10Machine-Learning-Team, 10artificial-intelligence, 10Research: [Epic] Article importance prediction model - https://phabricator.wikimedia.org/T155541 (10Isaac) a:03Isaac [12:54:53] (03PS5) 10Elukey: editquality: refactor preprocess common code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/829847 (https://phabricator.wikimedia.org/T313915) [12:54:55] (03PS4) 10Elukey: articlequality: refactor code to use the new extractor_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830058 (https://phabricator.wikimedia.org/T313915) [12:54:57] (03PS4) 10Elukey: draftquality: move to async preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830061 (https://phabricator.wikimedia.org/T313915) [12:54:59] (03PS4) 10Elukey: drafttopic: move preprocess to async [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830084 (https://phabricator.wikimedia.org/T313915) [12:55:11] (03CR) 10Elukey: editquality: refactor preprocess common code (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/829847 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [12:57:17] klausman: do we have some basic rate limit in place for the lw service? Just to avoid surprises in these days since everything is now public [12:58:22] Yes, I think 5000 qph (500 for anon users), and there is some barrier of entry (making a MW account, API token, get bearer JWT from MW Oauth and so on) [12:58:53] atm, not having a token at all does not work, i.e. you get an error from the GW itself and LW is not bothered [13:05:30] ack [13:06:26] (03CR) 10Elukey: [C: 03+2] editquality: refactor preprocess common code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/829847 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [13:08:42] 10Machine-Learning-Team, 10ORES, 10serviceops: Reimage rdb1009, rdb1010 as bullseye - https://phabricator.wikimedia.org/T317189 (10akosiaris) [13:09:01] 10Machine-Learning-Team, 10ORES, 10serviceops: Reimage rdb1009, rdb1010 as bullseye - https://phabricator.wikimedia.org/T317189 (10akosiaris) p:05Triage→03High [13:14:31] (03Merged) 10jenkins-bot: editquality: refactor preprocess common code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/829847 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [13:36:39] 10Machine-Learning-Team, 10ORES, 10serviceops: Reimage rdb1009, rdb1010 as bullseye - https://phabricator.wikimedia.org/T317189 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host rdb1009.eqiad.wmnet with OS bullseye [13:40:21] (03PS5) 10Elukey: articlequality: refactor code to use the new extractor_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830058 (https://phabricator.wikimedia.org/T313915) [13:40:23] (03PS5) 10Elukey: draftquality: move to async preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830061 (https://phabricator.wikimedia.org/T313915) [13:40:25] (03PS5) 10Elukey: drafttopic: move preprocess to async [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830084 (https://phabricator.wikimedia.org/T313915) [13:40:27] (03PS1) 10Elukey: Fix variable name in extractor_utils.py [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830635 (https://phabricator.wikimedia.org/T313915) [13:41:14] anybody free for a quick code review? https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/830635 [13:41:41] missed a wrongly typed variable name, and found it only when testing in staging [13:46:19] I'll have a look [13:46:49] (03CR) 10Klausman: [C: 03+1] Fix variable name in extractor_utils.py [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830635 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [13:46:52] LGTM! [13:50:43] elukey: thanks! i meant on your own machine for development, rather than on production hardware. like, i'm standing up a dev environment for a django app, and realizing that it has problems running on the python 3.10 that my debian system has. (i used pyenv to install 3.9... just wondering if there are other things that python experts would recommend instead.) [13:58:09] danke :) [13:59:10] ragesoss: ahhh okok, maybe using Docker could help to test on various environments [13:59:51] (03CR) 10Elukey: [C: 03+2] Fix variable name in extractor_utils.py [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830635 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [14:00:05] elukey: okay. i'll try that out next time. pyenv got me where i needed to be this time. [14:00:12] ack! [14:04:12] going to get some groceries while the new docker image is being build and published :) [14:07:19] (03Merged) 10jenkins-bot: Fix variable name in extractor_utils.py [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830635 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [14:08:23] 10Machine-Learning-Team, 10ORES, 10serviceops: Reimage rdb1009, rdb1010 as bullseye - https://phabricator.wikimedia.org/T317189 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host rdb1009.eqiad.wmnet with OS bullseye completed: - rdb1009 (**PASS**) - Downtime... [14:23:12] 10Machine-Learning-Team, 10ORES, 10serviceops: Reimage rdb1009, rdb1010 as bullseye - https://phabricator.wikimedia.org/T317189 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host rdb1010.eqiad.wmnet with OS bullseye [14:46:24] testing the new docker image [14:46:44] I am also comparing wrk results before/after just to make sure that I didn't introduce anything weird [14:46:56] wrk? [14:47:20] it is a benchmark tool on deploy1002, basic but nice [14:47:28] I am using it for perf evaluation [14:47:30] ah, right. [14:47:36] basically like apachebench? [14:47:41] yes yes [14:47:47] :+1: [14:49:29] all good from the tests, proceeding with article quality [14:50:02] (03CR) 10Elukey: [C: 03+2] articlequality: refactor code to use the new extractor_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830058 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [14:54:49] 10Machine-Learning-Team, 10ORES, 10serviceops: Reimage rdb1009, rdb1010 as bullseye - https://phabricator.wikimedia.org/T317189 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host rdb1010.eqiad.wmnet with OS bullseye completed: - rdb1010 (**PASS**) - Downtime... [14:55:14] (03Merged) 10jenkins-bot: articlequality: refactor code to use the new extractor_utils module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830058 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [14:57:01] 10Machine-Learning-Team, 10ORES, 10serviceops: Reimage rdb1009, rdb1010 as bullseye - https://phabricator.wikimedia.org/T317189 (10akosiaris) 05Open→03Resolved Hosts reimaged. aside from a small backlog for changeprop and an increase in latency for api-gateway for a bit, no other side-effect. Closing thi... [15:07:08] 10Machine-Learning-Team, 10ORES, 10serviceops: Reimage rdb1009, rdb1010 as bullseye - https://phabricator.wikimedia.org/T317189 (10MoritzMuehlenhoff) Thanks :-) [15:13:57] (03CR) 10Elukey: [C: 03+2] draftquality: move to async preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830061 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [15:15:49] aiko: o/ when you have a moment lemme know more details about your question in https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/830061, I didn't see it till now sorry. The results should be the same (as for the other models), any concern for draftquality? [15:21:31] (03Merged) 10jenkins-bot: draftquality: move to async preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/830061 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [15:39:15] elukey: o/ If the results are the same, then that's no problem! The question was for all models, not just for draftquality :) [15:39:40] ahhh okok [15:39:48] yes yes I check that everything works :) [15:40:01] nice \o/ [15:40:57] I am currently checking the perf for draftquality on staging [15:41:07] before the new async stuff [15:41:11] and it looks scaling already well [15:41:37] or at least better than edit/articlequality (before async) [15:42:28] but it doesn't make a lot of sense, I saw multiple calls to the MW API from my tests [15:42:31] mmmmm [15:44:38] the perf in the prod clusters is worse though [15:45:47] ah lol requests in staging are aborting early due to (probably istio throttling) [15:46:15] (in prod sorry) [15:46:26] okok need to file a code change first [15:50:14] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/830661/ [16:04:55] going afk for the evening, have a nice rest of the day folks [16:05:07] will restart working on draftquality tomorrow morning [16:17:53] there's a security update for runc, which has been deployed to the wikikube cluter without any issues, should I also go ahead for the ml cluster(s) or do you want to specifically test some things first? (can also do one DC only initially) [17:21:21] Can you do codfw-staging first, then codfw, then eqiad? [17:21:49] If so: go ahead