[06:30:48] 10Machine-Learning-Team, 10artificial-intelligence, 10Bad-Words-Detection-System, 10revscoring: Add language support for Swahili (sw) - https://phabricator.wikimedia.org/T162271 (10kevinbazira) a:05kevinbazira→03None [07:03:20] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Improve ml-serve's Istio logs - https://phabricator.wikimedia.org/T300707 (10elukey) p:05Triage→03Medium a:03elukey [07:35:53] good morning :) [07:36:16] the circuit breaking settings for istio may be more difficult than expected, without using the full mesh [07:36:39] it seems that the rules that I have added are enforced by the istio sidecar proxies, not by egress [07:36:47] that is a little counter intuitive [07:37:25] it makes sense for any generic service defined within the mesh, but for external services I hoped for a simpler config [07:37:51] there are also rate limit configs, but it seems that they need redis as backup [08:57:17] https://github.com/istio/istio/blob/01e46847e53cb17a7293f781642ad88adf2806e1/tests/integration/telemetry/policy/testdata/enable_envoy_local_ratelimit.yaml [09:19:25] 10Machine-Learning-Team, 10ORES, 10Technical-Debt: Inject Config to ORESService, convert tests to unit tests - https://phabricator.wikimedia.org/T232440 (10kostajh) a:05kostajh→03None [09:36:20] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Factor out feature retrieve functionality to a transformer - https://phabricator.wikimedia.org/T294419 (10elukey) @ACraze @kevinbazira I am currently wondering if the transformer strategy is a good compromise for the revscoring use case. We put a lot of work... [09:37:33] ok so in theory we can come up with a horrible envoy config to pass to istio [09:37:46] not what I have expected but.. [11:37:32] * elukey lunch! [16:11:05] Morning all! [16:11:11] o/ [16:11:28] morning :) [16:11:55] I am fighting with partman and debian installs for the new overlayfs thing, for today I decided to stop looking into istio :D [16:24:39] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Move Docker settings for kubernetes workers to overlay fs - https://phabricator.wikimedia.org/T300744 (10elukey) Tested the recipe on ml-serve2005, and it looks good. We have 2x480GBs SSDs + 2x2TB HDDs (that we don't currently use), and the recipe u... [16:35:51] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) [16:36:50] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) 05Open→03Resolved complete [16:36:53] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) [16:37:46] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) 05Open→03Resolved complete [16:44:04] (03CR) 10Accraze: [C: 03+2] topic: add transformer blubberfile [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/760517 (https://phabricator.wikimedia.org/T298990) (owner: 10Kevin Bazira) [16:44:13] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Sunset MiniKF sandboxes - https://phabricator.wikimedia.org/T293677 (10kevinbazira) Thank you for clarifying on the CPU limit, @ACraze. I removed the isvc that was running and created a new one: ` root@ml-sandbox:/srv/home/kevinbazira# kubectl get inferenc... [16:49:58] (03CR) 10Accraze: [V: 03+2 C: 03+2] topic: add transformer blubberfile [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/760517 (https://phabricator.wikimedia.org/T298990) (owner: 10Kevin Bazira) [16:50:23] (03CR) 10Accraze: "Manually merging due to pipeline not setup in config repo yet" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/760517 (https://phabricator.wikimedia.org/T298990) (owner: 10Kevin Bazira) [17:00:37] accraze: o/ [17:00:55] do we have an estimate about how much a model weights in MBs ? [17:00:59] more or less [17:04:01] elukey: it depends on which model class, editquality is ~10 MB, articlequality is ~40MB, draftquality is ~2MB, topic is ~45MB [17:04:30] i just saw your note about transformers, lets definitely talk about it at the technical meeting, im coming to similar conclusions for the revscoring models [17:05:20] we may be better off just putting everything into a single predictor model.py [17:05:53] (articlequality being the exception since it does not need the model in the transformer) [17:06:26] yeah I hate the fact that you and Kevin spent a lot of time on transformers, but we know have a good understanding of them (for future models etc..) [17:06:37] for the size, super good, thanks a lot [17:06:51] for each kubernetes worker node we have 4 disks [17:06:56] 2x480GB SSDs [17:07:02] 2x2TB HDDs [17:07:30] we are using only the SSDs, and I have to change the partitions to all workers for the Overlay FS migration in Docker [17:07:54] using non SSDs for docker images may be a problem (due to their slowness etc..) [17:08:12] but we could use the HDDs for model storage, if too much [17:08:15] but it doesn't seem so [17:09:29] nice, other models might be bigger... i feel like the outlink binary is > 200MB maybe? [17:10:06] but yeah it doesn't seem like we'll run out soon :) [17:14:00] is there a worry about running out of disk space? [17:16:48] nono we should be very good, I was double checking the sizes that I had in mind.. I think that we may not need to use the HDDs, but in case there will be (slow) space to use [17:16:57] I am inclined not to use it for the moment though [17:17:03] SSDs go soo much better [17:18:10] ah cool, got it [17:24:18] aacraze what percent of exist ORES models have we migrated to lift Wing or at close to [17:24:48] only enwiki ones for the moment (that are deployed and live) [17:24:57] okay cool [17:27:34] ^^ yup [17:28:14] in theory we could quickly add others (just need to upload model binary and then write config in deployment-charts) [17:30:29] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Editquality Transformer - https://phabricator.wikimedia.org/T298943 (10ACraze) Confirming I was able to run the editquality transformer image on ml-sandbox last week: ` root@ml-sandbox:/srv/home/accraze/isvcs/editquality# ./test.sh enw... [17:57:35] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Sunset MiniKF sandboxes - https://phabricator.wikimedia.org/T293677 (10ACraze) I have cleared out the old s3 buckets and have added documentation for our dev model storage: https://wikitech.wikimedia.org/wiki/User:Accraze/MachineLearning/ML-Sandbox#Model_Sto... [17:59:31] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Sunset MiniKF sandboxes - https://phabricator.wikimedia.org/T293677 (10ACraze) [18:14:27] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Sunset MiniKF sandboxes - https://phabricator.wikimedia.org/T293677 (10calbon) 05In progress→03Resolved a:03calbon [18:15:11] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Sunset MiniKF sandboxes - https://phabricator.wikimedia.org/T293677 (10ACraze) a:05calbon→03ACraze [18:22:02] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Factor out feature retrieve functionality to a transformer - https://phabricator.wikimedia.org/T294419 (10calbon) [18:32:18] klausman: we didn't have time to discuss https://phabricator.wikimedia.org/T300744 but FYI I am testing new settings on ml-serve2005, that is one of the new codfw nodes [18:32:26] (new partman recipe etc..) [18:32:42] Oh, nice. [18:32:53] in theory we should move to bullseye + overlay soonish [18:32:59] If you want a second set of eyes or similar, lmk [18:33:09] but I'd like to test it on buster + overlay [18:33:21] Yes, Ideally we'd be able to do both/either [18:33:26] yeah if you want to check the new partitioning scheme etc.. everything is in the task [18:34:01] are the fs'es all ext4? [18:34:07] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Factor out feature retrieve functionality to a transformer - https://phabricator.wikimedia.org/T294419 (10ACraze) [18:34:11] I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/759678 tomorrow to force the support for overlay via hiera (the current devicemapper code is a little convoluted) [18:34:21] after that, ml-serve2005 should be ready to go [18:34:25] yes all ext4 [18:34:43] raid1 between the two 480G SSDs, multiple lvm volumes [18:34:55] (Janis and Giuseppe asked for more configurability) [18:35:03] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Editquality Transformer - https://phabricator.wikimedia.org/T298943 (10ACraze) 05Open→03Resolved [18:35:16] I experimented with using btrfs subvolumes in a previous projects, eliminating overlayfs. But I don't think that risk is worth it (there are advantages, but you stray from what everyone else does etc) [18:35:52] ah yes Janis mentioned it, maybe in the future we could try btrfs (totally ignorant about it atm) [18:36:11] I'v been using it privately for close toa decade now :) [18:36:30] lemme know if you have anything against the settings, if not I'll proceed tomorrow :) [18:36:45] (it is also easy to wipe ml-serve2005 again if needed) [18:37:08] I'll give them a read and let you know :) When in doubt, you're probably right ;) [18:37:19] migrating to bullseye is another can of worms, since we'll need to adjust puppet and apt (copy packages etc..) [18:37:40] and hope that we'll not see horrible things like the iptables regression that we saw in buster :( [18:37:54] this is why I want to keep the overlay testing separate [18:38:01] too many variables :D [18:38:14] Aye [18:38:19] (but we'll hopefully reimage one-by-one with both overlay and bullseye) [18:38:23] ack perfect, thanks :) [18:38:38] I am going afk! have a nice (rest of the) day folks :) [18:39:44] seeya elukey! [18:39:49] (last but not the least - the 2x2TB HDDs for the moment are out of the game, using them for /var/lib/docker seem not a great choice, we can keep them if storage is needed for models) [18:44:23] elukey: I agree with your plan and rationale re: buster/bullseye and reprtitioning move. While I get Joe's point about reimaging twice, I also feel that the add'l exercise is building confidence. Plus, we can give a mixed cluster (some nodes Butser, some Bullseye) a try. Might provide interesting differential info. [18:44:58] Sorry, s/Joe/Jayme/ [18:51:23] (03CR) 10Accraze: [C: 03+2] nlwiki articlequality, hiwiki editquality, ores observability [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/755731 (https://phabricator.wikimedia.org/T300195) (owner: 10Halfak) [18:52:46] (03CR) 10Accraze: [V: 03+2 C: 03+2] nlwiki articlequality, hiwiki editquality, ores observability [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/755731 (https://phabricator.wikimedia.org/T300195) (owner: 10Halfak) [18:54:59] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10ACraze) The deploy CR has been merged and we... [20:59:34] rebuilding ml-sandbox to provision more cpu, should be back up in a bit [21:01:02] trying to attach a lime explainer to enwiki-goodfaith to see if we can get feature importance back from the `:explain` api [21:02:38] it should work in theory, but i keep hitting the cpu limits on the sandbox (just like when kevinbazira tried testing the transformer last week) [21:30:42] ml-sandbox cluster is back up