[06:30:48] <wikibugs>	 10Machine-Learning-Team, 10artificial-intelligence, 10Bad-Words-Detection-System, 10revscoring: Add language support for Swahili (sw) - https://phabricator.wikimedia.org/T162271 (10kevinbazira) a:05kevinbazira→03None
[07:03:20] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Improve ml-serve's Istio logs - https://phabricator.wikimedia.org/T300707 (10elukey) p:05Triage→03Medium a:03elukey
[07:35:53] <elukey>	 good morning :)
[07:36:16] <elukey>	 the circuit breaking settings for istio may be more difficult than expected, without using the full mesh
[07:36:39] <elukey>	 it seems that the rules that I have added are enforced by the istio sidecar proxies, not by egress
[07:36:47] <elukey>	 that is a little counter intuitive
[07:37:25] <elukey>	 it makes sense for any generic service defined within the mesh, but for external services I hoped for a simpler config
[07:37:51] <elukey>	 there are also rate limit configs, but it seems that they need redis as backup
[08:57:17] <elukey>	 https://github.com/istio/istio/blob/01e46847e53cb17a7293f781642ad88adf2806e1/tests/integration/telemetry/policy/testdata/enable_envoy_local_ratelimit.yaml
[09:19:25] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Technical-Debt: Inject Config to ORESService, convert tests to unit tests - https://phabricator.wikimedia.org/T232440 (10kostajh) a:05kostajh→03None
[09:36:20] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Factor out feature retrieve functionality to a transformer - https://phabricator.wikimedia.org/T294419 (10elukey) @ACraze @kevinbazira I am currently wondering if the transformer strategy is a good compromise for the revscoring use case. We put a lot of work...
[09:37:33] <elukey>	 ok so in theory we can come up with a horrible envoy config to pass to istio 
[09:37:46] <elukey>	 not what I have expected but..
[11:37:32] * elukey lunch!
[16:11:05] <chrisalbon>	 Morning all!
[16:11:11] <accraze>	 o/
[16:11:28] <elukey>	 morning :)
[16:11:55] <elukey>	 I am fighting with partman and debian installs for the new overlayfs thing, for today I decided to stop looking into istio :D
[16:24:39] <wikibugs>	 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Move Docker settings for kubernetes workers to overlay fs - https://phabricator.wikimedia.org/T300744 (10elukey) Tested the recipe on ml-serve2005, and it looks good. We have 2x480GBs SSDs + 2x2TB  HDDs (that we don't currently use), and the recipe u...
[16:35:51] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul)
[16:36:50] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) 05Open→03Resolved complete
[16:36:53] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul)
[16:37:46] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) 05Open→03Resolved complete
[16:44:04] <wikibugs>	 (03CR) 10Accraze: [C: 03+2] topic: add transformer blubberfile [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/760517 (https://phabricator.wikimedia.org/T298990) (owner: 10Kevin Bazira)
[16:44:13] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Sunset MiniKF sandboxes - https://phabricator.wikimedia.org/T293677 (10kevinbazira) Thank you for clarifying on the CPU limit, @ACraze.  I removed the isvc that was running and created a new one:  ` root@ml-sandbox:/srv/home/kevinbazira# kubectl get inferenc...
[16:49:58] <wikibugs>	 (03CR) 10Accraze: [V: 03+2 C: 03+2] topic: add transformer blubberfile [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/760517 (https://phabricator.wikimedia.org/T298990) (owner: 10Kevin Bazira)
[16:50:23] <wikibugs>	 (03CR) 10Accraze: "Manually merging due to pipeline not setup in config repo yet" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/760517 (https://phabricator.wikimedia.org/T298990) (owner: 10Kevin Bazira)
[17:00:37] <elukey>	 accraze: o/
[17:00:55] <elukey>	 do we have an estimate about how much a model weights in MBs ?
[17:00:59] <elukey>	 more or less
[17:04:01] <accraze>	 elukey: it depends on which model class, editquality is ~10 MB, articlequality is ~40MB, draftquality is ~2MB, topic is ~45MB 
[17:04:30] <accraze>	 i just saw your note about transformers, lets definitely talk about it at the technical meeting, im coming to similar conclusions for the revscoring models
[17:05:20] <accraze>	 we may be better off just putting everything into a single predictor model.py
[17:05:53] <accraze>	 (articlequality being the exception since it does not need the model in the transformer)
[17:06:26] <elukey>	 yeah I hate the fact that you and Kevin spent a lot of time on transformers, but we know have a good understanding of them (for future models etc..)
[17:06:37] <elukey>	 for the size, super good, thanks a lot 
[17:06:51] <elukey>	 for each kubernetes worker node we have 4 disks
[17:06:56] <elukey>	 2x480GB SSDs
[17:07:02] <elukey>	 2x2TB HDDs
[17:07:30] <elukey>	 we are using only the SSDs, and I have to change the partitions to all workers for the Overlay FS migration in Docker
[17:07:54] <elukey>	 using non SSDs for docker images may be a problem (due to their slowness etc..)
[17:08:12] <elukey>	 but we could use the HDDs for model storage, if too much
[17:08:15] <elukey>	 but it doesn't seem so
[17:09:29] <accraze>	 nice, other models might be bigger... i feel like the outlink binary is > 200MB maybe?
[17:10:06] <accraze>	 but yeah it doesn't seem like we'll run out soon :)
[17:14:00] <chrisalbon>	 is there a worry about running out of disk space?
[17:16:48] <elukey>	 nono we should be very good, I was double checking the sizes that I had in mind.. I think that we may not need to use the HDDs, but in case there will be (slow) space to use
[17:16:57] <elukey>	 I am inclined not to use it for the moment though
[17:17:03] <elukey>	 SSDs go soo much better
[17:18:10] <chrisalbon>	 ah cool, got it
[17:24:18] <chrisalbon>	 aacraze what percent of exist ORES models have we migrated to lift Wing or at close to
[17:24:48] <elukey>	 only enwiki ones for the moment (that are deployed and live)
[17:24:57] <chrisalbon>	 okay cool
[17:27:34] <accraze>	 ^^ yup
[17:28:14] <accraze>	 in theory we could quickly add others (just need to upload model binary and then write config in deployment-charts)
[17:30:29] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Editquality Transformer - https://phabricator.wikimedia.org/T298943 (10ACraze) Confirming I was able to run the editquality transformer image on ml-sandbox last week:  ` root@ml-sandbox:/srv/home/accraze/isvcs/editquality# ./test.sh enw...
[17:57:35] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Sunset MiniKF sandboxes - https://phabricator.wikimedia.org/T293677 (10ACraze) I have cleared out the old s3 buckets and have added documentation for our dev model storage: https://wikitech.wikimedia.org/wiki/User:Accraze/MachineLearning/ML-Sandbox#Model_Sto...
[17:59:31] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Sunset MiniKF sandboxes - https://phabricator.wikimedia.org/T293677 (10ACraze)
[18:14:27] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Sunset MiniKF sandboxes - https://phabricator.wikimedia.org/T293677 (10calbon) 05In progress→03Resolved a:03calbon
[18:15:11] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Sunset MiniKF sandboxes - https://phabricator.wikimedia.org/T293677 (10ACraze) a:05calbon→03ACraze
[18:22:02] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Factor out feature retrieve functionality to a transformer - https://phabricator.wikimedia.org/T294419 (10calbon)
[18:32:18] <elukey>	 klausman: we didn't have time to discuss https://phabricator.wikimedia.org/T300744 but FYI I am testing new settings on ml-serve2005, that is one of the new codfw nodes
[18:32:26] <elukey>	 (new partman recipe etc..)
[18:32:42] <klausman>	 Oh, nice.
[18:32:53] <elukey>	 in theory we should move to bullseye + overlay soonish
[18:32:59] <klausman>	 If you want a second set of eyes or similar, lmk
[18:33:09] <elukey>	 but I'd like to test it on buster + overlay 
[18:33:21] <klausman>	 Yes, Ideally we'd be able to do both/either
[18:33:26] <elukey>	 yeah if you want to check the new partitioning scheme etc.. everything is in the task
[18:34:01] <klausman>	 are the fs'es all ext4?
[18:34:07] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Factor out feature retrieve functionality to a transformer - https://phabricator.wikimedia.org/T294419 (10ACraze)
[18:34:11] <elukey>	 I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/759678 tomorrow to force the support for overlay via hiera (the current devicemapper code is a little convoluted)
[18:34:21] <elukey>	 after that, ml-serve2005 should be ready to go
[18:34:25] <elukey>	 yes all ext4
[18:34:43] <elukey>	 raid1 between the two 480G SSDs, multiple lvm volumes
[18:34:55] <elukey>	 (Janis and Giuseppe asked for more configurability)
[18:35:03] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Editquality Transformer - https://phabricator.wikimedia.org/T298943 (10ACraze) 05Open→03Resolved
[18:35:16] <klausman>	 I experimented with using btrfs subvolumes in a previous projects, eliminating overlayfs. But I don't think that risk is worth it (there are advantages, but you stray from what everyone else does etc)
[18:35:52] <elukey>	 ah yes Janis mentioned it, maybe in the future we could try btrfs (totally ignorant about it atm)
[18:36:11] <klausman>	 I'v been using it privately for close toa decade now :)
[18:36:30] <elukey>	 lemme know if you have anything against the settings, if not I'll proceed tomorrow :)
[18:36:45] <elukey>	 (it is also easy to wipe ml-serve2005 again if needed)
[18:37:08] <klausman>	 I'll give them a read and let you know :) When in doubt, you're probably right ;)
[18:37:19] <elukey>	 migrating to bullseye is another can of worms, since we'll need to adjust puppet and apt (copy packages etc..)
[18:37:40] <elukey>	 and hope that we'll not see horrible things like the iptables regression that we saw in buster :(
[18:37:54] <elukey>	 this is why I want to keep the overlay testing separate
[18:38:01] <elukey>	 too many variables :D
[18:38:14] <klausman>	 Aye
[18:38:19] <elukey>	 (but we'll hopefully reimage one-by-one with both overlay and bullseye)
[18:38:23] <elukey>	 ack perfect, thanks :)
[18:38:38] <elukey>	 I am going afk! have a nice (rest of the) day folks :)
[18:39:44] <accraze>	 seeya elukey!
[18:39:49] <elukey>	 (last but not the least - the 2x2TB HDDs for the moment are out of the game, using them for /var/lib/docker seem not a great choice, we can keep them if storage is needed for models)
[18:44:23] <klausman>	 elukey: I agree with your plan and rationale re: buster/bullseye and reprtitioning move. While I get Joe's point about reimaging twice, I also feel that the add'l exercise is building confidence. Plus, we can give a mixed cluster (some nodes Butser, some Bullseye) a try. Might provide interesting differential info.
[18:44:58] <klausman>	 Sorry, s/Joe/Jayme/
[18:51:23] <wikibugs>	 (03CR) 10Accraze: [C: 03+2] nlwiki articlequality, hiwiki editquality, ores observability [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/755731 (https://phabricator.wikimedia.org/T300195) (owner: 10Halfak)
[18:52:46] <wikibugs>	 (03CR) 10Accraze: [V: 03+2 C: 03+2] nlwiki articlequality, hiwiki editquality, ores observability [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/755731 (https://phabricator.wikimedia.org/T300195) (owner: 10Halfak)
[18:54:59] <wikibugs>	 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10ACraze) The deploy CR has been merged and we...
[20:59:34] <accraze>	 rebuilding ml-sandbox to provision more cpu, should be back up in a bit
[21:01:02] <accraze>	 trying to attach a lime explainer to enwiki-goodfaith to see if we can get feature importance back from the `:explain` api
[21:02:38] <accraze>	 it should work in theory, but i keep hitting the cpu limits on the sandbox (just like when kevinbazira tried testing the transformer last week)
[21:30:42] <accraze>	 ml-sandbox cluster is back up