[07:35:16] <elukey>	 morning!
[08:03:12] <isaranto>	  /o
[09:08:45] <elukey>	 isaranto: o/ if you have time I think that https://phabricator.wikimedia.org/T322006 is a good start, so you can familiarize with blubber and CI 
[09:09:04] <elukey>	 Kevin is already busy with AddALink and Ores so it should be fine to pick it up!
[09:11:09] <isaranto>	 Sure! I’m on it
[09:11:39] <wikibugs>	 10Machine-Learning-Team: Add new syntax directive to blubber.yaml files to enable users to directly use docker build with blubber.yaml. - https://phabricator.wikimedia.org/T322006 (10isarantopoulos) a:05kevinbazira→03isarantopoulos
[09:28:53] <isaranto>	 elukey: After I test that the images are built locally is there a way to also test CI (image generation) before I merge a patch? Also, where can I find the commands that run in CI?
[09:30:51] <elukey>	 isaranto: CI should basically run blubber as you run it locally, so if the image generation is fine we should be good. I am not 100% sure where the config for the command to run is defined, I'd start from the jobs in jenkins.wikimedia.org
[09:32:00] <elukey>	 need to run an unexpected errand, bbiab folks
[09:36:12] <isaranto>	 Thanks!
[10:29:55] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score-<model> - https://phabricator.wikimedia.org/T317768 (10EChetty)
[11:04:09] <isaranto>	 Hey I have the following problem: building docker image on my M1 macbook and I get ` # qemu: uncaught target signal 11 (Segmentation fault) - core dumped` error and it takes forever to build it. Anyone encountered it?
[11:24:19] <aiko>	 seems like it's a problem with M1 (arm64) chips. I use M1 chip too but I haven't tried the new syntax directive to build bubberfile. Maybe you can test it in ml-sandbox instead?
[11:26:27] <elukey>	 isaranto: didn't see it yet, I was about to ask - are you on mac + m1 ?
[11:35:32] <wikibugs>	 (03PS1) 10Elukey: Improve fetch_features handling in revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374)
[11:36:43] <wikibugs>	 (03CR) 10Elukey: "Still need to test all model servers, not ready for a full review yet :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey)
[11:36:51] * elukey lunch!
[11:53:30] <wikibugs>	 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10akosiaris) @LSobanski, @elukey, I am gonna remove #serviceops, I don't see aside from some best practices review what we can do more about this.
[12:27:47] <isaranto>	 will try it on ml sandbox. One of the reasons I never switched to M1 until now :)
[12:28:07] <isaranto>	 Am I the only one on the team with M1 Mac?
[12:39:40] <klausman>	 No, I think Aiko also has one
[12:59:57] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Test batch prediction for revert-risk model - https://phabricator.wikimedia.org/T323023 (10achou)
[13:00:37] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Test batch prediction for revert-risk model - https://phabricator.wikimedia.org/T323023 (10achou)
[13:04:25] <aiko>	 I have one :)
[14:00:17] <wikibugs>	 (03PS1) 10AikoChou: revertrisk: change output and remove HTTPError exception [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856556 (https://phabricator.wikimedia.org/T323023)
[14:16:58] <chrisalbon>	 Morning all!
[14:26:39] <elukey>	 o/
[14:28:35] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] revertrisk: change output and remove HTTPError exception [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856556 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou)
[14:43:10] <isaranto>	 Morning Chris!
[14:44:51] <isaranto>	 Do I need some other permission to run stuff on ml sandbox? Seems like I don’t have any space left under /home/isaranto and I cant run any minkube or docker commands
[14:51:54] <elukey>	 isaranto: nono you are admin, but the /srv partition is full (your home is under it)
[14:52:15] <elukey>	 the sandbox is self-managed and doesn't use puppet, it is still a little brittle as testing env
[14:52:18] <elukey>	 lemme see
[14:53:18] <elukey>	 49G	docker
[14:53:18] <elukey>	 12G	home
[14:53:18] <elukey>	 16K	lost+found
[14:54:35] <elukey>	 so docker image ls shows some old stuff
[14:55:10] <elukey>	 aiko: can some of the image-content-filtration images be deleted?
[14:57:14] <elukey>	 we probably need a bigger vm or a sandbox 2
[14:58:16] <elukey>	 aiko: I removed some old images of yours
[14:58:26] <elukey>	 isaranto: you should be able to execute commands now, can you try?
[15:02:00] <isaranto>	 elukey: Great! I have a problem with docker  `Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/images/json": dial unix /var/run/docker.sock: connect: permission denied`
[15:02:21] <elukey>	 isaranto: did you try with sudo? 
[15:03:02] <isaranto>	 sudo works :) . Thanks again
[15:03:03] <elukey>	 weird though
[15:03:09] <elukey>	 srw-rw---- 1 root docker 0 Apr  4  2022 /var/run/docker.sock
[15:03:17] <elukey>	 and you are in the docker group
[15:03:25] <elukey>	 it works for me without sudo
[15:03:27] <elukey>	 mmmm
[15:03:35] <elukey>	 what command did you run?
[15:08:08] <isaranto>	 I wasn’t in the docker group but I added myself (perhaps it requires a restart (?). I just ran docker and docker image ls
[15:08:40] <isaranto>	 I added me through `sudo usermod -aG docker isaranto`
[15:08:58] <elukey>	 maybe try to log out and ssh back in
[15:11:20] <wikibugs>	 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Observability-Logging, 10observability: Evaluate Benthos as stream processor - https://phabricator.wikimedia.org/T319214 (10JArguello-WMF)
[15:14:58] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] "Thanks for the review! I accidentally removed your +1 😂" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856556 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou)
[15:19:00] <isaranto>	 Works now :D
[15:19:45] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team, 10Research: Upload new outlinks topic model to LiftWing - https://phabricator.wikimedia.org/T322881 (10Isaac) > I plan to deploy this new model to Lift Wing along with a new docker image that contains some logging changes (https://gerrit.wikimedia.org/r/c/machinelearning...
[15:22:31] <wikibugs>	 (03Merged) 10jenkins-bot: revertrisk: change output and remove HTTPError exception [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856556 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou)
[15:25:45] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10Isaac) > I tested outlink with benthos for around 9 hours the other day (here is the grafana metrics), I observed it returned ~1800 Bad Requests error with "No matching article or the...
[15:39:52] <wikibugs>	 10Machine-Learning-Team, 10Gerrit, 10Release-Engineering-Team (Seen): gerrit: scoring/ores/editquality takes a long time to git gc - https://phabricator.wikimedia.org/T237807 (10hashar) I ran Tyler's script from `/home/thcipriani/elapsed_gc_time.py` ` gerrit1001:~$ python /home/thcipriani/elapsed_gc_time.py|...
[15:44:25] <wikibugs>	 10Machine-Learning-Team, 10Gerrit, 10Release-Engineering-Team (Seen): gerrit: scoring/ores/editquality takes a long time to git gc - https://phabricator.wikimedia.org/T237807 (10thcipriani) >>! In T237807#8392885, @hashar wrote: > I ran Tyler's script from `/home/thcipriani/elapsed_gc_time.py` > ` > gerrit10...
[15:47:20] <elukey>	 klausman: quick review :)
[15:47:24] <elukey>	 I'd like to drop
[15:47:25] <elukey>	 https://netbox.wikimedia.org/ipam/prefixes/533/
[15:47:33] <elukey>	 https://netbox.wikimedia.org/ipam/prefixes/534/
[15:47:56] <klausman>	 Looking
[15:49:13] <klausman>	 We're not using those?
[15:49:43] <elukey>	 yeah sorry those are staging ones
[15:50:00] <elukey>	 gimme a sec, netbox is not helping
[15:50:38] <elukey>	 so
[15:50:39] <elukey>	 https://netbox.wikimedia.org/ipam/prefixes/383/
[15:51:33] <elukey>	 https://netbox.wikimedia.org/ipam/prefixes/382/
[15:51:38] <elukey>	 these should be the old codfw ones
[15:52:20] <elukey>	 and then
[15:52:21] <elukey>	 https://netbox.wikimedia.org/ipam/prefixes/380/
[15:52:27] <elukey>	 https://netbox.wikimedia.org/ipam/prefixes/381/
[15:52:55] <elukey>	 Janis sent a code change earlier on and there were the old cidrs in there
[15:53:02] <elukey>	 it is confusing to keep both on netbox
[15:53:35] <klausman>	 LGTM deleting 383 and 382
[15:53:52] <klausman>	 380 and 381 would also cover non-ML allocations, so not 100% sure.
[15:54:36] <elukey>	 what do you mean?
[15:57:53] <elukey>	 klausman: --^
[16:00:12] <klausman>	 Sorry my bad, I misread the parent prefix disecription. 380/381 are fine, too
[16:00:39] <elukey>	 ack thanks
[16:02:43] <elukey>	 done
[16:43:13] <elukey>	 aiko: I am almost sure that the fetch_features code is mostly cpu bound, with the process pool it goes so much better
[16:43:46] <elukey>	 the main sad point is that the revscoring cache is not totally pickle-able, so if we use multi-process we have to skip it
[16:43:51] <elukey>	 but it is not a big deal 
[17:42:15] <wikibugs>	 (03PS2) 10Elukey: [WIP] Refactor revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374)
[17:44:07] <elukey>	 aiko: not completely tested but --^ should help a lot 
[17:44:25] <elukey>	 after your last changes for articlequality the model-servers are basically the same
[17:44:29] <elukey>	 except some minor bits
[17:44:48] <elukey>	 in this way we have a single class to modify from now on
[17:48:13] <elukey>	 I'll finish it tomorrow, but I'll be able to add the Multi Process support to all the model servers in one go (enabled or not via variable of course)
[17:48:24] <elukey>	 let me know folks if you like the idea
[18:01:41] <elukey>	 have a good rest of the day folks :)
[18:01:44] * elukey afk
[18:07:54] <klausman>	 \o