[07:35:16] morning! [08:03:12] /o [09:08:45] isaranto: o/ if you have time I think that https://phabricator.wikimedia.org/T322006 is a good start, so you can familiarize with blubber and CI [09:09:04] Kevin is already busy with AddALink and Ores so it should be fine to pick it up! [09:11:09] Sure! I’m on it [09:11:39] 10Machine-Learning-Team: Add new syntax directive to blubber.yaml files to enable users to directly use docker build with blubber.yaml. - https://phabricator.wikimedia.org/T322006 (10isarantopoulos) a:05kevinbazira→03isarantopoulos [09:28:53] elukey: After I test that the images are built locally is there a way to also test CI (image generation) before I merge a patch? Also, where can I find the commands that run in CI? [09:30:51] isaranto: CI should basically run blubber as you run it locally, so if the image generation is fine we should be good. I am not 100% sure where the config for the command to run is defined, I'd start from the jobs in jenkins.wikimedia.org [09:32:00] need to run an unexpected errand, bbiab folks [09:36:12] Thanks! [10:29:55] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10EChetty) [11:04:09] Hey I have the following problem: building docker image on my M1 macbook and I get ` # qemu: uncaught target signal 11 (Segmentation fault) - core dumped` error and it takes forever to build it. Anyone encountered it? [11:24:19] seems like it's a problem with M1 (arm64) chips. I use M1 chip too but I haven't tried the new syntax directive to build bubberfile. Maybe you can test it in ml-sandbox instead? [11:26:27] isaranto: didn't see it yet, I was about to ask - are you on mac + m1 ? [11:35:32] (03PS1) 10Elukey: Improve fetch_features handling in revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) [11:36:43] (03CR) 10Elukey: "Still need to test all model servers, not ready for a full review yet :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [11:36:51] * elukey lunch! [11:53:30] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10akosiaris) @LSobanski, @elukey, I am gonna remove #serviceops, I don't see aside from some best practices review what we can do more about this. [12:27:47] will try it on ml sandbox. One of the reasons I never switched to M1 until now :) [12:28:07] Am I the only one on the team with M1 Mac? [12:39:40] No, I think Aiko also has one [12:59:57] 10Lift-Wing, 10Machine-Learning-Team: Test batch prediction for revert-risk model - https://phabricator.wikimedia.org/T323023 (10achou) [13:00:37] 10Lift-Wing, 10Machine-Learning-Team: Test batch prediction for revert-risk model - https://phabricator.wikimedia.org/T323023 (10achou) [13:04:25] I have one :) [14:00:17] (03PS1) 10AikoChou: revertrisk: change output and remove HTTPError exception [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856556 (https://phabricator.wikimedia.org/T323023) [14:16:58] Morning all! [14:26:39] o/ [14:28:35] (03CR) 10Elukey: [C: 03+1] revertrisk: change output and remove HTTPError exception [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856556 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou) [14:43:10] Morning Chris! [14:44:51] Do I need some other permission to run stuff on ml sandbox? Seems like I don’t have any space left under /home/isaranto and I cant run any minkube or docker commands [14:51:54] isaranto: nono you are admin, but the /srv partition is full (your home is under it) [14:52:15] the sandbox is self-managed and doesn't use puppet, it is still a little brittle as testing env [14:52:18] lemme see [14:53:18] 49G docker [14:53:18] 12G home [14:53:18] 16K lost+found [14:54:35] so docker image ls shows some old stuff [14:55:10] aiko: can some of the image-content-filtration images be deleted? [14:57:14] we probably need a bigger vm or a sandbox 2 [14:58:16] aiko: I removed some old images of yours [14:58:26] isaranto: you should be able to execute commands now, can you try? [15:02:00] elukey: Great! I have a problem with docker `Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/images/json": dial unix /var/run/docker.sock: connect: permission denied` [15:02:21] isaranto: did you try with sudo? [15:03:02] sudo works :) . Thanks again [15:03:03] weird though [15:03:09] srw-rw---- 1 root docker 0 Apr 4 2022 /var/run/docker.sock [15:03:17] and you are in the docker group [15:03:25] it works for me without sudo [15:03:27] mmmm [15:03:35] what command did you run? [15:08:08] I wasn’t in the docker group but I added myself (perhaps it requires a restart (?). I just ran docker and docker image ls [15:08:40] I added me through `sudo usermod -aG docker isaranto` [15:08:58] maybe try to log out and ssh back in [15:11:20] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Observability-Logging, 10observability: Evaluate Benthos as stream processor - https://phabricator.wikimedia.org/T319214 (10JArguello-WMF) [15:14:58] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review! I accidentally removed your +1 😂" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856556 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou) [15:19:00] Works now :D [15:19:45] 10Lift-Wing, 10Machine-Learning-Team, 10Research: Upload new outlinks topic model to LiftWing - https://phabricator.wikimedia.org/T322881 (10Isaac) > I plan to deploy this new model to Lift Wing along with a new docker image that contains some logging changes (https://gerrit.wikimedia.org/r/c/machinelearning... [15:22:31] (03Merged) 10jenkins-bot: revertrisk: change output and remove HTTPError exception [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856556 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou) [15:25:45] 10Machine-Learning-Team, 10Patch-For-Review: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10Isaac) > I tested outlink with benthos for around 9 hours the other day (here is the grafana metrics), I observed it returned ~1800 Bad Requests error with "No matching article or the... [15:39:52] 10Machine-Learning-Team, 10Gerrit, 10Release-Engineering-Team (Seen): gerrit: scoring/ores/editquality takes a long time to git gc - https://phabricator.wikimedia.org/T237807 (10hashar) I ran Tyler's script from `/home/thcipriani/elapsed_gc_time.py` ` gerrit1001:~$ python /home/thcipriani/elapsed_gc_time.py|... [15:44:25] 10Machine-Learning-Team, 10Gerrit, 10Release-Engineering-Team (Seen): gerrit: scoring/ores/editquality takes a long time to git gc - https://phabricator.wikimedia.org/T237807 (10thcipriani) >>! In T237807#8392885, @hashar wrote: > I ran Tyler's script from `/home/thcipriani/elapsed_gc_time.py` > ` > gerrit10... [15:47:20] klausman: quick review :) [15:47:24] I'd like to drop [15:47:25] https://netbox.wikimedia.org/ipam/prefixes/533/ [15:47:33] https://netbox.wikimedia.org/ipam/prefixes/534/ [15:47:56] Looking [15:49:13] We're not using those? [15:49:43] yeah sorry those are staging ones [15:50:00] gimme a sec, netbox is not helping [15:50:38] so [15:50:39] https://netbox.wikimedia.org/ipam/prefixes/383/ [15:51:33] https://netbox.wikimedia.org/ipam/prefixes/382/ [15:51:38] these should be the old codfw ones [15:52:20] and then [15:52:21] https://netbox.wikimedia.org/ipam/prefixes/380/ [15:52:27] https://netbox.wikimedia.org/ipam/prefixes/381/ [15:52:55] Janis sent a code change earlier on and there were the old cidrs in there [15:53:02] it is confusing to keep both on netbox [15:53:35] LGTM deleting 383 and 382 [15:53:52] 380 and 381 would also cover non-ML allocations, so not 100% sure. [15:54:36] what do you mean? [15:57:53] klausman: --^ [16:00:12] Sorry my bad, I misread the parent prefix disecription. 380/381 are fine, too [16:00:39] ack thanks [16:02:43] done [16:43:13] aiko: I am almost sure that the fetch_features code is mostly cpu bound, with the process pool it goes so much better [16:43:46] the main sad point is that the revscoring cache is not totally pickle-able, so if we use multi-process we have to skip it [16:43:51] but it is not a big deal [17:42:15] (03PS2) 10Elukey: [WIP] Refactor revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) [17:44:07] aiko: not completely tested but --^ should help a lot [17:44:25] after your last changes for articlequality the model-servers are basically the same [17:44:29] except some minor bits [17:44:48] in this way we have a single class to modify from now on [17:48:13] I'll finish it tomorrow, but I'll be able to add the Multi Process support to all the model servers in one go (enabled or not via variable of course) [17:48:24] let me know folks if you like the idea [18:01:41] have a good rest of the day folks :) [18:01:44] * elukey afk [18:07:54] \o