[06:56:42] o/ I refactored the revscoring services to be able to run them locally and will now fix the multiprocessing code [06:56:59] Will be afk for the next ~1h [08:14:10] isaranto: cool!! I'll have a look and try it locally [08:16:01] Irrelevant to this patch but: I had issues installing dependencies on a clean virtual env after Mac update. [08:16:05] Just fyi [08:22:14] 10Lift-Wing, 10Machine-Learning-Team, 10I18n, 10NewFunctionality-Worktype, 10Patch-For-Review: Create a language detection service in LiftWing - https://phabricator.wikimedia.org/T340507 (10santhosh) Hi @isarantopoulos I drafted the model card here: https://meta.wikimedia.org/wiki/Machine_learning_model... [08:26:25] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144 (10Trizek-WMF) [08:28:34] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144 (10Trizek-WMF) @kevinbazira, I just learned that Engish Wikipedia has a script to track and remove what is considered overlinking. This... [08:50:57] isaranto: ack! [09:59:29] Decomming ores1001 as the last of the worker nodes right now, will then proceed with orespoolcounter VMs [10:00:10] 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10klausman) [10:09:22] 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores1001.eqiad.wmnet` - ores1001.eq... [10:09:56] Decommissioning the orespoolcounter VMs now. [10:23:26] 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `orespoolcounter[2003-2004].codfw.wm... [10:45:08] * klausman lunch [11:00:35] FYI, I think the decom cookbook didn't succesfully run for ores2008, I can still see it in puppetdb, e.g. "sudo cumin ores*" shows it as well [11:06:11] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10isarantopoulos) @kevinbazira thanks a lot for all the thorough investigation! @Isaac Thanks a lot for your recommendations they are really helpful. In the codebase there is a script... [11:06:24] * isaranto lunch! [11:14:03] isaranto: I also got dependency issues on mac, so you managed to run it on stat machine? [11:15:23] or where did you test the local runs? [11:15:36] moritzm: yes, I made a mistake running the decom book for 2008. How does one manually remove a host from puppet? [11:15:57] I managed to remove it from debmonitor, but couldn't figure out puppetdb [11:18:54] we don't have a separate cookbook for that, but do drop it manually you can run the following on puppetmaster1001: [11:19:04] sudo puppet node clean ores2008.codfw.wmnet [11:19:11] sudo puppet node deactivate ores2008.codfw.wmnet [11:19:28] that'll drop it from puppetdb [11:19:32] Aiko: I tested docker containers as before. Using and already existing environment also worked though. The issue was when creating a new one. Will investigate when I have some time [11:19:37] moritzm: ty! done that now [11:20:56] ack, all of ORES is gone now \o/ [11:21:22] There's still puppet stuff, but I'm working on it. Except pcc complains about running out of disk space %-) [11:27:02] isaranto: I see, thanks. dependency issues are really annoying [12:03:37] Good [12:03:42] Morning [12:09:28] Sometimes those two are indeed entirely separate :) [12:28:50] lol [12:28:59] I’m drinking coffee [12:32:31] Haha [12:32:43] Good morning then! [12:33:59] morning o/ [13:15:52] I removed the memory alert (to add it in a separate patch) so we can release our first alert sooner https://gerrit.wikimedia.org/r/c/operations/alerts/+/962056 [13:45:41] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10CodeReviewBot) mfossati updated https://gitlab.wikimedia.org/repos/data-engineering/ai... [14:14:25] klausman: sry +2 on the alerts. I reset my vote as I saw that you added Holan as reviewer [14:15:20] I set him on CC, but thatw as not the intent :D [14:15:36] Im having issues with the alert in this patch https://gerrit.wikimedia.org/r/c/operations/alerts/+/963724 [14:15:43] could you take a look please? [14:16:24] on it [14:16:28] it seems that the alert won't fire (according to CI) even if I explicitly specify the namespace to `="revcoring-editquality-damaging"` [14:16:41] thank you :) [14:18:29] That is kinda weird (also it would to know which l/v pair the check doesn't like) [14:19:31] Try dropping the prometheus= label. That only is added on Thanos [14:20:59] ofc that's it! [14:21:07] thanks Tobias! [14:21:12] np :) [14:21:23] the test that was failing noted in the docstrings "Ensure non-global alerts don't reference external labels. [14:21:23] In this case the alert will never fire because external labels don't show up [14:21:23] when evaluating non-global (i.e Prometheus, not Thanos) alerts." [14:21:48] Would be nice if the error/assert said that Thanos bit, too [14:22:29] is there a way to specify the cluster though? or do I set it on top where it says that deploy bit? [14:23:23] deploy bit? [14:24:11] on the top of each yaml [14:24:11] ``` [14:24:11] # deploy-tag: ops [14:24:11] # deploy-site: eqiad, codfw [14:24:11] ``` [14:25:24] I am unsure, honestly. I think those tags are about how the files in gerrit/git go to prod [14:25:33] Not what service the alert is about. [14:32:55] ack, thanks! [14:33:16] I just forgot, will go read docs to remember [14:41:46] ok, I found it! that was it. the deploy-tag is the prometheus instance we want to deploy so in our case it would be `k8s-mlserve`. and the deploy site is a bit more self-explanatory (if we want it both on eqiad and codfw) [14:43:15] ack, thanks for digging it up. [14:44:52] Documentation on alerts/alertmanager in Wikitech is great! [14:46:36] I updated both patches and they are good to go for first round of reviews [15:00:24] (03PS5) 10Ilias Sarantopoulos: revscoring: allow local runs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963367 (https://phabricator.wikimedia.org/T347404) [15:01:04] ruff "ate" my __init__.py so I fixed it and updated the patch [15:06:56] (03CR) 10CI reject: [V: 04-1] revscoring: allow local runs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963367 (https://phabricator.wikimedia.org/T347404) (owner: 10Ilias Sarantopoulos) [15:14:17] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963367 (https://phabricator.wikimedia.org/T347404) (owner: 10Ilias Sarantopoulos) [15:23:40] (03PS1) 10Ilias Sarantopoulos: revscoring: fix mp [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963754 [15:24:54] getting some weird error for multiprocessing, will continue tomorrow [15:27:44] 10Machine-Learning-Team: [revscoring] Multiprocessing code - https://phabricator.wikimedia.org/T348265 (10isarantopoulos) [15:27:54] 10Machine-Learning-Team: [revscoring] Fix Multiprocessing code - https://phabricator.wikimedia.org/T348265 (10isarantopoulos) a:03isarantopoulos [15:43:24] night all! [15:44:31] 10Machine-Learning-Team, 10Research: Review Revert Risk reports from WME - https://phabricator.wikimedia.org/T347136 (10diego) >>! In T347136#9225891, @prabhat wrote: > In the last 50 hours, we haven't seen any "Unsupported lang" issue. > Thanks for fixing this. Great! Thanks to all the ML team, specially to... [16:11:43] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144 (10kevinbazira) @Trizek-WMF, thank you so much for sharing this script that helps to curb overlinking. I am looping in @MGerlach, since...