[05:48:55] (03CR) 10AikoChou: "Kevin, thanks for working on this! I have some questions/comments:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/970831 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [08:53:30] hi, as mentioned in the SRE meeting yesterday, we're moving systems over to the puppetserver 7 setup, for ML I'll start with the staging cluster (master/worker/etcd) [08:54:00] don't expect any issues, the aux and dse clusters are already on Puppet 7, as is the wikikube_staging cluster [09:08:45] moritzm: hi! We can take care of it if you want [09:08:50] so you can focus on other tasks [09:09:42] klausman: o/ can you sync with Moritz about the upgrade? [09:09:51] will do [09:09:59] more info https://phabricator.wikimedia.org/T349619 [09:10:08] fine either way, I can also do the migration, then you folks check whether everything works as expected? [09:10:12] It was already mentioned in the SRE meeting yesterday, so I was wondering when he'd poke us :) [09:10:42] moritzm: nono we can do it, thanks for the ping [09:10:53] excellent, thanks :-) [09:29:09] starting with ml-staging-etcd2003.codfw.wmnet to get a feel for the cookbook [09:35:18] klausman: is there a puppet change to make first? [09:35:29] No, it's in the middle of the cookbook [09:35:42] I was about to send the review to you :) [09:36:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/974115 [09:36:58] can you run pcc? [09:37:01] https://phabricator.wikimedia.org/P53381 cookbook execution looks like this [09:37:25] okok [09:37:28] go ahead then [09:38:02] running pcc atm, but I _think_ that still always uses P5 [09:38:29] fine to proceed [09:46:45] moritzm: I saw one error in the middle of the execution for some homedirs, but the cookobook exited with success status, so I presume it's not a problem/to be expected? [09:47:12] https://phabricator.wikimedia.org/P53382 [09:48:24] yeah, that's known and harmless, let me find the task [09:48:39] https://phabricator.wikimedia.org/T350809 [09:49:29] ack, thanks. [09:58:52] elukey: I force an all-services check on icinga, and rang p-a on etcd2003, AM is also all green. Proceeding with the rest of the etcds. [09:59:13] (in staging, that is) [09:59:14] ack [10:15:26] 10Machine-Learning-Team, 10Product-Analytics, 10User-Iflorez: Transient error while running lift wing topic model - https://phabricator.wikimedia.org/T351114 (10klausman) Luca has raised a few questions that may reveal relevant information: Where is the code running from? How many requests are issued? and h... [10:26:17] doing the masters now (again, first one host, then the role) [10:27:05] elukey: should I let you review the change again or just proceed? [10:27:23] go ahead :) [10:28:34] ack [10:59:31] moritzm: I presume "Warning: The current total number of facts: 3883 exceeds the number of facts limit: 2048" is also ignorable? [11:01:58] I still have to open a task about this, it's harmless (just a warning), we still need to decide what to do about it: [11:02:13] there is an option to bump the warning threshold [11:02:28] but this could hide legit problems [11:02:46] and in general excessive facts tend to cause slowdowns in puppetdb [11:03:21] so one option might be to set the threshold for select corner cases [11:09:32] ack, thx [11:19:08] Ok, ml-staging all puppet-v7-ized. I'll let that soak overnight and will do the prod clusters starting tomorrow. I may do ml-cache this afternoon, since that is currently not seeing any prod traffic. [11:19:14] * klausman lunch [11:25:58] sounds good [11:30:05] * elukey lunch [12:09:49] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 2 (Growth Team)), 10Serbian-Sites, and 3 others: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 (10Sgs) I ran this script for adding the link-recommendation task type and populating the excluded sections entr... [12:33:32] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10Turkish-Sites, 10User-notice: Deploy "add a link" to 17th round of wikis - https://phabricator.wikimedia.org/T308143 (10Sgs) I ran this script for adding the link-recommendation task type and populating the excluded sections entries: `lang=bash PHAB=T... [12:46:49] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 2 (Growth Team)), 10Patch-For-Review, and 4 others: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 (10Sgs) [12:48:55] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 2 (Growth Team)), 10Patch-For-Review, and 2 others: Deploy "add a link" to 17th round of wikis - https://phabricator.wikimedia.org/T308143 (10Sgs) a:03Sgs [13:56:33] ml-cache now also on v7 and host overrides cleaned up. [14:01:34] Hello [14:03:05] Morning Chris [14:11:21] Little later getting up, my kid was sick last night [14:38:51] morning! [14:38:58] morning! [14:39:05] I tried to wake up at 4, but I couldn't. So I ended up waking up at 6 :D [14:39:28] ha [14:39:35] You definitely don't need to wake up at 4 [14:46:12] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 3 (Growth Team)), 10Patch-For-Review, and 2 others: Deploy "add a link" to 17th round of wikis - https://phabricator.wikimedia.org/T308143 (10KStoller-WMF) [14:46:25] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 3 (Growth Team)), 10User-notice: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 (10KStoller-WMF) [14:46:29] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 3 (Growth Team)), 10Patch-For-Review, and 4 others: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 (10KStoller-WMF) [14:54:06] 10Machine-Learning-Team, 10Product-Analytics, 10User-Iflorez: Transient error while running lift wing topic model - https://phabricator.wikimedia.org/T351114 (10mpopov) [15:17:21] 10Machine-Learning-Team, 10Product-Analytics, 10User-Iflorez: Transient error while running lift wing topic model - https://phabricator.wikimedia.org/T351114 (10mpopov) >>! In T351114#9329734, @klausman wrote: > Luca has raised a few questions that may reveal relevant information: > > Where is the code runn... [15:29:29] 10Machine-Learning-Team: Fix the Lift Wing documentation about how to decode the ACCESS TOKEN - https://phabricator.wikimedia.org/T350762 (10klausman) a:03klausman [15:35:52] 10Machine-Learning-Team, 10Growth-Team, 10GrowthExperiments: importOresTopics script fails to import topics - https://phabricator.wikimedia.org/T350137 (10klausman) a:03klausman [15:40:11] 10Machine-Learning-Team, 10artificial-intelligence, 10Bad-Words-Detection-System, 10revscoring: Add language support for Malay language (ms) - https://phabricator.wikimedia.org/T349968 (10klausman) a:03calbon @calbon Can you weigh in on this? AIUI, this would be nontrivial update to Revscoring. [15:56:21] 10Machine-Learning-Team: Apply multi-processing to preprocess() in isvcs that suffer from high latency - https://phabricator.wikimedia.org/T349274 (10klausman) a:03elukey [15:58:35] 10Machine-Learning-Team, 10Project-Admins: Create three Phab Projects for Machine Learning: Lift Wing, Pilot Flag, Test Grounds - https://phabricator.wikimedia.org/T264774 (10klausman) 05Resolved→03Open a:05Aklapper→03calbon [16:01:37] 10Machine-Learning-Team, 10Goal: Goal: Increase the number of models hosted on Lift Wing - https://phabricator.wikimedia.org/T348156 (10kevinbazira) 1.Isaac from the research team tested the deployed rec-api and shared 2 edge cases: - the rec-api wasn't returning results besides 'spec' param, we investigated t... [16:46:55] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 3 (Growth Team)), 10User-notice: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 (10Sgs) [16:47:23] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 3 (Growth Team)), 10Patch-For-Review, and 4 others: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 (10Sgs) [16:47:37] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 3 (Growth Team)), 10Patch-For-Review, and 2 others: Deploy "add a link" to 17th round of wikis - https://phabricator.wikimedia.org/T308143 (10Sgs) [17:03:21] 10Machine-Learning-Team, 10ORES: Add deprecation warnings to ORES-related repositories on Github - https://phabricator.wikimedia.org/T349632 (10achou) As far as I know, these are also ORES-related repositories: https://github.com/wikimedia/editquality https://github.com/wikimedia/draftquality https://github.co... [17:09:30] * elukey afk! [17:09:54] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade Revert Risk Language-agnostic docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347550 (10achou) [17:09:56] 10Machine-Learning-Team: Increased latencies with Kserve 0.11.1 (cgroups v2) - https://phabricator.wikimedia.org/T349844 (10achou) [17:10:35] 10Machine-Learning-Team: Increased latencies with Kserve 0.11.1 (cgroups v2) - https://phabricator.wikimedia.org/T349844 (10achou) [17:10:37] 10Machine-Learning-Team, 10Research: Upgrade xgboost in knowledge_integrity - https://phabricator.wikimedia.org/T350389 (10achou) [17:11:42] 10Machine-Learning-Team: Upgrade Revert Risk Multilingual docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347551 (10achou) [17:11:44] 10Machine-Learning-Team: Increased latencies with Kserve 0.11.1 (cgroups v2) - https://phabricator.wikimedia.org/T349844 (10achou) [17:14:43] 10Lift-Wing, 10Machine-Learning-Team: Multilingual model fails from python - https://phabricator.wikimedia.org/T351022 (10achou) a:03achou [17:21:39] 10Machine-Learning-Team, 10Product-Analytics, 10User-Iflorez: Transient error while running lift wing topic model - https://phabricator.wikimedia.org/T351114 (10Iflorez) [17:22:24] 10Machine-Learning-Team, 10Product-Analytics, 10User-Iflorez: Transient error while running lift wing topic model - https://phabricator.wikimedia.org/T351114 (10Iflorez) [17:50:00] * klausman heading out as well [17:50:00] 10Lift-Wing, 10Machine-Learning-Team: Revertrisk models are unable to provide scores for single-revision pages - https://phabricator.wikimedia.org/T351021 (10achou) a:03achou [17:51:32] 10Lift-Wing, 10Machine-Learning-Team: Revertrisk models are unable to provide scores for single-revision pages - https://phabricator.wikimedia.org/T351021 (10achou) p:05Triage→03Medium [17:51:40] 10Machine-Learning-Team, 10Product-Analytics, 10User-Iflorez: Transient error while running lift wing topic model - https://phabricator.wikimedia.org/T351114 (10Iflorez) @mpopov thank you for your suggestions, I'm mulling them over and considering integrating with the below. Fabian K recommends using a UDF... [18:26:49] 10Lift-Wing, 10Machine-Learning-Team: Revertrisk models are unable to provide scores for single-revision pages - https://phabricator.wikimedia.org/T351021 (10achou) Hi @Strainu, Revertrisk models require a revision to have a valid parent revision in order to measure the difference in quality (see the model car... [18:48:21] 10Lift-Wing, 10Machine-Learning-Team: Multilingual model fails from python - https://phabricator.wikimedia.org/T351022 (10achou) p:05Triage→03Medium [18:49:04] 10Lift-Wing, 10Machine-Learning-Team: Revertrisk models are unable to provide scores for single-revision pages - https://phabricator.wikimedia.org/T351021 (10achou) [18:49:32] 10Lift-Wing, 10Machine-Learning-Team: Multilingual model fails from python - https://phabricator.wikimedia.org/T351022 (10achou) [19:17:55] 10Lift-Wing, 10Machine-Learning-Team: Revertrisk models are unable to provide scores for single-revision pages - https://phabricator.wikimedia.org/T351021 (10Strainu) Thank you for pointing that out @achou. Feel free to close the task, but as a note, the wording in the LA page is a bit ambiguous for me - I tho... [20:02:04] 10Lift-Wing, 10Machine-Learning-Team: Revertrisk models are unable to provide scores for single-revision pages - https://phabricator.wikimedia.org/T351021 (10calbon) I wonder if we should make this more clear in the error message, otherwise people won't realize they can't check parentless revisions. [20:26:04] * aiko lunch! [22:26:41] 10Lift-Wing, 10Machine-Learning-Team: Multilingual model fails from python - https://phabricator.wikimedia.org/T351022 (10achou) Hi @Strainu, 1. Revision `15891878`: yes, the multilingual model is likely slow in processing this revision due to its size. When using the internal endpoint for the same revision,... [22:49:50] 10Lift-Wing, 10Machine-Learning-Team: Revertrisk models are unable to provide scores for single-revision pages - https://phabricator.wikimedia.org/T351021 (10achou) @calbon Yes, I think we should do that. We have received the same report from multiple users, so we should make it more clear. I'll create a new t...