[06:05:23] hello folks! I have some errands to do this morning, will join a little later IRC etc.. [07:21:09] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10incubator.wikimedia.org: Integrate the model training and the deployment of "Add a link" to new Wikipedias exiting the Incubator - https://phabricator.wikimedia.org/T308146 (10kevinbazira) If you would like to check wiki models before they are deployed,... [10:08:25] folks, I sent an email to Eric Evans for the cassandra CR, and IIUC it would be preferred to use the multi-instance setup (in our case, one instance per node) [10:08:34] so I am going to refactor the code review to reflect it [10:08:53] another interesting thing is that the AQS cluster is being expanded and "mirrored" to codfw [10:09:24] it should become multi-tenant for use cases where "generated datasets" are computed in Data Engineering land and loaded into cassandra periodically [10:09:30] it may be a good use case for the feature store [10:09:39] but probably not for the score cache/datastore? [10:09:51] will try to follow up and see how things unfold [10:38:15] What kind of replication latency would there be for AQS? [10:42:40] do you mean eqiad <-> codfw? Should be the same as ml-cache in theory [10:42:53] we'd have our own keyspace, and cassandra would handle the replication cross-dc [10:52:54] (everything is a supposition at this point, I'll try to follow up more) [10:53:13] but in theory the clusters for the score cache/datastore should be good to be created [10:55:45] Ack [10:55:56] It was mostly a matter of curiosity, not a technical concern [11:06:14] ah yes yes please the more we discuss this the better, I don't have a lot of ideas clear yet [11:12:53] ok so I have allocated the new DNS records for [11:12:58] ml-cache100[123]-a.eqiad.wmnet [11:13:07] that will correspond to the single cassandra instance [11:13:14] I am going to upate my code review accordingly [11:31:49] ack, poke me for re-review [11:31:59] <- lunch and errands [11:36:39] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10achou) I read the kserve docs: https://kserve.github.io/website/master/modelserving/v1beta1/custom/custom_model/#parallel-inference There are two ways to run parallel inference:... [11:40:44] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10achou) > My only doubt is if our preprocess can become a co-routine, since we use a dedicated library to call the mw api. For example, in the predict code it seems that kserve use... [12:04:31] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey) >>! In T296173#7952700, @achou wrote: > I read the kserve docs: https://kserve.github.io/website/master/modelserving/v1beta1/custom/custom_model/#parallel-inference > > T... [12:39:08] ok code review upgraded and followed up with Eric [12:44:39] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Watchlist: Remove "Likely to have problems" highlight when an edit is marked as patrolled - https://phabricator.wikimedia.org/T309100 (10Lectrician1) [13:03:32] (03CR) 10Elukey: "To keep archives happy - let's move to 2.11.4, already released in Pypi :) Sorry for the trouble!" [research/ores/wheels] (python37) - 10https://gerrit.wikimedia.org/r/791576 (https://phabricator.wikimedia.org/T302851) (owner: 10AikoChou) [13:06:23] 10Machine-Learning-Team: Revscoring library branching proposal - https://phabricator.wikimedia.org/T304063 (10elukey) 05Open→03Declined Coming back to this after T303801. We migrated ORES to Debian Buster and Python 3.7, updating wheels and dependencies. The revscoring library was fully compatible with the n... [13:11:00] I closed --^ since revscoring works like a charm on python37 [13:11:15] but one thing that we may want to do is to bump revscoring on our docker images [13:11:40] It may be a little painful for deps at first (even if I think it should be doable), but surely good in the long term [13:18:33] 10Machine-Learning-Team: Bump revscoring to 2.11.4 on our Docker images for Lift Wing - https://phabricator.wikimedia.org/T309102 (10elukey) [13:18:36] and now I opened --^ [13:18:47] people will hate me I know but I think it is wise to do it [13:23:09] Nah, you are entirely right [13:23:35] and now we also know how to release revscoring! :P [13:23:41] chrisalbon will be so happy [13:36:25] nobody will hate Luca!!! :) [13:43:51] morning all [13:45:47] ... [13:45:54] (03PS3) 10AikoChou: Update revscoring to 2.11.4 [research/ores/wheels] (python37) - 10https://gerrit.wikimedia.org/r/791576 (https://phabricator.wikimedia.org/T302851) [13:46:32] What is the benefit of the AQS change? I woke up like 10 minutes ago [13:48:44] (03CR) 10AikoChou: Update revscoring to 2.11.4 (031 comment) [research/ores/wheels] (python37) - 10https://gerrit.wikimedia.org/r/791576 (https://phabricator.wikimedia.org/T302851) (owner: 10AikoChou) [13:58:44] chrisalbon: in theory cassandra etc.. managed by another team, that supports multiple use cases (so it takes care of scaling, multi-dc, encryption, reboots, upgrades for us) [15:08:21] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10incubator.wikimedia.org: Integrate the model training and the deployment of "Add a link" to new Wikipedias exiting the Incubator - https://phabricator.wikimedia.org/T308146 (10kostajh) >>! In T308146#7951993, @kevinbazira wrote: > If you would like to c... [15:55:17] (03CR) 10Elukey: [C: 03+2] Update revscoring to 2.11.4 [research/ores/wheels] (python37) - 10https://gerrit.wikimedia.org/r/791576 (https://phabricator.wikimedia.org/T302851) (owner: 10AikoChou) [15:55:51] aiko: just merged your change! So now the next step is to file another one for the ORES deploy repo, updatin the git submodule [15:56:20] what I usually do is to `cd` into the submodule, `git pull`, and then go back to the main dir [15:59:52] and then `git diff` should show a change in the wheels submodule sha [16:00:26] then we will be able to cherry pick the code review in deployment-prep's deploy server and deploy it [16:17:03] elukey: Ok! Do I need to update the revscoring version in frozen-requirements.txt? [16:36:53] aiko: yeah let's do it! [16:39:01] 10ORES, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: revscoring feature extraction error for wikitext papes in Wikidata - https://phabricator.wikimedia.org/T302851 (10elukey) Next steps: 1) Create a change to the ores-deploy repo to bump the wheels submodule 2) Cherry pick the change in `depl... [16:43:03] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Unable to run helmfile and check pods - https://phabricator.wikimedia.org/T307927 (10elukey) [16:47:22] 10Machine-Learning-Team, 10Analytics-Radar, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10elukey) Time flies and both ROCm and tensorflow-io got several releases. https://github.com/tensorflow/io/releases/tag/v0.23.0 is out and contains the pull request that I made f... [16:50:03] 10Machine-Learning-Team, 10ORES, 10Infrastructure-Foundations, 10Puppet: Restructure ORES labs redis puppet role - https://phabricator.wikimedia.org/T281495 (10elukey) 05Open→03Resolved a:03elukey This has been solved with https://gerrit.wikimedia.org/r/c/operations/puppet/+/785111 in theory, closing... [16:51:12] going afk, have a nice one team [17:47:29] (03PS1) 10AikoChou: Update wheels submodule with latest changes [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/798894 [18:21:04] 10ORES, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: revscoring feature extraction error for wikitext papes in Wikidata - https://phabricator.wikimedia.org/T302851 (10achou) Note for ores-deploy repo: When first time cloning a project with submodules in it, by default you get empty submodule...