[07:28:19] good morning! [09:05:24] FYI, I'll start moving the ml-etcd1* VMs to DRBD disk storage soonish (so that we're able to shuffle them around when the Ganeti servers in eqiad get reimaged to Buster) [09:05:33] might increase latency slightly [09:06:13] although I noticed that for codfw one of the nodes was never converted to "plain", but was on DRBD all the time and that doesn't seem to have caused prior issues either [09:23:07] moritzm: ack! [09:36:05] chrisalbon, klausman - o/ re: scoring cache - I have probably not a lot of context but I would be very reluctant in having a critical piece of the serving infrastructure not handled by us. Of course we trust 100% DE, this is not the point, it is more in SLO terms: I see very different requirements from what DE needs to guarantee and what we will have to with our serving layer. For training I am 100% [09:36:11] onboard to integrate with DE as much as possible, and I'll push for it, but it is a completely different use case. [09:37:14] I am also a little confused about why DE would need to maintain a scoring cache service, since it is something tailored for us (do other service use a score cache other than ores-like systems?) [09:44:42] -- [09:45:17] about the ores metrics, I found that "sum(increase(ores_response_total[5m])) by (wiki)" returns a nice breakdown of non-200 responses (you can test it via https://thanos.wikimedia.org) [09:45:26] that doesn't match with the scores errored though [09:45:42] (also the wiki label is used for http response codes, not great) [11:45:30] * elukey lunch! [14:21:21] I am tcpdumping the port on ores1001 where statsd datapoints are sent [14:21:23] and I see [14:21:24] ores.ores1001.score_errored.wikidatawiki:1|c [14:21:24] ores.ores1001.score_errored.wikidatawiki.itemquality:1|c [14:21:50] (I am grepping fro score_errored) [14:22:03] I am wondering if it is always like that [14:22:14] if so we could add a label for the offending model [14:22:54] yeah there are other things like [14:22:55] ores.ores1001.precache_cache_miss.wikidatawiki:1|c [14:22:55] ores.ores1001.precache_cache_miss.wikidatawiki.itemquality:1|c [14:32:55] the ml-etcd change mentioned this morning is now complete [14:34:00] 10Machine-Learning-Team: Improve ORES observability - https://phabricator.wikimedia.org/T299137 (10elukey) [14:45:03] 10Machine-Learning-Team: Improve ORES observability - https://phabricator.wikimedia.org/T299137 (10elukey) I am trying to see if there is a way to have per-model metrics. The ores daemons push statsd metrics to a local endpoint, that in turn aggregates them and exposes them for Prometheus. I tcpdumped the stats... [14:45:19] elukey that is a good point. I still want to talk to DE and see their thought [14:45:45] o/ yes yes makes sense [14:45:50] moritzm: thanks! [15:53:57] o/ [16:15:41] 10Machine-Learning-Team, 10artificial-intelligence, 10Wikilabels, 10articlequality-modeling: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10Halfak) @ACraze, thanks for your review of the model repo updates. Can you also look at the patchset linked above in T... [16:24:07] (03CR) 10Accraze: [C: 03+2] Updated wheels for revscoring-2.11 and python 3.7 [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/748390 (owner: 10Halfak) [16:24:16] (03Merged) 10jenkins-bot: Updated wheels for revscoring-2.11 and python 3.7 [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/748390 (owner: 10Halfak) [16:45:14] 10Machine-Learning-Team: Improve ORES observability - https://phabricator.wikimedia.org/T299137 (10elukey) Fixed and cleaned up old panels in our dashboards. [17:29:56] Any other Apple M1 users want to help test out the new revscoring docs? [17:29:58] https://github.com/wikimedia/revscoring/pull/511 [17:31:22] 10Machine-Learning-Team: Improve ORES observability - https://phabricator.wikimedia.org/T299137 (10elukey) >>! In T299137#7619691, @elukey wrote: > The main weirdness that I see is that for score_processed the same value repeats across multiple model types (is it due to something that I don't see?). For score_er... [17:38:33] accraze: o/ [17:38:51] I was checking https://ores.wikimedia.org/v3/#/ and I noticed that there is an API call to score the same revid for multiple models at the same time [17:39:07] are we planning to offer it as well on lift wing? [17:39:26] with the current scheme it might be a little cumbersome to implement [17:44:21] elukey: good question! [17:44:59] my guess is not for mvp, but yeah eventually we will want something like that [17:52:50] how we handle that w/ api gateway etc is still an open question [17:53:15] seems a little sneaky to implement right now [17:53:25] in the sense that it may require some hacks [17:54:03] How is this different than, say, shadow model deployments or A/B testing? [17:56:44] this would be an endpoint that knows all relevant models for a given wiki and performs inference with all of them on a given revid [17:57:13] chrisalbon: that's knative/istio handling the routing of a give request to a set of model versions [17:57:18] *given [17:57:20] ah got it [17:57:33] but maybe there is a trick to use [17:58:08] in some cases i don't know how much this makes sense, why would you want to get articlequality & draftquality on the same revid? [17:58:21] no idea [17:58:23] :) [17:58:50] different subject - how do we want to name the new caching nodes? [17:58:59] ml-cache100[1-3] ? [17:59:05] klausman: --^ [17:59:51] mmh. yeah, that works. [18:00:12] it's a bit generic, but then again, who knows what we'll run exactly on them (in the long term) [18:01:17] Let's be super vague [18:01:35] like ml-sevices100[1-3] [18:01:39] or something [18:01:57] node100[1-3] [18:02:02] lol [18:02:06] Nobody even knows who owns it! [18:02:35] These might just host the cache, or the cache and a feature store, or a cache, feature store, and a LabelStudio instance, etc. etc. etc. [18:03:45] chrisalbon: I can't tell if you are joking or serious :D [18:04:00] lol, I am serious [18:04:24] We shouldnt call it ml-cache if in the future it might host more than just the cache [18:05:18] in this case cache would mean feature cache + score cache (in my mind) [18:05:27] but more than that probably not [18:05:39] otherwise the workloads would be too different [18:05:41] hmm, alright that is fine then [18:05:55] long live ml-cache [18:06:36] I mean I am open to any discussion! But having very different services on the same set of nodes is something that I am not super fond of [18:06:42] what does the team think? [18:07:01] As I said, I think it's the right level of generic. [18:07:28] accraze? [18:07:31] It's more of a "what does it do" rather than "how does it do it". [18:07:36] i'm down! [18:07:40] all right [18:08:02] klausman: if you want to review the specs, the tasks are https://phabricator.wikimedia.org/T297640 and https://phabricator.wikimedia.org/T297638 [18:11:05] LGTM [18:13:33] perfect, just approved and re-assigned to Rob [18:13:37] thanks all [18:13:42] thanks awesome [18:16:59] going to repost https://www.applyconf.com/ [18:17:04] I'll try to join [18:17:26] the agenda for the meetup looks interesting [18:17:44] Using Redis as your Online Feature Store: 2021 highlights & 2022 directions [18:20:37] going afk for today folks, have a nice evening/day! [18:20:51] \o [18:24:46] see ya elukey [19:02:18] 10Machine-Learning-Team, 10observability: Improve ORES observability - https://phabricator.wikimedia.org/T299137 (10Aklapper) [19:53:28] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) I wrote a guide on developing production images for our inference services: https://wikitech.wikimedia.org/wiki/User:Accraze/MachineLear...