[07:28:19] <elukey_>	 good morning!
[09:05:24] <moritzm>	 FYI, I'll start moving the  ml-etcd1* VMs to DRBD disk storage soonish (so that we're able to shuffle them around when the Ganeti servers in eqiad get reimaged to Buster)
[09:05:33] <moritzm>	 might increase latency slightly
[09:06:13] <moritzm>	 although I noticed that for codfw one of the nodes was never converted to "plain", but was on DRBD all the time and that doesn't seem to have caused prior issues either
[09:23:07] <elukey>	 moritzm: ack!
[09:36:05] <elukey>	 chrisalbon, klausman - o/ re: scoring cache - I have probably not a lot of context but I would be very reluctant in having a critical piece of the serving infrastructure not handled by us. Of course we trust 100% DE, this is not the point, it is more in SLO terms: I see very different requirements from what DE needs to guarantee and what we will have to with our serving layer. For training I am 100% 
[09:36:11] <elukey>	 onboard to integrate with DE as much as possible, and I'll push for it, but it is a completely different use case.
[09:37:14] <elukey>	 I am also a little confused about why DE would need to maintain a scoring cache service, since it is something tailored for us (do other service use a score cache other than ores-like systems?)
[09:44:42] <elukey>	 --
[09:45:17] <elukey>	 about the ores metrics, I found that "sum(increase(ores_response_total[5m])) by (wiki)" returns a nice breakdown of non-200 responses (you can test it via https://thanos.wikimedia.org)
[09:45:26] <elukey>	 that doesn't match with the scores errored though
[09:45:42] <elukey>	 (also the wiki label is used for http response codes, not great)
[11:45:30] * elukey lunch!
[14:21:21] <elukey>	 I am tcpdumping the port on ores1001 where statsd datapoints are sent
[14:21:23] <elukey>	 and I see
[14:21:24] <elukey>	 ores.ores1001.score_errored.wikidatawiki:1|c
[14:21:24] <elukey>	 ores.ores1001.score_errored.wikidatawiki.itemquality:1|c
[14:21:50] <elukey>	 (I am grepping fro score_errored)
[14:22:03] <elukey>	 I am wondering if it is always like that
[14:22:14] <elukey>	 if so we could add a label for the offending model
[14:22:54] <elukey>	 yeah there are other things like
[14:22:55] <elukey>	 ores.ores1001.precache_cache_miss.wikidatawiki:1|c
[14:22:55] <elukey>	 ores.ores1001.precache_cache_miss.wikidatawiki.itemquality:1|c
[14:32:55] <moritzm>	 the ml-etcd change mentioned this morning is now complete
[14:34:00] <wikibugs>	 10Machine-Learning-Team: Improve ORES observability - https://phabricator.wikimedia.org/T299137 (10elukey)
[14:45:03] <wikibugs>	 10Machine-Learning-Team: Improve ORES observability - https://phabricator.wikimedia.org/T299137 (10elukey) I am trying to see if there is a way to have per-model metrics. The ores daemons push statsd metrics to a local endpoint, that in turn aggregates them and exposes them for Prometheus.  I tcpdumped the stats...
[14:45:19] <chrisalbon>	 elukey that is a good point. I still want to talk to DE and see their thought
[14:45:45] <elukey>	 o/ yes yes makes sense
[14:45:50] <elukey>	 moritzm: thanks!
[15:53:57] <accraze>	 o/
[16:15:41] <wikibugs>	 10Machine-Learning-Team, 10artificial-intelligence, 10Wikilabels, 10articlequality-modeling: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10Halfak) @ACraze, thanks for your review of the model repo updates.  Can you also look at the patchset linked above in T...
[16:24:07] <wikibugs>	 (03CR) 10Accraze: [C: 03+2] Updated wheels for revscoring-2.11 and python 3.7 [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/748390 (owner: 10Halfak)
[16:24:16] <wikibugs>	 (03Merged) 10jenkins-bot: Updated wheels for revscoring-2.11 and python 3.7 [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/748390 (owner: 10Halfak)
[16:45:14] <wikibugs>	 10Machine-Learning-Team: Improve ORES observability - https://phabricator.wikimedia.org/T299137 (10elukey) Fixed and cleaned up old panels in our dashboards.
[17:29:56] <accraze>	 Any other Apple M1 users want to help test out the new revscoring docs? 
[17:29:58] <accraze>	 https://github.com/wikimedia/revscoring/pull/511
[17:31:22] <wikibugs>	 10Machine-Learning-Team: Improve ORES observability - https://phabricator.wikimedia.org/T299137 (10elukey) >>! In T299137#7619691, @elukey wrote: > The main weirdness that I see is that for score_processed the same value repeats across multiple model types (is it due to something that I don't see?). For score_er...
[17:38:33] <elukey>	 accraze: o/
[17:38:51] <elukey>	 I was checking https://ores.wikimedia.org/v3/#/ and I noticed that there is an API call to score the same revid for multiple models at the same time
[17:39:07] <elukey>	 are we planning to offer it as well on lift wing?
[17:39:26] <elukey>	 with the current scheme it might be a little cumbersome to implement
[17:44:21] <accraze>	 elukey: good question!
[17:44:59] <accraze>	 my guess is not for mvp, but yeah eventually we will want something like that
[17:52:50] <accraze>	 how we handle that w/ api gateway etc is still an open question
[17:53:15] <elukey>	 seems a little sneaky to implement right now
[17:53:25] <elukey>	 in the sense that it may require some hacks
[17:54:03] <chrisalbon>	 How is this different than, say, shadow model deployments or A/B testing?
[17:56:44] <accraze>	 this would be an endpoint that knows all relevant models for a given wiki and performs inference with all of them on a given revid
[17:57:13] <elukey>	 chrisalbon: that's knative/istio handling the routing of a give request to a set of model versions
[17:57:18] <elukey>	 *given
[17:57:20] <chrisalbon>	 ah got it
[17:57:33] <elukey>	 but maybe there is a trick to use 
[17:58:08] <accraze>	 in some cases i don't know how much this makes sense, why would you want to get articlequality & draftquality on the same revid?
[17:58:21] <elukey>	 no idea
[17:58:23] <elukey>	 :)
[17:58:50] <elukey>	 different subject - how do we want to name the new caching nodes?
[17:58:59] <elukey>	 ml-cache100[1-3] ?
[17:59:05] <elukey>	 klausman: --^
[17:59:51] <klausman>	 mmh. yeah, that works.
[18:00:12] <klausman>	 it's a bit generic, but then again, who knows what we'll run exactly on them (in the long term)
[18:01:17] <chrisalbon>	 Let's be super vague
[18:01:35] <chrisalbon>	 like ml-sevices100[1-3]
[18:01:39] <chrisalbon>	 or something
[18:01:57] <klausman>	 node100[1-3]
[18:02:02] <accraze>	 lol
[18:02:06] <klausman>	 Nobody even knows who owns it!
[18:02:35] <chrisalbon>	 These might just host the cache, or the cache and a feature store, or a cache, feature store, and a LabelStudio instance, etc. etc. etc.
[18:03:45] <elukey>	 chrisalbon: I can't tell if you are joking or serious :D
[18:04:00] <chrisalbon>	 lol, I am serious
[18:04:24] <chrisalbon>	 We shouldnt call it ml-cache if in the future it might host more than just the cache
[18:05:18] <elukey>	 in this case cache would mean feature cache + score cache (in my mind)
[18:05:27] <elukey>	 but more than that probably not
[18:05:39] <elukey>	 otherwise the workloads would be too different
[18:05:41] <chrisalbon>	 hmm, alright that is fine then
[18:05:55] <chrisalbon>	 long live ml-cache
[18:06:36] <elukey>	 I mean I am open to any discussion! But having very different services on the same set of nodes is something that I am not super fond of
[18:06:42] <elukey>	 what does the team think?
[18:07:01] <klausman>	 As I said, I think it's the right level of generic.
[18:07:28] <elukey>	 accraze?
[18:07:31] <klausman>	 It's more of a "what does it do" rather than "how does it do it".
[18:07:36] <accraze>	 i'm down!
[18:07:40] <elukey>	 all right
[18:08:02] <elukey>	 klausman: if you want to review the specs, the tasks are https://phabricator.wikimedia.org/T297640 and https://phabricator.wikimedia.org/T297638
[18:11:05] <klausman>	 LGTM
[18:13:33] <elukey>	 perfect, just approved and re-assigned to Rob
[18:13:37] <elukey>	 thanks all
[18:13:42] <chrisalbon>	 thanks awesome
[18:16:59] <elukey>	 going to repost https://www.applyconf.com/
[18:17:04] <elukey>	 I'll try to join 
[18:17:26] <elukey>	 the agenda for the meetup looks interesting
[18:17:44] <elukey>	 Using Redis as your Online Feature Store: 2021 highlights & 2022 directions
[18:20:37] <elukey>	 going afk for today folks, have a nice evening/day!
[18:20:51] <klausman>	 \o
[18:24:46] <accraze>	 see ya elukey
[19:02:18] <wikibugs>	 10Machine-Learning-Team, 10observability: Improve ORES observability - https://phabricator.wikimedia.org/T299137 (10Aklapper)
[19:53:28] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) I wrote a guide on developing production images for our inference services: https://wikitech.wikimedia.org/wiki/User:Accraze/MachineLear...