[09:18:40] errand [13:12:30] Hey! I have a doc with the goals for Q2. Let me know if there is something missing: https://docs.google.com/document/d/1A9ZwQI3mFPJyzR3GQUlYPo0eYbpqzJ3dTGpqD9FcLG0/edit [13:15:10] Trey314159: in particular, if you could review the descriptions for the "language stuff" section. Maybe we want to split it into 2 different sections for Kuromoji vs keyboard detection [13:28:09] o/ [13:30:48] o/ [13:38:54] pfischer: regarding T373459, would producing two sets of metrics be what we want? one set as early as possible in the producer and another set as close as possible to the end (ideally right after writing to elastic) [13:38:55] T373459: SUP: set up alerting for page_change_weighted_tags ingestion - https://phabricator.wikimedia.org/T373459 [13:39:05] then setup some alerting comparing the two? [13:46:08] dcausse: looking [13:49:18] dcausse: Hm, I was hoping that we could work with the delta between now and `meta.dt`, where now is - as you suggested - right after we wrote to ES. [13:51:31] pfischer: so no need to monitor what tags are entering the pipeline? just the lag of the time when write the tags into elastic? [13:53:51] dcausse: Ah, you are right, we would not be able to get that from my approach. [13:54:06] \o [13:54:10] o/ [13:54:13] o/ [13:54:49] if it's to give feedback for tags users it might be sufficient to have a dashoard showing a kind of lag per tag group I think? [13:54:56] dcausse: Well, then it’s what you suggested: two metrics labeled by prefix and than you can match the graphs [13:55:08] ok [13:55:38] for alerting I think we need to compare I think indeed [13:59:32] dcausse: That will be a bit harder since we allow clients to decide if a tag can wait to be merged with a primary or bypass the merging, that might change the shape of the second graph, so it’s no longer a shifted version of the first one. [14:01:18] But if we the shift the observed window of the second graph by 10min, maybe we get good-enough results so the SUMs are close enough. [14:02:04] yes... true might be a bit tricky to adjust this to avoid false positives [14:02:30] hard to tell without having the data at hand... [14:03:08] dcausse: somewhat related to the sources of weighted tags: I am currently writing the description for T372912 and try to summarise how the image suggestion pipeline. https://etherpad.wikimedia.org/p/T372912 [14:03:08] T372912: Migrate image recommendation to use page_weighted_tags_changed stream - https://phabricator.wikimedia.org/T372912 [14:03:25] Does that make sense? [14:04:13] pfischer: I don't see it, did you save it? [14:04:40] ah sorry in the etherpad [14:04:50] looking :) [14:10:23] pfischer: added a couple thoughts regarding the requirements, esp. the fact that ideally teams should be relatively autonomous in using this system (not relying on a spark job owned by search) [14:23:10] pfischer: sounds good to me [14:23:48] dcausse: thanks! [14:26:28] dcausse: Who else should be involved to resolve the open question(s)? If I remember ottomata correctly, jumbo would be the place for such events. [14:27:28] pfischer: if we said jumbo already then this answers the question I guess? [14:29:29] using jumbo will make the two producer jobs slightly different, only producer@eqiad will consume from jumbo I think? [14:30:45] which I think is completely fine and desirable actually [14:32:06] that means an extra set of options to allow configuring multiple kafka source brokers [14:34:35] Yes. Does event gate support jumbo? https://wikitech.wikimedia.org/wiki/Kafka suggests that it writes to kafka-main only [14:35:25] Ah, that was configurable IIRC [14:36:30] pfischer: I think eventgate-analytics is writing to jumbo [14:49:48] dcausse: Yes, I forgot about that. [14:49:57] errand, back in 90' [15:28:18] volunteer time, back in ~2h [17:04:47] gehel, I've reviewed and updated the Language Stuff goals. I don't have strong opinions on keeping it as one goal or splitting it (so I left it as one). [17:11:36] dinner [18:02:37] back [19:30:56] doctor appointment, back in ~2h [21:09:18] interesting reading: https://dtunkelang.medium.com/llms-and-rag-are-great-but-dont-throw-away-your-inverted-index-741d33630b7f [21:19:40] back [21:31:22] ebernhardson: indeed, thank you! That takes a bit of the hype-pressure out of the push for embeddings [21:33:03] pfischer: daniel is also a well respected search consultant. He has wonderful content to dig deeper into at http://contentunderstanding.com/ and http://queryunderstanding.com/ [21:33:16] i suspect in general we would get more value from those approaches than from embeddings directly [21:34:28] (and, more an aside, he taught a search relevance class with our previous CTO that they asked me to present how we do things at) [21:36:48] but it's also not clear how those kinds of things apply to our most common search, the autocomplete. May require some level of re-imagining what autocomplete is (users currently expect a prefix-ish search) [21:36:51] Nice! Where was that though? [21:37:21] pfischer: it was an online course they taught through Corise [21:45:54] ebernhardson: I’ll at least dig a bit deeper into Daniel’s content you linked, tomorrow. Regarding autocomplete: I guess so, the challenges are different ones. On my hunt for publications about all kinds of spelling mistakes (there’s not so much on Trey’s keyboard layout shift) I came across a few things from MS, for example, [21:45:54] https://www.microsoft.com/en-us/research/blog/speller100-zero-shot-spelling-correction-at-scale-for-100-plus-languages [21:48:25] interesting, will give it a read through. I've done some reasearch on spelling corrections a few years ago but the info at the time wasn't too compelling [21:51:33] alright, I’am off