[05:43:00] o/ good morning! [06:06:32] morning morning [06:07:37] good morning :) [06:29:43] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11084610 (10BWojtowicz-WMF) [07:00:49] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11084646 (10BWojtowicz-WMF) Based on the estimates given in [[ https://phabricator.wikimedia.org/T401778#11081270 | message above ]] and discussion during... [07:02:19] folks nobody is responding in wikimedia-operations. Should I start the deployment ? [07:05:34] still nobody responding I am gonna start it [07:08:19] yeah let's go! [07:08:34] it seems that we are the only ones with a patch in this window [07:08:41] yes [07:12:02] isaranto: alright is time for syncing. It is like the trwiki that I do not see the filter right now in the ores. Should I continue the deployment as we did last time ? [07:13:31] But I see the machine-learning platform in the SpecialVersion website [07:14:36] I hit sync [07:14:38] lets see [07:25:26] georgekyz: this deployment is basically a no-op. we dont want to add the filter, we just added the threshold so that others can us it [07:25:32] *use it. [07:27:29] oh that's right, we haven't activated the 'revertrisklanguageagnostic' under the 'wgOresModels' section. [07:27:39] alright perfect, the deployment was successful [07:29:18] grea, thanks! [07:29:33] thnx for being around [07:29:34] *great -- I don't know why I'm constantly mistyping this morning :P [07:29:45] haha np [07:51:51] morning! [07:56:56] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Increase character limit in tone check model - https://phabricator.wikimedia.org/T401696#11084800 (10gkyziridis) == Update == ` $ kubectl get pods NAME READY STATUS RESTARTS AGE edit... [08:03:26] Tone-check latest version deployed on prod [08:03:56] \o/ [08:44:31] Hi team, I have a small question about the article topics cache. I think ideally, we could make cache as compact as possible by storing e.g. only the top prediction+probability and not list of all predictions with their probabilities. This could potentially save _a lot_ of disk space. However, at the moment we return all predictions over a threshold (either sent by user or 0.5 by default). Additionally, I see that when sending our predictions [08:44:31] to EventGate, we’re sending all predictions+probabilities. [08:45:05] I’m wondering if for the purpose of Year in Review we could work around this? We could e.g. introduce a `get_top` flag to be used by Year in Review team (or even True by default?), which would return only the top prediction. In this case if user wants more detailed list of predictions with probabilities, they would not use the cache. [08:45:17] What do you think? [08:46:31] Also additional question for Year in Review in particular - what would we like to do if there’s no prediction over a threshold for a page? Do we set the page topic to unknown or use the top prediction available? [08:49:22] 10Lift-Wing, 06Machine-Learning-Team: Increase character limit in tone check model - https://phabricator.wikimedia.org/T401696#11084951 (10gkyziridis) 05Open→03Resolved [08:49:23] 06Machine-Learning-Team: Investigate revertrisk threshold generation for enwiki - https://phabricator.wikimedia.org/T400590#11084953 (10gkyziridis) 05Open→03Resolved [08:59:41] makes sense to me only using the top prediction+probability. In the YiR meeting notes, I saw they mentioned "Mapping of article title -> topic should be enough". but we can confirm with them [09:02:32] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Create an analytics service user for the ML team - https://phabricator.wikimedia.org/T400902#11084989 (10kevinbazira) @brouberol thank you for the recent updates to the HDFS permissions and ownership for `/wmf/cache/artifacts/airflow/m... [09:20:28] bartosz: I think the second part would depend on what the YIR team prefers. [09:21:01] as for not caching everything for the normal YIR queries (only getting the best predicition) sounds good to me. [09:21:26] there is a question there what we want to do for the steady-state queries (after the backfill), but we can revisit that later [09:27:03] thank you for your inputs Aiko and Tobias! I agree that the 2nd point is something that we need to discuss with the YIR team [09:43:39] indeed we can explore this option but it does get tricky for any other use case that will be hitting the cache afterwards. Is size a limitation at the moment? We could proactively ask the team and then be able to pivot if we see that there is an issue [09:50:33] bartosz: I can ping the appropriate ppl from the Apps team directly on the phab ticket. The main question if they are ok with this is to find out what would be an ideal number of categories to return -- one that would limit the size of the cache but provide meaningful insights to the team [09:53:13] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11085239 (10isarantopoulos) @Seddon @Dbrant We are exploring the option of adding a cache for all articles so that you can utilize it when making request... [09:53:45] isaranto: I think indeed the major savings would come from not storing list of all predictions and only the predictions over a threshold. This means we save 3-5 predictions instead of the 64 predictions in each row and it cuts down the estimated DB size from ~300GB to ~30GB assuming we have 65mil entries [09:54:47] But it also means we won't be able to support sending all predictions to eventgate when using cache [10:19:54] 1. iirc the responses in outlink article topic dont include all 64 topics. this is the case with the older revscoring model but not with this one (this applies the threshold as you mentioned). [10:19:54] 2. when sending events to eventgate we can leave the get_top_k parameter empty (or whatever it will be called) so we can store all the values [10:26:21] 2. How often do we send the events to eventgate? alternatively, we can set the threshold to 0 to sent all predictions there. This could potentially mean that we don't need to introduce the `get_top_k` - we'll just keep operating as we do currently, sending the articles over the threshold. [10:30:56] we send an event for every page change event (so basically every time a new revision is made) and we are using the 0.5 threshold. If we put 0 as a threshold then we would be sending all the categories. In any case we don't want to change the current behavior for the events that we publish as these are ingested by search -- unless there is a pretty valid reason for it, but we'd have to coordinate with the team [10:34:05] I see, thank you! So in this case it would be useful to store all 64 predictions in the cache so that we can use cache for eventgate as well? [10:42:20] just to clarify this bit: the current model+service doesn't return all 64 categories it returns the categories that have >50% probability so in many cases it is just 2-3 categories in the response. We would definitely want to keep it this way for the events that we send to eventgate [10:43:02] It seems to me that we return only >50% probability categories, but we also send the `all_predictions` list to the eventgate [10:43:43] ok, then I'm wrong I need to back and check :D [10:44:12] I think using cache for eventgate wouldn't make much sense - we want a new prediction each time a page changes. However, we could ingest to Cache when processing those :D [10:44:27] re:eventgate +caching. this is an interesting bit and perhaps more related to cache invalidation which we can decide later. Since we generate a prediction for each page change event it would make sense for this stream to not utilize the cache and update it [10:44:46] exactly! [10:45:42] it is probably one of the cases (if not the only) where we need to run inference instead of using a pre-computed score [10:46:33] yess, agree! [11:10:10] indeed we return all the probabilities, here you can see responses from the stream https://superset.wikimedia.org/sqllab/?savedQueryId=1103 [11:10:39] thanks for clarifying this Bartosz! [11:19:56] looking at the code, the long list of probabilities is only published in the events and are not part of the returned response. So in this case I think that we can just save the whole response since we only interested in the predicted categories and they wouldn;t be that many with the default 0.5 threshold [11:40:50] so since we dont return all the categories it seems that we are ok as is right? [11:41:27] just asking to figure out if my comment in the task is still relevant https://phabricator.wikimedia.org/T401778#11085239 [12:35:40] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Apply the Tone Check model to published articles, to learn whether we can build a pool of high-quality structured tasks for new editors - https://phabricator.wikimedia.org/T392283#11086077 (10achou) ### Architectural options #### Idea 1: Offline + serving pr... [13:34:18] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.07.26 - 2025.08.15), 10Editing-team (Tracking): Build model training pipeline for tone check using WMF ML Airflow instance - https://phabricator.wikimedia.org/T396495#11086241 (10BTullis) Sorry for the delay on this one. Tagging it so that ping it up asap an... [14:32:09] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11086477 (10BWojtowicz-WMF) I'm sharing my notes on the Cache design. Those are not final yet and feedback is hugely welcome on any of the points below!... [14:34:31] ^ I've shared my initial cache design notes, I'd love to get some feedback when you'll have a little free time :D [15:01:33] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11086596 (10isarantopoulos) >>! In T401778#11085238, @isarantopoulos wrote: > @Seddon @Dbrant > We are exploring the option of adding a cache for all art... [16:28:25] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11086948 (10Ottomata) Relevant: {T401260} [16:37:45] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11086986 (10Ottomata) > For the duration of Year in Review processing, we plan to not invalidate the cache to: > > * Keep the topic predictions consisten... [17:07:48] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11087094 (10Ottomata) Re backfill / populating the cache, see also https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#Saneitizer_(backgro... [20:49:49] 06Machine-Learning-Team, 06Data-Persistence, 06Growth-Team, 10Improve-Tone-Structured-Task, 07OKR-Work: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11087728 (10Eevans) Hi @achou, >>! In T401021#11082998, @achou wrote: > [ ... ] >... [21:52:36] 06Machine-Learning-Team, 05Goal: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968 (10SSalgaonkar-WMF) 03NEW