[03:58:42] We've brought the last of the new wdqs codfw hosts in service for https://phabricator.wikimedia.org/T332314. Next up, decom'ing `wdqs200[4-6]` (https://phabricator.wikimedia.org/T342035) [09:04:01] pfischer: any updates you want to add to https://etherpad.wikimedia.org/p/search-standup ? [09:15:13] Trey314159: did you do some marketing on the language unpacking Q&A session? [09:28:32] ryankemper, inflatador: there are 4 procurement tickets on https://phabricator.wikimedia.org/tag/data-platform-sre/ for wdqs and cloudelastic. Could you make sure those are reviewed so that we can place the order? [09:43:04] gehel: yes, one moment please [09:43:42] pfischer: I'm pushing the upate out in a few minutes! [09:48:51] gehel: done [09:49:03] thx [09:50:26] update posted: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-07-21 [09:50:33] lunch! [10:42:48] lunch [10:50:55] I’m trying to make an informed guess of the size of revision-based events holding page content. So I created a histogram of the current state of the database, see https://docs.google.com/spreadsheets/d/1Fp44MdLxUVlxi03MBD_64m0zQErny-9jUD5C6RGf_bU/edit?usp=sharing. However, that only depicts the past. I would assume, that the rate at which pages are created declines over time and if a revision occurs it’s more likely [10:50:55] caused by a page edit. Now I’m struggling to answer the question: What’s the probability for a revision to be of a certain size-range. Is that a question for the analytics team? [12:08:45] pfischer: I wonder if you could use the enriched page-state stream, it's a stream with the wikitext in it and it should be mostly page-edits, they perhaps have stats on the doc size? [12:10:07] alternatively we could guesstimate this from the doc sizes captured by CirrusSearch, it's all updates (revision based + page edits) perhaps this is close enough but no guarantee sadly [12:10:31] s/revision based + page edits/revision based + page rerenders/ [12:11:44] also the rev size as stored in the db might be lower than the actual cirrus doc, cirrus doc has the wikitext in the source field and the plain text version in the text field [12:20:32] stats from the mw enrichment job: https://w.wiki/762y (should be roughly the kafka compressed size of the wikitext for rev based updates) [12:23:28] https://w.wiki/762z: should the avg size as seen by CirrusSearch right after generating the cirrus doc for all page changes (rev based + rerenders + saneitized), it the uncompressed size of the cirrus doc content [12:30:16] you also have https://stats.wikimedia.org/#/all-projects that contains nice metrics (new pages vs modifications) [12:32:13] number of pages created does not seem to decline too much, it's been stable for the last 2 years around 3M/month +/- 1M [13:38:58] o/ [13:45:18] gehel :eyes [13:48:02] Thanks dcausse:, that was insightful! I looked at the moving median for cirrus’ doc_size and that’s between 20kb and 25kb. Looking at page_content_change (https://w.wiki/764G) the median message size was 22kb. I don’t know if they use compression (and batching?) and if snappy can achieve roughly 50% compression (so far I’ve only seen ~20%). I would estimate that cirrus’ doc_size accomodates wiki_source and [13:48:02] rendered text, so roughly twice the the wiki text source (do we have an estimate on size change between wiki text -> plain text?) [13:50:38] the mw enrichment certainly uses batching but not sure about the exact numbers, might have to ask Gabriele, snappy compression should be always used unless changed/disabled explicitly [13:54:21] re wikitext size vs cirrus_doc size, I'm unsure but 2x does not sound too bad (perhaps a bit more to account for all other metadata) [14:01:08] re snappy I did few tests on the cirrus doc I believe we can expect 2:1 [14:05:07] a small page: 1986 -> 1009 50.8%, a 40k page 40910 -> 18539 45.3% (using snappy_test_tool from github.com/google/snappy) [14:17:36] gehel (Re: unpacking Q&A) I posted to foundation-optional, added it to the staff calendar, and posted to #product-tech-dept on slack. I was going to follow up today and take a moment during triage on Monday to send a final reminder [14:18:12] Trey314159: sounds good! I don't really have any other place to promote it. [14:19:05] I might actually be able to join on Monday! The kids are with their aunt. (I also have plans for a nice dinner with France, but that might be a bit later) [14:19:17] Cool! [14:19:27] (Though dinner takes precedence, for sure!) [14:23:24] btw, ryankemper, I might skip our 1:1 next Monday, depending on those same dinner plans! [14:31:31] pfischer: do you have a moment to discuss kafka sizes? [14:46:23] \o [14:46:45] o/ [14:46:57] dcausse: are your two ExtensionRegistry patches ready to go? My +2 finger is itchy and I'm almost ready to push the button. ... but then a wild ebernhardson appears! [14:47:10] lol, i can look at that sure [14:47:13] :) [14:48:00] ebernhardson: I'm happy to +2 them if dcausse is ready. But if you want to look, you can do it instead [14:52:06] if cindy passes it probably works, it all looks pretty reasonable. I can't say i fully understand some bits yet, would take more thinking, but it probably works :) [14:55:38] dcausse: i suppose one thing i'm not quite following, where were the profiles previously loaded from? glanced around the removed code but didn't really see it [14:56:05] * ebernhardson could just need more coffee :P [14:56:54] ebernhardson: I think you're looking for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/940254/5/includes/ServiceWiring.php ? [14:58:34] dcausse: oh i really am just needing more coffee...it's the extractAttribute() bit that was in ConfigProfileRepository [14:59:07] oh sorry yes misunderstood your question [15:08:50] I got +2 sniped! (Thanks for looking into the whole thing in the first place, David!) [15:37:02] ryankemper instead of Prometheus study today, let's see if we can knock out those procurement tix if that works for you [16:10:18] going offline, have a nice week-end [16:53:32] inflatador: yup sounds good! [17:32:48] cool! Heading to lunch, back in ~45 [18:27:11] back