[09:23:27] <gehel>	 I just realized that I did not send a notification for the Search Platform Office Hours. 
[09:23:47] <gehel>	 Given the silent first week, I propose to wait until next week
[14:12:40] <inflatador>	 <o/
[14:28:09] <inflatador>	 Dentist apt, back in ~1h
[14:34:24] <Krinkle>	 gehel: could you help me estimate the size of the apifeatureusage dataset? I'd like to explore some alternatives but don't have a good sense of how much data is in there in practice. I imagine when aggregated as (YYYY-MM-DD, User-Agent, feature key, count) it's probably relatively small but then again we probably have a large and long tail of UAs, so that's probably where the bulk of it is. still curious on the order of magnitude 
[14:34:24] <Krinkle>	 though.
[14:34:46] <Krinkle>	 i.e. megabytes per day, or gigabytes.
[14:40:13] <pfischer>	 Krinkle: I’m afraid gehel is officially out of office this week
[14:49:22] <gehel>	 Krinkle: as said already, I'm out this week. But inflatador or ryankemper should be able to give the info
[15:22:54] <ebernhardson>	 Krinkle: varies from 1g to 4.4g per day, with a mean around 1.7gb/day
[15:23:27] <ebernhardson>	 that's the index size though, the exact data size is a bit harder to look up
[15:23:50] <ebernhardson>	 mean comes from: curl https://search.svc.eqiad.wmnet:9243/_cat/indices?bytes=b | grep apifeatureusage | awk 'BEGIN { cnt = 0; sum = 0} { cnt += 1; sum += $10/1024/1024/1024} END { print sum/cnt }'
[15:38:32] <inflatador>	 back
[15:40:02] <inflatador>	 ebernhardson would dumping the index and restoring it to an empty instance help estimate the data size?
[15:41:13] <ebernhardson>	 inflatador: depends on definition of size :)  You can aggregate with a script in elastic, so a small script can be written to perhaps sum up the lenghths of all strings, and then ask elastic to aggregate that over an index. 
[15:41:37] <ebernhardson>	 or could use a scroll/dump to get all the raw json content out of an index and write it to disk
[15:49:39] <inflatador>	 I guess I don't understand the difference between difference between data size and index size
[15:50:32] <ebernhardson>	 inflatador: index is the total on-disk data structures, at a minimum on-disk has two copies of all indexed data, depending on analysis chains used it could be more
[15:51:32] <ebernhardson>	 it looks like for these indexes no fancy analysis is setup, so it should be the indexed data structures, plus the compressed raw json
[15:56:50] <inflatador>	 OK, so the size of the compressed raw json is missing then? Is that deterministic? Like if we dumped the data and restored it somewhere else, would it be roughly the same size? (Thinking of MySQL where that would usually not be the case)
[15:57:58] <inflatador>	 Krinkle is there a task or associated with this? I probably need to study a bit before I can give you a decent answer ;)
[15:58:19] <ebernhardson>	 inflatador: it's more that the 1g-4.4g estimate is overcounting or undercounting, depending on the definition we want.  The index size reported is the total on-disk size of everything, but that means most data is represented twice on disk
[15:58:53] <ebernhardson>	 in the search indices it's worse though, we probably have 7 or so different representations of a title, for example.  Even bulk text gets represented maybe 4 different times
[16:01:35] <Trey314159>	 I moved the search office hours event on the staff calendar to next week, based on what gehel said above
[16:02:08] <ebernhardson>	 ahh, ok
[16:04:51] <inflatador>	 As far as the on-disk data structures, that's not a 1:1 with shards, right?
[16:07:04] <ebernhardson>	 inflatador: hmm, i don't quite follow the question. Shards ar basically defined by the data structures that make them up
[16:08:14] <inflatador>	 ebernhardson I was just curious if  "most data is represented twice on disk" was a direct reference to sharding. If it is, then I'm following you
[16:08:28] <ebernhardson>	 wed meeting starting, we can talk there :)
[18:31:22] <inflatador>	 lunch, back in ~1h
[19:23:49] <Krinkle>	 inflatador: the associated task is T313731
[19:23:50] <stashbot>	 T313731: Long term plan for reducing maintenance workload on the Search Platform team of supporting APIFeatureUsage - https://phabricator.wikimedia.org/T313731
[19:27:37] <Krinkle>	 ebernhardson: I'm thinking of exploring a simple mysql approach where the rows are essentially ensured/incremented as (YYYY-MM-DD, feature key, User-Agent, count), i.e. queue jobs instead of events, and periodically prune old data with a job as well. To consider that seriously, one of the things I'm interested in knowing is how large such table would end up being (and separately, what the read/write rate might look like, but that I can 
[19:27:37] <Krinkle>	 derive later).
[19:28:33] <inflatador>	 Krinkle ACK, thanks. Sounds like apifeatureusage touches PII?
[19:28:35] <Krinkle>	 The index would likely be a truncated length, or hash, not sure yet. first looking at the broad strokes in terms of how big things are to then derive how convoluted a solution can be justified.
[19:30:19] <ebernhardson>	 Krinkle: the total primary data we retain is ~1.2TB in 771 indices.  Which actually makes me a bit curious, i think that old data is supposed to be pruned but perhaps it isn't
[19:30:49] <Krinkle>	 "green open apifeatureusage-2022.10.13    "
[19:30:50] <Krinkle>	 indeed
[19:31:04] <inflatador>	 LOL
[19:31:05] <ebernhardson>	 Krinkle: probably a gb per day is a plausible estimate, with some days doubling that
[19:31:46] <Krinkle>	 profile::apifeatureusage::logstash::curator_actions:  description: 'apifeatureusage: delete older than 91 days'
[19:31:52] <Krinkle>	 seems relevant
[19:31:56] <Krinkle>	 anyway, I'll close that tab.
[19:33:12] <Krinkle>	 I'm trying to put together ApiFeatureUsageQueryEngineElastica.php with your `curl` invocation to see if I can make it dump one day's worth of data. Not sure if that's a stupid idea yet, but 1GB seems like something I could plumb through curl for temporary analysis.
[19:34:12] <Krinkle>	 mwgrep is probably a good example of a standalone curl query to elastic.
[19:35:40] <ebernhardson>	 Krinkle: for a day worth of data, i would probably try and pull it from kafka.  If i wanted to pull a day of data from an index i would probably use the python repl with the elasticsearch client to use elasticsearch.helpers.scan, because elasticsearch limits to 10k results in a single request
[19:36:21] <Krinkle>	 right, I see. I can get it from the events, that should do nicely with kafkacat.
[19:45:05] <Krinkle>	 hm.. I can't seem to find any mention on Wikitech or easily in Puppet which concrete kafka topics this ends up in, or which brokers to get it from, ideally after it has been reduced to apifeatureusage.  
[19:45:14] <Krinkle>	 but it looks like api-feature-usage.log exists on mwlog as well, I thought maybe they'd been excluded
[19:45:29] <ebernhardson>	 hmm, should be in the bits related to logstash. But if it's on mwlog i suppose that should work too
[19:46:28] <ebernhardson>	 oh, maybe we don't use kafka here. It seems to be configured to receive over rsyslog udp
[19:46:41] <ebernhardson>	 i thought everything had moved
[19:50:12] <Krinkle>	 "logstash::output::elasticsearch "
[19:50:39] <Krinkle>	 I'm guessing that means after rsyslog (which I presume does use kafka here, but it's all one firehose at that point) it goes straight to elastic?
[19:50:56] <ebernhardson>	 Krinkle: yes, that looks to be how it's configured
[19:57:35] <inflatador>	 heading to appointment, back in ~90
[20:19:40] <Krinkle>	 fwiw, the UI seems to use 2021-11-24 as default start date. Not sure if intentional or some automatic side effect of lack of pruning, but at least it is exposed as such and indeed returns non-empty data  
[20:19:49] <Krinkle>	 Special:ApiFeatureUsage that is
[20:20:13] <Krinkle>	 e.g. enter "WikipediaApp" as UA query.
[20:23:03] <Krinkle>	 using Jan 1st as example, it saw 27M events during that 24 hour window, which would translate to an average of 300 write queries per second. Formatting it as (feature,agent) pairs, there were 62,444 unique pairs that day (i.e. rows), multiplied by 90 days, would mean ~5M rows. 
[20:23:29] <ebernhardson>	 that doesn't sound too bad 
[20:23:37] <Krinkle>	 seems pretty small, but the insert rate is probably the main thing to think about further
[20:23:52] <Krinkle>	 insert/update rate*
[20:24:55] <ebernhardson>	 some sort of pre-aggregation could probably get the update rate under control, but that's non-trivial in most of our infra
[20:26:15] <Krinkle>	 yeah, given the trivial nature of the increment, it might make more sense as a post-send deferred than an actual job. it'd cost far less overhead that way, and would better reflect the second tier nature (i.e. doesn't need to be recoverable if lost of any reason).
[20:26:47] <Krinkle>	 as opposed to quueing an event through several layers and services, and then to spawn a new job process just to increment a number.
[20:27:34] <ebernhardson>	 yea, that makes sense.  all those extra layers would be significant overhead compared to the increment and commit.  It does mean db writes from GET requests, but in postsend probably we do that already
[20:27:52] <ebernhardson>	 hmm, actually these are form submits, probably post
[20:28:29] <ebernhardson>	 ahh, they are GET's
[20:28:47] <Krinkle>	 it can be either. depends on the API parameter.
[20:29:26] <Krinkle>	 but yeah, would not be unusual to do from a deferred, or even on a GET, i.e. watchlist seen markers are updated during logged-in page views.
[20:29:53] <Krinkle>	 but to reduce cross-dc writes, we'd want this to be on x2 or something, not a core wiki db for sure.
[20:30:12] <Krinkle>	 anyway, I'll write a summary tomorrow on-task. just exploring the idea right now.
[20:30:21] <Krinkle>	 looks like there's something there that might work.
[20:30:52] <ebernhardson>	 ya, seems plausible
[21:33:13] <inflatador>	 back