[09:47:34] lunch + out for the afternoon [09:57:43] I was puzzling around about "multiple wikis in the same index isn't something cirrus can do" that you mentioned like 1 month ago ebernhardson . Did you end up making any tickets, patches etc.. around that that I could take a look at? We're trying to figure out if having hundreds or low thousands of wikibases (and thus mediawikis) use WikibaseCirrusSearch is a possible or sensible direction to go in or if we should be [09:57:44] building something else. [12:01:54] ebernhardson: hoping you can help me with some airflow/python stuff. I am trying to put a couple of tables in `partition_names` of `NamedHivePartitionSensor`. Along with these tables, I also want to include `eventgate_partitions(a_table)`. Since eventgate_partitions returns a Sequence[str], I cannot seem to simply `+` a list with it. Whats a nice solution to this? [13:19:41] greetings [13:47:39] dropping off kids, back in ~15-20 [14:01:23] back [14:47:55] \o [14:48:06] tanny411: hmm, off the top of my head i'm not sure but i can look into it [14:50:27] tarrow: we didn't really go too far down that path, we started generalizing the way we handle document id's, currently they are page_id's so having multiple wikis in the same index would have overwritten each other [14:51:18] tarrow: what we never started on, and is the real work, is understanding how do you do maintenance like that? How do you reindex wikis, how do you change the analysis chain for a language? Today we create a new index, copy into it, and delete the old one. Could we do that if we were talking about 5TB of data each time and multiple billions of docs? Didn't seem fun [14:52:43] tarrow: other random idea i remembered though, you could tune $wgCirrusSearchNamespaceMappings in such a way that there is no content/general index difference, everything would go into the content index. Cirrus would still end up creating the general index though as its hard coded in a few places [14:53:08] but could probably find a way without too much work for cirrus to accept a single-index per wiki [14:57:47] tanny411: you should be able to take advantage of airflow templating to do as necessary, here is an example of an object with arguments that then gets resolved into a sequence at rendering time: https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/airflow/plugins/wmf_airflow/template.py#L110 [14:58:38] tanny411: you might also be able to do something simpler, i'm not entirely sure [14:59:46] oh, you should already be using it. sorry i'm avoiding coffee at the moment and maybe it's a bit of a weakness :P hmm [15:01:39] tanny411: it's a bit ugly, but all eventgate_partitions does is invoke the TemplatedSeq with a particularly formatted template, you should be able to create a TemplatedSeq directly with an appropriate template, maybe a helper function would make the template less obvious. Otherwise i would end up with a java-ish solution where you wrap the TemplatedSeq with some new `SequentialSeqs` [15:01:40] class that flattens multiple seq's into one or some such [15:08:02] ebernhardson: ooh! That's a good lead. I'll take a look at that [15:55:48] working out, back in ~45 [16:31:18] back [17:06:08] is there a convenient way for me to query the WDQS to see what the timestamp of the last inserted revision was? basically to see how close or far from "real time" it is [17:30:48] lunch/errands...back in ~1h [17:39:19] hare: iirc the updater keeps a timestamp in the rdf store, but i don't remember if thats the old or new updater, looking [17:39:37] hare: the updater inserts a triple with the latest update timestamp. David should be back tomorrow and he should have more specific info [18:42:29] back [19:15:53] * ebernhardson was impressed that the error message had it's own url, https://s.apache.org/sbnn-error . Except that says 'Page Not Found'. Slightly less impressed :P [19:18:56] LOL [19:39:00] an interesting search development: https://github.com/brave/goggles-quickstart [19:39:10] (still not sure i trust brave :P) [19:44:23] ebernhardson: trust as in like privacy wise, or trust as in "the search is actually useful"? [19:44:47] if it's the latter I definitely agree, I've been trying out brave instead of ddg the last couple months and the search is not great (although it's cool that they actually have their own index so at least I get *different* results) [19:44:47] ryankemper: trust as in, the last time i saw news about them it was something about crypto currency and ads :) [19:45:31] ebernhardson: ah, yeah that was probably related to their basic attention token, IIRC it's basically people get paid in BAT for looking at ads and I assume you can "pay" BAT to get ads to display to other people [19:45:45] so actually not as scammy as most of the crypto stuff is, although it is still a little silly [19:46:36] oh I see, it was prob something like https://www.theverge.com/2020/6/8/21283769/brave-browser-affiliate-links-crypto-privacy-ceo-apology [19:47:06] fair enough, i suppose i do try and maintain that there is probably a use for crypto, but it has been overrun with profit seekers and schemes that look little different from traiditional pump-n-dumps. I suppose it's much easier to be wary of the whole industry than try and pick apart which are potentially legitimate uses [19:47:43] ebernhardson: you will get no argument from me there, IMO 99.9% of crypto is just a 21st century ponzi scheme [19:47:55] In a way I almost respect the ones that are more shameless about it and don't even try to pretend there's actual utility behind it, haha [19:48:49] but yeah it cracks me up how many coins there are with really beautiful webpages and all these weird rules about "we burn x% of crypto every time period so the value of your crypto always goes up", and somehow people always fall for the hype [19:49:10] * ryankemper is an old man though and remembers when the "currency" part of cryptocurrency was actually the core value proposition [19:52:33] lol, I suppose i haven't looked too closely but i was certainly surprised that luna was giving a 20% APY and people thought that was somehow legitimate [19:56:39] unrelated, metrics seem to suggest the cirrus latency alerts on 6/16 and 6/17 (maybe others, haven't looked closer) is 1080 getting into a bad spot somehow [19:57:16] https://grafana.wikimedia.org/d/000000486/elasticsearch-per-node-percentiles?orgId=1&from=1655445844000&to=1655447134000 and https://grafana.wikimedia.org/d/000000486/elasticsearch-per-node-percentiles?orgId=1&from=1655358844000&to=1655360134000 [19:57:35] not sure what it means though :S [20:01:05] that feeling you get when you start typing S3 into DDG and it autocompletes to "the time i got reincarnated as a slime s3" [20:10:28] no great ideas on 1080, it's heap is a bit full but not nearly as bad as other instances we seen. I'll restart the elastic instance anyways and hope it stops complaining as much [20:18:02] ACK [20:56:11] * ebernhardson now realizes the reason hourly updates stopped complaining is because we don't let the next hour run if the previous hour failed. So it's just been waiting since the 16th [22:08:21] ebernhardson have you tried creating an ES snapshot in relforge yet? we were going to try it if not [22:35:25] inflatador: hmm, i can't remember exactly, so guessing no :) [22:37:11] poking at my bash history on relforge, doesn't seem like it