[01:27:16] ebernhardson: seaborn is nice and is pretty close to pyplot [08:48:49] EU Morning search people! I was looking at WDQS and wanted to confirm that you now don't maintain some `allowlist.txt` in addition to this one (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/query_service/templates/allowlist.txt.epp) generated by puppet? [08:49:11] there isn't one you also keep in the source repo is there? [08:56:31] tarrow: The allow list that you linked should be the only one used in our production. There might be left overs from a previous implementation in other places, but those should not have any impact. [08:58:28] gehel: awesome; and is there a recommended process to change that? Phab ticket? Write Gerrit patch? [09:06:00] tarrow: phab ticket is good! You can tag it with [data-platform-sre]. If you attach a gerrit patch, even better. Otherwise, we'll write the patch and merge. [09:06:16] also, is there a good way I could get the "hydrated" form of that file too? [09:09:09] gehel: maybe blazegraph has an api to get them? [09:13:56] you should have all the relevant parts in the template itself. The variable part is only for managing internal endpoints, like the federation between the scholarly and main graphs. [09:20:28] tarrow: That allow list implementation is a custom extension to Blazegraph. We did not implement a way to query it. [09:36:04] gehel: The thing is we (for example) on wikibase cloud also want to allow people to query the wikidata scholarly graph; it would be nice not to have to manually maintain that bit of the list [09:40:05] tarrow: the part that's hydrated by the puppet template engine might not be something you can use as-is and is varying depending on the endpoint, one line looks like: 'https://wdqs-scholarly.discovery.wmnet/sparql,https://query-scholarly.wikidata.org/sparql,https://query.wikidata.org/subgraph/scholarly_articles' [09:40:55] I see this allow list as a deployment specific configuration. The fact that we have a few endpoints that we all want to federate is accidental IMHO. [09:43:43] gehel: I totally agree but... for all the people who are running 3rd party query services selecting this list isn't trivial; a good starting point is of course wikidata and so knowing that complete list is helpful [09:44:55] in an ideal world everyone cleverly and wisely picks this list; in reality people want to cargo cult the best list from somewhere [09:47:47] sadly with the internally federated endpoints this list is no longer the same [09:48:32] the endpoints we allow are "internal" service URLs that can't be used outside of the WMF infra [09:56:28] decausse: are they? to me that looks like publicly accessible URLs and endpoints? [09:56:45] surely those URLs are the ones user have to write in service lines? [10:09:15] the "https://wdqs-scholarly.discovery.wmnet/sparql" one is definitely not public, and "https://query.wikidata.org/subgraph/scholarly_articles" is an alias that matches a shortcut wdsubgraph:scholarly_articles [10:09:32] and is not a "real" endpoint [10:10:29] additionally to get the full list you'd have to blend the allowlist from query-main and query-scholarly, because the allowlist from query-scholarly does not allow federating itself [10:12:08] in other words there are no single list with both https://query-scholarly.wikidata.org/sparql and https://query-main.wikidata.org/sparql in them [10:42:39] dcausse: Thanks! Super clear; I'm thinking about what to then best recommend to users to do then [10:54:45] lunxh [12:06:13] dcausse: if we wanted to build some shiny UI (for example) on scholarly or main to hint to them places they could federate to we'd really just need to manually inject into the scholarly UI the fact that one can federate to main and visa versa etc. [12:06:21] ? [13:28:27] status update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2024-12-06 [13:30:17] tarrow: I suppose? but if discoverability of the federated endpoints is very important then I suppose we might to build a new API endpoint [13:30:19] tarrow: yes, I think that what we want to advertise in terms of federation needs to be decided on a case by case basis. At the moment, we have a lot of focus on scholarly articles, as this is the latest split. But over time, most of the possible federation should be roughly similar [13:32:42] in general (my assumption) you come to wdqs because you have a precise purpose and you already know what data you want to query [14:16:52] o/ [15:58:29] heading out, have a nice week-end [17:40:05] .o/ [19:10:59] lunch/appointment, back in ~2h [19:38:55] left wondering if i'm using the same clickthrough definition we did in the old ab test....most of the metrics so far are reasonably aligned with the old ones, but clickthroughs was 33% before and i have 53% here [20:57:02] We're running low on HDFS space. Search Platform as significant usage in /user/analytics-search and /wmf/data/discovery. Is there something we don't use in there and that we could delete? - T381707 [20:57:03] T381707: Low available space on Hadoop / HDFS - https://phabricator.wikimedia.org/T381707 [20:57:20] The cause of the low space is certainly not Search, but if we can do something to help... [21:20:41] back [21:24:02] i can poke around [21:24:52] curious there is 25T in /user/analytics-search, we shouldn't be storing data there [21:27:28] hmm, so it looks like that auto-cleans itself after 30 days, but in our case 30 days of trash is 25TB [21:29:19] our main data uses are basically 25T for trash, 30T for cirrus dumps, 25T for wikidata dumps, and ~7TB for everything else [21:30:49] Don't spend too much time on it, but if we can recover 25T, that would already help a little bit... [21:31:32] gehel: it wouldn't really recover, it would start building again and be back to 25T in 30 das [21:31:34] days [21:32:13] maybe we can stop moving things into the trash, i would have to review what is actually in there and how it ends up there. Maybe we can skip that (i think our hdfs generically does that for any deleted file)