[12:43:03] Hello. [12:43:06] Lol 69 [12:43:31] https://www.wikidata.org/wiki/Q119726160 is a dupe of https://www.wikidata.org/wiki/Q108267123 can anyone merge? [12:53:44] hello? [12:54:00] yes? [12:56:47] https://www.wikidata.org/wiki/Q119726160 is a dupe of https://www.wikidata.org/wiki/Q108267123 can anyone merge? [12:57:14] no need to repeat yourself. someone surely will pick it up. [19:49:25] Er, hi there.  I have a question about the wikidatawiki/entities download, and I've no idea if I'm in the right place...? [20:47:27] Ah, low-traffic, OK.  Well, in case this is the right place to ask: I know that the latest-all.json files are the most recent versions of the dumps, but is there any way to determine via program the date of the latest dump (short of parsing https://dumps.wikimedia.org/wikidatawiki/entities/ and checking each subdirectory for the file)? [20:54:02] just make a HEAD request against the file in question and check the returned timestamp? [20:55:49] Thought of that, but it's unfortunately the modification time, not the date of the dump itself.  So the -latest-all.json dump is currently the same as wikidata-20230612-all.json.gz, which has a modification date of 2023-06-14. [20:55:55] e.g.: wget -S --method=HEAD https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz [20:56:05] or: perl -MLWP::Simple -wE 'say head("https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz")->last_modified' [20:56:34] ah. so the time of the *start* of the dumping not the time of the *end* of the dumping. [20:56:38] It's a reasonable approximation, but if I'm trying to express in a reproducible way "this is the file I downloaded for this information", it'd be nice to be able to say specifically "the 20230612 dump" and not just "the dump retrieved on 2023-06-14". [20:56:47] (here dumping includes stuff like compressing, writing, ...) [20:57:27] hm. [20:57:43] It'd be nice if the /latest* URLs were redirects to the actual file in one of the /YYYYMMDD/ subdirectories. [20:58:05] Rather than hardlinks, I assume. [20:58:07] For Wikipedia and other things, I can find out what the latest version is from the index.json file at the root URL, but that doesn't contain the "entities" files. [20:58:50] JAA: totallly!  If I could be redirected to a download of "wikidata-20230612-all...", my problem would be solved. [21:01:37] the ETag matches for the hardlinks. [21:03:25] Definitely thought about using the ETag, but again in terms of documenting my process, it's nicer to be able to say "the 20230612 dump" than "the one with ETag 648a3e7e-13386d9bbf". [21:04:10] totally with you there. [21:04:27] You could fetch the ETag of all relevant files and compare, but eww. [21:04:33] If there's no way, of course, "2023-06-14" is as I said a reasonable approximation, and anyone trying to check my work can certainly go to the /entities/ directory and find the closest match. [21:04:38] just wanting to throw raw ideas into the pot so you can cook the best that can be done with what you have. [21:06:12] "eww" is pretty much my thought. :-) Even figuring out what "all relevant files" are is, like I said, kind of unpleasant, since it means poking into each directory in turn (and even getting a list of directories is "parse this HTML return"). [21:07:11] the format of the filenames seems static. [21:07:17] noob here, what are some reasons an item would keep being reverted when my edits are simply adding more recent information from the same source? [21:08:36] e.g.: https://dumps.wikimedia.org/wikidatawiki/entities/%Y%m%d/wikidata-%Y%m%d-all.json.gz [21:09:08] which would avoid the parsing and would be a good guess for what to try. [21:09:23] Guest62: any example of that? [21:09:43] Right--but you'd need to know what the YMD are. :-) (One option being "check every date", but...well, it's better than parsing HTML, but worse than, like, not having to make repeated requests to the site until I get one I like.) [21:10:50] if yoi [21:11:16] if you're stateful you can just try the current date(s) and cache the responses from your last runs. [21:11:43] so it would be only a few requests each time (max one per day on avg.). [21:12:44] also if you're only intrested in the most recent one you can use today and then go back one day until you hit it (with some limit so it will not go wild in case of an error). [21:12:50] all not perfect.