[00:37:43] I used to have a script that could kick off all the jobs, the hard part was collecting/aggregating the output out of jenkins [09:51:18] [[Tech]]; 2409:4081:686:D05E:FDB8:2605:5C82:6332; /* santosh 2121 */ new section; https://meta.wikimedia.org/w/index.php?diff=24642555&oldid=24635327&rcid=26515038 [09:51:56] [[Tech]]; Tegel; Reverted changes by [[Special:Contributions/2409:4081:686:D05E:FDB8:2605:5C82:6332|2409:4081:686:D05E:FDB8:2605:5C82:6332]] ([[User talk:2409:4081:686:D05E:FDB8:2605:5C82:6332|talk]]) to last version by ArchiverBot; https://meta.wikimedia.org/w/index.php?diff=24642556&oldid=24642555&rcid=26515040 [12:41:06] hello! someone willing to +2 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/892946/ for me? It unbreaks page deletions on gucwiki (T330746). [12:41:08] T330746: MediaWiki\Page\PageAssertionException: The given PageIdentity {pageIdentity} does not represent a proper page - https://phabricator.wikimedia.org/T330746 [12:50:55] already done it seems [17:31:53] hi all - I have been reading a bit about knowledge retrieval from Wikipedia [17:32:27] not a current grad student, but I do have some relevent background and experience.. this is a lot of research going on, however !! so tips appreciates [17:32:50] appreciated.. I see today, refs to a KILT natural language corpus for Wikipedia [17:33:17] page 13, Appendix A here https://arxiv.org/pdf/2009.02252.pdf [17:33:28] hi dbb, is there anything specific you'd appreciate tips on? :) [17:33:40] getting to it :-) [17:34:29] a "dump" of wikipedia.. I think I might need to generate my own that is specific to the inquiry.. these researchers are trading their corpus between each other.. [17:34:51] its so much for one person though.. hard to find the right spot to start [17:35:51] I did setup a wikilabs VM at one time, but that was before covid..so not recently [17:36:10] I believe my toolsab ID is still valid from that [17:36:16] tools lab [17:37:18] finding other pre-built wikipedia dumps is useful, but I suspect that one that I am interested in now, would be smaller than most [17:38:18] for this particular inquiry I want to match to Openstreetmap things, and definitely include non-English [17:38:46] (I do work with OSM data dumps regularly already here) [17:39:24] dbb: so, you want to have a language corpus from Wikipedia's texts? is that correct understanding? [17:42:38] yes [17:43:02] the basis of the content is geography.. [17:43:24] so you find areas, get related content.. repeat for an area, with some bounds [17:43:55] like many things, it is not perfect in wikipedia data but it could be done somehow [17:44:20] if I get this straigh I can add wikidata content.. but .. one step at a time right? [17:44:26] straight [17:46:01] https://corpus.tools/ is a tool that's sometimes used to generate corpus (it needs HTML of articles, which can be downloaded from https://dumps.wikimedia.org/other/enterprise_html/ for example) [17:46:51] if you want data about geography content, you might be interested in exploring dbpedia.org (which tries to give structure to data from Wikipedia) [17:47:26] hm looking [17:49:12] querying Wikidata can also provide some info about geographical entities, but data retrieved that way may or may not be equal to Wikipedia's data (it's a project on its own) [17:49:43] Linked Open Data is related to RDF I ythink [17:50:12] I can do graph queries manually but I would like to stay out of some RDF excursion [17:50:55] https://en.wikipedia.org/wiki/Resource_Description_Framework <- not for me [17:56:54] that KILT strucutured dump is new to me, and perhaps to most here.. but it appears that they combined "text data from wikipedia" with their own simple JSON format to identify page name and paragraphs [17:57:22] and, the researchers in that paper did not do that themselves, they got a copy of the completed set from another team [17:57:43] so - the mental overhead of doing the tech details to make that.. disappear [17:59:50] I think everyone knows that SparQL is a mixed situation .. [18:00:01] hm they have a geo experiment https://en.wikipedia.org/wiki/OGC_GeoSPARQL [18:00:24] that is similar to what I am thinking.. tons of work though [18:26:00] I would assume that wikipedia dumps its geo db as well (?) [18:29:12] yup [18:31:32] what geo db? Aren't coordinates generally just template managed data in article space? The only structured geo data that I know of would be in wikidata and accessible via WDQS + SPARQL [18:33:03] there once was a geo dataset extracted from articles co-hosted with the wiki replica data, but that was community maintained stuff that we unfortunately had to remove support for ~6 years ago now. [18:33:23] coincidentally around the time I last poked at this :-p [18:40:43] The 2017 wiki replica refresh -- https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/ -- is when that data was removed. [18:42:23] The "why" of it is in that blog post, but the TL;DR is that we have no way to sync the data when failing over to a different server or to recover the data when rebuilding after a crash. [18:54:31] bd808 / dbb : I mean the DB table. I think https://dumps.wikimedia.org/enwiki/20230220/enwiki-20230220-geo_tags.sql.gz is it [18:55:04] oh ? looking [19:03:13] dbb: These would be the coordinates that are stored on pages via {{#coordinates:... parser functions (e.g. from https://www.mediawiki.org/wiki/Extension:GeoData ) [19:03:36] Wikimedia uses the elastic backend mostly, but it seems like stuff is still stored in DB [19:03:49] as well [19:03:54] hm [19:04:15] As an aside, kind of odd that GeoData extension calls waitForReplication in the middle of linksupdate... I feel like that's the wrong time [19:04:19] I think jsanz is the geo guy at Elastic now? one of them? he is on email lists here [19:05:05] I mean, we use elastic, but we don't provide dumps of that, so that's not something that affects you [19:05:33] But we duplicate the data streams to both mysql and elasticsearch, and we do provide downloads of the mysql table, which should have identical information [19:05:44] I might have interviewed for that job in SF [19:05:53] anyway you are right of course [19:06:33] I think what bd808 is referring to is the old geohack thing, which is from a long time ago, and has been 95% replaced with the Geo_data mediawiki extension (I think, but i don't know that much about this area) [19:06:52] that dump file linked - I am looking at it in a terminal and I see comma seperated tuples like this '(631553526,19175792,'earth',0,51.46111111,4.66944444,10000,'city',NULL,'NL',NULL,NULL,NULL)' [19:07:06] The version numbers at https://johnresig.com/blog/javascript-testing-does-not-scale/ are surprising looking back. I can't remember what it felt like to have a point in time where the "current" browser versions are Firefox 2 and 3, Chrome 1 and 2, and Safari 3. I mean, this wasn't the early Internet, this was only 2010. [19:07:09] Yes, it is a mysql dump file [19:07:44] dbb: Normally to view it, you would import it into a mysql database (I mean, of course you don't have to [19:08:07] Either the first or second number (Don't know off the top of my head, but it should say at the beginning) is a page id [19:08:23] do you have unique page IDs though? [19:08:41] Which you can match up with real page ids at https://dumps.wikimedia.org/enwiki/20230220/enwiki-20230220-page.sql.gz (Warning, 1.9GB file) [19:08:51] bawolff: fwiw, the call here is indeed redundant: https://gerrit.wikimedia.org/g/mediawiki/extensions/GeoData/+/3eee38aa615c644240750ddae4011b573546101c/maintenance/updateIndexGranularity.php#53 [19:09:03] ah oh hm [19:09:05] because Maintenance->commitTransaction already takes care of waiting for replication [19:10:15] Fixed - https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GeoData/+/893041 [19:10:36] bawolff: https://paste.debian.net/1272466/ [19:11:18] Krinkle: I was thinking more about https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/GeoData/+/3eee38aa615c644240750ddae4011b573546101c/includes/Hooks.php#192 - I would assume that waiting for replication is something that the job runner script would do between jobs, and you wouldn't really do it from the middle of a job [19:12:38] dbb: So yes, that means the second number would be the page id [19:12:50] that dump file appears to be just one table [19:13:15] That is correct [19:13:36] dbb: The particular example you posted earlier seems to be a test page - en.wikipedia.org/?curid=19175792 [19:14:24] Of the DB tables that are available to be downloaded (Not all are), generally they are split up into different files [19:15:13] bawolff: TIL that that data is less ad hoc than I remembered. Thanks. [19:15:38] bawolff: thx [19:15:45] bd808: There was definitely a point in time when what you said would have been absolutely correct [19:16:36] bawolff: if a job runs multiple links updates (e.g. recursive links update batch) there will generally be a commit+waitFor between such chunks in the Job subclass. And if you use the built-in runJobs.php then there will also often be waits between jobs. But in prod we use kafka -> http runjobs, for one job each, there isn't waits between jobs at that level. It's a common and recommended practice that when not in web requests to batch [19:16:36] writes and commit in batches. And pretty much all utility methods we provide will generally combine (commit + waitFor + re-open same type of transaction). [19:16:57] bawolff: the danger of sticking around a project too long is that the world has progressed past the last snapshot your brain took of some part of it. ;) [19:18:01] hopefully this chat will freshen things up :-) [19:18:02] This code in GeoData looks like it is doing batching within a single links update. That LGTM to me. But yeah if it really needs multiple db batches, maybe it's big enough to be a job instead of a deferred update. I can't think off-hand of other deferreds that use multiple db batches. [19:18:37] but generally, if it is justified to batch into multiple commits, probably okay to do waitFor as well, especially given it's often implied. [19:19:05] e.g. LBF::commitAndWaitForReplication or Maint::commitTrans [19:19:11] Hmm, maybe some pages have a lot of coordinates or something [19:19:23] My gut feeling is that this code would normally be inserting a single row into the db [19:19:27] but that could be wrong [19:19:56] In any case, even if it wasn't neccessary, it definitely wouldn't be harming anything [19:20:18] Depending on how fault tolerant it is, maybe it doesn't need a transaction at all then. Could use an AutoCommit deferred, and then just batch the delete() directly without any trx wrapper [19:21:29] Hmm, added in 886a379c7ade38360be1 - guess it really was neccessary [19:23:48] looking at LinksUpdate and LinksTable.php in core, that seems to do the same thing [19:23:55] batch writes and commitAndWait between [19:24:08] e.g. for imagelinks and other N rows written