[07:39:36] hi! do WMF wiki support cropping of images via the filepath? for example, https://upload.wikimedia.org/wikipedia/commons/thumb/1/17/Rotkehlchen_bird.jpg/446px-Rotkehlchen_bird.jpg is a full image: is there any way to crop that down to the bird only, specifically by the URL? [07:39:42] the motivation is this: https://phabricator.wikimedia.org/T269818, so on-wiki solutions like {{css image crop}} and CropTool won't work [07:54:03] no, that's not a facility that mediawiki provides [07:55:35] Change on 12meta.wikimedia.org a page Tech was modified, changed by ArchiverBot link https://meta.wikimedia.org/w/index.php?diff=21757677 edit summary: [-1130] Bot: Archiving 1 thread (older than 30 days) to [[Tech/Archives/2021]] [07:57:20] Ok, thanks, I'll feed back on that task [07:57:38] great [11:12:07] inductiveload: there were some discussions of doing that in relation to DjVu processing [11:12:38] you can probably find some discussions initiated by alex_brollo where some people described the difficulties of such an approach [11:14:22] Nemo_bis: i mean, it's not really an issue for me, in that I don't think it's too onerous to crop the image on the OCR tool side before farming out to the OCR program/API [11:15:27] inductiveload: indeed that's the traditional approach, it just gets tedious when hundreds or thousands of images need to be extracted for a single book [11:15:34] have you seen https://phabricator.wikimedia.org/T159640 and https://phabricator.wikimedia.org/T9757 yet [11:15:54] in general cropping images out of DjVus is a dreadful idea in terms of quality [11:16:27] e.g. https://en.wikisource.org/wiki/File:Comparison_of_images_derived_from_DjVu_and_from_original_source.png [11:16:41] so such a thing is only really a stopgap to doing it properly from upstream sources [11:17:14] inductiveload: I mean illustrations which are interspersed within text in the transcriptions [11:17:29] Uh, cscott has had https://phabricator.wikimedia.org/T37756 assigned for several years now :) [11:17:54] and the same thing from a Google PDF: https://en.wikisource.org/wiki/File:Comparison_of_images_derived_from_DjVu_and_from_PDF.png [11:17:54] https://ws-image-uploader.toolforge.org/ is an attempt to make the whole uploading of images from a book process a little bit less painful [11:18:33] inductiveload: that's rather old DjVu with a very high compression, not the fairest comparison :) [11:19:15] Nemo_bis: pre bugzilla [11:19:24] That probably needs unlicking [11:19:26] inductiveload: I've added this piece of advice to the help page but I never got any comments "The simplest way to increase quality is to change --bg-subsample (default 3, max 12) to 2 or 1 (best quality)" https://en.wikisource.org/wiki/Help:DjVu_files#Method_3_-_pdf2djvu [11:19:46] Nemo_bis: it's pretty representative of many many IA DjVus tho [11:20:10] inductiveload: yes true, but the easiest solution in such a case is to recreate the DjVu file (well, maybe that's what you were saying too) [11:20:20] Indeed one should not extract illustrations from the DjVu itself if possible [11:20:53] I recommend to go to the IA and download the page images directly [11:21:03] JP2 if you can, but the JPG is OK too [11:21:08] RhinosF1: feel free to :) I don't do it myself as aklapper reminds me I'm not innocent myself ;) [11:21:39] if the IA "source" is a Google Books PDF, then you are indeed out of luck [11:21:52] inductiveload: the JP2 may be a derivative too, you need to check in the meta file which one is the source [11:21:59] yeah :) [11:22:26] if the scan is in colour, it's a good bet that the JP2 is the one [11:23:12] and even for the "nice" scans, the DjVus are compressed with an MRC compressor, which is specifically harsh on images [11:23:38] well it's easier to just check https://archive.org/download/naturalhistorymo00goss/naturalhistorymo00goss_files.xml [11:24:32] yes I left some comments about the new MRC compression... it's nice but still being worked on [11:24:52] the JPX PDFs are evil tho [11:24:53] At some point it may become useful for us too (and the documentation is always interesting) [11:25:10] and that's all the IA gives you these days [11:26:00] it takes ~15 seconds per page to decode on my e-reader [11:26:26] and it's noticable laggy even on my desktop [11:26:51] inductiveload: speaking of which, do you know an answer to https://softwarerecs.stackexchange.com/q/25497/13986 [11:27:00] yes that depends a lot on the client [11:27:29] no idea [11:27:33] your e-reader might be relying on mupdf, it's useful to report such issues; cf. https://bugs.ghostscript.com/show_bug.cgi?id=696255 [11:27:45] it is indeed mupdf [11:27:53] it does work, it's just deathly slow [11:28:14] still useful to report test cases upstream :) [11:28:30] 1GHZ Freescale Solo Lite CPU [11:28:45] (I've not checked new ones; hard to believe the last time I checked was 6 years ago already) [11:29:32] i moaned at the IA, they ignored me [11:30:17] i would not be surprised if 10-15s is the limit for that CPU anyway, since it's 1-2s on a Ryzen 3600 [11:31:22] unless the device has a JPEG2000 decode hardware, which I would be extremely surprised to see since $$$$ [11:32:14] Devs may not be aware of the real issues because they are very hard to reproduce without specific information, especially when hardware acceleration is involved [11:32:24] How do those PDFs perform when uploaded to Commons? [11:32:30] not well [11:32:40] look at basically any Faebot uipload [11:32:47] they're all from the IA [11:32:50] so they're all JPX [11:33:04] Really? I thought he started those uploads before IA started generating those PDFs [11:33:24] And they started from newspapers/periodicals IIRC. Did Faebot upload these too? [11:36:00] https://commons.wikimedia.org/wiki/File:Medical_Heritage_Library_(IA_00110080RX3.nlm.nih.gov).pdf [11:36:30] 3 JPX images per page, two rgb and a mask [11:36:42] (because it's MRC'd) [11:37:14] if they come from google originally they might be the original CCITT images, IDK [11:39:16] if you just select a random page from inside a book that looks like it's not been cached you can see how it goes [11:39:47] * inductiveload uploaded an image: (28KiB) < https://libera.ems.host/_matrix/media/r0/download/matrix.org/ITJDlQMnsmMPqxiCtpTuWcfR/2021-07-19_123930_1029x94_screenshot.png > [11:42:59] it seems a bit better recently - about 2-4 months ago, it was really bad, like 10s sometimes [11:45:58] those numbers come from https://archive.org/details/0021005.nlm.nih.gov [11:46:10] and that PDF was generated in 2014 [14:42:31] Nemo_bis: on the subject of the IA and DjVus, etc: what do you think about the IA-upload client side JS thing? [15:44:59] inductiveload: which one? [15:46:18] for validate-on-type of filenames [15:47:02] though there are quite a few tools I'd like to see in that UI in general in the longer-run [15:47:47] for example: choose authors from wikidata, a licence selector, category chooser, etc [15:56:56] inductiveload: I don't know... in the end all that interface does is producing an XML file which is then rsync'ed forward. Surely it would be more sensible to interface with the API via the python client? [15:57:42] right but then you have manage the state all on the server, reject and drop the user back at the form [15:58:53] https://ws-image-uploader.toolforge.org/ does nearly everything client side and only really delegates to the server for the upload (because I do not know how to work OAuth2) [15:59:23] which means you can have as-you-type auto fill and all sorts of nice things [16:04:43] the license is a bit annoying for one - I always forget to set it and leave it as {{pd-scan}} [16:10:12] Lucas_WMDE: ping? [16:10:17] hi [16:10:34] can I DM you about...that thing [16:10:37] sure [16:39:41] o/ should https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines be updated? The guidelines say that the message body lines should be <=100 chars long, I configured my settings to hard wrap at 80, but I regularly see that people have to autoformat the message in my patches, presumably because the new Gerrit UI seems to hard wrap at 70 or so chars. [16:41:24] hm, I thought the standard width was 72 [16:41:44] mszabo: people are the worst [16:41:47] (https://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html) [16:42:44] yea, but the wiki page says "Wrap the message body so that lines are less than 100 characters long" and commit-message-validator doesn't complain either [16:42:48] I guess it could be updated to say 72? [16:43:45] my interpretation would be that 100 used to be accepted, at least [16:43:45] looks like gerrit went down now that I dared insult its commit message display [16:43:48] but maybe that’s not the case anymore [16:43:58] mszabo: scheduled maintenance, don’t worry :) [16:44:17] the page should probably at least mention 72 [16:46:44] "In startup we are not believe in downtime. We are call it planned maintenance." --DevOps Borat [16:47:22] unknowingly planned maintenance [16:47:58] Mentioning 72 makes sense. I thought the tolerance for 100 is for things like URLs? [16:48:33] URLs in the first line of a commit message would be strange though [16:48:57] Weren't we talking about body only [16:49:23] oh, ok [16:49:30] per wp:bold I updated the page to use 72 [16:49:41] URLs already seem to be explicitly exempted from the line length rules on the page [16:49:57] once gerrit comes back I guess we can update commit-message-validator as well [16:50:07] cool, was about to say just do it. thx [16:50:13] "less than 72"? Did you mean "up to 72"? [16:50:16] https://www.mediawiki.org/w/index.php?title=Gerrit%2FCommit_message_guidelines&type=revision&diff=4710845&oldid=4623112 [16:50:46] makes sense, Gerrit wraps after the 72nd char [16:50:50] off-by-one error :) [16:50:52] :) [17:51:17] hm, looks like I cannot do a developer checkout of commit-message-validator [17:55:33] ? [18:05:31] nvm: https://gerrit.wikimedia.org/r/c/integration/commit-message-validator/+/705438 [18:07:01] Jerkins says no [18:10:27] ironically it said no because of a line over 72 chars, heh [18:10:35] :D [18:46:30] alright, should be good now :) [18:48:32] easy ride [21:20:27] Change on 12meta.wikimedia.org a page Tech was modified, changed by Valp link https://meta.wikimedia.org/w/index.php?diff=21760467 edit summary: [+140] /* URL problem at French Wikisource */