[00:17:56] !log toolsbeta rebooting the control plane nodes for kubernetes because it can't make things worse T289390 [00:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [00:18:01] T289390: Certificate generation is broken in toolsbeta - https://phabricator.wikimedia.org/T289390 [18:39:25] o/ I didn't see any information about quotas on https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#User_databases [18:39:33] are there some guidelines around that? [18:40:45] I guess the usual "be reasonable" [18:43:09] there have been phab tasks opened about excessive database size [18:43:23] I have no idea what's a reasonable size limit for a Toolforge user table. [18:44:06] There's a usage page somewhere [18:45:29] https://phabricator.wikimedia.org/T224152 [18:46:39] tgr: do you have any specific large datasets to store there? [18:46:48] the largest rn are 400 GB, 100 GB, and 90 GB [18:46:56] the rest are below 50 GB [18:47:36] ( https://tool-db-usage.toolforge.org/ ) [18:47:38] we're hoping to bring managed individual Trove databases (currently available for Cloud VPS projects) to larger Toolforge users at some point, but not yet [18:55:03] majavah: I was thinking of making a dataset of image captions. I can start with smaller wikis and move upwards, just need to know when to stop. [19:01:37] image captions... from wiki uses? [19:04:37] yes, something like the imagelinks table but with most of the image markup paramters included [19:04:55] it could be interesting [19:05:08] lots of parsing ahead, though [19:05:32] I'm hoping mwparserfromhell can deal with it, admittedly haven't looked into it yet [19:10:08] it may be better to parse from the generated html [19:10:16] albeit that means lots of requests :/ [19:12:20] yeah, I don't think that's realistic [19:12:58] we'll have HTML dumps some day but I don't want to wait until then [19:15:52] tgr: I'd like to have at least some general direction on how much data you'd like to store, plus toolsdb is a rather undersized resource and there might be other scaling issues with large amounts of data which makes me really hesitant to promise anything [19:17:18] I have no idea how much data it is going to be. If it's too much I can just stop and use a sampled dataset or find another method. I just need to know how much is too much. [19:18:02] There's a cloudvps project too as an option w/ trove [19:18:04] it can probably be estimated by doing 1-2 wikis and extrapolating [19:18:27] you want to check all pages or only content namespaces? [19:20:44] content only [21:07:11] on eswiki, that would be nearly 10M pages [21:07:26] SELECT DISTINCT COUNT(il_from) FROM imagelinks WHERE il_from_namespace IN (0, 104) → 9959234 [21:11:37] 10M rows, you mean [21:12:05] I assume most of those don't have any captions though. [21:12:48] and in any case caption length varies wildly, so even with the number of rows known, it's hard to make an estimate. [21:13:15] Probably way below 50GB though. [21:26:21] I mean 10M pages embedding at least one image [21:26:52] I was mostly concerned about processing [21:27:23] processing one page will take a similr amount of time, whether it has 1 image or 20 [21:27:46] in fact, some of those will be non-existant images [21:28:28] if it took us one second per page, that's 115 days = 3.8 months