[11:13:01] * Emperor sees T355914 and is sad [11:16:09] (enwiki want to change thumbs from 220 to 250) [11:19:11] why sad? [11:21:06] 250 isn't a pre-generated size, last time Amir.1 looked, only about 2% of thumb requests were for 250px. So I think making this change would cause us to have to generate new thumbs for most of the images on enwiki. [11:21:23] I see [11:21:43] which seems likely to result in thumbor going down in flames [11:22:05] hnowlan: do you agree, as person most often left holding the baby^Wthumbor? [11:22:09] would it be possible to request on an agreement inter-wikis? [11:22:27] (assuming that would help) [11:23:25] I don't even know if most enwiki request is to enwiki or to commonswiki [11:23:56] I'm not sure it would; the extra disk from storing 250 as well as 220 is probably negligible, it's the spike in thumbor load that I'm worried about. [11:24:17] I see [11:25:19] what about a mw-layer sampling (e.g. starting with 0.01% of requests and going up?) [11:25:31] Emperor: yeah it would be a significant bump - due to some recent improvements it has gotten significantly *better* in terms of performance, but I'm assuming this would be a case of 100x the usual RPS [11:25:54] some kind of pregeneration would be the way to do it [11:26:00] I see [11:26:35] jynus: that would (if its easy to do) presumably have the effect of gradually generating the new size thumbs rather than all at once. [11:26:51] a percentage would also be good as it'd be easy to tune up/down [11:27:17] I just don't like you being sad \o/ [11:27:18] hnowlan: as an aside, how easy now thumbor is on k8s is it to say "let's throw some more resource at thumbor"? [11:27:45] Emperor: very easy, change number, line go up :D [11:28:08] modulo how much spare capacity we have in k8s, presumably [11:28:46] jynus: do you know how we would go about doing that sort of gradual-ease-in and/or who we would need to talk about it with? [11:29:07] I know some devels do it for features [11:29:24] and well, there is some of that for k8s at traffic level [11:29:42] I would start asking first releng [11:29:58] Mediawiki Engineering would be worth asking also [11:30:02] and they probably know best if it is possible and how [11:30:06] aside from serviceops [11:30:14] or how other teams do it [11:30:50] for mw-on-k8s it's a bit more simple a match, I would say we couldn't easily do it at traffic layer. not impossible, but messier [11:31:05] yeah, makes sense [11:31:18] a feature flag for a percentage would be nice although I'm not sure how you manage that consistently across anonymous users etc [11:31:36] well, I would guess pure code [11:31:56] unless someone has already support for it [11:31:59] hnowlan: in your boundless free time, could you add something to https://wikitech.wikimedia.org/wiki/Thumbor on how to give more resource to thumbor if necessary? [11:32:35] Hm, so that's releng and/or serviceops and/or mediawiki engineering [11:32:55] I'd say more and than or :-D [11:33:52] I (possibly incorrectly) don't want to bug everyone everywhere all at once :) [11:34:08] well, start with service ops as it is closer (?) [11:34:26] Mmm [11:34:43] but as you may know, they may not have the mw expertise [11:35:01] but I think they should be on the loop [11:35:04] Mmm [11:35:10] we are already :) [11:35:22] ah, sorry, hnowlan, I didn't put you there [11:38:38] hnowlan: do you have a view on the next people to consult? I can leave it with you and serviceops for now if you like, but don't want to seagull... [11:41:10] (nor do I want to tread on your toes!) [11:41:40] I'd say someone in mediawiki engineering is the next port of call as regards what our options are [11:41:49] the "who" of it all is a bit sticky :D [11:43:20] I wonder if they have a team interface page [11:44:21] no, it's WIP [11:44:32] I'll ask a Wrong Person and they can point me in the right direction :) [11:48:13] thanks! [11:49:48] hnowlan: would you like to be CCd? [11:50:13] arnaudb: is there anything I can help with T356240 that makes sense (e.g. backup1 hosts or something like that?) [11:53:47] Emperor: please! [11:59:26] Hey keepers of the data, I'm going to raise traffic to mw-on-k8s to 35% today, and j.oe brought up that we should keep an eye on the number of connections to the DB due to us using a larger number of smaller worker pools. Have you seen anything that would warrant concern about this? For reference, the last 5% bump was on January 23rd [12:04:24] hnowlan: YHM [13:14:16] jynus: sorry I was afk lunching! as for T356240: you can take hosts you're OK to deal with if you have some time to spare it'll be appreciated! [13:15:08] well, I don't usally have time, I was only asking for example if it involves backup1 or something like that [13:15:34] it looks like it does not, will ping you if I stumble upon an impacted host! [13:15:36] as I thought I will have to reboot mediabackups, but apparently I don't [13:15:50] no problem [15:15:59] Emperor: jynus hnowlan I know how to do this, I've done it before for other changes [15:16:26] can't file the patch but you can just put a mt_rand on CommonSettings.php [15:17:27] oh cool :) [15:17:53] It'll cause cache pollution but whatever :D [15:18:22] we did it for deploying of the new skin of enwiki [15:18:57] jynus: if you want to coordinate here for T316655 feel free to ping me when you want to do the manual run. And sure we can rename/delete/move the files and do a restore. I'd be happy to try the restore as I'm rusty and it's good to know it ;) [15:18:57] T316655: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 [15:21:06] volans: I am creating the backup, thank you [15:21:12] will ping you when donwe [15:21:22] <3 thx [15:22:37] Amir1: ah, useful to know, thanks [15:32:03] volans: I haven't commented because I don't think it corresponds to me, but one of the reasons I suggested the mariadb structure is that it makes that possible, and it is quite flexible [15:32:28] maybe at some point we could support postgres and its monitoring (the most important part) on wmfbackups [15:32:44] the same way we want to support gitlab exports [15:34:48] volans: can you move or remove the original files now (or I can do it)? [15:35:59] I will ask you for a hash of them, too [15:36:51] (on the good side, if the backup fails and we have a 0 byte backup, now it will complain) [15:37:17] nice! [15:37:33] but it is not as nice as the wmfbackup heuristics :-D [15:38:29] ehehe which host have you backupped? [15:38:36] oh [15:38:38] I've moved the files on netboxdb2002 [15:38:54] I had actually run it on eqiad [15:38:56] as I saw [15:39:04] it's the same [15:39:10] psql-all-dbs-2024-01-31-14-55.sql.gz [15:39:16] on the full backup [15:39:23] 98a7c3851bfa471c48a02bcfacbbf1f5 psql-all-dbs-latest.sql.gz [15:39:39] feel free to delete it right away- and trust your code! [15:39:57] is that codfw? [15:40:02] I can run it too [15:40:04] that's eqiad [15:40:06] ok [15:40:15] I've renamed into .moved [15:40:16] ready to restore when you tell me [15:40:19] but we can delete too :D [15:40:39] psql-all-dbs-2024-01-31-14-55.sql.gz.moved and psql-all-dbs-latest.sql.gz.moved [15:42:09] uh, I got an error [15:43:08] netbox1002.eqiad.wmnet-fd JobId 550808: Error: Missing private key required to decrypt encrypted backup data. [15:43:39] oh oh... [15:45:25] that's weird, because we tested restores after puppet 7 migration [15:45:35] and restores had happened since then [15:46:12] ah, level 8 error [15:46:19] layer 8 [15:46:26] one sec, wrong host [15:46:40] lol :D [15:46:41] ok [15:46:54] I tried to recover to netbox, not netboxdb [15:47:01] better, also which command are you running for restore? [15:47:11] I think the confusion is ok [15:47:14] restore [15:47:15] L8 issues are the easiest to fix :D [15:47:27] it is actually well documend if you search panic mode on wikitech [15:47:48] let me share the screen for you on a meet [15:47:58] nah no need [15:47:59] (I promise not to talk) [15:48:00] I know the docs [15:48:20] and nice shortcut to find it :D [15:48:22] I like to do restore [15:48:28] and then 3 [15:48:40] because I like using check_bacula.py [15:48:48] and work with job ids [15:48:50] but that is me [15:50:05] 46.42 M OK [15:50:30] let me delete some empty dirs on netbox1002 [15:51:12] md5 matches [15:52:33] yeah, and clearly it is not a 1kb file, which is what would happen with softlinks [15:52:41] (on storage, I mean) [15:53:15] consider, at your own will, attempting to restore on a staging db [15:53:27] but otherwise, let's amend and send the other patch [15:53:41] yes I can restore that on netbox-next but zless looks good [15:54:12] I would worry if md5 matches but zless retuns garbage! [15:54:20] indeed [15:54:37] not impossible (google did it with pdfs) but not trivial [15:55:01] I will leave the restore there unless you tell me so [15:55:12] ok renamed back files to their original names [15:55:17] I'll take cre of that no worries [15:55:18] thx [15:55:36] just to be 100% sure, you md5 sum the file on /srv/postgres-restores [15:55:38] right? [15:55:39] last question, if the host is totally lost could we restore it elsewhere? [15:55:44] yep [15:55:56] yes of course :) [15:55:59] just needs the private key, it is the section bellow panick mode [15:56:24] that is what I didn't worry about the message- we have a backup for the process :-D [15:56:33] thank alex for the good setup! [15:57:04] and that is also, for context, why I am a pain to work with when changing certs [15:57:36] lol [15:57:42] we cannot just switch to a different one because it will take 3 months to rotate on all older files [15:58:07] possible, just painful :-D [15:58:11] I took the liberty to add a shortcut to the top, feel free to change it at will if you don't like it [15:58:14] https://wikitech.wikimedia.org/wiki/Bacula [15:58:27] yeah, makes sense [15:58:48] normally I think the idea was to come from the runbuk portal of the team, it as better accesibility [15:59:10] I also have to separate runbooks from design, based on feedback [15:59:24] on everything backups, but that will take some work [15:59:31] todo never ends [15:59:43] ok, let me send the patch and we are almost there [16:00:48] do you have handy the profile of netboxdb ? [16:01:34] hieradata/common/profile/netbox/db.yaml ? [16:02:30] thanks, it is just a small edit to modules/profile/manifests/netbox/db.pp [16:10:47] see patchset 3 [16:12:48] mmh does 'Daily-productionEqiad' works for codfw too? [16:12:57] we do backups on both hosts [16:13:00] yes [16:13:02] ok [16:13:39] eventually there will be a difference here, so non-default defaults may need a change, but this is ok for now [16:14:02] ack makes sense [16:14:18] e.g. the idea is to backup eqiad on codfw and codfw on eqiad only [16:14:34] but for now, backuing up on the primary and cloning is the default [16:14:41] ack [16:15:40] at which time bacula will copy the file? [16:16:10] given we do a ln -f it's almost atomic, but avoiding overlap might be good anyway :D [16:17:53] these are the job on codfw (including its termination time) https://phabricator.wikimedia.org/P55975 [16:19:53] ok so early morning but fairly distributed, I think we're good for the daily one on netboxdb1002, as for netboxdb2002 we do hourly there so might overlap but shouldn't be a big deal [16:20:04] here the eqiad times: https://phabricator.wikimedia.org/P55975#226521 [16:20:38] currently it is scheduled for 04:05 [16:21:00] but it is not the only one, ofc at that time [16:21:16] ok I'd say we're good [16:22:39] thanks a lot for the support on this! [16:22:39] one thing that probably should be good is some kind of distributed locking mechanism [16:23:00] not only for postgres, but for other processes [16:23:02] I'll ping data-eng on task for changing the config for their host [16:23:26] on my side I will keep an eye for the first runs [16:23:38] feel free to cc me on those changes [16:23:53] If I see them for sure [16:24:33] as far as https://gerrit.wikimedia.org/r/c/operations/puppet/+/994743 that can go [16:24:36] I +1 [16:25:33] 'ed it [16:26:00] thanks! [16:26:02] thank you for working on it, I think I am happy with that, and longer term improvements can come later on [16:26:25] like a proper, integrated backup framework and monitoring beyond mariadb [16:26:51] but being too ambitious at first == doing nothing, so I am happy with this [16:28:11] :D [20:42:23] (SessionStoreOnNonDedicatedHost) firing: (2) Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost