[09:00:39] gehel: I’ll be 10 minutes later [09:00:49] iemarjay: ack [09:01:12] Late* [16:04:35] \o [16:04:41] o/ [16:15:50] tgr|away: missed your question the other day, the -partitioned job queue should be the "real" jobs, What's happening here is eventgate is reading a 1 partition event stream then it's writing it out to a new stream, but with explicit partitioning by cluster. This mostly allows one cluster to slow down writes without slowing everyone else down. We do the same for links jobs and mediawiki [16:15:52] sql db's. [16:19:19] I'm alone with the kids tonight. I'll skip the unmeeting and start the weekend a bit early [16:19:26] enjoy! [17:24:35] started up a restore swift->relforge, seems to be going ok. It reading back at the same ~90MB/s it went in, should be done this afternoon and can try a few queries. Not sure how else to validate that the snapshot/restore worked really [18:44:57] In case anyone is interested, my attempt at munging the WCQS dump ended up crashing due to running out of heap, without producing any munged data. [18:58:48] ebernhardson: heads up I do need to do a rolling restart of relforge (but it is much less impt than the restore of the commonswiki_file obv). So if it finishes up early enough today for me to be clear to restart before eod let me know, otherwise I can just do the restarts on monday [18:59:15] ("it" meaning the restore of swift->relforge ofc) [20:06:28] ryankemper: sure, you can see status with http://localhost:9200/_cat/recovery?active_only=true from one of the instances, it's currently doing shards 24-29 out of in think 34 [20:21:53] sweet, I'll check up on that [20:34:39] one thing i wonder about for doing this for real is cross-dc traffic, this only does 90MB, but it's also only two instances. Default limits are 20MB/s/partition, this is doing 6 at a time but prod will do all 32 which is more like 5 gigabits. Probably need to push the rate limit lower but not sure to where [21:06:58] i suspect librenms.wikimedia.org has graphs, but i don't have access. I see mention elsewhere ~2 months ago we typically only use 1.5gbit eqiad<->codfw. For 10gbit+ link it should be fine and not cause any trouble then, but i guess i should put this in a ticket and ping someone to make sure [21:54:42] ebernhardson: what graphs are you looking for specifically? graphs of historical throughput on the link or something else? [21:57:55] ryankemper: mostly i was wondering if it sends 5gbit from elastic codfw -> swift eqiad, does that push any limits? [21:58:28] hmm, actually i bet that all has to run through the swift frontends so there might be a narrower part [22:00:33] i'm imagining we should set a lower limit, but not really sure what how how to choose it [22:04:14] I wonder which team owns swift here. I'd guess either service-ops or infrastructure foundations [22:04:32] (assuming that it's even owned by a specific team, but I imagine someone must be keeping it from falling over :P) [22:05:59] definitely will need some help from someone because thinking about all the pieces makes my head spin :P just to confirm today we're just doing the xfer to relforge and verifying that works but presumably not doing the xfer over to elastic eqiad and cutting back over until monday or so? [22:14:12] ryankemper: re timeing, yes. I did relforge->swift->relforge (same-dc) to see if it would appear to work, to do it for real we need to do a plugins deploy and perhaps get some appropriate swift credentials (i re-used the analytics cred's for data shipping to test with) [22:15:10] historically filippo managed the swift cluster [22:35:22] i pinged him on the ticket, and added some notes about how i think we could do the recovery [22:56:41] great sounds good [22:56:51] thanks for adding those notes