[15:49:58] I'm setting up a data transfer that will go from a 32 partitions(~=servers) in codfw -> swift in eqiad, by default this uses 20MB/s/partition of network, but 640MB/s(~5Gbit) seems like a lot to just kick off. Not turning it on just now yet anyways, but what would be an appropriate way to decide network usage cap? [15:53:35] wel, the eqiad<->codfw physical links are only ~10Gbps each [15:53:40] so that's half [15:53:55] librenms could tell you our usage patterns [15:53:58] (of the links) [15:54:01] https://librenms.wikimedia.org/bill/bill_id=24/ [15:54:22] sadly i can't login to librenms :) [15:54:47] I'll open a task for that then, you should be able to [15:54:51] thanks! [15:55:45] ebernhardson: we usually peak at 4Gbps in the codfw -> eqiad direction, and we have 2x10G links active/active [15:56:16] XioNoX: ok, so if i limit this at something like 2Gbps, it should leave plenty of room for everything else [15:56:22] ebernhardson: yep [15:56:28] XioNoX: thanks [15:56:29] so when planning large transfer we need to ideally rate limit it so if one link fails, the other doesn't saturate [15:57:06] makes sense [15:57:42] then don't hesitate to ping us so we can monitor it, or let you know if there is any ongoing maintenance [15:58:58] then load balancing is done based on the row the source server is in, so the more diversity the better [15:59:00] expecting to run it later this week, it's only ~1.5tb so wont run super long [16:00:18] cool! [16:00:21] yeah that might be ~2h or so in the real world, if my mental napkin-math is right, at 2Gbps [16:00:42] https://github.com/wikimedia/puppet/blob/9d2eb790e0ac25e068a35d1bcec66bebcb3bd554/modules/profile/manifests/librenms.pp#L155 looks like we can set people as read only [16:09:33] ebernhardson: opened https://phabricator.wikimedia.org/T295700 [18:03:14] slides from today: https://commons.wikimedia.org/wiki/File:Shellbox_overview.pdf [21:29:45] kostajh, urbanecm: I'm not sure how, but it seems https://gerrit.wikimedia.org/r/c/operations/puppet/+/730752 is running from deployment-mwmaint02 against production? [21:30:50] Hey cwhite! I can definitely look at that. Why do you think it runs against production? [21:31:02] That...makes no sense (and...cloud shouldn't have access to prod?l [21:31:09] *?) [21:31:13] Hence my confusion. [21:32:30] Production api-gateway reports that it is ratelimiting requests from 172.16.3.92 (deployment-mwmaint02) every hour around the 27 minute mark [21:33:05] API url config wrong? [21:33:08] https://logstash.wikimedia.org/goto/04da2240cc93a7faf9e91f02d35e6c90 [21:33:16] Oh, right. Sorry, confused that with another job we run. [21:33:34] This is correct, AFAIK it is supposed to use production API [21:34:25] Of course, it shouldn't hit the rate limits. Should we decrease the request rate at beta? [21:37:18] If that's easy enough, I'd say sure. The problem I'm attempting to fix is rate related. [21:37:46] What's that problem if I may ask cwhite? [21:38:28] https://phabricator.wikimedia.org/T295717 [21:38:45] thanks [21:39:23] I'll make sure someone from Growth has a look ASAP. [21:39:59] thanks a bunch :) [21:40:05] no problem [21:43:06] cwhite: also just to help us prioritize: what's the urgency? Can it wait a week or two, or should we treat it as an UBN? [21:50:57] urbanecm: I hope someone will look sooner rather than later. The alert that is firing is one of the main indicators of functioning logstash. [21:51:18] got it -- thanks [22:09:47] if I'm going from role:insetup to something else (thumbor::mediawiki in this case), I don't need to reimage in the middle right? [22:14:44] no, you dont have to [22:14:51] thanks [22:16:11] if you reimage with insetup and switch the role separately, then it doesnt have to work on first run during cookbook run [23:01:30] legoktm: just reading https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-04_large_file_upload_timeouts and I'm trying not to be super sad that I was actually supposed to dig through the code and document how chunked uploads work in 2013 but never did. [23:02:10] also kudos for being a coordinator to actually tracking down and fixing that sneaky problem [23:52:00] bd808: :( and :) - I have documenting how it works saved for some Friday when I get bored, but happy to be beaten to it by someone else [23:52:31] * bd808 calls not it! [23:53:41] it was honestly the very first MediaWiki task that RobLa gave me when I was hired and I can't find any notes from it. I think I talked to Aaron and then went and found other things that were less intimidating to look at. [23:54:13] heh