[09:08:27] rzl: just read your "Production Training Design Proposal". it's _superb_, and exactly what i as a still-pretty-new SRE need [09:12:30] * elukey knows that he will never ever get a "superb" from kormat [09:14:45] * kormat feels she's done a good job setting expectations [09:20:29] * elukey can confirm [09:20:51] :D [09:25:33] <_joe_> oh yes my expectations were set appropriately during the interview, even [09:26:22] hahah [15:46:04] topranks: I have a question that might sound silly [15:46:56] mm let me grab my graphs [15:58:21] kormat: \o/ if you haven't seen it yet, you might also be interested in https://docs.google.com/document/d/1tjwg23lOMQncKTvjj2YYyFgeA-HNtlPybyTgcInuzVs/edit which is still being actively worked on [15:59:42] effie: lol go ahead whenever you have all your presentation materials ready :) [16:02:06] the really short version is that [16:02:58] mw servers had the ability to stampede on a memcached shard, exhausting its available bw, we had 1G cards [16:03:29] mcrouter, the software on each mw server doing the sharding, would block a shard temporarily if it was unable to connect to it (due to teh stampede) [16:03:45] the infamous TKOs [16:03:47] created 2 VMs in each DC and .. there were enough IP addresses for it to just work, seems like netops turned them all into /24s and they were smaller before? nice, thank you, it wasn't like that a little while ago [16:04:00] so now on eqiad we have 10Gs [16:04:33] and we were hoping never to see them again, but a few TKOs have surfaced since the switchover yesterday, but apart from that [16:05:16] I see some tcp retransmissions happening, my question is, is it ok to have eg this [16:05:33] https://grafana-rw.wikimedia.org/d/A0xQ-EI7k/xxxx-effie-tcp-retransmissions?orgId=1&from=1631715269766&to=1631715864261 [16:05:54] now what we see here is: [16:07:33] mw1422 registered a tko to 10.192.48.77 (which is in codfw) [16:08:11] but at the same time I see a small burst of retransmissions in mw* and memcached servers [16:08:22] Ideally we'd have no TCP retransmissions (at least caused by the network) at all. [16:08:42] Certainly none within a DC, and given we have private WAN links we should have enough control to prevent them there. [16:08:44] I know for a fact that there is no way that we exhausted 10.192.48.77's bw, because we send very little memcached traffic to codfw [16:09:40] That said the uplinks from switch rows to CRs (used for inter-row traffic and inter-site traffic) do have drops on them fairly regularly, despite the fact the links themselves are a good way from being saturated. [16:10:15] The recent buffer changes we did in eqiad were aimed at that, but while they made a big difference the drops didn't go away completely. [16:11:12] The majority of the TKOs you see (apologies if I get the terms wrong here,) they are for servers in the same DC? Or remote ones (say in eqiad registering a tko to codfw as in the example) [16:15:09] let's leave the TKOs out of the problem for now, to simplify it [16:15:49] if it makes sense to chase up the retransmission errors [16:16:03] I can revisit the TKO problem after [16:18:37] I'm trying to do just that here yeah. It does make sense to try to get to the bottom of them. [16:18:57] If the cause is the discards on row A uplinks to the CRs then there is not much we can do (short term). However we working longer-term on a new design that should prevent that (largely by adding BW to those links, and taking row->row traffic off them and onto a dedicated spine layer). [16:19:48] If something else is causing the retransmissions we best know about that though, so it's good if we can try to verify what the root cause of them is. [16:20:45] well, if you zoom out eg last 12h [16:21:04] you will see that this not a rare occurance [16:22:12] Yeah. Very roughly the pattern of those over time matches the rise/fall in discards we see towards the CRs. For instance row A discards over past 24 hours: [16:22:13] https://librenms.wikimedia.org/graphs/to=1631722200/id=15267/type=port_errors/from=1631635800/ [16:22:22] https://librenms.wikimedia.org/graphs/to=1631722200/id=15279/type=port_errors/from=1631635800/ [16:22:51] https://librenms.wikimedia.org/graphs/to=1631722200/id=15281/type=port_errors/from=1631635800/ [16:24:55] ah! [16:25:11] so it is a known issue [16:26:03] yes. sad face :( [16:26:28] But it's a long term fix, and last thing we want to do is dismiss this as "part of the big problem" just in case there is something else going on. [16:26:43] there is a chance that those TKOs are related, but I can't be sure unless those retransmission clear up [16:27:03] the pattern matches somewhat [16:27:19] To troubleshoot another issue (backup speeds) I temporarily changed the GW IP one of the backup servers was using. I'm unsure if we could do the same here. [16:27:48] Basically cr1 in eqiad is the VRRP master for row A private (for instance). So it has IP 10.64.0.1 [16:28:02] the blast radius would be considerably larger if we'd do this here [16:28:21] in case something does not go as planned I mean [16:28:21] If we change the GW mw1422 is using from that IP to 10.64.0.3, which CR2 is using, then packets it sends will use the currently idle links from swtiches to CR2. [16:28:29] yes indeed. [16:28:57] we can always limit it to a handful of servers and watch them [16:29:09] I think the change is relatively safe, however I figure puppet would probably fix anything like that done on next run (in case of backups that didn't matter as I could re-test immediately, here we'd want to leave it a while and see if graphs are clean). [16:29:10] we can live without 3-4 mw servers [16:32:47] if you want to follow it, we can depool a few servers, changes their gw and pool them back [16:33:13] up to you, it is not something urgent anyway [16:33:40] sounds like a plan. It'd be useful to us to confirm if it's the cause anyway. [16:34:01] Only thing I'm unsure of is the "change the gw" bit... might have to ask someone with more knowledge how to accomplish that. [16:34:13] so bear with me if you can. [16:37:39] I can bear with you [16:38:34] I would guess it should be a few changes in our dhcp config, but moritz or john would know better [16:39:25] I never wondered how to change a default gw around here [16:41:26] I've asked said experts in our channel there, may be tomorrow till we can try, I know John is off for the week. [16:42:08] The config on the servers is hard-coded after the initial install, so I don't think we can do it via DHCP (which would be handy actually, as we can use the MAC there). [16:43:33] pff [16:43:39] :p [17:02:36] yes sorry, acronyms [18:33:10] urbanecm: do you want to do another export of sitereq-l@ now that there are 3 more mails not in the mbox export? [18:33:17] good idea [18:33:33] if you're able to do the import now, I'll redo the takeout [18:33:46] yep [18:33:49] ok, doing [18:34:00] legoktm: in the meantime, is there anything else to finish on T287916? [18:34:01] T287916: Disable DPL on wikis that aren't using it - https://phabricator.wikimedia.org/T287916 [18:34:43] uhh, I need to run my checker again [18:35:15] urbanecm: also, sitereq-l@ or sitereq@? [18:35:22] what would you recommend? [18:36:13] the -l suffix is deprecated, but that's the name of the list already so I'm OK keeping it [18:37:55] I lean towards keeping it, to make it a domain-change only. But if we're able to alias lists (and make both names work), maybe that's the right way? [18:38:00] (keeping it => the suffix, i mean) [18:38:08] new takeout is at https://people.wikimedia.org/~urbanecm/tmp/sitereq-l@wikimedia.org_2.tar.gz, please check :) [18:38:40] > Subject: Migration of this list to Mailman [18:38:44] lgtm [18:38:53] great [18:42:09] https://lists.wikimedia.org/hyperkitty/list/sitereq-l@lists.wikimedia.org/ [18:42:33] thanks! [18:43:08] am i good to say "migration finished" and disable the google group? [18:43:41] I think so [18:43:48] thanks for the help legoktm :) [18:43:58] the hyperkitty_import is still at "Synchronizing properties with Mailman" but I assume that will eventually finish [18:44:00] :D [18:44:24] I'm fine with waiting a while, just wasn't sure :)) [18:44:36] the list itself should be usable now [18:45:05] ok [18:45:11] > 57295 emails left to refresh, checked 1000 [18:45:54] that's...a large number [18:48:59] https://gitlab.com/mailman/hyperkitty/-/blob/master/hyperkitty/lib/mailman.py#L129 it's like, updating every sender to a Mailman list or something [18:51:19] anyways, this is all background stuff, once it finishes I'll update the search cache too [18:54:29] > Indexing 54 emails [18:54:32] really all done now :) [18:54:58] thanks :) [18:56:15] https://wikitech.wikimedia.org/wiki/Mailman#Import_from_Google_Groups