[09:08:27] <kormat>	 rzl: just read your "Production Training Design Proposal". it's _superb_, and exactly what i as a still-pretty-new SRE need
[09:12:30] * elukey knows that he will never ever get a "superb" from kormat
[09:14:45] * kormat feels she's done a good job setting expectations
[09:20:29] * elukey can confirm
[09:20:51] <kormat>	 :D
[09:25:33] <_joe_>	 oh yes my expectations were set appropriately during the interview, even
[09:26:22] <kormat>	 hahah
[15:46:04] <effie>	 topranks: I have a question that might sound silly 
[15:46:56] <effie>	 mm let me grab my graphs 
[15:58:21] <rzl>	 kormat: \o/ if you haven't seen it yet, you might also be interested in https://docs.google.com/document/d/1tjwg23lOMQncKTvjj2YYyFgeA-HNtlPybyTgcInuzVs/edit which is still being actively worked on
[15:59:42] <topranks>	 effie: lol go ahead whenever you have all your presentation materials ready :)
[16:02:06] <effie>	 the really short version is that 
[16:02:58] <effie>	 mw servers had the ability to stampede on a memcached shard, exhausting its available bw, we had 1G cards
[16:03:29] <effie>	 mcrouter, the software on each mw server doing the sharding, would block a shard temporarily if it was unable to connect to it (due to teh stampede) 
[16:03:45] <effie>	 the infamous TKOs
[16:03:47] <mutante>	 created 2 VMs in each DC and .. there were enough IP addresses for it to just work, seems like netops turned them all into /24s and they were smaller before? nice, thank you, it wasn't like that a little while ago
[16:04:00] <effie>	 so now on eqiad we have 10Gs 
[16:04:33] <effie>	 and we were hoping never to see them again, but a few TKOs have surfaced since the switchover yesterday, but apart from that 
[16:05:16] <effie>	 I see some tcp retransmissions happening, my question is, is it ok to have eg this 
[16:05:33] <effie>	 https://grafana-rw.wikimedia.org/d/A0xQ-EI7k/xxxx-effie-tcp-retransmissions?orgId=1&from=1631715269766&to=1631715864261
[16:05:54] <effie>	 now what we see here is:
[16:07:33] <effie>	 mw1422 registered a tko to 10.192.48.77 (which is in codfw)
[16:08:11] <effie>	 but at the same time I see a small burst of retransmissions in mw* and memcached servers
[16:08:22] <topranks>	 Ideally we'd have no TCP retransmissions (at least caused by the network) at all.
[16:08:42] <topranks>	 Certainly none within a DC, and given we have private WAN links we should have enough control to prevent them there.
[16:08:44] <effie>	 I know for a fact that there is no way that we exhausted 10.192.48.77's bw, because we send very little memcached traffic to codfw
[16:09:40] <topranks>	 That said the uplinks from switch rows to CRs (used for inter-row traffic and inter-site traffic) do have drops on them fairly regularly, despite the fact the links themselves are a good way from being saturated.
[16:10:15] <topranks>	 The recent buffer changes we did in eqiad were aimed at that, but while they made a big difference the drops didn't go away completely.
[16:11:12] <topranks>	 The majority of the TKOs you see (apologies if I get the terms wrong here,) they are for servers in the same DC?  Or remote ones (say in eqiad registering a tko to codfw as in the example)
[16:15:09] <effie>	 let's leave the TKOs out of the problem for now, to simplify it 
[16:15:49] <effie>	 if it makes sense to chase up the retransmission errors 
[16:16:03] <effie>	 I can revisit the TKO problem after
[16:18:37] <topranks>	 I'm trying to do just that here yeah.  It does make sense to try to get to the bottom of them.
[16:18:57] <topranks>	 If the cause is the discards on row A uplinks to the CRs then there is not much we can do (short term).  However we working longer-term on a new design that should prevent that (largely by adding BW to those links, and taking row->row traffic off them and onto a dedicated spine layer).
[16:19:48] <topranks>	 If something else is causing the retransmissions we best know about that though, so it's good if we can try to verify what the root cause of them is.
[16:20:45] <effie>	 well, if you zoom out eg last 12h 
[16:21:04] <effie>	 you will see that this not a rare occurance 
[16:22:12] <topranks>	 Yeah.  Very roughly the pattern of those over time matches the rise/fall in discards we see towards the CRs.  For instance row A discards over past 24 hours:
[16:22:13] <topranks>	 https://librenms.wikimedia.org/graphs/to=1631722200/id=15267/type=port_errors/from=1631635800/
[16:22:22] <topranks>	 https://librenms.wikimedia.org/graphs/to=1631722200/id=15279/type=port_errors/from=1631635800/
[16:22:51] <topranks>	 https://librenms.wikimedia.org/graphs/to=1631722200/id=15281/type=port_errors/from=1631635800/
[16:24:55] <effie>	 ah! 
[16:25:11] <effie>	 so it is a known issue 
[16:26:03] <topranks>	 yes.  sad face :(  
[16:26:28] <topranks>	 But it's a long term fix, and last thing we want to do is dismiss this as "part of the big problem" just in case there is something else going on.
[16:26:43] <effie>	 there is a chance that those TKOs are related, but I can't be sure unless those retransmission clear up 
[16:27:03] <effie>	 the pattern matches somewhat 
[16:27:19] <topranks>	 To troubleshoot another issue (backup speeds) I temporarily changed the GW IP one of the backup servers was using.  I'm unsure if we could do the same here.
[16:27:48] <topranks>	 Basically cr1 in eqiad is the VRRP master for row A private (for instance).  So it has IP 10.64.0.1
[16:28:02] <effie>	 the blast radius would be considerably larger if we'd do this here 
[16:28:21] <effie>	 in case something does not go as planned I mean 
[16:28:21] <topranks>	 If we change the GW mw1422 is using from that IP to 10.64.0.3, which CR2 is using, then packets it sends will use the currently idle links from swtiches to CR2.
[16:28:29] <topranks>	 yes indeed.
[16:28:57] <effie>	 we can always limit it to a handful of servers and watch them 
[16:29:09] <topranks>	 I think the change is relatively safe, however I figure puppet would probably fix anything like that done on next run (in case of backups that didn't matter as I could re-test immediately, here we'd want to leave it a while and see if graphs are clean).
[16:29:10] <effie>	 we can live without 3-4 mw servers
[16:32:47] <effie>	 if you want to follow it, we can depool a few servers, changes their gw and pool them back 
[16:33:13] <effie>	 up to you, it is not something urgent anyway 
[16:33:40] <topranks>	 sounds like a plan.  It'd be useful to us to confirm if it's the cause anyway.
[16:34:01] <topranks>	 Only thing I'm unsure of is the "change the gw" bit... might have to ask someone with more knowledge how to accomplish that.
[16:34:13] <topranks>	 so bear with me if you can.
[16:37:39] <effie>	 I can bear with you 
[16:38:34] <effie>	 I would guess it should be a few changes in our dhcp config, but moritz or john would know better 
[16:39:25] <effie>	 I never wondered how to change a default gw around here 
[16:41:26] <topranks>	 I've asked said experts in our channel there, may be tomorrow till we can try, I know John is off for the week.
[16:42:08] <topranks>	 The config on the servers is hard-coded after the initial install, so I don't think we can do it via DHCP (which would be handy actually, as we can use the MAC there).  
[16:43:33] <effie>	 pff
[16:43:39] <effie>	 :p
[17:02:36] <volans>	 yes sorry, acronyms
[18:33:10] <legoktm>	 urbanecm: do you want to do another export of sitereq-l@ now that there are 3 more mails not in the mbox export?
[18:33:17] <urbanecm>	 good idea
[18:33:33] <urbanecm>	 if you're able to do the import now, I'll redo the takeout
[18:33:46] <legoktm>	 yep
[18:33:49] <urbanecm>	 ok, doing
[18:34:00] <urbanecm>	 legoktm: in the meantime, is there anything else to finish on T287916?
[18:34:01] <stashbot>	 T287916: Disable DPL on wikis that aren't using it - https://phabricator.wikimedia.org/T287916
[18:34:43] <legoktm>	 uhh, I need to run my checker again
[18:35:15] <legoktm>	 urbanecm: also, sitereq-l@ or sitereq@?
[18:35:22] <urbanecm>	 what would you recommend?
[18:36:13] <legoktm>	 the -l suffix is deprecated, but that's the name of the list already so I'm OK keeping it
[18:37:55] <urbanecm>	 I lean towards keeping it, to make it a domain-change only. But if we're able to alias lists (and make both names work), maybe that's the right way?
[18:38:00] <urbanecm>	 (keeping it => the suffix, i mean)
[18:38:08] <urbanecm>	 new takeout is at https://people.wikimedia.org/~urbanecm/tmp/sitereq-l@wikimedia.org_2.tar.gz, please check :)
[18:38:40] <legoktm>	 > Subject: Migration of this list to Mailman
[18:38:44] <legoktm>	 lgtm
[18:38:53] <urbanecm>	 great
[18:42:09] <legoktm>	 https://lists.wikimedia.org/hyperkitty/list/sitereq-l@lists.wikimedia.org/
[18:42:33] <urbanecm>	 thanks!
[18:43:08] <urbanecm>	 am i good to say "migration finished" and disable the google group?
[18:43:41] <legoktm>	 I think so
[18:43:48] <urbanecm>	 thanks for the help legoktm :)
[18:43:58] <legoktm>	 the hyperkitty_import is still at "Synchronizing properties with Mailman" but I assume that will eventually finish
[18:44:00] <legoktm>	 :D
[18:44:24] <urbanecm>	 I'm fine with waiting a while, just wasn't sure :))
[18:44:36] <legoktm>	 the list itself should be usable now
[18:45:05] <urbanecm>	 ok
[18:45:11] <legoktm>	 > 57295 emails left to refresh, checked 1000
[18:45:54] <urbanecm>	 that's...a large number
[18:48:59] <legoktm>	 https://gitlab.com/mailman/hyperkitty/-/blob/master/hyperkitty/lib/mailman.py#L129 it's like, updating every sender to a Mailman list or something
[18:51:19] <legoktm>	 anyways, this is all background stuff, once it finishes I'll update the search cache too
[18:54:29] <legoktm>	 > Indexing 54 emails
[18:54:32] <legoktm>	 really all done now :)
[18:54:58] <urbanecm>	 thanks :)
[18:56:15] <legoktm>	 https://wikitech.wikimedia.org/wiki/Mailman#Import_from_Google_Groups