[14:45:20] <mepps>	 hi! i'm the engineering manager for trust and safety tools and i'm wondering if someone can help me locate a private config value for https://www.mediawiki.org/wiki/Extension:MediaModeration
[14:51:04] <cdanis>	 mepps: sorry to redirect you but I think you'd be better off asking in #wikimedia-serviceops -- they're the sub-team that runs Mediawiki in production @ WMF
[14:51:13] <mepps>	 thanks cdanis!
[15:56:50] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul) @hashar let me know when this is offline so i can take over
[16:02:31] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) @Papaul the machine is shutting down. I am on IRC if you want t...
[16:04:33] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) > From: @Joe  > I'm generally not a big fan of reformatting patches, because of how hard they make to reconstruct git history. However,...
[16:16:46] <wikibugs>	 10puppet-compiler, 10Infrastructure-Foundations, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 4 others: [pcc] Release the latest version - https://phabricator.wikimedia.org/T297356 (10DannyS712)
[16:43:04] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) >>! In T236954#7625612, @jbond wrote: > Thanks for the work on this looks really good, in relation to linting vs automatic formatting i...
[16:46:49] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul) reset IDRAC, uograde BIOS and IDRAC.
[16:47:05] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) >>! In T236954#7626447, @fgiunchedi wrote: > 100% agreed on consistency, I like the general idea and wanted to say +1 on not removing bl...
[16:54:30] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) 05Open→03Resolved a:03Papaul I have restarted ferm.  Zuul...
[16:56:54] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul) @hashar no problem you can close the task once all is back onli...
[17:03:05] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul)
[17:20:03] <klausman>	 XioNoX: \o
[17:20:22] <klausman>	 I am having bandwidth issues to bast1003.eqiad and wondered if it's a known issue
[17:20:27] <klausman>	 ISP is init7 in CH
[17:21:00] <cdanis>	 fun
[17:21:11] <cdanis>	 klausman: can you post the results of mtr --report-wide --show-ips --aslookup --tcp --port 22 bast1003.wikimedia.org
[17:21:21] <klausman>	 sec
[17:22:02] <klausman>	 https://phabricator.wikimedia.org/P18783
[17:22:18] <cdanis>	 hmm
[17:22:34] <klausman>	 That latency jump between hops 14 and 15 looks odd, but I'm no networker :)
[17:22:55] <klausman>	 Also asked kormat about her view of the matter,m since she's on the same ISP
[17:23:13] <klausman>	 Things seem ok on her end.
[17:24:56] <klausman>	 Hrmmm. bast202 is also slow for me. Sounds like a local problem, then
[17:25:21] <cdanis>	 can you say more what you mean by bandwidth issues?  are you trying to do a large copy?
[17:25:41] <klausman>	 I tried a git pull on the pw repo and only got like 30-40KiB/s
[17:26:09] <klausman>	 I then checked with `ssh bast2002.wikimedia.org cat /dev/zero|pv >/dev/null` (Compression off) and it's indeed maybe 60KiB/s
[17:26:45] <klausman>	 Also tried assorted other hosts (cumin, ml-serve1001,...) and they all are slow. (or my local conn is)
[17:26:53] <cdanis>	 right, those all go via the bastion host ofc
[17:27:04] <klausman>	 Yarp
[17:27:18] <cdanis>	 how does bast300x look for you?
[17:27:35] <XioNoX>	 fyi, init7 is one of our transits in esams, so overall you would have better perf using the bast over there
[17:28:11] <XioNoX>	 then could you try the same thing with IPv4 ? ssh -4 ....
[17:28:11] <klausman>	 Both 2004 and 3005 are slow as well
[17:28:25] <klausman>	 I also tried from the same machine to my root server at Hetzner - blazing fast
[17:28:57] <klausman>	 v4 tp 3005 is fast, but to 3004 is still slow
[17:29:01] <klausman>	 s/tp/to/
[17:29:06] <XioNoX>	 ok, at this point I'd blame your home internet :)
[17:29:16] <cdanis>	 I am also sussy of some awfulness involving TCP window sizes or similar
[17:29:38] <cdanis>	 have you tweaked anything weird on your workstation? :)
[17:29:55] <klausman>	 My first impulse was PMUTd being frazzled, but I wouldn't know why it suddenly started
[17:30:00] <klausman>	 Not that I'm aware
[17:30:17] <klausman>	 I mean I update shit as Debian sends it to me, that's about it
[17:30:21] <cdanis>	 ah okay
[17:30:31] <cdanis>	 basically my only thought from here for how to proceed is to spend a bunch of time squinting at wireshark
[17:31:16] <klausman>	 Let me try and reboot the device. It has quite some uptime
[17:44:31] <klausman>	 Nope, that didn't help. TIme to get a wiresharkin'
[17:54:44] <klausman>	 https://imgur.com/a/2IX9V4a Insights welcome
[17:58:07] <klausman>	 There's a lot of out-of-order and missing segments
[18:13:20] <klausman>	 Definitely not a machine with just the work machine. My private workstation has the same problem.
[18:13:31] <klausman>	 s/not a machine/not a problem/
[18:13:51] <klausman>	 But *only* with WMF hosts, so far. Weird.
[18:20:24] <cdanis>	 oh, I guess the other thing to check for is a return path issue
[18:20:34] <cdanis>	 get a mtr towards your own IP from a bastion host?
[18:21:28] <klausman>	 sec
[18:22:25] <klausman>	 `-bash: mtr: command not found`
[18:28:41] <paravoid>	 return path is definitely a good suggestion, but I would add that there are equal cost multipath scenarios between the hops that you're seeing
[18:28:53] <paravoid>	 hashed by l3 (IPs) and l4 (protocol or port)
[18:29:18] <paravoid>	 that would explain e.g. traceroute looking OK, but a tcp connection not, and maybe another tcp connection working ok
[18:29:43] <paravoid>	 mtr --tcp or tcptraceroute can help here, and repeated runs with different source ports as well
[18:30:51] <cdanis>	 paravoid: yeah, I asked for mtr --tcp --port 22 :)
[18:30:58] <cdanis>	 klausman: try sudo mtr perhaps?  it might live in sbin
[18:31:08] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10colewhite) Thanks for looking into this!  Automatic formatting would be great as long as the output is human-oriented.  >>! In T236954#7624944, @jh...
[18:31:50] <cdanis>	 last I checked it was installed on bastion hosts
[18:33:46] <klausman>	 So here's a thing
[18:34:07] <klausman>	 I assumed all bastion hosts would be treated equally when it comes to can-be-directly-ssh'd to
[18:34:14] <klausman>	 They are not
[18:34:24] <klausman>	 Well, not according to the .ssh/config
[18:34:38] <cdanis>	 ah, are you always going via one bastion host?
[18:34:42] <klausman>	 Fast hosts are those I can go to directly according to the config
[18:34:55] <klausman>	 i.e. bast3005, 2002 and 1003
[18:35:11] <klausman>	 bast 3004 is hopped-to via 3005 (I presume) and is slow.
[18:35:41] <klausman>	 SO it's not a me issue \o/ But something that happens with proxy hosts /p\
[18:35:42] <cdanis>	 ssh -v will hopefully provide some enlightenment 
[18:35:54] <cdanis>	 as to when a proxy is used and isn't
[18:36:37] <klausman>	 debug1: Executing proxy command: exec ssh -a -W bast3004.wikimedia.org:22 bast1003.wikimedia.org
[18:36:47] <klausman>	 So 3004 is connected to via 1003
[18:37:00] <cdanis>	 that will definitely be slow :)
[18:37:07] <XioNoX>	 please don't use 3004 :) https://github.com/wikimedia/puppet/blob/production/manifests/site.pp#L266
[18:37:37] <klausman>	 Still, the original problem was a that git pulls from cumin1001 were super-slow
[18:38:02] <cdanis>	 where did your ~/.ssh/config originate from?
[18:38:04] <klausman>	 And they still are
[18:38:11] <klausman>	 wmf-update-thingamabob
[18:38:22] <cdanis>	 that's where your known hosts comes from, not your ssh config :)
[18:38:32] <cdanis>	 or did you install wmf-sre-laptop ?
[18:38:57] <klausman>	 wmf-update-ssh-config
[18:39:09] <klausman>	 I use both that and wmf-update-known-hosts-production
[18:39:13] <cdanis>	 ah okay
[18:40:24] <klausman>	 I usuallt have my own config and then use an Include, but I ruled that out as a problem already (using only the config coming out of the update tool atm)
[18:40:37] <cdanis>	 yeah
[18:40:53] <cdanis>	 so, I don't use the now-standard scripts, and I am about to disagree with their config ;)
[18:41:07] <cdanis>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/wmf-sre-laptop/+/refs/heads/master/configs/ssh-client-config#2
[18:41:21] <cdanis>	 my config has 'bast*.wikimedia.org' instead of fully listing the allowed ones
[18:41:43] <klausman>	 I mean, I can always try stuff :)
[18:42:49] <klausman>	 Nope, no difference
[18:43:14] <klausman>	 debug1: Executing proxy command: exec ssh -a -W cumin1001.eqiad.wmnet:22 bast1003.wikimedia.org
[18:43:25] <klausman>	 In this constellation stuff is slow
[18:43:34] <cdanis>	 well, that's expected given the rest of the file 
[18:43:46] <cdanis>	 but that one change should have made bast3005 much faster
[18:44:00] <cdanis>	 also it sounds like your real problem isn't the esams bastions, it's the eqiad ones
[18:44:02] <klausman>	 wait, why would an SSH to cumin1001 via 1003 be slow?
[18:44:24] <cdanis>	 grab a mtr towards bast1003 ?
[18:44:37] <klausman>	 (the other host I've tried is ml-serve1001, which is also in eqiad)
[18:44:39] <klausman>	 sec.
[18:44:55] <cdanis>	 if you are tired of troubleshooting, and just want a quick fix
[18:45:36] <cdanis>	 then I suggest rewriting the config to a much simpler one that simply makes all traffic go via your closest bastion
[18:45:50] <klausman>	 https://phabricator.wikimedia.org/P18783 (from earlier) https://phabricator.wikimedia.org/P18784 (right now)
[18:45:54] <cdanis>	 https://wikitech.wikimedia.org/wiki/SRE/Production_access#SSH_configuration
[18:45:56] <cdanis>	 like so
[18:46:22] <cdanis>	 bast3005 is capable of talking to cumin1001
[18:46:30] <klausman>	 would going to cumin1001 via AMS be expected to be better regarding these matters?
[18:46:44] <cdanis>	 it shouldn't be drastically better
[18:46:52] <klausman>	 I mean, I get the "but what if AMS is down" concern
[18:46:56] <cdanis>	 there is still some issue with your ssh traffic between your home and eqiad, which we haven't figured out yet
[18:47:11] <cdanis>	 but also I'm not sure we will
[18:47:29] <klausman>	 I am also puzzled that the hop from xe-5-3-3-500.cr1-eqiad.wikimedia.org to bast1003.wikimedia.org add's 70ms all by itself
[18:47:56] <cdanis>	 yeah, I took a quick peek at librenms and a few other things and didn't see obvious signs of congestion there
[18:48:42] <klausman>	 I can run a wireshark dump like above, but going to bast1003
[18:49:08] <klausman>	 See if there's anything obvious. My earlier dump was clearly muddled but not understanding the SSH config
[18:52:39] <klausman>	 Out of a total of 1443 packets (that cat dev/zero command, running for 1m), 75 out of order, 80 missing segments (those two can muddy each other, I figure). 8 retransmits. 120 dupe ACKs. 20 window updates
[18:53:00] <klausman>	 I can make the pcap downloadable somewhere if desired.
[18:55:22] <klausman>	 Hm. Nothing I can derive any insight from
[18:55:44] <klausman>	 I mean the go-via-AMS path will be a bandaid, but I dunno what's going on with eqiad
[18:57:58] <klausman>	 Sloe: 1003, 2002, 4003. Fast: 3005, 5002
[18:58:34] <klausman>	 The Atlantic hates me
[19:02:40] <klausman>	 Alright. Enough for today. Maybe sleeping on this will get me insights
[19:52:10] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar)
[19:52:46] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) CI had to be restarted after the machine went up due to some od...