[14:45:20] hi! i'm the engineering manager for trust and safety tools and i'm wondering if someone can help me locate a private config value for https://www.mediawiki.org/wiki/Extension:MediaModeration [14:51:04] mepps: sorry to redirect you but I think you'd be better off asking in #wikimedia-serviceops -- they're the sub-team that runs Mediawiki in production @ WMF [14:51:13] thanks cdanis! [15:56:50] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul) @hashar let me know when this is offline so i can take over [16:02:31] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) @Papaul the machine is shutting down. I am on IRC if you want t... [16:04:33] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) > From: @Joe > I'm generally not a big fan of reformatting patches, because of how hard they make to reconstruct git history. However,... [16:16:46] 10puppet-compiler, 10Infrastructure-Foundations, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 4 others: [pcc] Release the latest version - https://phabricator.wikimedia.org/T297356 (10DannyS712) [16:43:04] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) >>! In T236954#7625612, @jbond wrote: > Thanks for the work on this looks really good, in relation to linting vs automatic formatting i... [16:46:49] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul) reset IDRAC, uograde BIOS and IDRAC. [16:47:05] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) >>! In T236954#7626447, @fgiunchedi wrote: > 100% agreed on consistency, I like the general idea and wanted to say +1 on not removing bl... [16:54:30] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) 05Open→03Resolved a:03Papaul I have restarted ferm. Zuul... [16:56:54] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul) @hashar no problem you can close the task once all is back onli... [17:03:05] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul) [17:20:03] XioNoX: \o [17:20:22] I am having bandwidth issues to bast1003.eqiad and wondered if it's a known issue [17:20:27] ISP is init7 in CH [17:21:00] fun [17:21:11] klausman: can you post the results of mtr --report-wide --show-ips --aslookup --tcp --port 22 bast1003.wikimedia.org [17:21:21] sec [17:22:02] https://phabricator.wikimedia.org/P18783 [17:22:18] hmm [17:22:34] That latency jump between hops 14 and 15 looks odd, but I'm no networker :) [17:22:55] Also asked kormat about her view of the matter,m since she's on the same ISP [17:23:13] Things seem ok on her end. [17:24:56] Hrmmm. bast202 is also slow for me. Sounds like a local problem, then [17:25:21] can you say more what you mean by bandwidth issues? are you trying to do a large copy? [17:25:41] I tried a git pull on the pw repo and only got like 30-40KiB/s [17:26:09] I then checked with `ssh bast2002.wikimedia.org cat /dev/zero|pv >/dev/null` (Compression off) and it's indeed maybe 60KiB/s [17:26:45] Also tried assorted other hosts (cumin, ml-serve1001,...) and they all are slow. (or my local conn is) [17:26:53] right, those all go via the bastion host ofc [17:27:04] Yarp [17:27:18] how does bast300x look for you? [17:27:35] fyi, init7 is one of our transits in esams, so overall you would have better perf using the bast over there [17:28:11] then could you try the same thing with IPv4 ? ssh -4 .... [17:28:11] Both 2004 and 3005 are slow as well [17:28:25] I also tried from the same machine to my root server at Hetzner - blazing fast [17:28:57] v4 tp 3005 is fast, but to 3004 is still slow [17:29:01] s/tp/to/ [17:29:06] ok, at this point I'd blame your home internet :) [17:29:16] I am also sussy of some awfulness involving TCP window sizes or similar [17:29:38] have you tweaked anything weird on your workstation? :) [17:29:55] My first impulse was PMUTd being frazzled, but I wouldn't know why it suddenly started [17:30:00] Not that I'm aware [17:30:17] I mean I update shit as Debian sends it to me, that's about it [17:30:21] ah okay [17:30:31] basically my only thought from here for how to proceed is to spend a bunch of time squinting at wireshark [17:31:16] Let me try and reboot the device. It has quite some uptime [17:44:31] Nope, that didn't help. TIme to get a wiresharkin' [17:54:44] https://imgur.com/a/2IX9V4a Insights welcome [17:58:07] There's a lot of out-of-order and missing segments [18:13:20] Definitely not a machine with just the work machine. My private workstation has the same problem. [18:13:31] s/not a machine/not a problem/ [18:13:51] But *only* with WMF hosts, so far. Weird. [18:20:24] oh, I guess the other thing to check for is a return path issue [18:20:34] get a mtr towards your own IP from a bastion host? [18:21:28] sec [18:22:25] `-bash: mtr: command not found` [18:28:41] return path is definitely a good suggestion, but I would add that there are equal cost multipath scenarios between the hops that you're seeing [18:28:53] hashed by l3 (IPs) and l4 (protocol or port) [18:29:18] that would explain e.g. traceroute looking OK, but a tcp connection not, and maybe another tcp connection working ok [18:29:43] mtr --tcp or tcptraceroute can help here, and repeated runs with different source ports as well [18:30:51] paravoid: yeah, I asked for mtr --tcp --port 22 :) [18:30:58] klausman: try sudo mtr perhaps? it might live in sbin [18:31:08] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10colewhite) Thanks for looking into this! Automatic formatting would be great as long as the output is human-oriented. >>! In T236954#7624944, @jh... [18:31:50] last I checked it was installed on bastion hosts [18:33:46] So here's a thing [18:34:07] I assumed all bastion hosts would be treated equally when it comes to can-be-directly-ssh'd to [18:34:14] They are not [18:34:24] Well, not according to the .ssh/config [18:34:38] ah, are you always going via one bastion host? [18:34:42] Fast hosts are those I can go to directly according to the config [18:34:55] i.e. bast3005, 2002 and 1003 [18:35:11] bast 3004 is hopped-to via 3005 (I presume) and is slow. [18:35:41] SO it's not a me issue \o/ But something that happens with proxy hosts /p\ [18:35:42] ssh -v will hopefully provide some enlightenment [18:35:54] as to when a proxy is used and isn't [18:36:37] debug1: Executing proxy command: exec ssh -a -W bast3004.wikimedia.org:22 bast1003.wikimedia.org [18:36:47] So 3004 is connected to via 1003 [18:37:00] that will definitely be slow :) [18:37:07] please don't use 3004 :) https://github.com/wikimedia/puppet/blob/production/manifests/site.pp#L266 [18:37:37] Still, the original problem was a that git pulls from cumin1001 were super-slow [18:38:02] where did your ~/.ssh/config originate from? [18:38:04] And they still are [18:38:11] wmf-update-thingamabob [18:38:22] that's where your known hosts comes from, not your ssh config :) [18:38:32] or did you install wmf-sre-laptop ? [18:38:57] wmf-update-ssh-config [18:39:09] I use both that and wmf-update-known-hosts-production [18:39:13] ah okay [18:40:24] I usuallt have my own config and then use an Include, but I ruled that out as a problem already (using only the config coming out of the update tool atm) [18:40:37] yeah [18:40:53] so, I don't use the now-standard scripts, and I am about to disagree with their config ;) [18:41:07] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/wmf-sre-laptop/+/refs/heads/master/configs/ssh-client-config#2 [18:41:21] my config has 'bast*.wikimedia.org' instead of fully listing the allowed ones [18:41:43] I mean, I can always try stuff :) [18:42:49] Nope, no difference [18:43:14] debug1: Executing proxy command: exec ssh -a -W cumin1001.eqiad.wmnet:22 bast1003.wikimedia.org [18:43:25] In this constellation stuff is slow [18:43:34] well, that's expected given the rest of the file [18:43:46] but that one change should have made bast3005 much faster [18:44:00] also it sounds like your real problem isn't the esams bastions, it's the eqiad ones [18:44:02] wait, why would an SSH to cumin1001 via 1003 be slow? [18:44:24] grab a mtr towards bast1003 ? [18:44:37] (the other host I've tried is ml-serve1001, which is also in eqiad) [18:44:39] sec. [18:44:55] if you are tired of troubleshooting, and just want a quick fix [18:45:36] then I suggest rewriting the config to a much simpler one that simply makes all traffic go via your closest bastion [18:45:50] https://phabricator.wikimedia.org/P18783 (from earlier) https://phabricator.wikimedia.org/P18784 (right now) [18:45:54] https://wikitech.wikimedia.org/wiki/SRE/Production_access#SSH_configuration [18:45:56] like so [18:46:22] bast3005 is capable of talking to cumin1001 [18:46:30] would going to cumin1001 via AMS be expected to be better regarding these matters? [18:46:44] it shouldn't be drastically better [18:46:52] I mean, I get the "but what if AMS is down" concern [18:46:56] there is still some issue with your ssh traffic between your home and eqiad, which we haven't figured out yet [18:47:11] but also I'm not sure we will [18:47:29] I am also puzzled that the hop from xe-5-3-3-500.cr1-eqiad.wikimedia.org to bast1003.wikimedia.org add's 70ms all by itself [18:47:56] yeah, I took a quick peek at librenms and a few other things and didn't see obvious signs of congestion there [18:48:42] I can run a wireshark dump like above, but going to bast1003 [18:49:08] See if there's anything obvious. My earlier dump was clearly muddled but not understanding the SSH config [18:52:39] Out of a total of 1443 packets (that cat dev/zero command, running for 1m), 75 out of order, 80 missing segments (those two can muddy each other, I figure). 8 retransmits. 120 dupe ACKs. 20 window updates [18:53:00] I can make the pcap downloadable somewhere if desired. [18:55:22] Hm. Nothing I can derive any insight from [18:55:44] I mean the go-via-AMS path will be a bandaid, but I dunno what's going on with eqiad [18:57:58] Sloe: 1003, 2002, 4003. Fast: 3005, 5002 [18:58:34] The Atlantic hates me [19:02:40] Alright. Enough for today. Maybe sleeping on this will get me insights [19:52:10] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) [19:52:46] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10SRE, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) CI had to be restarted after the machine went up due to some od...