[08:10:55] klausman: still having issues? [08:11:30] volans, topranks, any idea how to pass config options to paramiko through Junos-pyez ? [08:26:09] XioNoX: was looking through some old code there but doesn’t look like I’ve done it in the past [08:26:21] What do you need to pass through? [08:34:33] XioNoX: yep. 3005 still fast, 1003 still slow [08:38:54] topranks: opening a task, but we're hitting https://github.com/paramiko/paramiko/issues/1961 since I upgraded homer's libs [08:39:08] klausman: slow through both v4 and v6? [08:39:33] nly v6 [08:39:36] +o [08:40:27] 10netops, 10Infrastructure-Foundations: Paramiko > 2.8.1 incompatibility with some Juniper devices - https://phabricator.wikimedia.org/T299482 (10ayounsi) p:05Triage→03High [08:40:46] klausman: I'd blame HE in that case [08:40:49] topranks: https://phabricator.wikimedia.org/T299482 [08:41:03] klausman: they're not the most reliable [08:41:18] Alright. :-/ [08:41:29] klausman: we can try dropping them in eqiad and testing it again [08:41:50] What do you mean by "dropping them"? [08:42:12] just using a different v6 peer? [08:42:29] klausman: disabling the BGP session we have with them, and letting BGP pick another path [08:43:17] I mean, I feel like doing that for just me when I also could just disable v6 for SSH to WMF seems a bit... "heavy"? [08:43:29] Or do you mean for diagnostics only? [08:45:31] klausman: yeah just to know if it's really the issue [08:45:50] We can try that. Just let me know when to test [08:45:59] finishing up something first [08:46:06] Sure, no rush [08:47:01] klausman: I'd recommend to use bast3005 though instead of disabling v6 [08:47:11] (as a longer term solution) [08:47:14] Ack [08:52:04] klausman: (done) please try it again once there is no more "he.net" in your mtr (5min or so) [08:52:11] aye [09:02:52] I've a v6 tunnel through HE here myself. It's still handing off directly to us in eqord. [09:03:02] I'm not having troubles to bast1003 that I can tell though [09:03:57] I still have HE in my route to 1003 (v6) [09:04:55] My PCAP is clean on an SSH file transfer. No out of order packets or heavy retransmits or anything similar. [09:05:25] Yeah, I am not sure if the retrans/out-of-order stuff is a symptom of anything specific. [09:06:48] It'll play havok with performance no matter what you'd expect [09:06:52] https://phabricator.wikimedia.org/P18810 [09:07:54] ^^ this is my mtr. But doesn't at all mean HE isn't the source of the issue, just implies HE is ok where they hand off to us in Ashburn. They could have more local issues closer to you causing it. [09:08:06] Ack [09:08:25] Let me do a quick check of my ISP's status page, just in case they've added anything since yesterday [09:08:46] XioNoX: never done we can have a look though [09:09:02] the other option is to put an upper limit for the dependency [09:09:04] volans: https://gerrit.wikimedia.org/r/c/operations/software/homer/+/755312 [09:09:05] and rebuild homer [09:09:09] :) [09:09:44] klausman: what's your mtr now? [09:09:49] {done} [09:10:04] https://phabricator.wikimedia.org/P18814 [09:10:27] klausman: ah yeah, now you land in chicago instead of eqiad previously [09:10:59] Things are still slow on v6 [09:12:37] You still have that hop in the trace in Genva? Hop 9? [09:12:51] Might be nothing but the ICMPs were super slow coming back on that. [09:13:48] checking... [09:18:23] Yep, e0-35.core2.gva1.he.net is still there [09:18:49] (and the ICMP from it is still slow) [09:18:50] Ok. I also notice in the trace back to you it's going via Telia, who send it from Ashburn to Reston I think [09:18:55] https://phabricator.wikimedia.org/P18815 [09:20:26] These are fairly close by, however the latency is very high on the hop between Ashburn and Reston [09:20:29] https://usercontent.irccloud-cdn.com/file/yIUxQ7yH/image.png [09:21:10] Not much I can do from here, I suspect [09:24:56] We certainly aren't powerless, and to the extent your issue might be reflected in performance problems for users it's probably worth us getting to the bottom of it. [09:25:35] I'll see what XioNoX thinks in terms of actions we can take. [09:25:43] I mean, it's obviously your call to make [09:25:45] We could disable HE in Chicago potentially, or adjust the routing back to not go via Telia. To see how that affects things. [09:26:19] At worst, I'll tweak my ssh/config to always use bast3005 and hope I remember it if-when that breaks [09:27:26] It makes sense to use bast3005 anyway I think. [09:28:10] Now I must resist the urge to craft a tool that will make the best ssh/config depending on local network conditions :D [09:32:15] I'm reenabling HE in eqiad [09:42:58] XioNoX: I note Init7 are at Equinix, Ashburn but we don't peer with them there. [09:43:05] Any harm if I fire off a peering request to them? [09:43:25] topranks: sure, go for it! [09:45:29] Knowinf init7, they'll say yes [09:50:07] 10netbox, 10DBA, 10Infrastructure-Foundations: Grants not working with DB hosts with to ipv6 - https://phabricator.wikimedia.org/T270101 (10jcrespo) To expand marostegui's answer (as I also reasearched it at T271148#6735477): > Can we "just" add the following Not really, adding ipv6 means the extra grants a... [09:51:23] klausman: yes one would hope so. [09:51:48] I've sent that now, so for now let's park this and, assuming we get that set up over the next few days, re-test when the path back is direct from us to them. [09:52:13] Roger! [10:45:21] volans: new homer released and seems to be working fine! thanks [10:46:25] yay! thank you [10:46:46] 10netops, 10Infrastructure-Foundations, 10SRE: Paramiko > 2.8.1 incompatibility with some Juniper devices - https://phabricator.wikimedia.org/T299482 (10ayounsi) 05Open→03Resolved a:03ayounsi Workaround pushed. [10:50:47] 10netops, 10Infrastructure-Foundations, 10SRE: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) Juniper bumped their recommended version to at least Junos 20 on a lot of platforms. * pfw: T295691 * cr: T295690 * mr: T278289 [11:20:30] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-ulsfo: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10ayounsi) 05Stalled→03Declined Not needed anymore. [14:52:57] 10Packaging, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: microsites systemd job "Sync OS migration reports/overview" might be broken - https://phabricator.wikimedia.org/T299520 (10jbond) Thanks, this is not such a big issue for the os_reports as the '*' gets passed to the remotes rsync server... [14:53:02] 10Packaging, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: microsites systemd job "Sync OS migration reports/overview" might be broken - https://phabricator.wikimedia.org/T299520 (10jbond) [15:06:07] 10Packaging, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: microsites systemd job "Sync OS migration reports/overview" might be broken - https://phabricator.wikimedia.org/T299520 (10jbond) 05Open→03Resolved a:03jbond updated [15:46:03] XioNoX: one thing to watch out for re: atlas_exporter is that the scrape time doesn't become too long -- the code isn't exactly the most scalable/robust [16:45:32] cdanis: noted, 2 questions about that, why is -streaming set to false (default is true), and I guess if something times out it will be showed in the logs? [16:45:56] XioNoX: -streaming caused race conditions and crashes when I tested it [16:46:32] and it may show up in the expoerter logs but it will definitely show up as scrape failures in prometheus (missing data points) [16:46:51] cdanis: ok thanks! [16:47:53] XioNoX: there's also prometheus-exported self-metrics about this https://w.wiki/4i3i [16:58:38] yeah, what's the unit? [16:58:43] seconds I guess? [16:58:44] seconsd [16:58:48] a bit more background at https://www.omerlh.info/2019/03/04/keeping-prometheus-in-shape/ [16:58:53] right, it's in the name :) [16:58:58] there's some metrics about byte size and such as well [16:59:06] oh and the generic up{job="atlas_exporter"} target as well [16:59:06] thx, will read [16:59:18] literally 'does prometheus think this job it is watching is up' [17:40:39] definitely an increase in scrape_duration (and that's only adding eqiad/codfw) but doesn't look too bad [17:59:42] yeah and the up{job="atlas_exporter"} metric just shows you restarting the jobs, which is good [18:00:18] 10CFSSL-PKI, 10Infrastructure-Foundations: cfssl: cfssl signeres shold correctly inject default values to profiles - https://phabricator.wikimedia.org/T299562 (10jbond) p:05Triage→03Medium [18:00:57] I'm going to try to add all the other sites to see how it goes [18:01:03] will rollback if needed [18:01:32] +1 [18:01:46] 10CFSSL-PKI, 10Infrastructure-Foundations: cfssl: cfssl signeres should correctly inject default values to profiles - https://phabricator.wikimedia.org/T299562 (10jbond) [19:16:57] 10Puppet, 10Infrastructure-Foundations, 10Project-Admins, 10PM: Clarify Puppet tag - https://phabricator.wikimedia.org/T295221 (10Aklapper) @joanna_borun: ping [19:46:44] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) Hey @Dzahn can you please remove wikimania as an alias from the mail servers controlled by... [23:37:05] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jhathaway) >>! In T236954#7629565, @colewhite wrote: > Deviations are sometimes necessary to maintain human-readability. When deviations are neces...