[10:29:32] hi there! [10:29:41] I see this on PCC for an unrelated change [10:29:48] Call, Class[Profile::Debdeploy::Client]: parameter 'ensure' expects an Array value, got String (file: /srv/jenkins-workspace/puppet-compiler/31263/change/src/modules/profile/manifests/base.pp, line: 112, column: 5) [10:31:14] should be fixed with https://gerrit.wikimedia.org/r/723480 [10:33:17] isn't that patch introducing the array datatype that the compiler is complaining about? [10:34:11] in other words, I think the failure is produced by that patch [10:35:04] yeah, sorry I misread the error [10:35:07] ^ jbond [10:37:24] thanks fixing [10:39:51] phew [10:39:58] I thought I broke the LVSs at first [10:48:57] jbond: I am getting another error, are you already aware ? [10:50:48] effie: yes thanks justsending another patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/723493 [10:50:54] not my day today [10:54:43] no worries, we still like you [10:56:35] <_joe_> did we ever? [10:56:42] :P [11:00:56] fyi the fix is rolloing out [12:46:51] ryankemper: I think one of your wcqs changes broke icinga. See https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=alert1001&service=Check+correctness+of+the+icinga+configuration [12:47:02] It complains about wcqs_codfw hostgroup not being defined [12:47:19] Error: Could not find any hostgroup matching 'wcqs_codfw' (config file '/etc/icinga/objects/puppet_hosts.cfg', starting on line 50424) [17:15:41] Thanks akosiaris, deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/723314/ to fix [17:54:16] The page about LE root expiring makes it sound like some of our Tier-1 sites would be impacted. https://meta.wikimedia.org/wiki/HTTPS/2021_Let%27s_Encrypt_root_expiry [17:54:28] Is this accurate? [17:54:32] yes [17:54:43] " Wikipedia) make use of Let's Encrypt certificates at some of our edge servers" [17:54:53] I thought we only used them for secondary services and as fallback [17:55:16] https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-traffic/20210923.txt has some good background/details from b.black if you haven't read it yet [17:55:27] 14:29:53 we replaced one of the major commercial providers with LE some time back [I'd have to dig for the date] [17:55:28] 14:30:22 in our current config, the US edge sites (ulsfo, codfw, and eqiad) all serve an LE certificate, and the non-US sites (eqsin, esams) use a commercial Digicert certificate [18:00:01] Hm.. I'm worried about that potentially going to far. Wikipeida already seems like the odd one out of the major sites that we don't serve older TLS clients. Eg. when using an old XP/mac/iOS device typically Apple.com/Google/Mozilla.org etc load, but wikipedia.org doesn't connect. [18:00:24] in the grand scheme of things, would it not be "cheap" to "just" get another non-LE as backup for the next few years? [18:00:50] (I'm mainly asking to learn why it doesn't make sense, not because I think Im right) [18:34:45] It looks like April 1, 2019 was when we started serving LE on the so-called unified wildcard cert: https://phabricator.wikimedia.org/T213705#5074087 [18:34:49] so, over two years ago [18:42:15] although, maybe https://phabricator.wikimedia.org/T230687 is when it actually happened? [18:57:32] Krinkle: yes, we had a 3x CA setup during the interim when we were first deploying LE in this capacity, but we dropped one of the commercial options because it also didn't perform as well (bigger chains, bigger OCSP staples, etc) and two seems like enough for operational redundancy. [18:58:36] it is relatively-cheap to pick up another commercial option (but not trivial - low 5 figures / year) [18:59:06] Ack, droppping the third one seemed sensible when LE wasn't yet expiring in this way and/or unknown whether some other creative workaround would come out of the woodwork for another 1-2 years. [18:59:19] if the LE root issue ends up having large fallout, that's probably what we'll have to do. But our estimation is the fallout will be pretty minimal vs where our TLS standards are already at in terms of client cutoff. [18:59:42] the announcement, etc, is just being abundantly cautious and informative [19:00:37] The numbers are going to be small in any way thats measured as a percentage at our scale, for sure. Have we decided, before seeing the numbers, how many affected "device uniques" we'd consider acceptable to deny access to? [19:01:08] 5 figures is larger than I expected. Is that for an EV or a regular wildcard of the main project domains? [19:01:18] yes, many times over the years, and this one is expected to be in line with past decisions [19:01:25] we don't use EV certs [19:01:40] (but we do have a large SAN count, many of which are wildcards) [19:03:07] half of which are m-dots [19:03:28] yes, we're pretty much stuck with those for the foreseeable future, even if someone finally did get rid of them canonically :) [19:03:37] ack [19:03:38] (which I'd be in great support of!) [19:06:34] calculating the actual users denied by any TLS changes (mostly speaking about our past, intentional ones, like dropping TLSv1.0, or dropping CBC ciphers, or dropping 3DES, etc, etc)... is quite tricky. Some have multiple devices and only lose 1/N old things around the house. Some lose access from $work due to a really bad office-controlled setup, but still have it on their phone and/or at home, [19:06:40] etc. [19:07:46] we don't really have the stats to infer the "real" impact, but historically we've aimed at estimations that our actions would affect ~0.1% or less of "requests" based on what we have in analytics about UAs and TLS negotiations. [19:09:01] 0.1% is not a small number at our scale, but there has to be a cutoff point at which we can help move the world past insecure choices in a timely fashion, too. In general clients that fail to meet our TLS standards are also horribly insecure and outdated in 10 other ways and probably not safe to rely on. It is a tricky debate, but we've been over this ground many times over many years. [19:09:04] I suppose we don't have a way to detect from our edge that a client is affected by this, or do we have a partal redirect compaign this time around as well? [19:09:30] certain harder than last time, if at all possible, short of UA sniff [19:09:38] no, we don't have any easy way to know about this one, other than inferences from UA-string stats [19:10:05] even those aren't reliable, as e.g. the same UA string may be sent by java8 code that either is or isn't up to the patchlevel needed to work around this. [19:10:50] We could potentially do something like we did with IPv6, where we expose a hostname that's LE-only and then add a bit of JS to the WikimediaEvents payload that at random makes a JS background req there, and if it fails, put up a mw.notify() linking to our information page. [19:11:23] but I'm not sure that'd even be stable since a req can fail or all kinds of reasons. [19:11:30] The failure rate could be higher and uncorelated with what we wawnt [19:11:45] it can potentially also be fooled by caching additional chain data from other unrelated sites, etc. [19:12:11] there's lots of trickiness to actually detecting this reliably [19:12:17] other than user reports after the fact. [19:12:32] although the JS could be limited to the <1% of UAs we know are likely affected. [19:12:41] but predicting it for major client platforms is pretty simple, based on whether they've gotten root CA updates since LE released their "real" one ~5 years back. [19:13:21] that's basically what it boils down to: if this root expiry breaks you, you're effectively on a platform that hasn't been security-patched in 5 years :/ [19:13:31] right [19:13:33] (or more!) [19:14:03] and Firefox remains afaik the only browser that (fortunately?) ships its own (most/fully?) OS-independent certificate handling. [19:14:25] right. For Firefox, the cutoff is being on v50 or higher to have the cert, which is pretty old. [19:14:34] even WinXP can run FF v53 and probably get through this [19:15:08] right, you'd have to be on a very old Firefox, but more importantly, you can get a newer Firefox and unblock your device even if other aspects of it are stuck [19:15:10] (and for all I know, the affected iPhone 4S users can install firefox somehow and get past this too. I'm not really an iUser) [19:15:28] No, there is no alternate engines allowed on iOS devices. [19:15:39] The "Firefox" app is just Safari with Mozilla Sync for bookmarks added on top [19:15:45] heh, nice [19:16:15] basically the same kind of webviews that youd' see in native aps when opening a web link on the side without leaving the app, but as the whole app. [19:16:21] :q [19:16:56] although apple is pretty good about supporting OS updates for a long time, especially on mobile. [19:17:09] I still have an iPhone SE from 2016 with the latest everything. [19:17:13] right, I did all the digging on the public info for that from apple and others [19:17:18] and MacBook pro from 2015 same deal [19:17:46] and an iPad from 2013 I think, which is now stuck on last year's iOS for the first time. [19:18:16] but according to the best public info, you have to be on iOS 10 to have gotten this cert from Apple. The "iPhone 4" already doesn't work with us, the "iPhone 4S" currently does, but is stuck on iOS 9 and will fail at root expiry, and the iPhone 5 and beyond can get iOS 10 or higher and are fine, basically. [19:19:29] iOS 9.3.6 was released in 2019. [19:19:36] I wonder if that last push came with any cert updates [19:19:45] (and for macs - basically anything made in circa 2010-ish or later can upgrade far enough) [19:20:45] Is there a test site with an LE cert that doesn't include the cross-sign hack? [19:20:51] https://letsencrypt.org/docs/certificate-compatibility/ is the source material on which platforms trust the newer root [19:20:52] would be nice to link on https://meta.wikimedia.org/wiki/HTTPS/2021_Let%27s_Encrypt_root_expiry [19:20:54] it claims only iOS 10 [19:21:41] Krinkle: I think Platonides set up a test site [19:22:22] the LE info links to a "last updated in 2017" help page from apple, so might be out of date [19:22:24] yeah, that could be valuable for pre-testing I guess (one that explicit chains straight to the un-cross-signed X1) [19:22:29] [23:22:05] I have prepared two sites with old and new root chains [19:22:29] [23:22:32] old.le.wikilov.es & new.le.wikilov.es [19:23:04] but that cert caching might interfere [19:24:24] yeah it's possible [19:24:34] this confirms the config is correct for new.le: https://www.ssllabs.com/ssltest/analyze.html?d=new.le.wikilov.es [19:24:49] (it send a chained R3 which is signed by the self-signed X1 root) [19:25:21] so, yes, if you're on a device which will break on the date, that test site will break now to prove it) [19:28:54] but to give an example of the scale we're talking about with some of these known-affected UAs: [19:28:56] so, testing on iOS 9.0.2 with browserstack definitely fails. In the default Safari browser, it gives a very mild permission prompt while the page remains in "loading..." state where it just asks if you want to continue, and then loads the page fine. [19:29:35] In the Chrome app, it's a big scary red page, per the Chrome design style guide for TLS warnings to be something people will not pass unless they're very tech savvy [19:29:51] (and youd have to go to advanced / scroll down / accept ris and continue) [19:30:03] in past 24h of turnilo data: UA strings with "iPhone OS 9" in the string account for ~0.016% of all requests, and then within that set, 36% of those all self-identify as simulations on behalf of googlebot and baidu [19:31:48] and that's the one sub-case I'm most worried about in terms of real human user impact (the iOS9 iPhone 4S's, which are some subset of the above) [19:32:04] this should be mostly a non-event in the big picture [19:32:08] on iOS 8 it won't load and doesn't have a way to continue, presuably due to TLS version. apple/google/yahoo preloaded links do load with HTTPS, the preloaded wikipedia link does not. [19:32:32] yes, we're stricter than other sites on TLS standards [19:32:51] so yeah, I think iOS 9 will be fine. It predates Apple joining Chrome on the "TLS warnigns are scary" bandwagen, so most people probably won't even read the prompt and continue without a problem. [19:33:31] https://usercontent.irccloud-cdn.com/file/KsBdLELq/Screenshot%202021-09-24%20at%2020.33.16.png [19:33:39] This is mild even for Safari, compared to e.g. their desktop versions in the past. [19:33:49] yeah [19:34:33] When I'm outside nearby a coffee shop, I press OK on at least 4 of those before I've made an order. [19:34:43] (to reject the unsollicited invivations to join a hotspot) [19:35:27] which is of course why they changed the design later, but oh well :) [19:35:59] the good news is we're basically-done with advancing standards on our end, for quite some time into the future (until we can kill TLSv1.2 in the future, which is so far out it's unknowable at this point, barring some unexpected new bug) [19:37:05] Do we have any stats on attempted TLS connections that couldn't with some high probabilty of relating to TLS versions? [19:37:06] the last little bit we have left to do in the present is deprecate and remove RSA certificate support in the public termination, which we've been putting off for a while because it supports one last known use-case: users of Firefox 53 on WinXP. [19:37:13] Krinkle: nope [19:38:40] with a concerted effort and enough analysis, we could probably figure out something like that, but it would be a big project to do so, and the data interpretation would be very heuristic and human-analysis-filtered to make any sense. [19:39:12] (vs network disruptions, and separating real human UAs from e.g. old scripts running on some forgotton server banging away at us with old software, etc) [19:39:45] the faliure happens before we get very much fingerprinting data to go on (but we do get some, in the form of the ciphersuite support set of the client and a few other minor bits) [19:41:55] there was some old sniffer-based horrible hacky scripts we used years ago to do such an analysis of incoming ciphersuite fingerprints, to guide some of our earliest improvements, which might still be in puppet somewhere [19:42:55] ah yes, cipher_cap and cipher_sim here: [19:42:58] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/tlsproxy/files/utils [19:44:23] this allowed us to grab the ClientHello cipher data from the live traffic, and then run simulations that answer questions like "if we remove support for cipher X, which clients will fail and which will negotiate some other viable option?" etc [19:48:14] (but that was able to operate just on the first packet from the client. To look at negotiation failures like you're talking about, we'd also have to observe the success or rejection or timeout/drop of the whole bidirectional session, and then analyze the ciphersuites to suss out likely client-types, and look at the client IP ranges to provide more inference on who/what they are, etc.) [19:48:39] and realistically, if some device is being run by a human and hasn't been able to talk to us for years, how likely is it to connect to show us that, now in the present? [20:28:34] Krinkle: bblack: as usual, thank you for the informative Q&A above.