[01:16:17] 10HTTPS, 10Traffic, 10SRE, 10Wikidata, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10Dzahn) caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/714624 ? [01:16:26] 10HTTPS, 10Traffic, 10SRE, 10Wikidata, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10Dzahn) It definitely should not be 2 levels of subdomain. That won't be covered by the cert and would explain the error. That being said, none of the... [05:02:50] 10HTTPS, 10Traffic, 10SRE, 10Wikidata, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10Marostegui) p:05Triage→03High [11:55:49] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) Ok well we're about a week after DC switchover back to eqiad so we can make some conclusions on the results of the changes in eqiad. Overall there definitel... [12:01:13] 10netops, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10SRE, 10bacula: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) @jcrespo thanks for the above comments. In terms of... [12:02:47] 10netops, 10Infrastructure-Foundations: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) [12:03:18] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [12:03:24] 10netops, 10Infrastructure-Foundations: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) [12:03:34] 10netops, 10Infrastructure-Foundations: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) p:05Triage→03High [12:13:59] 10netops, 10Infrastructure-Foundations: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) In terms of further mitigation one thing we could possibly do in the short-term is to change how we configure our VRRP states. Currently we configure VRRP primary/backup stat... [12:18:25] 10netops, 10Infrastructure-Foundations: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) Another change that could help here would be to move the L3 gateway for hosts to the virtual-chassis. i.e.: - Set up new, routed sub-interfaces between the ASWs and CRs. - A... [13:37:51] 10netops, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10SRE, 10bacula: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) @cmooney Please feel free to resolve this ticket and... [14:15:31] 10Traffic, 10Analytics, 10Analytics-Kanban: Review use of realloc in varnishkafka - https://phabricator.wikimedia.org/T287561 (10odimitrijevic) p:05Triage→03Low [15:02:02] 10HTTPS, 10Traffic, 10SRE, 10Wikidata, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10Ladsgroup) The config I'm getting is query-commons.wikimedia.org https://gerrit.wikimedia.org/r/c/wikidata/query/gui-deploy/+/720072/2/sites/wcqs/cust... [15:32:51] 10HTTPS, 10Traffic, 10SRE, 10Wikidata, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10EBernhardson) Commons query has not been deployed yet. No public DNS has been assigned. Nothing is configured to route traffic from the public interne... [16:10:36] 10HTTPS, 10Traffic, 10SRE, 10Wikidata, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10Dzahn) It seems that the reported "currently giving a broken https cert" is basically impossible with this not being in DNS. [16:34:45] 10netops, 10Infrastructure-Foundations, 10SRE: Create an alert for output discards on network devices - https://phabricator.wikimedia.org/T284593 (10cmooney) [16:34:51] 10netops, 10Infrastructure-Foundations, 10SRE: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) [16:39:05] 10netops, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10SRE, 10bacula: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) 05Open→03Resolved @jcrespo thanks. As you say i... [16:50:38] hey traffic, just merged the `service_setup` part of the wcqs work, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959. I've now got the patch up to transition to `lvs_setup`: https://gerrit.wikimedia.org/r/c/operations/puppet/+/723254 [16:50:45] per https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers it sounds like I should get a sanity check that the change looks good, and then get your go-ahead before actually doing the pybal restarts [16:51:03] so...does everything look right for https://gerrit.wikimedia.org/r/c/operations/puppet/+/723254? [17:28:05] 10HTTPS, 10Traffic, 10SRE, 10Wikidata, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10EBernhardson) 05Open→03Invalid Seems to be a miscommunication, the service is not yet publicly available. [17:31:57] (VarnishTrafficDrop) firing: 60% GET drop in text@eqsin during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [17:36:57] (VarnishTrafficDrop) resolved: 66% GET drop in text@eqsin during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [17:41:47] bblack, sukhe: would you be available to give a hand to ryankemper for a few LVS patches? [18:06:24] I am available and around but unfortunately can't review LVS patches as I have never done that. so it has to be bblack as the other person who can do that is on vacation [18:06:29] gehel: ^ [18:08:19] sukhe: thanks ! Let's see if Brandon is around. Otherwise, is there someone on a European timezone ? I can take this over from Ryan tomorrow if needed. [18:14:25] gehel: the only consideration is that tomorrow's friday. although arguably the european friday is less of a friday than the american friday given we'll be around to clean up any (purely hypothetical) messes :P [18:15:06] (we meaning america) [18:23:23] Still not a super great idea :/ [18:24:00] I hope we can merge this today. Otherwise let's try to schedule some time with bblack on Monday [18:25:51] gehel: if Brandon is not around today, you can try to ping em.a (CEST) but I am not sure if he can help. apologies for not being more useful :) [18:26:26] (I meant ping em.a tomorrow) [18:30:03] sukhe: all good, you've done everything you can! [19:42:33] gehel: I am around today, just been busy! [19:43:06] bblack: thanks! I'm getting off work, but ryankemper should be around [19:46:50] ryankemper: since this only affects low-traffic in the core DCs, basically the pybals to restart are going to be lvs2010 and lvs1016 (the backup LVSes), then the primary low-traffic ones at lvs2009 and lvs1015 [19:47:23] looks sane! [19:53:54] bblack: great, I'd like to start the process in 10-15 mins if that doesn't conflict with anything [19:55:24] sounds like the procedure is: merge, run puppet across all of `'O:lvs::balancer'`, ack any alerts, restart pybal on backups, sanity check `sudo ipvsadm -L -n`, then restart on the active lvs server, then finally do a test with curl on `wcqs.svc.eqiad.wmnet` / `wcqs.svc.codfw.wmnet` [20:00:36] ryankemper: should work! [20:45:44] bblack: great, proceeding. loose plan here: https://phabricator.wikimedia.org/T280001#7375321 see #wikimedia-operations for updates, but I'll come and shout here if anything catches on fire :) [21:14:32] re: earlier discussion in the traffic meeting about the upcoming LetsEncrypt root expiry issue: [21:16:42] we're planning to let this play out naturally (meaning no changes on our end), and I've summarized the issue on a Wikitech page that might help supporting in case of complaints (I'll send this to sre@ in an email, and reach out to CommRel+TE in a separate email since they're likely to hear any complaints first): [21:16:57] https://wikitech.wikimedia.org/wiki/HTTPS/Letsencrypt-Root-2021 [21:26:24] where are we using letsencrypt? [21:26:48] I think they are mostly not-very-used domains, isn't it? [21:27:12] Platonides: no, but it's tricky to observe depending on where you're located! :) [21:27:20] to backtrack a bit: [21:28:10] for many years now, for the main sites, we've had a policy of using redundant duplicate certificates issued by two independent certificate authorities, because this provides us some important insurance against some operational issues [21:28:25] yes [21:28:42] but I thought both were from "big players" [21:28:45] and since a backup cert that isn't being actively used could develop issues that we aren't even aware of until we try to use it, we also use both of them live [21:28:58] yep, I know [21:29:23] [sorry if you know some of this, but I'm considering other readers may have similar questions and different background!] they used to be from two major commercial cert providers [21:29:53] we replaced one of the major commercial providers with LE some time back [I'd have to dig for the date] [21:30:22] in our current config, the US edge sites (ulsfo, codfw, and eqiad) all serve an LE certificate, and the non-US sites (eqsin, esams) use a commercial Digicert certificate [21:30:28] ok, I wasn't aware of that switch of one of those [21:30:44] iirc it was a pretty uneventful switch :) [21:31:12] this arrangement actually helps reduce the compat issues we're talking about now, as users who typically face our US edges typically have more-updated clients, too. [21:31:28] but that's a minor detail and we'd have to get into a lot of technical weeds and what ifs about it. [21:32:43] we'll have a patch prepped in case the expiry fallout is much larger than everyone anticipated, but we're not planning to use it if the impact is the expected minor fallout. [21:32:56] (a patch to switch all sites temporarily to Digicert while we figure out our next steps) [21:33:32] it'd be interesting to measure # of affected clients [21:34:07] but I can't think a way for that which isn't intrusive nor requires a non-trivial amount of work [21:34:09] it's very difficult to do so, as they'll just fail to connect to us at all, and they represent such a small statistical portion of our clients/requests that it falls below the noise floor of most statistical analysis [21:34:41] (and yeah, there's always a way, but it might be quite difficult!) [21:34:57] you could use a LE root on uploads.wm.o everywhere and digicert on text [21:35:07] then people who see no images are the ones affected [21:35:24] but that's intrusive [21:35:41] you could load a js from a specific domain which will fail [21:35:44] yes, and potentially raises other corner case issues, too [21:35:54] similar to the things we did when changing tls versions [21:36:02] but that requires work [21:36:25] we do the cert split on a geographic (US vs non-US) basis to reduce the odds of one client ever really seeing both certs alternatingly/persistently [21:36:58] hmm, would that really be a problem? [21:37:00] (because there are some subtle bugs out there where requests for text or upload get misdirected to the other, and some client might be confused by the competing certs, etc, etc - there's lots of unknowns there) [21:37:39] esp with HTTP/2 connection reuse in play, and both text and uploading using the same wildcard SANs which cover the names of both [21:38:07] are clients really allowed to do that? O_o [21:38:22] they shouldn't, but we've seen odd logs at times! [21:38:23] send a request for foo to a server on a different ip but valid cert? [21:38:42] I'd expect that would break so much of the internet [21:39:02] pretty much every server where you put a wildcard [21:39:03] right - they're supposed to obey the IP distinction from DNS. But we have some VCL to catch this case because some clients do hit it (accidental reuse of the wrong connection, due to some bug or standards-interpretation issue or who knows what) [21:40:00] there's even a special response code made for this case: 421 Misdirected Request [21:40:43] wow [21:40:48] https://datatracker.ietf.org/doc/html/rfc7540#section-9.1.1 [21:41:00] so many things that "should be impossible" [21:41:34] its bugs all the way down Platonides ;) [21:41:41] :D [21:42:51] well, "should be impossible" usually means more like "0.001%", and then when you get ~100K+ reqs/sec, those things really do happen eventually. [22:40:56] bblack: curious if you know when the digicert root will expire [22:47:40] legoktm: it's actually a complex thing to figure out for any given case. For the common case, clients which stay up to date get new replacement roots well ahead of time for most such cases. [22:48:26] some X years ahead of an important root expiring, someone makes a new root to replace it, and goes around informing all the people who manage sets of roots (browser/OS/ssl-lib vendors, etc) to add it. [22:48:56] and those parties take various time going through various processes, and hopefully usually get it into newer clients in plenty of time. [22:50:29] the question you really want the answer to is: "For a given certificate X, which relies in some cases on root cert Y (of which there can be multiple overlapping), which clients which are no longer receiving root updates (because they're old/unmaintained) and might have Y expire on them sometime before the last of these clients dies a natural death and dissappears from the internet for other [22:50:35] reasons?" [22:51:08] and you have to run that down, per-root-cert, and per-client-OS/browser/library, for all the cases you care about, for all the roots that could be propping up your end-cert [22:51:58] Gotcha...I was mostly curious because swapping out LE for Digicert to mitigate fallout is just a temporary measure, since eventually Digicert would hit some expiry too [22:53:00] probably if we had to go down that "temporarily all digicert" road, the followup action would be to put a high priority on bringing another commercial vendor back into the mix alongside Digicert first. [22:53:14] (which might take a month+, given how these things go) [22:53:38] * legoktm nods [22:54:01] for the most part, most of the parties involved (major CAs, major browser/OS/lib vendors) act responsibly and this isn't often an issue (and root cert lifetimes are many many years) [22:54:52] LE kind of got itself into a bind, as they had to bootstrap quickly using a much older root just to join in the game, and some of the old clients affected didn't completely die off as fast as projected. In the future (well even now, for Modern clients) they'll be on their own root that's independently managed by them. [22:56:06] (the root they started with was not their own, because starting from scratch with a fresh root would've meant waiting *years* longer before anyone could usefully use LE at all) [22:56:31] Yeah, I remember reading about the cross-sign stuff [23:00:19] anyways, life intrudes (kids, sports, dinner, etc). I'll send the emails about this either later tonight or tomorrow. Feel free to improve on https://wikitech.wikimedia.org/wiki/HTTPS/Letsencrypt-Root-2021 , anyone :) [23:22:05] I have prepared two sites with old and new root chains [23:22:32] old.le.wikilov.es & new.le.wikilov.es [23:22:43] wanted to see how many clients connected to the former but not the later [23:23:11] although certificate caching may fiddle with it [23:23:54] even though the second doesn't provide the legacy chain, maybe the client will figure it out by itself [23:24:04] by having used the other one :/