[06:46:56] (EdgeTrafficDrop) firing: 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:01:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:45:57] 10Traffic, 10ops-eqiad: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10ayounsi) p:05Triage→03High [07:46:25] FYI - https://phabricator.wikimedia.org/T300703 [07:49:09] XioNoX: btw alertmanager had already reported that as https://phabricator.wikimedia.org/T294896, but I guess it wasn't really visible with those tags? [07:50:09] oh wow, I forgot we had this [07:50:13] yeah indeed [07:51:49] 10Traffic, 10ops-eqiad: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10ayounsi) [08:54:27] oh lol, now it opened a new task since you closed the old one as a duplicate :D https://phabricator.wikimedia.org/T300709 [09:15:26] 10Traffic, 10SRE, 10ops-eqiad: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10RhinosF1) [09:15:31] 10Traffic, 10SRE, 10ops-eqiad: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10RhinosF1) Can the alert be downtimed so it doesn't go off again? [09:15:44] XioNoX: ^ [09:16:01] hahaha [09:17:53] ? [09:20:13] well, it's funny :) [09:21:42] Possible? [09:22:18] * RhinosF1 knows nothing about alertmanager or interface errors apart from they are making tasks [09:26:13] the phabricator integration for sure needs to be improved [11:11:27] 10Traffic, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10fgiunchedi) [11:50:43] 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1087.eqiad.wmnet with OS buster [11:56:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1087:9331 is unreachable - https://alerts.wikimedia.org [12:01:24] 10Traffic, 10SRE, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10Joe) Thanks @rzl this looks like an excellent plan. I would suggest that when we move to 1.18, we might want to start from the `thanos-fe` cluster which would see fixing of a real iss... [12:14:47] 10Traffic, 10SRE, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10Vgutierrez) this looks great :) in traffic we're already using 1.18.3 from the envoy-future component, thanks @RLazarus [12:26:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1087:9331 is unreachable - https://alerts.wikimedia.org [12:43:58] 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1087.eqiad.wmnet with OS buster completed: - cp1087 (**WARN*... [13:12:49] 10Traffic, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10fgiunchedi) [13:38:08] 10Traffic, 10SRE, 10ops-eqiad: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10RhinosF1) [13:40:24] 10Traffic, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10fgiunchedi) [13:45:06] did we end up depooling lvs1015? [13:47:57] nope AFAIK [13:48:38] pybal is up and running in lvs1015 :) [13:52:16] 10netops, 10Infrastructure-Foundations, 10SRE: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10BBlack) 05Open→03Resolved Just doing some cleanup here, we did end up on the 2B path (replacement LVSes with 3x dual NIC cards) and have the hardware already racked. [13:54:01] vgutierrez: yeah I see that. I'm not sure what came of the ticket earlier yet. If we're actively having a worrying level of CRC errors though, it could normally warrant depooling it for now. [13:54:58] on the other other hand, we have a whole replacement server for it that we should be swapping over to as well. And I guess technically it could be the cable rather than the optics, in which case the error may come across to the new server too heh. [13:55:18] taking a peek at whatever the data is [13:58:15] yeah the rate so far seems to be quite small [13:58:44] over the past 6h average, we're losing something like 0.01% of packets [13:59:00] probably tolerable to carry on normally for now [14:00:03] (in absolute terms, we're losing ~5pps out of ~50kpps) [14:14:03] 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [15:58:53] 10Traffic, 10SRE, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10Joe) >>! In T300324#7670904, @Vgutierrez wrote: > this looks great :) in traffic we're already using 1.18.3 from the envoy-future component, thanks @RLazarus I think the question we... [16:05:19] 10Traffic, 10SRE, 10envoy, 10serviceops: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10hnowlan) Based on the release notes I think the API gateway will most likely have no issue going straight to 1.21. If there are issues they will most likely be minor enough that we can... [16:18:25] vgutierrez: hey, not sure if you saw https://phabricator.wikimedia.org/T300366 already, but apparently pooling envoy based nodes in cache_upload is causing some third-party mediawiki installs to fail loading commons images [16:25:48] 10HTTPS, 10SRE, 10Traffic-Icebox, 10Upstream: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Jaxonvilleder) I think our [[ https://www.morningtoncabinetmakers.com.au/ | cabinets ]] does now! [16:53:49] 10Traffic, 10SRE: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10TheDJ) Here is another related Envoy ticket about 1.0 support that might be useful: https://github.com/envoyproxy/envoy/issues/170 [17:08:11] 10Traffic, 10SRE, 10ops-eqiad: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10RhinosF1) [18:00:58] 10netops, 10Infrastructure-Foundations, 10SRE: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney) Just to add an update here the main VXLAN/EVPN configuration has been added to the devices, and using some test servers kindly installed by dc-ops I've been able... [19:11:53] if we use "acme_chief::cert" in puppet, are we expected to normally also use "class { 'sslcert::dhparam': }" along with that? it installs a file /etc/ssl/dhparam.pem and says it's "needed for servers to use with DHE ciphersuites". I noticed in the past I have used that but in some other newer places it's not used. [19:12:21] or will acme_chief::cert by itself work just fine and have everything needed also for DHE ciphersuites [19:13:25] (or do we not care about DHE ciphersuites) [19:16:14] vgutierrez: ^ [19:16:52] we still include that in lots of places so I am going to assume _not_ having it is not optimal, and add it [19:17:12] We shouldn't care about dhe ciphers anymore [19:17:34] vgutierrez: ok! also a good answer, then I will NOT add it to for gitlab [19:17:43] But If you're using acme-chief you're terminating TLS as well [19:17:44] i noticed it is included on gerrit, lists, mirrors and so on [19:17:53] So you're picking the ciphers [19:18:15] We still carry it around yeah [19:18:25] But it's probably legacy at this time [19:18:32] ok, so if I use acme_chief and only set the puppet_src parameter and nothing else... [19:18:36] Maybe mx servers still need it [19:18:38] then it picks the best one for me, right [19:18:49] ok [19:18:54] thanks Valentin [19:19:16] Answered from my bike.. double check the facts lol [19:22:58] yeah we're supposed to not be using DHE anymore [19:23:07] Thx Brandon [19:23:08] ACK! well, I can confirm the cert for existing https://gitlab.wikimedia.org https://www.ssllabs.com/ssltest/analyze.html?d=gitlab.wikimedia.org&latest .running test [19:23:14] and it does NOT include that extra class [19:23:19] as opposed to gerrit and others [19:23:19] there's some lingering bits of support for it here and there, but it should be gone from our puppet ciphersuite configs, unless someone has re-introduced it somewhere [19:23:53] ok, I will stop using that class in new code. thanks [19:23:58] basically, there aren't use-cases for which DHE should be useful anymore, from our perspective (but maybe I'm missing something!) [19:24:30] IIRC Gerrit and hence gitlab should be using the strictest list of ciphersuites [19:25:03] but you'll still see some odd references to DHE-$foo or specifically DHE-RSA-AES128-SHA in various code in the puppet repo [19:25:16] but it's like, old stats support and testsuites, and apparently some desynced matching WMCS configs [19:25:22] this is back from the "beast attack" I think [19:25:27] look at this gem https://phabricator.wikimedia.org/T83768 :) [19:25:29] it shouldn't be in the live config of anything [19:26:13] I don't think we ever cleaned up all the uses of sslcert::dhparam, but yeah in practice I don't think it's useful anymore [19:26:32] maybe an important corner case to check would be the MXes for their SMTP-over-TLS stuff [19:26:45] maybe they still use/need DHE for some reason, because the SMTP TLS world is different/worse? [19:26:50] yep, got it. so some day: [19:26:51] ~/puppet$ git grep sslcert::dhparam [19:27:00] that's what I started with.. [19:27:20] this stuff self-replicates because we copy/paste things for new services [19:32:47] yeah the risk is - if something still has a DHE cipher config (e.g. maybe those MXes), and we kill the dhparam file first, it will make that DHE even weaker (that specific dhparam was to strengthen DHE if we have to use it) [19:38:44] ACK, what I take away from this is to not touch MX servers and not worry about this for gitlab and let it fade away when I migrate a service or so :) [19:57:14] 10Traffic, 10SRE: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10Dzahn) 19:07 < taavi> fallout from their TLS termination experiments, envoy does not support http 1.0 [19:57:54] ^ is that true, btw? [20:04:10] 10Traffic, 10SRE: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10Majavah) >>! In T300366#7672412, @Dzahn wrote: > 19:07 < taavi> fallout from their TLS termination experiments, envoy does not support http 1.0 "TLS termination experiments" re... [20:05:17] 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Dzahn) There are some user reports / IRC chatter and this ticket T300366 that seem like they are related to this. [20:16:57] mutante: seems within the realm of reason that it's real, yeah [20:18:22] ACK, thanks. we linked the tickets [20:19:40] https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/protocol.proto#config-core-v3-http1protocoloptions [20:20:10] there is config for it, but it sounds a little scary [20:20:40] requires a default host, and admits to being "not fully standards compliant", so I'm not gonna just turn it on randomly :) [20:20:53] heh, yea, that makes sense [20:20:54] this is probably something vgutierrez will have to take a deeper look at tomorrow [20:21:20] will add some comments to that effect I guess! :) [20:21:34] users of that "xtools" tool are talking about whether they can use "curl instead" [20:21:42] great :) [20:22:16] musikanimal: ^ adding your findings to the ticket at T300366 might be helpful [20:22:16] T300366: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 [20:22:21] cURL worked for XTools :) [20:22:22] sure [20:22:52] nice, everyone here [20:22:57] and curl works, ok [20:23:20] curl adds a host header [20:23:24] Even for 1.0 [20:23:40] Where that header is optional instead of mandatory [20:24:47] 10Traffic, 10SRE: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10MusikAnimal) #xtools was affected by this, since it uses `file_get_contents` to fetch an on-wiki config. In case it helps others, I managed to get around it by changing the code... [20:25:31] I suppose the main learning is that there are some unknown number of Wikimedia content consumers using the lowest common denominator of HTTP protocol support even if they are keeping up with TLS requirements changes. [20:25:47] It's pretty weird [20:26:00] Tlsv1.2 but http1.0 in 2022 :) [20:26:26] I mean, HTTP/1.0 is old and strange, but it's also sort of the baseline, and it's simpler for some uses. [20:26:43] I can't think of a good reason to not support it, and there could be some odd integrations here and there that would benefit from having it available. [20:27:07] 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10BBlack) summarizing about the link above: apparently we do have HTTP/1.0 clients, and it does work with our other terminators, but not envoy. Envoy does have s... [20:27:37] file_get_contents() with an https URL is idiomatic PHP and apparently defaults to HTTP/1.0 unless you do a bunch of tuning with a stream_context_create() generated resource [20:27:45] if a feature in a supported stable mediawiki versions hitting our apis (instantcommons) is using http 1.0, our infrastructure should probably support it too [20:28:03] on the other hand, I'm sure there are contexts in which I've said in the past something like "It would be awesome if all clients used conformant HTTP/1.1, it would fix all kinds of issues we have to deal with" [20:28:28] but I don't think that's a realistic wish for our public edge (the sudden dissappearnce of 1.0 in the world) [20:29:01] it is certainly not unreasonable to think that HTTP/1.1 is a widely available protocol [20:29:25] Especially considering our TLS requirements [20:29:46] I'll take a look tomorrow on the varnish side [20:30:02] there's a sort of inbetween subset, where if you just take what you were doing as a 1.0 request, add Host: header, and don't close-delimit, you've effectively got basic 1.1 support. [20:30:18] Cause at some point we should offer default vhosts for those requests [20:30:24] although, I think even close-delimit is still technically allowed in 1.1? [20:30:58] (meaning that the content body has no Content-Length and no chunking, and implicitly ends when the connection closes after all data, which is unclean in the sense that you never know if you got a partial transfer) [20:31:38] if I'm remembering that 1.1 still allows (if discourages) close-delimit, then a host header and changing the number to 1.1 is all a 1.0 client has to do to be barely-conformant, at least for basic GET/HEAD. [20:32:50] even if that's true, I doubt we can get every such client to suddenly fix it :) [20:33:49] BTW.. we haven't seen this before cause even if MediaWiki uses envoy our ATS stack only talks http 1.1 with the applayer [20:34:15] yeah but this is on the other end of things [20:34:36] mediawiki at some 3rd party host is the client, talking 1.0 to our front edge to use our content [21:32:02] 10Traffic, 10SRE, 10ops-eqiad: asw2-b-eqiad:xe-2/0/3 interface errors (lvs1015) - https://phabricator.wikimedia.org/T300703 (10Zabe) [22:22:30] 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Ahecht) Toolforge is still running PHP 7.3, which defaults to HTTP/1.0, so any PHP toolforge tools that access the API are going to be broken by this unless the... [22:22:34] 10Traffic, 10SRE: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10Ahecht) My [[ https://randomincategory.toolforge.org/ | RandomInCategory ]] tool on toolforge was affected by this as well since it's using the standard PHP 7.3 installation on... [22:26:51] 10Traffic, 10SRE, 10WMF-Legal, 10Performance-Team (Radar), 10Privacy: Consider disabling Chrome Lite pages for Wikipedia on Chrome on mobile with Cache-Control: no-transform - https://phabricator.wikimedia.org/T218618 (10JBennett) I'm not aware of prior management decisions or risk assessments to prohibi... [23:07:57] 10Traffic, 10SRE: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10Dzahn) If other users find their way here, it affects you if you are a HTTP/1.0 client for one way or another. see the summary at T271421#7672538 [23:11:22] 10Traffic, 10SRE: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Dzahn) [23:13:07] 10Traffic, 10SRE, 10WMF-Legal, 10Performance-Team (Radar), 10Privacy: Consider disabling Chrome Lite pages for Wikipedia on Chrome on mobile with Cache-Control: no-transform - https://phabricator.wikimedia.org/T218618 (10Krinkle) >>! From T298166: > Add the `no-transform` header […] and hope this deals w... [23:14:27] 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Dzahn) [23:14:31] 10Traffic, 10SRE: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10Dzahn) [23:37:17] 10Traffic, 10SRE, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10TheDJ) >>! In T271421#7672538, @BBlack wrote: > Envoy does have some config for this (cf https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/proto...