[06:44:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:49:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:53:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:58:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:55:11] 10netops, 10Infrastructure-Foundations, 10serviceops: Kubernetes1018's eth negotiated speed is 10MB/s - https://phabricator.wikimedia.org/T296369 (10elukey) [08:20:15] <_joe_> is anyone around? I have a problem with trafficserver I don't understand. [08:20:43] <_joe_> oh sigh wait, now I do [08:22:24] <_joe_> uhm no, still seeing the problem [08:22:43] <_joe_> so I changed the mapping for search.wikimedia.org to point to the new apple-search service [08:22:53] <_joe_> now I ran puppet on a single server in codfw [08:23:05] <_joe_> and then made the following curl request [08:23:34] <_joe_> curl --resolve search.wikimedia.org:3128:$(dig +short cp2027.codfw.wmnet) http://search.wikimedia.org:3128/?search=test [08:23:49] <_joe_> I still see the response coming from the mw servers [08:24:01] <_joe_> what did I do wrong? vgutierrez / ema ? [08:25:35] I'll check it in a few seconds [08:37:49] * vgutierrez looking [08:38:25] you got puppet disabled and ran it on cp2027? [08:39:58] my bad [08:40:00] Nov 24 08:35:31 cp2027 trafficserver[1111]: [Nov 24 08:35:31.295] [ET_NET 42] WARNING: SNI (search.wikimedia.org) not in certificate. Action=Terminate server=apple-search.discovery.wmnet(10.2.1.68)[Nov 24 08:35:31.295] [ET_NET 42] ERROR: SSL connection failed for 'search.wikimedia.org': error:1416F086:SSL routines:tls_process_server_certificate:certificate ve [08:40:25] _joe_: the cert on apple-search.discovery.wmnet needs to have search.wm.o on the SAN list [08:40:59] <_joe_> vgutierrez: sigh, that's one thing I didn't do and didn't check. My bad [08:41:12] <_joe_> vgutierrez: can we depool cp2027 while I fix this? [08:41:19] <_joe_> I'd rather not revert [08:41:19] sure [08:41:30] <_joe_> ok [08:41:56] codfw/cache_text/ats-be/cp2027.codfw.wmnet: pooled changed yes => no [08:41:57] done [08:43:38] <_joe_> tbh I am not sure it makes sense for ATS to behave this way [08:43:59] <_joe_> I give you a map, you should verify the cert on the backend is valid for the host I gave you as destination [08:52:52] but ATS needs to guarantee that's connecting you to a valid search.wm.o backend server [09:08:38] hey, what's up? [09:10:01] _joe_: so is the issue that the Host value is not in SAN? [09:10:11] <_joe_> ema: yes [09:10:13] <_joe_> solved :) [09:10:16] ack :) [09:10:27] <_joe_> vgutierrez: meh, I'm not sure I agree with the logic [09:10:33] yeah I agree that it's arguable whether ATS logic makes sense in that regard [09:10:44] <_joe_> ats should verify the identity of the backend you're pointing it to [09:10:58] <_joe_> you take care of being sure that backend is responding what you want [09:11:17] <_joe_> it's a misunderstanding of what the identity verification stands for [09:11:32] from one point of view you can see it like this: the identity of the backend is verified by the fact that it does have a certificate for a domain you own [09:12:25] but I'll stop being a devil's advocate and just agree with you that it would make more sense to verify that the origin hostname is in SAN instead [09:14:41] OTOH it is (was in the 90s?) a common scenario to use ATS for request routing to multiple backends, which may not even have a DNS entry altogether and just be accessed round-robin based on IP [09:15:34] in that case you don't want to verify anything when it comes to hostnames but you do want to make sure that a random evil organization does not MITM you to steal all your private data [09:35:49] <_joe_> ... which chan be verified using IP-based SANs :P [09:52:12] is puppet re-enabled on text? [10:16:11] vgutierrez: it is [10:16:25] cool [13:57:46] FWIW, I like having the SAN/SNI stuff stay consistent throughout the stack as we traverse via HTTPS [13:58:22] it's the cert-level equivalent of the problem of changing the Host: header as you traverse the stack. [13:58:59] (or in other words - everything's conceptually simpler if the exact same client can make connections in the same way, whether they're connecting to the outermost edge, or connecting directly somewhere in the middle) [14:01:27] or in other words: if at some future time, some other internal service wants to connect to https://search.wikimedia.org/ , but wants to do it "directly" instead of looping out through the cache, we should be able to make that with L3 routing and/or split-horizon DNS towards the internal IPs, but the client shouldn't have to know that it needs to SAN-match some other hostname in discovery.wmnet [14:01:33] as well, which differs from the HTTP-level Host: value it will send. (none of this may apply to this particular case, but as a general rule I think it makes some things simpler) [15:59:53] 10netops, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) [16:00:29] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) Moved debate into {T296411} [16:41:39] 10netops, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10cmooney) Thanks @arturo, I think that sums up the options we discussed. A... [17:07:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10ayounsi) [17:20:21] vgutierrez: can I edit text_haproxy.yaml ?:) [17:20:43] cleanup change https://gerrit.wikimedia.org/r/c/operations/puppet/+/740907 [17:21:09] Sure [17:21:44] :) thx [17:22:20] Now I'm wondering if search.wm.o needs the same [17:22:20] merged on master. just letting puppet do its normal thing [17:23:00] what happened is I had grepped through the repo just before the start of text_haproxy vs text [17:23:22] or you copied it right before I had removed from text.. or something [18:22:43] remap rules are on the trafficserver::backend hiera file not on the role one [23:28:44] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH) [23:57:23] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH) [23:58:37] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH) [23:58:46] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10RobH)