[01:34:39] 10Traffic, 10RESTBase-API, 10SRE: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Legoktm) >>! In T307610#7918787, @Mitar wrote: >> Because our edge traffic code enforces a stricter limit of ~100/s (for responses that aren't frontend cache hits due to popularity)... [01:35:54] 10Traffic, 10RESTBase-API, 10SRE, 10Documentation: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Legoktm) [06:42:54] 10Traffic, 10RESTBase-API, 10SRE, 10Documentation: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Mitar) > Most of that is controlled by the SRE team at a level in front of the REST API, since the frontend caching layer is a shared resource across everything.... [07:22:34] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4004.ulsfo.wmnet with OS bullseye [08:01:03] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4004.ulsfo.wmnet with OS bullseye completed: - ganeti4004 (**PASS**) - Down... [08:01:56] (HAProxyEdgeTrafficDrop) firing: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:06:56] (HAProxyEdgeTrafficDrop) resolved: (2) 64% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:09:22] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I have now deployed the change to double the number of replica pods for eventgate-an... [09:13:50] ^ btullis I'm starting to feel sorry about opening that task /o\ [09:16:52] vgutierrez: Hah, no worries. :-) It's the kind of thing that once you look at it, it doesn't feel right to sweep under the carpet. I'm determined to find out what's causing it (one day). [09:41:17] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7916593, @cmooney wrote: >> If there is any kind of anycast with the k8s prefixes (same prefix adverti... [10:24:51] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10akosiaris) >>! In T306181#7920083, @BTullis wrote: > I have now deployed the change to double... [12:15:17] vgutierrez bblack please take a look at https://gerrit.wikimedia.org/r/c/operations/alerts/+/789575 (and https://gerrit.wikimedia.org/r/c/operations/puppet/+/790671 but that's just a followup) thank you ! [12:47:03] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) > Yup, it's there. Subtle but noticeable. Shaved off ~1s from p99 and ~80-100ms from... [12:55:24] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I have stumbled upon this issue with HAProxy, which seems to fit some of the symptom... [13:45:01] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Vgutierrez) Beginning with HAProxy 2.1 HTX is the only way to go. On another issue (https://g... [13:46:35] btullis: ^^ could you provide a valid payload that I could submit to the intake-analytics endpoint? [13:48:16] even that with my understanding of ?hasty=true actually anything should return a 202 [13:51:24] vgutierrez: Yes, I will work on getting that for you. I've got lots of packet captures with valid content, but they're from after haproxy, varnish, ats, envoy have all messed with the headers etc. I'll try to get something like a canary event that we can submit. [13:51:53] so... curl -d "{}" "https://intake-analytics.wikimedia.org/v1/events?hasty=true" gives a 202 [13:52:21] as soon as I set transfer-encoding: chunked I got a 400 from eventgate [13:52:44] Hmm. Very interesting. [13:52:50] even with --http1.1 [13:53:10] chunked TE isn't supported on HTTP/2 [13:54:47] https://github.com/curl/curl/commit/d4c5a917226ad6d5bee1b1d6deb099e1d9a895e6 [13:54:51] hmm that's... interesting [13:55:06] * vgutierrez running curl 7.74.0 (debian bullseye) [13:59:51] btullis: funny, curl -d "{}" -H transfer-encoding:chunked "https://intake-analytics.wikimedia.org/v1/events?hasty=true&vgutierrez=1" results on HAProxy dechunking the data and adding a transfer-encoding:chunked header for varnish [14:00:55] vgutierrez: I'm heading into a meeting now, but will check back again afterwards and ask about a test payload. [14:01:02] ack, thanks [14:03:49] BTW, I just tested against https://eventgate-analytics-external.discovery.wmnet:4692/v1/events?hasty=1 and it looks like it's able to handle a transfer-encoding:chunked request generated by curl [14:05:28] but if I do it via ats-be I get a 400 [15:04:47] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) > Even in the legacy setup (pre row e/f) adding new nodes requires manual error-prone gerrit changes like this one 35b0... [15:21:02] vgutierrez: I'm using the following [15:21:06] https://www.irccloud.com/pastebin/QMYU4KXk/ [15:21:38] `curl -v -H 'Content-Type: text/plain' -d@test2.json 'https://intake-analytics.wikimedia.org/v1/events?hasty=true&btullis=1' | jq .` [15:22:30] Same as you, when I add this to the curl command I get a 400 `-H "transfer-encoding: chunked"` [15:23:53] yep... in any case that's not the issue that we're seeing [15:23:58] those return immediately [15:24:12] and don't trigger a 5XX on any layer [15:26:56] OK, thanks anyway. We're still seeing an unusually high number of POST requests without bodies, which I still can't explain. [15:28:18] btullis: hmmm but I assume those will trigger a 400 on your side, right? [15:31:16] I'm not saying that we shouldn't fix that, just that those requests aren't the same ones that are triggering varnish fetch errors [15:31:23] Right, eventgate produces 400s. I've seen them by capturing plaintext traffic between the envoy-tls container and the eventgate app and following the HTTP streams in Wireshark. [15:31:54] btullis: can you confirm that -H transfer-encoding:chunked triggers that? [15:32:01] But varnish never sees any 400s and I don't know why. Varnish is seeing 503s. [15:32:17] hmmm varnish sees 400s for sure [15:32:26] cause we got those 400s on turnilo.wm.o [15:32:40] and varnish is the layer that sends that data [15:33:09] Oh right, I didn't see any 400s here, but maybe I've got my search wrong: https://logstash.wikimedia.org/goto/d33422be4372dd2116d88c5f323ec77e [15:34:10] right, cause that's the varnishfetcherr log feed [15:34:35] by definition those are 503s [15:35:01] in that log you will only find requests without a response from the backend (from varnish PoV) [15:35:13] so that's translated as a 503 to the user [15:49:15] I hope that helps with the 400s mystery 😅 [15:49:41] OK, thanks for that explanation. So that explains why we don't see the 400s - No closer to finding the cause of the 503s though, right? :-) [15:50:51] indeed [16:12:17] did we ever understand why e.g. _info GETs at robots.txt are so high sometiems? [16:14:44] Hi traffic, we'd like to deploy this change to the wikireplicas hosts: https://gerrit.wikimedia.org/r/c/operations/puppet/+/779915 [16:14:44] It should only move some mysql replica query traffic, not interrupting anything, but figured we should communicate before we touch lvs1019.eqiad.wmnet / lvs1020 :) [17:30:26] bblack: not sure if you are the right person to ping about ^ but razzi wants to get that moving. IIUC it should be relatively safe for him to do that now? [18:02:18] I'm going to go ahead and merge that, following the instructions on https://wikitech.wikimedia.org/wiki/LVS#Deploy_a_change_to_an_existing_service. I'll start with the inactive host (lvs1020) for now [18:07:37] +1 [18:12:30] inactive went fine, going to go for the active host pybal restart (lvs1019) [19:03:46] 10Traffic: Restarting pybal caused icinga error - https://phabricator.wikimedia.org/T308174 (10razzi) [19:05:19] Update: from the lvs1020 perspective pybal restarted fine, but I noticed a new alert, so I'll wait to hear back from traffic before proceeding [19:54:43] razzi: there shouldn't have been any need for restarts for that change [19:54:51] the restarts are for defining new services (or removing them) [19:57:50] I guess "service change" is not a very clear way to state that [19:59:57] 10Traffic: Restarting pybal caused icinga error - https://phabricator.wikimedia.org/T308174 (10BBlack) So, the icinga check in question was already in a bad state before the pybal restart. This check activates on any BGP session issue for the whole router (cr2-eqiad), and there are already other ongoing session... [20:00:08] ^ and the icinga alert is no big deal, the restart went fine [20:00:21] (although it's very easy to see how that's also not very clear!) :) [21:13:57] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp3052:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [21:23:00] fyi: firmware flashign esams cp hosts via T243167, should have joined here before i started heh [21:23:01] T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 [21:23:12] i want to close out the ancient task. [21:23:57] (VarnishPrometheusExporterDown) resolved: (2) Varnish Exporter on instance cp3052:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [21:44:57] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp3054:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [21:54:57] (VarnishPrometheusExporterDown) resolved: (2) Varnish Exporter on instance cp3054:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [22:15:57] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp3056:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [22:20:57] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp3056:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [22:25:57] (VarnishPrometheusExporterDown) resolved: (2) Varnish Exporter on instance cp3056:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [22:38:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3059:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [22:43:57] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp3058:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [22:48:57] (VarnishPrometheusExporterDown) resolved: (2) Varnish Exporter on instance cp3058:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown