[11:43:34] 10Traffic, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) >>! In T319067#8314210, @BBlack wrote: > Is it possible to fake this out with a bunch of trivially-built empty udebs that are in our rep... [11:46:04] 10Traffic, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) >>! In T319067#8314213, @ssingh wrote: > On the Traffic side, the image + cookbook patch is working for us. The only issue being -- and... [13:34:16] (VarnishTrafficDrop) firing: (2) Varnish traffic in eqiad has dropped 57.15798415498152% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [13:34:56] (HAProxyEdgeTrafficDrop) firing: 38% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [13:39:16] (VarnishTrafficDrop) resolved: (7) Varnish traffic in drmrs has dropped 56.187107527678776% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [13:39:56] (HAProxyEdgeTrafficDrop) resolved: (5) 49% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [14:55:50] gehel: if there are per-backend-service timeseries for that, they'll be exported by ATS-BE -- its' the only part of the traffic stack that has any notion of different backing services [15:18:03] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10dcaro) > But still fairly comfortably within the 10G NIC capcity. What throughput limits were hit? Sorry if I missed them on the dashboard you linked, I d... [15:52:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) > I did some tests in the past and that was more or less the maximum network throughput I got, so I was expecting for that to be the same (thinki... [17:01:06] gehel: what kind of metrics do you need? [17:02:02] gehel: right now ATS tracks TTFB per backend service and HTTP method [17:02:15] it's available as trafficserver_backend_requests_seconds_bucket [17:08:36] ah I was hoping we had status code in there for those [17:09:34] I suppose the other way to get at this data indirect would be through analytics [17:10:07] (basically look at webrequest status codes for wdqs, only for reqs whose cache status doesn't end up as int or hit, meaning it reached the backend) [17:22:41] per status code we got it for accumulated TTFB [17:22:48] trafficserver_backend_requests_seconds_sum[$status][$method][$backend] += $origin_ttfb / 1000.0 [17:23:08] gehel is out now on vacation, but probably inflatador or ryankemper knows more [17:29:05] vgutierrez: I think what they're trying to find an [existing] data source for, is actual percentages of status codes on a per-response basis. Like an SLI about what percentage of requests to the backend were 2xx. [17:29:30] if it's being aggregated as ttfb sums rather than req counts, I think we've already lost that in this prom data. [17:29:35] trafficserver_backend_requests_seconds_count [17:29:46] oh ok [17:29:50] that's what they need [17:30:17] that's split by status, method and backend [17:30:30] should be perfect for what they're asking then! [17:32:55] something like https://grafana.wikimedia.org/goto/ND_GJhS4z?orgId=1 [18:02:38] vgutierrez: bblack: yes that's perfect, thanks [19:31:38] (LVSHighCPU) firing: (2) The host lvs3005:9100 has at least its CPU 1 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs3005 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [19:33:16] (VarnishTrafficDrop) firing: Varnish traffic in esams has dropped 56.58308674702148% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [19:33:56] (HAProxyEdgeTrafficDrop) firing: 37% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [19:36:38] (LVSHighCPU) resolved: (18) The host lvs3005:9100 has at least its CPU 1 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [19:38:16] (VarnishTrafficDrop) resolved: (2) Varnish traffic in esams has dropped 59.723267352576855% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [19:38:56] (HAProxyEdgeTrafficDrop) resolved: (2) 64% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [19:40:56] (HAProxyEdgeTrafficDrop) firing: (2) 65% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [19:45:56] (HAProxyEdgeTrafficDrop) resolved: (2) 66% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop