[05:52:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:57:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:49:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) @Andrew I think it's wise to proceed cautiously alright. And I've no objection to us keeping the Ceph host "public" and "cluster" NICs separate... [09:52:23] 10Traffic, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) >, @MoritzMuehlenhoff wrote: > Currently it doesn't run fully non-interactive yet, there's a dialogue being prompted: //No kernel module... [09:53:03] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) 05Open→03Resolved a:03cmooney Gonna close this one device still showing ok and no alarms for FPC errrors. We can re-open if problem happens again. ` cmooney@re0.cr2-esams>... [10:25:40] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10dcaro) > Perhaps something like cluster re-syncing when nodes are added/remove or in failure etc? Yep, that is when we hit the throughput limit yes, if... [10:50:28] 10Traffic, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10Volans) Just as an additional datapoint, if you connect to the console and anwer the question while the cookbook is running, it will happily continue once... [12:17:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) Ok. Yeah there are some larger spikes around that time. Biggest on cloudcephosd1026 on Aug 18th. {F35565980} But still fairly comfortably wit... [13:22:37] 10Traffic, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10BBlack) Is it possible to fake this out with a bunch of trivially-built empty udebs that are in our repo? Or does it have to come straight from debian? [13:23:09] 10Traffic, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ssingh) >>! In T319067#8313691, @MoritzMuehlenhoff wrote: >>, @MoritzMuehlenhoff wrote: >> Currently it doesn't run fully non-interactive yet, there's a di... [14:54:16] (VarnishTrafficDrop) firing: Varnish traffic in eqiad has dropped 66.26389481038862% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [14:54:56] (HAProxyEdgeTrafficDrop) firing: (3) 24% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [14:59:15] 10Traffic, 10SRE, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) [14:59:16] (VarnishTrafficDrop) resolved: (7) Varnish traffic in drmrs has dropped 54.39356773455335% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [14:59:56] (HAProxyEdgeTrafficDrop) resolved: (6) 44% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:29:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Jclark-ctr) @Vgutierrez Would like to move connections can You depooled Servers so connection can move this afternoon?? [15:37:59] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @Jclark-ctr we need to do lvs1020 first, and after it's done we can depool lvs1017, but we cannot depool both lvs instances at th... [16:05:41] 10Traffic, 10SRE: Implement SLI measurement for ATS - https://phabricator.wikimedia.org/T316921 (10Vgutierrez) [16:05:55] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10Upstream: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez) 05Open→03Resolved After partitioning the ATS cache in the whole fleet c... [16:07:55] 10netops, 10Infrastructure-Foundations, 10SRE: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) Just FYI I've adjusted one of the links on the row E/F switches now. Quick run-down of process: # Drain link by chaning OSPF interface cost both sides: ** `set protocols ospf area 0.0.... [17:24:18] 10netops, 10Infrastructure-Foundations, 10SRE: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) Ok I've fixed the MTUs for all the underlay / switch to switch links in the new cage now. All that remains on those are the uplink sub-ints to the CRs, which for some reason are at 9174... [17:26:40] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Jclark-ctr) Moved two connections cableid stayed the same lvs1017 is connected to asw2-b4-eqiad on old port xe-4/0/15 Cableid 4801, New asw... [17:26:48] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Jclark-ctr) a:03Jclark-ctr [17:28:19] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Jclark-ctr) @Papaul I can make myself available if @Cmjohnson cant [17:39:56] (HAProxyEdgeTrafficDrop) firing: 53% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:40:16] (VarnishTrafficDrop) firing: Varnish traffic in eqsin has dropped 56.02751868517339% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [17:49:56] (HAProxyEdgeTrafficDrop) resolved: 66% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:50:16] (VarnishTrafficDrop) resolved: (2) Varnish traffic in eqsin has dropped 50.46244378646911% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [18:24:21] 10netops, 10Infrastructure-Foundations, 10SRE: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) Actually I've discovered something odd on those sub-interfaces between switches and cr's. Firstly the value I was seeing was the protocol mtu (i.e. payload mtu) as I was looking at the... [19:15:31] bblack: we're still thinking about those WDQS SLO. We're thinking at percentage of successful (HTTP/200) requests. Looking at Turnilo, this seems somewhat reasonable as a metric. [19:16:16] Do you know if we collect metrics about HTTP traffic in prometheus? That would be way easier to work with. And this seems like a generic enough problem that there should already be something in place. [19:17:06] We did some digging into the insane number of time series available in Prometheus. But there are too many to find anything if you don't know exactly what you're looking for! [21:38:56] (HAProxyEdgeTrafficDrop) firing: 31% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [21:39:16] (VarnishTrafficDrop) firing: (4) Varnish traffic in eqsin has dropped 32.848534566978095% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [21:43:56] (HAProxyEdgeTrafficDrop) resolved: (5) 63% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [21:44:16] (VarnishTrafficDrop) resolved: (7) Varnish traffic in drmrs has dropped 58.64393795998353% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [23:07:16] (VarnishTrafficDrop) firing: Varnish traffic in esams has dropped 60.67001798436592% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [23:07:56] (HAProxyEdgeTrafficDrop) firing: 40% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [23:12:16] (VarnishTrafficDrop) firing: (4) Varnish traffic in eqsin has dropped 47.7558350771591% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [23:12:56] (HAProxyEdgeTrafficDrop) resolved: (5) 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [23:14:40] ^ those aren't "real", they're just comparing against earlier spikes in a fixed 30 minute interval [23:17:16] (VarnishTrafficDrop) resolved: (4) Varnish traffic in eqsin has dropped 47.7558350771591% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop