[01:34:56] (HAProxyEdgeTrafficDrop) firing: 62% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [01:39:56] (HAProxyEdgeTrafficDrop) resolved: 62% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [01:41:56] (HAProxyEdgeTrafficDrop) firing: 67% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [01:46:56] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [02:03:38] (LVSHighCPU) firing: (6) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [02:08:38] (LVSHighCPU) firing: (8) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [02:18:38] (LVSHighCPU) resolved: (8) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [05:49:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:54:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:06:09] 10Traffic, 10SRE, 10Patch-For-Review: Varnish SLI is impacted by external components performance|behavior - https://phabricator.wikimedia.org/T317051 (10Vgutierrez) 05Open→03Stalled I'm waiting for a while after merging https://gerrit.wikimedia.org/r/831528, next steps aren't feasible in the short term [08:22:56] (HAProxyEdgeTrafficDrop) firing: 60% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:56:57] (PyBalBGPUnstable) firing: (4) PyBal BGP sessions on instance lvs2007 are failing - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [09:03:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=927fadc1-f5b2-478f-95ce-98bfc47881a9) set by cmooney@cumin1001 for 2:00:00 on 3 host(s) and th... [09:13:03] 10Traffic, 10Data-Persistence, 10SRE: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez) [09:13:18] 10Traffic, 10Data-Persistence, 10SRE: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez) p:05Triage→03Medium [09:21:26] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:21:42] 10Traffic, 10Data-Persistence, 10SRE: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez) [10:14:00] 10Traffic, 10Data-Persistence, 10SRE: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez) @MatthewVernon we would like to get your input here. Before tuning Swift's current TLS termination we'd like to know what are your plans regarding it. Is a migration to envoy in... [11:21:57] (PyBalBGPUnstable) resolved: (4) PyBal BGP sessions on instance lvs2007 are failing - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [11:39:06] 10Traffic, 10Data-Persistence, 10SRE: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) Thanks for asking! No, we don't currently have a move to envoy on our roadmap (I'm afraid there is too much higher-priority stuff there right now), though I'm not opposed to... [14:40:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) cr1-codfw and cr2-codfw sucessfully upgraded today. Took a while with the firmware upgrades too, I've added some notes [[https://wikitech.wikimedia.o... [20:46:29] hey traffic, search team is working on developing an SLI for wdqs uptime, and we could use some guidance from you all as far as the best way to do that [20:46:51] I tried to lay out some initial questions here: https://phabricator.wikimedia.org/T313751#8234409, could someone take a look when they get a chance? [20:50:41] ryankemper: nice! just want to make sure you're also talking to rzl as well :) [20:56:54] cdanis: appreciate it! and out of curiosity, is that because this is somewhat of a service-opsy thing in addition to being traffic-y, or is it about r.zl being oncall this week? [20:57:21] ryankemper: rzl is driving an SRE-wide OKR about getting SLOs in place :) https://wikitech.wikimedia.org/wiki/SLO [20:59:11] ah yes, that does seem a wee bit relevant :D [21:10:30] ryankemper: yes hi! I'll check out the task and maybe book some time to chat if you like [21:10:33] (thanks cdanis) [21:18:41] rzl: ack, we've got our search team weekly weds meeting from 8-10 am PT, so if you're available you could drop by tomorrow after the wmde&wmf design research readout if you're free [21:18:53] otherwise any open block on my calendar works