[06:37:57] (EdgeTrafficDrop) firing: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [06:42:56] (EdgeTrafficDrop) resolved: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [07:46:42] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade Fastnetmon to 1.2.0 - https://phabricator.wikimedia.org/T271228 (10ayounsi) It's back! https://github.com/pavel-odintsov/fastnetmon/releases/tag/v1.2.0 :) [08:11:30] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2032.codfw.wmnet with OS buster [08:31:42] 10netops, 10Infrastructure-Foundations: Finalise design extentison of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10Aklapper) [08:40:50] 10netops, 10Infrastructure-Foundations: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10Aklapper) [08:53:17] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2032.codfw.wmnet with OS buster com... [09:02:34] 10netops, 10Infrastructure-Foundations: Ganeti hosts use analytics vlan as v6 getaway - https://phabricator.wikimedia.org/T305034 (10ayounsi) p:05Triage→03Medium [09:16:03] 10netops, 10Infrastructure-Foundations: Ganeti hosts use analytics vlan as v6 getaway - https://phabricator.wikimedia.org/T305034 (10ayounsi) I couldn't find any mention of `accept_ra` in Puppet or cookbooks. Some more digging shows that it might have been added manually in T265607#6547365, but maybe the scri... [09:40:49] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2030.codfw.wmnet with OS buster [10:11:34] 10netops, 10Infrastructure-Foundations, 10SRE: Ganeti hosts use analytics vlan as v6 getaway - https://phabricator.wikimedia.org/T305034 (10ayounsi) p:05Medium→03Low a:03MoritzMuehlenhoff After chatting with Moritz I pushed a manual fix and confirmed that the route was gone after the expiring timer. T... [10:26:51] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2030.codfw.wmnet with OS buster com... [11:24:45] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2028.codfw.wmnet with OS buster [11:30:38] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Review filtering for cloud-hosts on CR routers eqiad - https://phabricator.wikimedia.org/T285461 (10ayounsi) 05Open→03Resolved a:03ayounsi All done here! [11:39:39] 10netops, 10Infrastructure-Foundations, 10SRE: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) [12:05:17] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2028.codfw.wmnet with OS buster com... [13:04:02] Hi traffic, I'm going to remove an LVS service (https://gerrit.wikimedia.org/r/c/operations/puppet/+/770504) [13:07:17] hi :) [13:07:21] taking a look [13:09:24] I can't remember a case quite exactly like this, but I think it should work! [13:19:30] I did something like that a yeah ago, moving back to lvs_setup stage. But in that case I wanted to get rid of the complete service, not just the lvs part of it [13:41:18] bblack: thanks for the review of the varnish cookbook, what would be the safest way to test it post-merge? is there some test host I could use? [13:50:01] 10Traffic, 10Performance-Team, 10SRE, 10Performance-Team-publish, 10Sustainability (Incident Followup): Collect Backend-Timing in Prometheus - https://phabricator.wikimedia.org/T131894 (10Krinkle) [13:56:53] volans: not really, but we can just do some artificial job on a single host for no good reason [13:57:10] I assume the query/selection stuff, we can test with dry-runs, and we just need one host to confirm the action part works [13:57:13] volans: ^ [13:57:30] yes [13:57:50] drmrs is probably the least-impactful place to try [13:58:00] I'll merge and do some dry-runs first and ping back when ready to do something real [13:58:06] ok [13:58:29] thanks [14:00:53] someday, it would be nice if we had a mechanism to reduce the remaining low-impact disruption (which we generally just accept) [14:01:09] it doesn't caues an SLO failure though, so there's that argument for ignoring it :) [14:01:41] which part of it? ensuring all in-flight requests are completed after the depool before the restart? [14:01:49] the depools stop new connections, but don't actively try to cleanly drain/close the remainder of the open ones, right [14:02:14] they'll reconnect immediatley, but some of them will be mid-flight and cause a failed http transaction, which shows up as small noise in our failure stats [14:02:41] the timer between depool and restart helps a lot, but it's far from perfect [14:03:08] indeed [14:04:24] is there any easy metric/command to run on the host to poll for # of in-flight connections and consider it safe enough to restart below some threshold? [14:04:57] yeah, we could attack it that way [14:05:10] ofc we would have some cli args to bypass this check for emergency restarts or stuck varnishes where that check will never return [14:05:14] but tuning that is tricky [14:05:16] and that [14:05:30] but we could do a combination approach of some kind [14:06:01] check connection count before the depool, and wait for either 90% drain or X seconds, whichever comes first, or something [14:06:24] but... this kind of reminds me of the VLA thinking trap in C99 [14:07:08] at the end of the day, there will commonly be that one connection that lingers forever. If you go for 100%, you'll commonly go the whole X seconds, and if X seconds is acceptable, then just a timeout will do. We clearly don't want infinite timeout just because of one stuck conn. [14:07:28] ehehe [14:08:26] what would help a lot, would be some kind of new varnishadm command that asks it to actively drain conns [14:09:19] there are drain strategies that work well without causing a high risk of failures [14:09:36] it doesn't have one now afaik [14:10:08] maybe that's easier to attack on haproxy though, which in practice holds all the connections that matter anyways [14:11:18] anyways, something for the future [14:12:45] we also don't need to be perfect here -- in practice, connections get closed all the time, even ones with in-flight requests, and both user-agents and bots will retry, usually transparently [14:13:15] well, they retry, but there's impact - both how we measure it for SLOs, and how it impacts users (extra latency / retry) [14:13:46] it's transprent, but we end up doing lots of restarts every week one way or another, and those are one of the factors in not getting more 9s [14:13:50] fair :) [14:14:58] you can make the case that we're not failing our SLO today, so no point worrying about it [14:15:30] but on another level, 99.9 is not an ideal SLO for the front edge, it's just what we thought was achievable for the first iteration, to record the status quo. [14:18:15] sometimes perspective is nice for feeling *good* about the state of things, though - it likely used to be much worse before many improvements over the years! [14:19:07] bblack: btw, we generally ignore these because they're so noisy, but we do track them -- NEL has data about `tcp.reset` and `h2.ping_failed` which would be useful for tracking the impact of restarts over time (if you look at just the relevant time windows) [14:19:14] there's a lot of background noise there because of NAT devices and the like [14:19:25] yeah [14:21:17] ideally, we'd flip a "drain this daemon" switch, and varnish would start tacking "connection: close" onto all further requests it receives (which is the cleanest drain), and then server-close the rest of the idle ones after a certain timeout (cleanly without a RST). There's still a window that our FIN crosses in-flight with a new request from the idle conn, but the odds are decently low after [14:21:23] you wait a while. [15:19:27] 10Traffic, 10SRE, 10ops-codfw: Degraded RAID on cp2028 - https://phabricator.wikimedia.org/T305047 (10herron) p:05Triage→03High [15:35:47] when you get a change please take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/775251 [15:41:40] godog: ugh yeah. we're going to have some related naming problems to clear up later too, like all of our haproxy services being called ats-tls in confd :) [15:42:24] bblack: yeah : [15:42:31] ooops, new keyboard :| [15:42:45] probably we'll also rename the classes back at that point (after haproxy transition done), but then we'll remove these prometheus stanzas, too [15:43:28] a few assumptions around how the whole thing is structured there for sure, I had a glimpse of it when putting the patch together and quickly backed out to the only changes necessary :) [15:43:59] :) [15:44:31] but yeah the alert itself was real for sure, in the sense that prometheus was expecting to be able to poll metrics but it wasn't able to [15:44:48] s/was/is/ [15:45:14] 10Traffic, 10SRE, 10ops-codfw: Degraded RAID on cp2028 - https://phabricator.wikimedia.org/T305047 (10MMandere) 05Open→03Invalid The problem later resolved on Icinga as the check succeeded, after the reimage of the instance was complete. [16:51:35] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) For esams failover testing: we're planning to attempt this on Thursday. The idea is to merge the oustanding patches and then depool esa... [18:43:25] FYI I've done some fixes to the varnish roll restart cookbook after the dry-run tests, and the last dry-runs seems correct, I'll ping back tomorrow for some real test [19:05:06] volans: ack, thanks