[06:37:57] <jinxer-wm>	 (EdgeTrafficDrop) firing: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop
[06:42:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop
[07:46:42] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade Fastnetmon to 1.2.0 - https://phabricator.wikimedia.org/T271228 (10ayounsi) It's back! https://github.com/pavel-odintsov/fastnetmon/releases/tag/v1.2.0 :)
[08:11:30] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2032.codfw.wmnet with OS buster
[08:31:42] <wikibugs>	 10netops, 10Infrastructure-Foundations: Finalise design extentison of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10Aklapper)
[08:40:50] <wikibugs>	 10netops, 10Infrastructure-Foundations: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10Aklapper)
[08:53:17] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2032.codfw.wmnet with OS buster com...
[09:02:34] <wikibugs>	 10netops, 10Infrastructure-Foundations: Ganeti hosts use analytics vlan as v6 getaway - https://phabricator.wikimedia.org/T305034 (10ayounsi) p:05Triage→03Medium
[09:16:03] <wikibugs>	 10netops, 10Infrastructure-Foundations: Ganeti hosts use analytics vlan as v6 getaway - https://phabricator.wikimedia.org/T305034 (10ayounsi) I couldn't find any mention of `accept_ra` in Puppet or cookbooks.  Some more digging shows that it might have been added manually in T265607#6547365, but maybe the scri...
[09:40:49] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2030.codfw.wmnet with OS buster
[10:11:34] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Ganeti hosts use analytics vlan as v6 getaway - https://phabricator.wikimedia.org/T305034 (10ayounsi) p:05Medium→03Low a:03MoritzMuehlenhoff After chatting with Moritz I pushed a manual fix and confirmed that the route was gone after the expiring timer.  T...
[10:26:51] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2030.codfw.wmnet with OS buster com...
[11:24:45] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2028.codfw.wmnet with OS buster
[11:30:38] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Review filtering for cloud-hosts on CR routers eqiad - https://phabricator.wikimedia.org/T285461 (10ayounsi) 05Open→03Resolved a:03ayounsi All done here!
[11:39:39] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney)
[12:05:17] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2028.codfw.wmnet with OS buster com...
[13:04:02] <jayme>	 Hi traffic, I'm going to remove an LVS service (https://gerrit.wikimedia.org/r/c/operations/puppet/+/770504)
[13:07:17] <bblack>	 hi :)
[13:07:21] <bblack>	 taking a look
[13:09:24] <bblack>	 I can't remember a case quite exactly like this, but I think it should work!
[13:19:30] <jayme>	 I did something like that a yeah ago, moving back to lvs_setup stage. But in that case I wanted to get rid of the complete service, not just the lvs part of it
[13:41:18] <volans>	 bblack: thanks for the review of the varnish cookbook, what would be the safest way to test it post-merge? is there some test host I could use?
[13:50:01] <wikibugs>	 10Traffic, 10Performance-Team, 10SRE, 10Performance-Team-publish, 10Sustainability (Incident Followup): Collect Backend-Timing in Prometheus - https://phabricator.wikimedia.org/T131894 (10Krinkle)
[13:56:53] <bblack>	 volans: not really, but we can just do some artificial job on a single host for no good reason
[13:57:10] <bblack>	 I assume the query/selection stuff, we can test with dry-runs, and we just need one host to confirm the action part works
[13:57:13] <bblack>	 volans: ^
[13:57:30] <volans>	 yes
[13:57:50] <bblack>	 drmrs is probably the least-impactful place to try
[13:58:00] <volans>	 I'll merge and do some dry-runs first and ping back when ready to do something real
[13:58:06] <bblack>	 ok
[13:58:29] <volans>	 thanks
[14:00:53] <bblack>	 someday, it would be nice if we had a mechanism to reduce the remaining low-impact disruption (which we generally just accept)
[14:01:09] <bblack>	 it doesn't caues an SLO failure though, so there's that argument for ignoring it :)
[14:01:41] <volans>	 which part of it? ensuring all in-flight requests are completed after the depool before the restart?
[14:01:49] <bblack>	 the depools stop new connections, but don't actively try to cleanly drain/close the remainder of the open ones, right
[14:02:14] <bblack>	 they'll reconnect immediatley, but some of them will be mid-flight and cause a failed http transaction, which shows up as small noise in our failure stats
[14:02:41] <bblack>	 the timer between depool and restart helps a lot, but it's far from perfect
[14:03:08] <volans>	 indeed
[14:04:24] <volans>	 is there any easy metric/command to run on the host to poll for # of in-flight connections and consider it safe enough to restart below some threshold?
[14:04:57] <bblack>	 yeah, we could attack it that way
[14:05:10] <volans>	 ofc we would have some cli args to bypass this check for emergency restarts or stuck varnishes where that check will never return
[14:05:14] <bblack>	 but tuning that is tricky
[14:05:16] <bblack>	 and that
[14:05:30] <bblack>	 but we could do a combination approach of some kind
[14:06:01] <bblack>	 check connection count before the depool, and wait for either 90% drain or X seconds, whichever comes first, or something
[14:06:24] <bblack>	 but... this kind of reminds me of the VLA thinking trap in C99
[14:07:08] <bblack>	 at the end of the day, there will commonly be that one connection that lingers forever.  If you go for 100%, you'll commonly go the whole X seconds, and if X seconds is acceptable, then just a timeout will do.  We clearly don't want infinite timeout just because of one stuck conn.
[14:07:28] <volans>	 ehehe
[14:08:26] <bblack>	 what would help a lot, would be some kind of new varnishadm command that asks it to actively drain conns
[14:09:19] <bblack>	 there are drain strategies that work well without causing a high risk of failures
[14:09:36] <bblack>	 it doesn't have one now afaik
[14:10:08] <bblack>	 maybe that's easier to attack on haproxy though, which in practice holds all the connections that matter anyways
[14:11:18] <bblack>	 anyways, something for the future
[14:12:45] <cdanis>	 we also don't need to be perfect here -- in practice, connections get closed all the time, even ones with in-flight requests, and both user-agents and bots will retry, usually transparently
[14:13:15] <bblack>	 well, they retry, but there's impact - both how we measure it for SLOs, and how it impacts users (extra latency / retry)
[14:13:46] <bblack>	 it's transprent, but we end up doing lots of restarts every week one way or another, and those are one of the factors in not getting more 9s
[14:13:50] <cdanis>	 fair :)
[14:14:58] <bblack>	 you can make the case that we're not failing our SLO today, so no point worrying about it
[14:15:30] <bblack>	 but on another level, 99.9 is not an ideal SLO for the front edge, it's just what we thought was achievable for the first iteration, to record the status quo.
[14:18:15] <bblack>	 sometimes perspective is nice for feeling *good* about the state of things, though - it likely used to be much worse before many improvements over the years!
[14:19:07] <cdanis>	 bblack: btw, we generally ignore these because they're so noisy, but we do track them -- NEL has data about `tcp.reset` and `h2.ping_failed` which would be useful for tracking the impact of restarts over time (if you look at just the relevant time windows)
[14:19:14] <cdanis>	 there's a lot of background noise there because of NAT devices and the like
[14:19:25] <bblack>	 yeah
[14:21:17] <bblack>	 ideally, we'd flip a "drain this daemon" switch, and varnish would start tacking "connection: close" onto all further requests it receives (which is the cleanest drain), and then server-close the rest of the idle ones after a certain timeout (cleanly without a RST).  There's still a window that our FIN crosses in-flight with a new request from the idle conn, but the odds are decently low after 
[14:21:23] <bblack>	 you wait a while.
[15:19:27] <wikibugs>	 10Traffic, 10SRE, 10ops-codfw: Degraded RAID on cp2028 - https://phabricator.wikimedia.org/T305047 (10herron) p:05Triage→03High
[15:35:47] <godog>	 when you get a change please take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/775251
[15:41:40] <bblack>	 godog: ugh yeah.  we're going to have some related naming problems to clear up later too, like all of our haproxy services being called ats-tls in confd :)
[15:42:24] <godog>	 bblack: yeah :
[15:42:31] <godog>	 ooops, new keyboard :|
[15:42:45] <bblack>	 probably we'll also rename the classes back at that point (after haproxy transition done), but then we'll remove these prometheus stanzas, too
[15:43:28] <godog>	 a few assumptions around how the whole thing is structured there for sure, I had a glimpse of it when putting the patch together and quickly backed out to the only changes necessary :)
[15:43:59] <bblack>	 :)
[15:44:31] <godog>	 but yeah the alert itself was real for sure, in the sense that prometheus was expecting to be able to poll metrics but it wasn't able to
[15:44:48] <godog>	 s/was/is/
[15:45:14] <wikibugs>	 10Traffic, 10SRE, 10ops-codfw: Degraded RAID on cp2028 - https://phabricator.wikimedia.org/T305047 (10MMandere) 05Open→03Invalid The problem later resolved on Icinga as the check succeeded, after the reimage of the instance was complete.
[16:51:35] <wikibugs>	 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) For esams failover testing: we're planning to attempt this on Thursday.  The idea is to merge the oustanding patches and then depool esa...
[18:43:25] <volans>	 FYI I've done some fixes to the varnish roll restart cookbook after the dry-run tests, and the last dry-runs seems correct, I'll ping back tomorrow for some real test
[19:05:06] <bblack>	 volans: ack, thanks