[09:19:32] <_joe_> vgutierrez: so I wanted to add v3 support in puppet to the envoy service proxy configuration [09:20:25] righjt [09:20:26] *right [09:20:34] <_joe_> I don't see an explicit feature flag in the tls terminator part though [09:20:52] <_joe_> I was thinking of adding a parameter to profile::envoy [09:21:01] <_joe_> switching which version fo the config it would use [09:21:08] <_joe_> how did you implement it? [09:21:23] <_joe_> just based on the use of Tlsconfig or TlsconfigV3 ? [09:21:55] indeed [09:29:42] <_joe_> uhm. I might tweak that a bit, I'll send patches your way [09:29:54] <_joe_> your use of envoy is via profile::cache::envoy, correct? [09:33:54] yep [09:45:00] <_joe_> vgutierrez: specifically, I don't see a real difference between Tlsconfig and TlsconfigV3 so the check purely depends on the upstream class declaring a hash as v2 or v3 [09:45:16] <_joe_> which seems suboptimal to me, tbh [09:47:21] <_joe_> so I'm making that an explicit parameter, also because when we have more listeners / clusters than just tls termination, we need to ensure the api version is the same [09:47:57] ack [10:03:06] <_joe_> vgutierrez: still no text nodes with envoy? [10:03:14] that's right [12:21:08] can I pass more than one host to sre.hosts.reimage ? [12:21:11] I don't see that on the doc [12:27:26] marostegui: I don't think so, the code seems not to handle multiple hosts (parser.add_argument('host', help='Short hostname of the host to be reimaged, not FQDN'), no args='*' or similar) [12:28:11] right, thanks dcaro :) [12:28:27] I recall we used to have a script for multiple hosts, but I think that was long time ago [12:28:34] Anyways, thanks! [12:28:41] 👍 [13:26:40] marostegui: a shell loop to fire off a bunch of reimages, then. WC*P*GOW? :) [13:26:55] WCPGW, even, other than my typing [13:27:13] Emperor: I have no idea what those two things mean, but yeah, that's basically what I will do :) [13:46:47] What Could Possibly Go Wrong [13:47:05] (one day I will learn to explain my acronyms. Today is not that day. Tomorrow's looking doubtful too) [13:47:15] haha [13:50:33] as penance, I have taught the slack abbrev bot about WCPGW :) [14:45:08] we have an open SRE session slot this coming Monday (Jan 10th) - do we have any takers? [14:50:22] o/ I have something regarding a proposal to use corosync/pacemaker within the analytics realm. I've already shared it with my team, but it might be good to present it and get wider SRE feedback as to whether it's worth taking forward. Happy to cede to someone else though, if there is anything more pressing. [14:51:40] <_joe_> uhm I would be interested, also is there a task about this? [14:52:28] <_joe_> I'm usually not a big fan of HA using corosync/pacemaker, in particular without hardware STONITH [14:52:40] <_joe_> but it's sometimes the only option indeed [14:56:42] _joe_: Sure. The original task where I made the suggestion was here: https://phabricator.wikimedia.org/T287967 [14:56:42] The use case for this was presto-server failover. [14:57:25] More recently, I thought that it might be suitable for a MariaDB use-case: https://phabricator.wikimedia.org/T284150 [14:57:47] ...so I built a demonstration in WMCS. [14:57:49] <_joe_> heh ok mariadb is explicitly a case I'd be quite decidedly against using it [14:58:09] <_joe_> because of my experience, but I'd let the data persistence folks comment [14:58:53] If you would like to look at the presentation it is here: https://docs.google.com/presentation/d/1EVptdaP68LiAx92k7qLb0m_-iqPGt1kzJwKYrkmyq5c/edit#slide=id.g15105b408d_0_287 [15:01:04] Bear in mind that this is completely separate from the Mediawiki MariaDB databases and servers. It's only those in use by analytics services like Hive, Presto, Airflow, Superset, Oozie etc. [15:01:42] <_joe_> btullis: to be clear, I think that corosync is a good option for applications that don't have more tailored failover tools (as is the case with mariadb/mysql for instance) and in general I would always evaluate the need for uptime (so: your SLOs), the probabilitiy of failure, and the cost of a manual automation of a failover (via a cookbook) [15:02:06] <_joe_> vs the added complexity and failure scenarios you add by putting pacemaker in the middle [15:03:20] <_joe_> btullis: for mysql specifically, what happened at github is quite emblematic of what I mean - https://github.blog/2012-09-14-github-availability-this-week/ [15:04:50] <_joe_> again, I'm not saying "hell no", I'm saying "think about it twice - is it worth the hassle?" [15:05:56] <_joe_> but: for stateless services, there are better options; for mysql, I'd talk to the data persistence team - maybe you want to experiment with using orchestrator for managing failovers! [15:08:11] _joe_: Understood. I'm not advocating for shared storage (as in the GitHub MySQL/DRBD example) - Nor am I advocating for an "all in" approach. I'm just testing the waters in terms of making corosync/pacemaker available to us a framework that we /can/ use, starting with relatively simple scenarios. From my own experience it's great, both for stateless and where STONITH is required. I have built systems with it on Debian for many years. [15:10:10] btullis: not sure if you're aware, but we're using keepalived (which I understand is quite similar) for some things on the cloud services infra [15:10:38] _joe_: For MariaDB, rest assured that I did talk to the data persistence team before heading down this avenue at all. We discussed running our own orchestrator instance for modifying the replication topology, but nothing was available in terms of HA. [15:12:57] taavi: Thanks. Yes I happened upon this page: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Keepalived#Neutron_configuration - I requested a similar Neutron setup for the VIP in WMCS. [15:13:14] <_joe_> btullis: yeah sorry, I said orchestrator, I meant proxysql [15:17:40] _joe_: There are lots of tools. :-) I've also run HA proxysql with read/write query routing, which is nice. And Galera, which is another option. [15:18:18] Anyway, I thought it was worth discussing, but I'm not expecting an easy ride. [15:35:35] We ran Pacemaker + Postgres at my former company as the primary HA story. We definitely hit our fair share of sharp corners which caused outages. But, overall I think we benefitted from the setup. Pacemaker can be great when it works correctly, but understanding why it made a decision you did not want, can be quite difficult. [15:36:37] And keeping your stonith config operational requires careful monitorring and testing [15:36:52] <_joe_> jhathaway: more or less my experience, yes. [15:42:53] <_joe_> there are two levels of my doubts, and I think the more immediately relevant is that I'm not sure it's worth it given the failure rate / uptime expectations we have on these systems [15:43:37] <_joe_> but there is a deeper point, and forgive me for bblacking for a second - I think shared-resources HA is the wrong way of doing HA for almost everything [15:44:15] shared-resources being the pacemaker cluster itself, or something else, e.g. DRBD? [15:44:22] <_joe_> both really [15:44:41] we never ran clusters with shared storage [15:44:41] <_joe_> let me make you a simple example: LVS HA [15:45:07] Similarly, I'm not proposing shared storage here. [15:45:57] <_joe_> so, a typical way to put your load balancers in HA is [15:46:03] <_joe_> pacemaker to manage a floating IP [15:46:49] <_joe_> that's quite inferior to the solution used by pybal or google's seesaw, where you use bgp with priorities to have almost-flawless HA [15:47:00] <_joe_> it's simpler, it's more stable, it gives you less downtime [15:47:35] <_joe_> so for anything that *could* run in multiple copies (e.g. it's stateless) I tend to prefer using bgp announcements for managing paths to the service if we want active/passive [15:48:39] <_joe_> for anything that needs to run strictly one copy across a full cluster, we have kubernetes :)( [15:48:55] <_joe_> but this ofc doesn't hold up for e.g. databases [15:49:31] <_joe_> where the other objection holds valid [15:53:48] _joe_: how does pybal deal with TCP connection state? [15:55:24] <_joe_> that can mean many things, but if I got it right, we do LVS-DR, which means that incoming connections go through the LVS server, but responses go from the backend back to the client directly [15:55:28] New hire here, can anyone add me to ops@lists.wikimedia.org ? email is bking@wikimeda.org [15:55:47] err, bking@wikimedia.org [15:58:00] <_joe_> jhathaway: typically pybal stops announcing bgp when the server crashes, or we turn pybal off. In both cases, LVS rules are still there, so if the server didn't crash it keeps handling the connections that are established to it unless the routers kill them [15:59:00] interesting, I was wondering about failover, we used conntrackd to sync tcp state so we could fail over between nodes without dropping inflight connections [15:59:46] _joe_: We don't currently have access to any LVS servers in the analytics vlan. If we did I would happily use them. There has been discussion about creating some LVS servers for our use and/or removing the analytics vlan altogether. [16:00:04] <_joe_> btullis: sigh yes now i remember [16:00:35] <_joe_> btullis: we use bird for bgp based ha elsewhere though [16:02:53] inflatador: I was just added to that list yesterday, but I am not sure how I was added :( [16:04:21] jhathaway cool, I'm on the search team, can bug my boss about it ;) . Mainly wanted to see if we could/should use XFS instead of ext4 on secondary disks ( https://phabricator.wikimedia.org/T298570 ) [16:06:35] jhathaway: yeah, we just drop inflight connections :) [16:07:14] Amir1: nicely done, re: query timeouts on wikitech. Very well written and addressing user needs. [16:07:20] <_joe_> cdanis: do we, if we just kill pybal? LVS stays up and I don't think it loses all connections immediately [16:07:48] cdanis: thanks for confirmation, metallb, which is a network load balancer for kubernetes, suffers the same woes [16:07:57] _joe_: if we kill pybal, then we failover to the backup LVS [16:08:02] which then doesn't have state :) [16:08:40] <_joe_> cdanis: not for responses to say an HTTP request that has already been sent to the backend, is my point [16:09:20] <_joe_> but sure, then if the client is actively sending a request, that gets killed [16:09:37] _joe_: wouldn't the ACK / FIN / FINACK still need to flow correctly? [16:09:38] <_joe_> we don't consider that to be a problem as *almost everything* has a retry logic [16:10:13] but yes [16:10:21] including most modern web browsers [16:10:25] <_joe_> cdanis: I'm not sure of how the linux kernel reacts to a packet for an established tcp connection that it doesn't know [16:10:38] Probably RST [16:10:39] It'll just send an RST [16:10:41] however it does, it certainly won't make it to the correct realserver, is my point [16:10:42] <_joe_> but I would expect it to sentd an RST [16:10:51] <_joe_> oh sure [16:11:12] <_joe_> but if your client can't survive a broken tcp connection, you should really fix your client [16:12:12] <_joe_> because this isn't, by far, the easiest way in which a tcp connection will fail/break [16:12:13] yeah, and with all the NATs, wireless roaming etc. out there clients are generally built to recover very quickly from these things [16:12:19] https://logstash.wikimedia.org/goto/8525c7f722c0e9abccb9d0e7b2f24a15 [16:12:23] https://sal.toolforge.org/production?p=0&q=pybal&d= [16:12:28] Krinkle: thanks. It was heavily influenced by discussions we had on this topic! [16:12:33] I agree that LVS sends RST [16:12:36] very apparent :) [16:13:14] but yes, I agree this also isn't a big deal [16:13:24] Chrome will auto-retry the first time or two without showing an error page, even [16:13:26] <_joe_> cdanis: oh yes at the edge ofc it's even more evident [16:13:44] I'll do more work in this front later. Now we can check slow queries in logstash and fix the rest (either timeout or fix the query) [16:13:58] <_joe_> cdanis: I was mostly talking about internal-lvs, where I expect the impact to be smaller [16:14:48] yes [16:16:19] I have also used conntrackd in the past to sync TCP state between LVS nodes, as jhathaway mentioned. [16:18:54] I just used the NEL data as an easy way to find out what happens in practice :) [16:19:30] I've no idea if we've ever considered conntrackd or not -- I'll leave that one to bblack [16:19:55] <_joe_> cdanis: I'm not convinced it adds any significant value for our use-case of LVS [16:20:09] <_joe_> and it makes things like syn floods even harder to defend against, btw [16:21:11] <_joe_> I wouldn't recommend trying something like conntrackd on our edge load-balancers [16:22:44] Yeah I think it depends on the type of service you are running and whether servered TCP connections are okay during maintenance or hardware failures [16:23:10] If occasionally severing connections is okay, than I don't think the added complexity is worthwhile [16:23:16] *then [16:23:46] Firewall/connection sync is a very tricky thing to get right and scale well. [16:24:00] If we can avoid the complexity better to take the occasional RST storm I think. [16:24:01] <_joe_> jhathaway: my point is that any well-designed distributed system should be able to cope with severed tcp connections heh [16:24:27] yeah indeed. it needs to be resilient to that at some layer. [16:24:27] <_joe_> topranks: that's also my point. Most of the times the most technologically sound solution is fixing the client [16:24:41] I'd like to apologise for starting this whole flame war of a conversation about HA in the analytics vlan 🙂 [16:24:55] for sure, though you don't always have that control [16:25:08] <_joe_> btullis: it's not a flame war though, it's a few old people yelling at the cloud [16:25:28] btullis: no apologies needed, it is always an interesting discussion! [16:31:06] Agreed [17:41:44] elukey: which partman recipe should i use for a host that has one hw raid for OS and another hw raid for storage? 'flat' seems to assume only one volume [17:42:50] (maybe that's not a standard case although it seems like it would be) [20:11:09] Thanks to whoever signed me up for the ops list! [20:18:02] inflatador: wasn't me but I just added you to another "list". the google groups group called "ops maintenance" [20:18:17] that is probably a checkbox you can remove on your onboarding list [20:18:32] it will mean more mail though :p [21:05:06] More email? (pumps fist) [21:10:06] inflatador: :) it's the mail that data centers send about maintance and things like that. a special shared google inbox [21:10:18] so that part was a google group. while the ops thing was a mailman list [21:13:39] Ah, weird. I def got added to the ops email list [21:16:01] inflatador: https://groups.google.com/a/wikimedia.org/forum/#!forum/ops-maintenance [21:16:29] I guess the right term is "forum" now, heh [21:16:41] but it's Google, so it's changing :) [21:20:31] oh interesting, looks like we probably use the same DC as my old gig [21:23:37] inflatador: yw :) [21:24:33] my mysterious benefactor uncloaked! [21:25:15] haha [21:48:29] no more ops@lists.wikimedia? [21:49:40] hauskatze: the mailman list exists as well. the google groups are "internal" lists [21:50:15] sometimes that's done so that we can (ab)use the group as a shared ACL for other GOOG things like gdrive and calendar invites [21:50:31] bd808: sorry, my bad. The question should've been, what do you use nowadays? :) [21:51:12] I remember sending stuff to ops@lists sometimes [21:51:40] hauskatze: all of them! You've been around long enough to know that we never actually stop using an existing comms channel when we add a new one ;) [21:51:53] Aye sir [21:51:56] :) [21:52:15] One day we'll re-enable Conpherence, for fun [21:52:35] the ops@lists.wikimedia.org has had multiple messages since January 1 [22:05:02] needless to say, there are several sre-something google aliases / ACLs as well [22:07:27] similar to IRC channels there is "really just sre" and "sre plus other engineers" and all that