[13:03:41] urandom: o/ [13:04:18] lemme know if you have time for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013571, it should be a no-op, but it would unblock all the tests in AQS :) [13:04:31] going to update the task with my idea [14:06:17] Left a note in https://phabricator.wikimedia.org/T352647#9692816 related to the AQS clients, I am a bit puzzled [14:32:34] elukey: glad you pinged, I will look right now! [14:35:59] ugh [14:37:03] elukey: so are the aqs 2.0 services really using the cassandra-http-gateway chart? Because they are not cassandra-http-gateway -based (not sure what the consequences of that would be). /cc hnowlan [14:37:34] most of them are using it [14:37:39] how come? [14:37:44] some of them are using druid-http-gateway [14:37:54] how do you mean? [14:38:34] again, I don't know the consequences of reusing the chart (other than confusion like this), but they aren't cassandra-http-gateway services [14:39:26] aren't in what sense? [14:40:05] for example https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/media-analytics/helmfile.yaml#18 [14:40:45] I was under the impression the drop-in config behaviour we get for aqs2 services are exactly what the chart was built for [14:41:12] cassandra-http-gateway is a (go) framework for creating a (very limited) http shim on top of a cassandra table: https://gitlab.wikimedia.org/repos/generated-data-platform/cassandra-http-gateway [14:41:26] we have exactly one such service, image-suggestions [14:41:48] the idea was that others would follow, and that chart was created to bang those out quickly when the time came [14:42:00] I thought it came with "other stuff", envoy configuration and whatnot [14:42:07] but I don't recall all of the deets [14:42:49] if it's useful for other stuff, and can be repurposed to do them as well, then it's at least a misnomer at this point [14:43:20] I'm pretty sure it wasn't created with that in mind though [14:43:33] heh [14:43:38] I was completely unaware of that leg of the table [14:44:09] I even wrote the chart but I had zero idea the name corresponded to another software component [14:44:57] oh, this chart was created by you? [14:45:00] yep [14:45:03] what is the other chart called... [14:45:14] * urandom goes looking... [14:46:05] wmf-stable/cassandra-http-gateway ? [14:46:53] so you created this chart for the aqs 2.0 services? [14:47:07] I thought j.ayme created the one I'm thinking of [14:48:18] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/cassandra-http-gateway is the one I created [14:48:40] which was used to roll out image-suggestion [14:49:24] oh, so you did write it! [14:49:27] the OG chart [14:50:04] so what are the distinguishing properties of this chart? [14:50:28] what makes it different to other charts, I mean [14:50:56] "description: A generic helm chart for cassandra-based HTTP gateway applications" [15:01:42] hnowlan: so this was a couple of years ago, I remember a meeting that included (at least) a.kosiaris and j.ayme about what was being proposed. They talked about what this would look like vis-a-vis k8s including —as I mentioned— something about an envoy configuration that wouldn't have been warranted for a single service, but since we anticipated more, there was perceived to be a payoff later. [15:02:18] there was terminology that I didn't recognize (and that I can't remember), and I asked and got a long-story-short explanation (that I can't remember). [15:02:39] ahem the coversation derailed a little, I'd steer it to PKI if possible :D [15:03:30] I'm not sure how that meeting connected to your implementation, but (to elukey's point) I'm wondering if there is any longer-term problem (the confusing name notwithstanding) with using that chart for these services [15:03:51] I assume not, but... [15:04:58] elukey: and, we should really be doing cert verification, shouldn't we? I guess it makes things easier to migrate to if we're not, but that feels like an action item for later, no? [15:05:20] or maybe not...? [15:05:23] urandom: yep yep I think the same, it simplifies a lot the work on our side [15:05:32] and we can do it later, right after all nodes have PKI certs [15:05:43] I mean, poorer security does usually make things easier :) [15:06:09] otherwise we'd have needed to use a specific bundle on clients, containing the PKI Root Cert and the ca-manager's Root CA [15:06:14] ahahah yes :) [15:06:51] so yeah now the main idea would be to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013571 and the next in the chain, then extend it to all AQS nodes [15:07:12] at that point we'll have the truststore deployed everywhere, and we'll be ready to migrate one/two cassandra instances to PKI [15:12:12] urandom: funnily enough the first mention of cassandra-http-gateway I have on file is me saying I missed the meeting where it was discussed :D Since then we've modularised and standardised a lot of stuff so things like envoy config etc aren't spceific to the chart [15:12:34] personally apart from the name collision I only see benefits to this approach, in theory [15:13:14] what does this chart do, what makes it special? [15:13:23] I'm mostly just curious at this point [15:13:57] I'd jump in an try to figure this out (and might still), but I find these charts to be impenetrable [15:14:21] so much (templated) yaml [15:14:27] it's like Spring for Java [15:14:34] like Spring for k8s [15:16:57] it's a codified set of assumptions about a binary that uses a standard config to connect to cassandra. An easier way to understand it might be to look at a values.yaml file for it [15:17:01] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/media-analytics/values.yaml [15:17:06] apart from these values, nothing changes about the chart [15:18:11] I see, so it's The Cassandra Chart™ [15:18:16] ? [15:18:33] generic-cassandra-connecting-service chart? [15:18:54] pretty much [15:18:55] s/cassandra-http-gateway/cassandra/g ? [15:18:57] heh [15:22:09] elukey: ok, so then next steps would be r1013566, and a smoke test for tls errors connecting to aqs1010? [15:23:03] and then remove the temporary `profile::base::certificates::trusted_certs` setting and check again? [15:24:18] urandom: so in theory after the test on aqs1010 it will run a new truststore, if the instances are able to connect to the other ones it should be a good validation test [15:24:22] (in addition to `tls_use_pki_keep_old_ca` I guess) [15:25:25] oh, right, we can't remove the trusted_certs settings until the cluster has been migrated entirely [15:25:55] and then we can validate that the services can connect [15:26:11] yes exactly, I'd do this [15:26:29] 1: we rollout the new truststore to aqs1010 [15:26:33] 2: to the whole cluster [15:26:47] 3: we force pki for aqs1010 and check [15:26:54] 4: we rollout pki to the rest [15:26:57] does it make sense? [15:27:27] 3: is where we find out of the clients have issues, yes? [15:27:40] s/of/if/g [15:28:02] (in addition to the other cluster instances) [15:28:22] oh, we have the cqlsh config too! [15:28:43] re: 3, yes exactly! [15:42:36] urandom: for cqlsh, do we explicitly use TLS and set the ca cert? [15:55:26] yes [15:56:30] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/cassandra/templates/cqlshrc-4.x.erb#8 [15:56:30] do you have a pointer where this happen? [15:56:37] super thanks [15:56:59] I think the answer is "more conditionals" :) [15:57:21] which hopefully we can cleanup up once all of the clusters are using the PKI [15:57:41] I am wondering if in there we could just put the same value that ends up in the cassandra's config [15:57:46] basically the truststore path [15:58:01] ah wait this one wants a crt [15:58:05] yes [15:58:12] uff [15:58:12] cqlsh is python [15:59:14] okok thanks for the pointer, I'll try to think about something and I'll send a patch next week [15:59:34] it shouldn't prevent us to move forward with the new truststore, ok if I start with aqs1010 next week? [16:00:43] yes, ofc [16:01:15] I can work on that too, and I'm still trying to think of all the corner cases [16:01:29] the restbase cluster is ...more concerning, I think [16:02:10] but I understand if you wan to divest yourself from this project after the aqs cluster :) [16:04:49] nono I can help for session store and restbase too [16:05:35] I am wondering if we validate TLS certs for session store now [16:06:12] if not, it's a bug [16:07:46] I see a cassandra CA blurb inline, so I guess it is the self-signed ca's root cert [16:08:04] but looking at it, I think the code is similar to the aqs 2.0 code, so maybe it defaults to "not" [16:08:16] the self-signed, yeah [16:10:03] yep yep [16:10:51] wait, I guess I don't know where those certs are coming from [16:10:58] more helm chart magic [16:11:39] the self signed root CA is listed under helmfile.d/services/sessionstore/values.yaml [16:11:51] but that's being overridden in production, no? [16:12:03] along with the password? [16:12:15] so that one is the prod's override, not the chart's one [16:12:24] the password is probably stored in puppet's private [16:12:38] the root CA is the public cert, so nothing to protect [16:12:55] I'm provisioning a vm, and it wants to set db2214 as active in netbox, is that okay? https://phabricator.wikimedia.org/P59690 [16:12:57] it should just be for kask to validate the cassandra's TLS certs in theory [16:13:37] what ca is that? if it's not overridden, are we replicating it? [16:13:59] https://gitlab.wikimedia.org/repos/mediawiki/services/kask/-/blob/main/storage.go?ref_type=heads#L74 - I think that we don't check the certs though [16:14:22] urandom: it is the self signed CA from ca-cassandra/manager (don't recall the exact name) [16:14:56] it is set to allow prod's pods (kask) to validate the TLS certs provided by cassandra's session store [16:15:00] afaict [16:15:03] so we are replicating it [16:15:24] copypasta [16:15:28] replicating in the sense copy/pasting it from cassandra? If so yes [16:15:34] yes [16:16:59] but we use gocql.SslOptions as we do in aqs' code, without the extra option.. [16:17:45] but now I am wondering why we have the override for the self-signed CA in session store, was it a problem (like when testing TLS didn't work) or not? [16:20:11] I'm not sure I follow [16:21:02] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kask/templates/_config.yaml [16:22:56] I'll try to explain my thoughts [16:23:04] I checked /etc/kask/config.yaml on a kask pod, and we have [16:23:09] tls: [16:23:09] ca: /etc/cassandra-certs/ca.crt [16:23:38] and this is fine, it is the self signed CA that is listed in deployment-charts (copy/pasted and deployed as separate file via helm) [16:24:28] helm is writing that verbatim (copy/paste) output to that file then? [16:24:43] yes I think it writes /etc/cassandra-certs/ca.crt [16:24:45] on the pod [16:25:38] but if all the rumbling that I added in https://phabricator.wikimedia.org/T352647#9692816 is true (big if), we don't set InsecureSkipVerify or EnableHostVerification and gocql.SslOptions seems to default to not verify a cert [16:26:41] so I am wondering why on aqs we didn't add ca.crt, and why we have it on kask's config [16:26:59] since, in theory, the code is the same and they should not be validating certs [16:27:04] does it make sense? [16:28:01] like, say that in kask we'd set the same ca bundle that aqs uses (so puppet ca and pki ca only) - would the code complain and start failing TLS connections to cassandra's session store? [16:28:23] or would it keep working ? [16:28:54] I don't know (which is concerning). [16:29:35] one experiment that we could do is to override the ca bundle in sessionstore's values-staging.yaml [16:29:41] deploy and see what kask does [16:29:51] if it keeps working we know [16:32:50] actually, for sessionstore we can (and should) work out the whole thing in staging [16:33:07] we have that luxury [16:36:01] ok, pretty sure we are not verifying [16:36:07] * urandom sighs [16:36:46] I'm also pretty sure the situation was different when kask was written, gocql has since changed, and I guess the aqs 2.0 services just copied what kask was doing [16:39:34] elukey: I'm going to fix this in kask by adding a config option to enable/disable verification, we can test in staging, and then open bug(s) for aqs to follow suit [16:40:00] or to otherwise refactor to use a *tls.Config from the crypto package [16:41:06] urandom: I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1017320/ if you want to quickly test staging [16:41:52] I mean, we could, but I'm quite sure it's not verifying [16:42:29] ah okok :D [16:42:43] https://usercontent.irccloud-cdn.com/file/7pJZmVUx/image.png [16:43:03] yes yes it is in the commit [16:43:15] buuut I wanted to be sure :) [16:43:30] in that table, Config is nil, and EnableHostVerification is nil (i.e. false) [16:43:33] anyway, it means that migrating session store will be easier too :D [16:43:33] ok [16:43:42] haha [16:44:28] I didn't have enough context from that commit message to understand what they meant, I do now [16:45:55] probably because I stopped at the commit message :) [16:46:09] all right I am going to log off for today, thanks a lot for the brainbounce and the chat [16:46:22] I'll keep working on aqs next week and report back on the truststore [16:46:36] I can review the go fixes if you want, feel free to add me [16:46:54] sure, and thanks for leading this! [16:46:55] have a nice rest of the day and weekend folks! o/ [16:46:58] <3 [16:47:01] you too! [23:38:40] PROBLEM - MariaDB sustained replica lag on s3 on db2194 is CRITICAL: 4.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2194&var-port=9104 [23:38:42] PROBLEM - MariaDB sustained replica lag on s3 on db2190 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2190&var-port=9104 [23:39:40] RECOVERY - MariaDB sustained replica lag on s3 on db2194 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2194&var-port=9104 [23:39:44] RECOVERY - MariaDB sustained replica lag on s3 on db2190 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2190&var-port=9104