[13:03:41] <elukey>	 urandom: o/
[13:04:18] <elukey>	 lemme know if you have time for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013571, it should be a no-op, but it would unblock all the tests in AQS :)
[13:04:31] <elukey>	 going to update the task with my idea
[14:06:17] <elukey>	 Left a note in https://phabricator.wikimedia.org/T352647#9692816 related to the AQS clients, I am a bit puzzled
[14:32:34] <urandom>	 elukey: glad you pinged, I will look right now!
[14:35:59] <urandom>	 ugh
[14:37:03] <urandom>	 elukey: so are the aqs 2.0 services really using the cassandra-http-gateway chart?  Because they are not cassandra-http-gateway -based (not sure what the consequences of that would be).  /cc hnowlan 
[14:37:34] <hnowlan>	 most of them are using it 
[14:37:39] <urandom>	 how come?
[14:37:44] <hnowlan>	 some of them are using druid-http-gateway
[14:37:54] <hnowlan>	 how do you mean? 
[14:38:34] <urandom>	 again, I don't know the consequences of reusing the chart (other than confusion like this), but they aren't cassandra-http-gateway services
[14:39:26] <hnowlan>	 aren't in what sense? 
[14:40:05] <hnowlan>	 for example https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/media-analytics/helmfile.yaml#18
[14:40:45] <hnowlan>	 I was under the impression the drop-in config behaviour we get for aqs2 services are exactly what the chart was built for
[14:41:12] <urandom>	 cassandra-http-gateway is a (go) framework for creating a (very limited) http shim on top of a cassandra table: https://gitlab.wikimedia.org/repos/generated-data-platform/cassandra-http-gateway
[14:41:26] <urandom>	 we have exactly one such service, image-suggestions
[14:41:48] <urandom>	 the idea was that others would follow, and that chart was created to bang those out quickly when the time came
[14:42:00] <urandom>	 I thought it came with "other stuff", envoy configuration and whatnot
[14:42:07] <urandom>	 but I don't recall all of the deets
[14:42:49] <urandom>	 if it's useful for other stuff, and can be repurposed to do them as well, then it's at least a misnomer at this point
[14:43:20] <urandom>	 I'm pretty sure it wasn't created with that in mind though
[14:43:33] <hnowlan>	 heh
[14:43:38] <hnowlan>	 I was completely unaware of that leg of the table 
[14:44:09] <hnowlan>	 I even wrote the chart but I had zero idea the name corresponded to another software component
[14:44:57] <urandom>	 oh, this chart was created by you?
[14:45:00] <hnowlan>	 yep
[14:45:03] <urandom>	 what is the other chart called...
[14:45:14] * urandom goes looking...
[14:46:05] <urandom>	 wmf-stable/cassandra-http-gateway ?
[14:46:53] <urandom>	 so you created this chart for the aqs 2.0 services?
[14:47:07] <urandom>	 I thought j.ayme created the one I'm thinking of
[14:48:18] <hnowlan>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/cassandra-http-gateway is the one I created
[14:48:40] <hnowlan>	 which was used to roll out image-suggestion
[14:49:24] <urandom>	 oh, so you did write it!
[14:49:27] <urandom>	 the OG chart
[14:50:04] <urandom>	 so what are the distinguishing properties of this chart?
[14:50:28] <urandom>	 what makes it different to other charts, I mean
[14:50:56] <urandom>	 "description: A generic helm chart for cassandra-based HTTP gateway applications"
[15:01:42] <urandom>	 hnowlan: so this was a couple of years ago, I remember a meeting that included (at least) a.kosiaris  and j.ayme about what was being proposed.  They talked about what this would look like vis-a-vis k8s including —as I mentioned— something about an envoy configuration that wouldn't have been warranted for a single service, but since we anticipated more, there was perceived to be a payoff later.
[15:02:18] <urandom>	 there was terminology that I didn't recognize (and that I can't remember), and I asked and got a long-story-short explanation (that I can't remember).
[15:02:39] <elukey>	 ahem the coversation derailed a little, I'd steer it to PKI if possible :D
[15:03:30] <urandom>	 I'm not sure how that meeting connected to your implementation, but (to elukey's point) I'm wondering if there is any longer-term problem (the confusing name notwithstanding) with using that chart for these services
[15:03:51] <urandom>	 I assume not, but...
[15:04:58] <urandom>	 elukey: and, we should really be doing cert verification, shouldn't we?  I guess it makes things easier to migrate to if we're not, but that feels like an action item for later, no?
[15:05:20] <urandom>	 or maybe not...?
[15:05:23] <elukey>	 urandom: yep yep I think the same, it simplifies a lot the work on our side
[15:05:32] <elukey>	 and we can do it later, right after all nodes have PKI certs
[15:05:43] <urandom>	 I mean, poorer security does usually make things easier :)
[15:06:09] <elukey>	 otherwise we'd have needed to use a specific bundle on clients, containing the PKI Root Cert and the ca-manager's Root CA
[15:06:14] <elukey>	 ahahah yes :)
[15:06:51] <elukey>	 so yeah now the main idea would be to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013571 and the next in the chain, then extend it to all AQS nodes
[15:07:12] <elukey>	 at that point we'll have the truststore deployed everywhere, and we'll be ready to migrate one/two cassandra instances to PKI
[15:12:12] <hnowlan>	 urandom: funnily enough the first mention of cassandra-http-gateway I have on file is me saying I missed the meeting where it was discussed :D Since then we've modularised and standardised a lot of stuff so things like envoy config etc aren't spceific to the chart
[15:12:34] <hnowlan>	 personally apart from the name collision I only see benefits to this approach, in theory
[15:13:14] <urandom>	 what does this chart do, what makes it special?
[15:13:23] <urandom>	 I'm mostly just curious at this point
[15:13:57] <urandom>	 I'd jump in an try to figure this out (and might still), but I find these charts to be impenetrable 
[15:14:21] <urandom>	 so much (templated) yaml
[15:14:27] <urandom>	 it's like Spring for Java
[15:14:34] <urandom>	 like Spring for k8s
[15:16:57] <hnowlan>	 it's a codified set of assumptions about a binary that uses a standard config to connect to cassandra. An easier way to understand it might be to look at a values.yaml file for it 
[15:17:01] <hnowlan>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/media-analytics/values.yaml
[15:17:06] <hnowlan>	 apart from these values, nothing changes about the chart
[15:18:11] <urandom>	 I see, so it's The Cassandra Chart™
[15:18:16] <urandom>	 ?
[15:18:33] <urandom>	 generic-cassandra-connecting-service chart?
[15:18:54] <hnowlan>	 pretty much
[15:18:55] <urandom>	 s/cassandra-http-gateway/cassandra/g ?
[15:18:57] <urandom>	 heh
[15:22:09] <urandom>	 elukey: ok, so then next steps would be r1013566, and a smoke test for tls errors connecting to aqs1010?
[15:23:03] <urandom>	 and then remove the temporary `profile::base::certificates::trusted_certs` setting and check again?
[15:24:18] <elukey>	 urandom: so in theory after the test on aqs1010 it will run a new truststore, if the instances are able to connect to the other ones it should be a good validation test
[15:24:22] <urandom>	 (in addition to `tls_use_pki_keep_old_ca` I guess)
[15:25:25] <urandom>	 oh, right, we can't remove the trusted_certs settings until the cluster has been migrated entirely
[15:25:55] <urandom>	 and then we can validate that the services can connect
[15:26:11] <elukey>	 yes exactly, I'd do this
[15:26:29] <elukey>	 1: we rollout the new truststore to aqs1010
[15:26:33] <elukey>	 2: to the whole cluster
[15:26:47] <elukey>	 3: we force pki for aqs1010 and check
[15:26:54] <elukey>	 4: we rollout pki to the rest
[15:26:57] <elukey>	 does it make sense?
[15:27:27] <urandom>	 3: is where we find out of the clients have issues, yes?
[15:27:40] <urandom>	 s/of/if/g
[15:28:02] <urandom>	 (in addition to the other cluster instances)
[15:28:22] <urandom>	 oh, we have the cqlsh config too!
[15:28:43] <elukey>	 re: 3, yes exactly!
[15:42:36] <elukey>	 urandom: for cqlsh, do we explicitly use TLS and set the ca cert?
[15:55:26] <urandom>	 yes
[15:56:30] <urandom>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/cassandra/templates/cqlshrc-4.x.erb#8
[15:56:30] <elukey>	 do you have a pointer where this happen?
[15:56:37] <elukey>	 super thanks
[15:56:59] <urandom>	 I think the answer is "more conditionals" :)
[15:57:21] <urandom>	 which hopefully we can cleanup up once all of the clusters are using the PKI
[15:57:41] <elukey>	 I am wondering if in there we could just put the same value that ends up in the cassandra's config
[15:57:46] <elukey>	 basically the truststore path
[15:58:01] <elukey>	 ah wait this one wants a crt
[15:58:05] <urandom>	 yes
[15:58:12] <elukey>	 uff
[15:58:12] <urandom>	 cqlsh is python
[15:59:14] <elukey>	 okok thanks for the pointer, I'll try to think about something and I'll send a patch next week
[15:59:34] <elukey>	 it shouldn't prevent us to move forward with the new truststore, ok if I start with aqs1010 next week?
[16:00:43] <urandom>	 yes, ofc
[16:01:15] <urandom>	 I can work on that too, and I'm still trying to think of all the corner cases
[16:01:29] <urandom>	 the restbase cluster is ...more concerning, I think
[16:02:10] <urandom>	 but I understand if you wan to divest yourself from this project after the aqs cluster :)
[16:04:49] <elukey>	 nono I can help for session store and restbase too
[16:05:35] <elukey>	 I am wondering if we validate TLS certs for session store now
[16:06:12] <urandom>	 if not, it's a bug
[16:07:46] <elukey>	 I see a cassandra CA blurb inline, so I guess it is the self-signed ca's root cert
[16:08:04] <urandom>	 but looking at it, I think the code is similar to the aqs 2.0 code, so maybe it defaults to "not"
[16:08:16] <urandom>	 the self-signed, yeah
[16:10:03] <elukey>	 yep yep
[16:10:51] <urandom>	 wait, I guess I don't know where those certs are coming from
[16:10:58] <urandom>	 more helm chart magic
[16:11:39] <elukey>	 the self signed root CA is listed under helmfile.d/services/sessionstore/values.yaml
[16:11:51] <urandom>	 but that's being overridden in production, no?
[16:12:03] <urandom>	 along with the password?
[16:12:15] <elukey>	 so that one is the prod's override, not the chart's one
[16:12:24] <elukey>	 the password is probably stored in puppet's private
[16:12:38] <elukey>	 the root CA is the public cert, so nothing to protect
[16:12:55] <jhathaway>	 I'm provisioning a vm, and it wants to set db2214 as active in netbox, is that okay? https://phabricator.wikimedia.org/P59690
[16:12:57] <elukey>	 it should just be for kask to validate the cassandra's TLS certs in theory
[16:13:37] <urandom>	 what ca is that?  if it's not overridden, are we replicating it?
[16:13:59] <elukey>	 https://gitlab.wikimedia.org/repos/mediawiki/services/kask/-/blob/main/storage.go?ref_type=heads#L74 - I think that we don't check the certs though
[16:14:22] <elukey>	 urandom: it is the self signed CA from ca-cassandra/manager (don't recall the exact name)
[16:14:56] <elukey>	 it is set to allow prod's pods (kask) to validate the TLS certs provided by cassandra's session store
[16:15:00] <elukey>	 afaict
[16:15:03] <urandom>	 so we are replicating it
[16:15:24] <urandom>	 copypasta
[16:15:28] <elukey>	 replicating in the sense copy/pasting it from cassandra? If so yes
[16:15:34] <urandom>	 yes
[16:16:59] <elukey>	 but we use gocql.SslOptions as we do in aqs' code, without the extra option..
[16:17:45] <elukey>	 but now I am wondering why we have the override for the self-signed CA in session store, was it a problem (like when testing TLS didn't work) or not?
[16:20:11] <urandom>	 I'm not sure I follow
[16:21:02] <urandom>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kask/templates/_config.yaml
[16:22:56] <elukey>	 I'll try to explain my thoughts
[16:23:04] <elukey>	 I checked /etc/kask/config.yaml on a kask pod, and we have
[16:23:09] <elukey>	   tls:
[16:23:09] <elukey>	     ca: /etc/cassandra-certs/ca.crt
[16:23:38] <elukey>	 and this is fine, it is the self signed CA that is listed in deployment-charts (copy/pasted and deployed as separate file via helm)
[16:24:28] <urandom>	 helm is writing that verbatim (copy/paste) output to that file then?
[16:24:43] <elukey>	 yes I think it writes /etc/cassandra-certs/ca.crt
[16:24:45] <elukey>	 on the pod
[16:25:38] <elukey>	 but if all the rumbling that I added in https://phabricator.wikimedia.org/T352647#9692816 is true (big if), we don't set InsecureSkipVerify or EnableHostVerification and gocql.SslOptions seems to default to not verify a cert 
[16:26:41] <elukey>	 so I am wondering why on aqs we didn't add ca.crt, and why we have it on kask's config
[16:26:59] <elukey>	 since, in theory, the code is the same and they should not be validating certs
[16:27:04] <elukey>	 does it make sense?
[16:28:01] <elukey>	 like, say that in kask we'd set the same ca bundle that aqs uses (so puppet ca and pki ca only) - would the code complain and start failing TLS connections to cassandra's session store?
[16:28:23] <elukey>	 or would it keep working ?
[16:28:54] <urandom>	 I don't know (which is concerning).
[16:29:35] <elukey>	 one experiment that we could do is to override the ca bundle in sessionstore's values-staging.yaml
[16:29:41] <elukey>	 deploy and see what kask does
[16:29:51] <elukey>	 if it keeps working we know 
[16:32:50] <urandom>	 actually, for sessionstore we can (and should) work out the whole thing in staging
[16:33:07] <urandom>	 we have that luxury
[16:36:01] <urandom>	 ok, pretty sure we are not verifying
[16:36:07] * urandom sighs
[16:36:46] <urandom>	 I'm also pretty sure the situation was different when kask was written, gocql has since changed, and I guess the aqs 2.0 services just copied what kask was doing
[16:39:34] <urandom>	 elukey: I'm going to fix this in kask by adding a config option to enable/disable verification, we can test in staging, and then open bug(s) for aqs to follow suit
[16:40:00] <urandom>	 or to otherwise refactor to use a *tls.Config from the crypto package
[16:41:06] <elukey>	 urandom: I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1017320/ if you want to quickly test staging
[16:41:52] <urandom>	 I mean, we could, but I'm quite sure it's not verifying
[16:42:29] <elukey>	 ah okok :D
[16:42:43] <urandom>	 https://usercontent.irccloud-cdn.com/file/7pJZmVUx/image.png
[16:43:03] <elukey>	 yes yes it is in the commit 
[16:43:15] <elukey>	 buuut I wanted to be sure :)
[16:43:30] <urandom>	 in that table, Config is nil, and EnableHostVerification is nil (i.e. false)
[16:43:33] <elukey>	 anyway, it means that migrating session store will be easier too :D
[16:43:33] <urandom>	 ok
[16:43:42] <urandom>	 haha
[16:44:28] <urandom>	 I didn't have enough context from that commit message to understand what they meant, I do now
[16:45:55] <urandom>	 probably because I stopped at the commit message :)
[16:46:09] <elukey>	 all right I am going to log off for today, thanks a lot for the brainbounce and the chat
[16:46:22] <elukey>	 I'll keep working on aqs next week and report back on the truststore
[16:46:36] <elukey>	 I can review the go fixes if you want, feel free to add me
[16:46:54] <urandom>	 sure, and thanks for leading this!
[16:46:55] <elukey>	 have a nice rest of the day and weekend folks! o/
[16:46:58] <elukey>	 <3
[16:47:01] <urandom>	 you too!
[23:38:40] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s3 on db2194 is CRITICAL: 4.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2194&var-port=9104
[23:38:42] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s3 on db2190 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2190&var-port=9104
[23:39:40] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s3 on db2194 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2194&var-port=9104
[23:39:44] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s3 on db2190 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2190&var-port=9104