[08:57:01] Dear SREs, I have deployed a change that is not working on admin_ng [08:57:07] on kubernetes [08:57:13] please do not deploy there [09:00:55] I'll be upgrading seaborgium (LDAP server in codfw) to bullseye in a bit, should not be noticeable in practice since practically all LDAP requests are going against the replicas [09:10:42] VO question - have we stopped doing the working hours thing where if the on-callers don't ack a p.age within 5 minutes it gets escalated to batphone? [09:20:12] (I know the batphone escalator when no-one's on-call is done by a systemd timer outside VO) [09:23:50] seaborgium update is complete [10:48:35] klausman: are you the one I should ping if want to update the flink-operator on dse-k8s-eqiad ? [10:48:57] Nope, not a DSE man :) I think you want btullis [10:50:19] then we need to update that entry on wikitech [10:50:25] "The dse-k8s cluster group currently comprises only the dse-k8s-eqiad cluster. This clusters is jointly managed by the Machine Learning and the Data Engineering teams" [10:50:37] yeah, I just spotted that. Will do an edit [10:51:04] brouberol btullis are you the ones I should ask for an admin change on dse-k8s-eqiad ? [12:06:43] yep, what do you need? [12:59:00] brouberol: sorry I was at lunch:) [12:59:57] I have deployed this change on eqiad codfw, and I would like it it be deployed on dse-k8s-eqiad too, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1029573 [13:02:21] sure, I can do that [13:02:57] I can see that this _should_ allow the flink operator to have access to the k8s API via calico network policies vs hardcoded IPs, is that right? [13:03:04] only thing needed is to kill the flink-operator pods [13:03:07] yes [13:03:40] I am not sure what flink stuff you lot have there, and I didn't want to cause any trouble [13:04:00] as an aside, we could even remove the hardcoded zk-flink IPs and rely on externa-services, for that'd be for later [13:04:05] brouberol: the general testing we did was: kill the operator pods (one first to see if it comes up) [13:04:26] oh, you mean kill them after the diff is applied? [13:04:34] yes, there is a patch ready to go afterwards https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1031811 [13:04:48] yes, you kill one, make sure it comes up and ready [13:04:52] kil the other one [13:05:05] and then you can redeploy a service that is using it [13:07:43] alright then, let me find what service I'll be restarting, and then I'll proceed [13:08:38] dcausse: o/ [13:08:45] o/ [13:09:04] if you have a moment, I'd like to verify one thing about the usage of swift thanos by the rdf-streaming-updater [13:09:15] sure [13:09:38] the only thing I can see that uses it is the rdf-streaming-updater [13:09:42] I am doing some checks for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031439, since we'll move in a bit one of the thanos swift frontends to a new TLS cert, issued by PKI and not by the Puppet CA anymore [13:10:41] actually, it seems that it was never deployed to dse-k8s-eqiad at all [13:10:48] so redeploying it makes little sense [13:10:57] I am trying to verify if the flink config is able to validate TLS certs for both puppet and pki, I see that the docker image carries wmf-certificates and I don't see any specific mention of a cabundle set (like, flink validates only tls certs coming from the puppet CA) [13:11:25] dcausse: is it the right understanding? I am trying to prevent outages basically :) [13:11:34] effie: I'm applying the change [13:12:13] are we using flink-operator in dse-k8s at all anymore? I think we put it there for testing [13:12:33] I don't think we are inflatador [13:12:48] elukey: I can perhaps test from the container directly? [13:12:55] I killed flink-kubernetes-operator-6cd94cdb7f-b8scl and the new pod came back 1/1 Ready [13:13:35] brouberol: it seems like you can kill this bit altogether [13:14:27] if no one is using the operator, then no need to have it [13:14:32] yep, agreed [13:15:30] dcausse: the plan is to depool ms-fe1001, change the cert, validate and then repool (so a single rather than all the thanos swift ones) - can I ping you when we do it so we can verify? [13:18:33] elukey: I don't really understand what is going to happen [13:19:40] inflatador effie: I've opened https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1031900 [13:20:20] elukey brouberol these changes are only happening in dse-k8s-eqiad, or do we need to address this in prod too? B/c we def use thanos-swift in prod [13:20:33] currently thanos-swift.discovery.wmnet gets verified via Puppet CA which comes from the deb package wmf-certificates IIUC [13:21:03] Understood, guessing that means we need to build a new image if ours doesn't have the new CAs? [13:21:09] dcausse: sure sorry, lemme explain - IIUC Flink contacts Thanos Swift for storage, and at the moment it should be using TLS. At the moment the Thanos swift nodes (thanos-swift.discovery.wmnet) are running with a TLS cert issued by the Puppet CA, meanwhile we want to move to another one issued by PKI (so basically we want to stop using the Puppet CA for these certs). PKI is the new infra for TLS [13:21:15] certs that SRE offers. Flink should be able to validate both type of TLS certs in theory, since it has the wmf-certificate deployed on the docker image, and afaict we don't set to trust only Puppet CA certs in the FLink config. [13:22:18] will I need to deploy a new version of the wmf-certificate package? [13:22:31] I wonder if the containers have curl, maybe we can just exec in and try to hit the new endpoint [13:22:33] inflatador: we are going to move ms-fe1001 in a bit to PKI TLS certs, so it will affect all prod use cases [13:22:54] it has curl and that is what I use for testing [13:22:55] although I guess flink will use the java keystore [13:23:24] is there a way to check the flink config for s3/swift? [13:23:38] I mean, in what file is rendered.. I can check on the chart as well [13:24:36] elukey: you mean, I think, thanos-fe1001 (sorry, I don't mean to be pedantic, but ms-fe* are a different swift cluster already migrated) [13:25:00] Emperor: yes sorry sorry pebcak [13:25:06] the code review is right [13:25:34] but I mentioned the wrong node :D [13:25:35] we point flink at thanos-swift.discovery.wmnet [13:27:17] dcausse: I see in flinkdeployment.yaml that we only set s3.endpoint and not any cabundle or tls specific thing, so it just relies on what's available on the node (in theory). So we should be good :) [13:28:08] yes nothing particular is configured so if the system is supposed to validate these certs properly we should be good [13:28:15] I can monitor the jobs while you do it [13:28:27] dcausse: yes yes thanks for checking, I'll tell you once done so you can keep an eye <3 [13:30:45] Emperor: ok if I depool and upgrade thanos-fe1001 (this time I should have the correct node sorry) [13:31:15] elukey: I'm happy if godog is happy [13:31:36] is godog happy? [13:31:44] yep olly is happy, Filippo not sure [13:31:59] :D [13:32:06] yep SGTM [13:32:46] :) [13:33:05] flink-operator and associated CRDs are gone from dse-k8s-eqiad [13:40:15] thanos-fe1001 is running the new cert, openssl seems fine (SANs etc..) [13:40:18] repooling! [13:40:38] effie: pooled! [13:42:30] dcausse: new node is up and running serving traffic, lemme know if you see anything strange [13:42:52] elukey: how many nodes are being this lvs? [13:43:04] s/being/behind [13:43:17] dcausse: 4 afaics (in eqiad) [13:44:17] things seem to be working, restarting a job to force a reconnection [13:47:27] elukey: well, everything looks fine but can't tell for sure if it used thanos-fe1001 [13:49:16] dcausse: I checked via netstat, I see taskmanagers and flink-app containers connected (from rdf-streaming-updater namespace) [13:49:19] so all good so far [13:49:27] thanks! [13:50:24] elukey: yeah I already see the cpu working harder [13:50:59] effie: can you share a link? Didn't find it [13:51:07] yes one moment please [13:51:39] https://grafana.wikimedia.org/goto/WYQaRkPSg?orgId=1 -> throttled [13:51:54] which then wend down though [13:51:55] and [13:52:26] https://grafana.wikimedia.org/goto/_PDYRzESg?orgId=1 [13:53:45] weird I checked https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details and I don't see anything for the tegola-vector-tiles ns [13:54:01] ah now wait the graphs fooled me [13:54:01] okok [13:54:29] limit is 4s and it is difficult to see [13:55:21] so ~+50ms of cpu usage on the pod that I am checking, sigh [13:55:50] how many frontends do we have? [13:55:57] 4 afaics [13:57:25] I see almost 2 times, it is even worse [13:57:57] https://grafana.wikimedia.org/goto/aA5eRzPSg?orgId=1 [13:58:54] total usage goes from ~150/160 to ~210, I think it is roughly +50ms [13:59:14] that is not great for a TLS cert change yes [13:59:47] elukey: I just selected a random pod, but even so, the problem is still there [14:00:36] effie: we can keep this setting in my opinion and verify what's wrong, it affects a little perfs but the service looks healthy now [14:01:02] and there is room for increase in cpu usage, not that we absolutely need to use it, but.. [14:01:22] so the only difference I had spotted at the time, is this one: Cipher Suits: TLS_AES_128_GCM_SHA256 is selected by envoy with the puppet cert vs TLS_AES_256_GCM_SHA384 with cfssl [14:02:08] elukey: we can beef things up for 1 cluster to handle traffic of boths DCs [14:06:15] effie: have we ever tried to contact upstream about this issue? [14:06:51] we have an old version first of all, and secondly, it is not a tegola problem rather a go problem [14:07:29] our chances of any of those upstreams peaking into this, are not good [14:09:20] effie: the issue is with the golang ssl libs? is this a case where an envoy upgrade would help? [14:09:48] cdanis: from tegola's end [14:10:15] sorry, don't understand [14:10:45] cdanis: tegola is a go app [14:11:09] ah ok [14:11:25] effie: it seems more an issue with the aws sdk, that tegola uses.. In https://github.com/go-spatial/tegola/releases/tag/v0.20.0 I see that they bumped it, maybe it is sufficient to make things better [14:12:44] pretty sure that we aren't the first ones experiencing problems [14:13:33] now I am already seeing the next problem, who upgrades tegola? [14:13:50] :) [14:15:59] just four years of commits [14:19:46] I have updated the task, I'll try to see if I can find more about the TLS usage [14:19:52] thanks all for the support for now :) [14:20:42] I think that it all comes down to crypto/tls in the end [14:27:16] I am pretty sure that tegola doesn't do any connection pooling etc.. to thanos, so opening a connection every time + tls handshake + etc.. a different cipher suite can lead to increase in usage, it may be an explanation [14:27:38] if so using the envoy sidecar would surely alleviate the problem [14:28:03] but we'd need a patch to our Tegola repo, to fix the host header for request signing [14:28:11] (to be proxy aware) [14:28:27] we are jumping from a 128bit key to a 256 one, and switching from SHA256 to SHA384, so we know this is one problem on its won [14:29:37] pretty sure it wouldn't be an issue with the sidecar [14:29:42] conn pooling etc.. [14:29:59] sure sure, a proxy would def solve all our problems [14:30:12] btw, the AWS envoy filter had anything interesting ? [14:30:21] I am using netstat on thanos-fe1001, filtering for one tegola ip, and it keeps creating conns [14:30:56] I found only https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/aws_request_signing_filter and it is listed as experimental [14:31:37] if it is not a lot of work to have a go, we could do so [14:32:08] I understand that experimental is generally a nogo (pun intended), but what if it works ok for us? [14:32:40] I have to shoot off, happy to help in any way [14:33:33] elukey: thanks for putting effort on ths [14:39:50] effie: np! thanks for the brainbounce.. I left a comment in the task, I'd lean towards fixing request signing on tegola's end and test it, should be a matter of a small go patch. Not very ideal but it could be a start, then we could experiment with the envoy filter in the future when it is more stable. [15:46:56] is zuul/gerrit ok? [15:47:04] vgutierrez: see -ops, gerrit is not [16:26:57] we had an incident today where gerrit was down between 15:42 and 15:55 UTC [16:27:04] 11:47:31 <+jinxer-wm> FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - [16:27:16] we were _not_ paged for it. any thoughts on making this page? [16:40:28] more-broadly, what's our standard for paging? is it only directly-user-impacting? or also developer/SRE-workflows-impacting? [16:41:01] (I suspect as with most such questions, the answer at present is fuzzy) [16:43:00] phab is considered slightly more important because then you cant report/follow other bugs.. I guess [16:50:43] I would say Gerrit should be included because of the developer workflow impact and its effects on other things [16:52:26] example scenario: if gerrit is down and we were not aware and there was a need to depool the site, we can be caught off guard [16:58:44] bblack: I don't think there is a real standard [16:58:56] some of that, SLOs will solve tventually [17:38:49] topranks: sorry, stupid question: when are you planning to re-assign IPs to the dbs (what we discussed regarding dbctl), are the new IPs going to be on 10.64.% and 10.192.% or it can be other IPs too? I'm working on some grants and wondering if I should make them 10.% instead [17:39:42] Amir1: no specific timescale, we need to pick that back up [17:40:14] But to answer the particular question, eqiad will stay within 10.64 and codfw within 10.192 so you can probably leave the grants as they are [17:40:22] ah awesome [17:40:25] thank you! [17:41:04] we need to get the cookbooks ready to make it easier for you guys then we can tackle at a pace that suits everyone