[12:16:09] 10serviceops, 10docker-pkg: operations/docker-images/production-images contains references to non-existent image python3 - https://phabricator.wikimedia.org/T336682 (10hashar) The python3 image got removed as part of T335282 by https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/91176... [12:16:41] 10serviceops, 10docker-pkg: operations/docker-images/production-images contains references to non-existent image python3 - https://phabricator.wikimedia.org/T336682 (10hashar) [12:17:14] 10serviceops: operations/docker-images/production-images contains references to non-existent image python3 - https://phabricator.wikimedia.org/T336682 (10hashar) Removing #docker-pkg tag which is for the building software itself rather than the images. [12:58:55] 10serviceops, 10Prod-Kubernetes: Fix naming confusion around main/wikikube kubernetes clusters - https://phabricator.wikimedia.org/T336861 (10JMeybohm) >>! In T336861#8864785, @Ottomata wrote: >> appending -k8s which I personally find redundant > I also find it redundant. > > @JMeybohm > What will the intend... [13:04:26] 10serviceops, 10Prod-Kubernetes: Fix naming confusion around main/wikikube kubernetes clusters - https://phabricator.wikimedia.org/T336861 (10Ottomata) * 'wikikube-main-eqiad' * 'wikikube-main-codfw' * 'wikikube-staging-eqiad' * 'wikikube-staging-codfw' Then? [13:16:43] 10serviceops, 10Prod-Kubernetes: Fix naming confusion around main/wikikube kubernetes clusters - https://phabricator.wikimedia.org/T336861 (10JMeybohm) That makes less sense to me then the prior proposal. But as said, this is undecided and needs to be looked at in detail with all places of use in mind. As I've... [13:21:05] 10serviceops, 10Prod-Kubernetes: Fix naming confusion around main/wikikube kubernetes clusters - https://phabricator.wikimedia.org/T336861 (10Ottomata) Okay, we are mostly trying to guess/anticipate what you all will do so we can move forward with using these names in T336656. If we guess wrong, we'll deal wi... [13:21:43] 10serviceops, 10Prod-Kubernetes: Fix naming confusion around main/wikikube kubernetes clusters - https://phabricator.wikimedia.org/T336861 (10JMeybohm) [13:28:06] hello, akosiaris , can you remind me? We set up staging-codfw cluster the same as staging-eqiad, but generally we don't deploy services there? is that correct? [13:29:13] I ask because we want to set a cluster specific value, and I think I want to create two environments in the helmfile. so instead of just staging, I'd have staging-eqiad and staging-codfw [13:37:34] this is another thing that we need to look at with the renaming :) [13:40:19] it is correct that we usually don't deploy to staging-codfw. Although we should be able to. This makes it a bit unclear on how to deal with your usecase as flink will probably start with an unexpected state in case somebody deploys to staging-codfw, right? [13:40:42] as an old tombstone will be used ...most likely [13:41:35] but maybe you can hack your way out of that by using the same zookeeper path for all staging clusters, as leader election will then take care of only running the stuff on one cluster [13:41:49] well..not true because of one zookeeper per DC [13:42:37] is there maybe an unserstanding on how this is handled by rdf-streaming-updater? [13:53:15] jayme: gonna do it like this: [13:53:15] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/920268/11/helmfile.d/services/mw-page-content-change-enrich/helmfile.yaml [13:53:35] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/920268/12/helmfile.d/services/mw-page-content-change-enrich/values-staging-codfw.yaml [13:53:37] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/920268/12/helmfile.d/services/mw-page-content-change-enrich/values-staging-eqiad.yaml [13:55:30] ottomata: that makes you a snowflake as the deploy command for this service if different from all others... [13:55:54] hm, okay. [13:55:54] in other words: please don't :) [13:55:56] haha okay [13:56:06] got another suggestion? [13:56:26] i guess we could make staging always target eqiad things, e.g. eqiad zookeeper [13:56:33] even if deployed in staging-codfw? [13:58:08] we could, in that case we would have them fight for leader cross-cluster [13:58:24] oh if they were deployed in both clusters at the same time, that would be a problem. [13:58:24] which might be good - or not - I don't know [13:58:46] that would break. [13:58:49] they will be at some point, I can assure [13:59:03] they can be if we can use different ZK and checkpoint paths [13:59:11] treating them as distinctly different app deployments [13:59:38] but if they use the same ZK cluster and path, and same swift cluster / checkpoint paths, they will break for sure [13:59:55] why would they break? [14:00:33] I would have assumed that leader election will determine there is one leader (somewhere) and not do anything [14:00:50] (not saying that I think this is a good idea :)) [14:02:16] we don't want to run a flink cluster across DCs, nor do I think we really can? the JobManager is responsible for talking to hte flink operator in the current cluster to ask it to make TaskManager pods [14:02:36] ottomata: my generic answer is to avoid tailoring anything to staging-codfw. Try to not reference it any kind of way, try to not deploy things to it, try to ignore that it exists [14:03:11] akosiaris: but, in admin_ng, we should? rigth? e.g. we do want to deploy the flink-operator and create service namespaces there, but we just don't want the hemfile.d/service stuff to every refer to staging-codfw? [14:03:26] ever* [14:03:48] > the JobManager is responsible for talking to hte flink operator in the current cluster to ask it to make TaskManager pods [14:04:22] actually that isn't quite right. The flink-operator spawns the JobManager(s). The JobManager ask k8s to create TaskManager pods. [14:04:52] it might be better to not dive down that rabbithole and just treat staging-eqiad as staging then for your use cases [14:05:16] akosiaris: should we not deploy flink-operator in staging-codfw? and should we not create service namespaces in staging-codfw? [14:05:38] ottomata: yeah, generally treat it as something that does not exist. It's the safest thing for you to do [14:06:07] that means we 'll have to adapt some of our workflows when we need to mess with staging-eqiad, but that's probably a better ux for you [14:06:48] okay, will have to revert some stuff. [14:09:52] hm, yeah i can see that other things, not just flink, won't work as is with the values-staging.yaml files. they all refer to eqiad specific resources, e.g. test-eqiad kafka cluster. [14:10:05] i suppose it staging so we don't really care if we have to wipe flink state later if/when we rename [14:13:42] Hi folks! Did you notice "Puppet CA certificate videoscaler.svc.eqiad.wmnet is about to expire" ? [14:13:51] (same thing for codfw) [14:22:24] elukey: Hmm that's strange since I regen'd the mediawiki certs recently, although I may have forgotten to copy over videoscaler [14:24:00] claime: the alert seems to come from the puppet CA, so I think that the videoscaler cert was not regenerated [14:24:10] * claime grumbls [14:24:24] there is another one also for mirror maker, that we should take care of [14:24:32] (Kafka mirror maker I mean) [14:25:13] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) Based on a convo with @akosiaris, we need to undeploy flink-operator in staging-codfw, as well as mw-pag... [14:26:20] claime: but something is weird though [14:26:20] notBefore=May 1 13:55:21 2023 GMT [14:26:20] notAfter=Apr 30 13:55:21 2028 GMT [14:27:25] (I am hitting a cert with CN: jobrunner.discovery.wmnet) [14:27:42] Yes because videoscalers and jobrunners are the same machines [14:28:19] ack ack, do they have the same cert with SANs then? Don't recall [14:28:24] Yeah they should [14:28:25] because the expire date look ok [14:28:57] yeah I see they should [14:29:13] In mediawiki.certs.yaml videoscaler.discovery.wmnet is an alt_name of jobrunner [14:29:43] maybe the one that is alerting is an old one [14:29:57] Same for the .svc names [14:43:16] jayme: i think https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922138 looks right? undeploying flink-operator and mw-page-content-change-namespace from staging-codfw [14:44:09] elukey: Yeah, it checks the certs in /var/lib/puppet/server/ssl/ca and I guess the videoscaler ones weren't removed, they're from 2018 (jobrunner one is ok, from May 2 so checks out) [14:44:46] (it being the prometheus exporter for puppet) [14:45:06] claime: super [14:45:38] I'll make a note to check with the team at our meeting if they can be safely removed [14:45:45] I think so but I'd like a sanity check [14:46:46] make sense yes! [14:48:45] 10serviceops, 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10Ottomata) [14:52:43] the kafka mirror maker one looks "standard", I'll make a note and refresh it this week [14:55:25] 10serviceops, 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10elukey) Some info - the kafka mirror maker's cergen TLS cert has `kafka_mirror_maker` as CN, that is used as "username" in Kafka ACLs: `... [14:55:38] claime: Andrew opened --^ [14:55:59] it would be super nice to use PKI for that one [14:56:05] it would remove some toil [14:56:16] Ah, then I am *not* up to speed on this :D [14:58:47] it is a use case that we didn't have (yet I think), but if we manage to have puppet generate those (including say varnishkafka etc..) then it should be way easier [15:04:08] elukey: I 'll need to restart ORES rdbs for a kernel upgrade. Is it ok to do it tomorrow? It will create a ~5m ORES outage. [15:04:25] akosiaris: sure no problem! [15:04:34] thanks [16:46:32] 10serviceops, 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10jbond) > maybe @jbond can chime in with some suggestions Happy too but i may need some more context :) specifically what is the endpoint... [17:09:55] 10serviceops, 10Data-Engineering, 10SRE, 10Shared-Data-Infrastructure: kafka_mirror_maker TLS cert about to expire - 2023 - https://phabricator.wikimedia.org/T337248 (10Ottomata) > specifically what is the endpoint that theses certs authenticate to Kafka brokers > is that allready managed by pki yup! Th... [21:13:46] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) Working a bit on flink-kubernetes-operator dashboard; I think [[ https://lists.apache.org/thread/ccovbfl...