[09:16:51] dcausse: jayme so current status is [09:17:16] dcausse: o/ you have proof the wall of text references connection issues to the api? :D [09:17:59] ah,m well...it sais "at org.apache.flink.kubernetes.operator.metrics.KubernetesClientMetrics.intercept(KubernetesClientMetrics.java:130) " somewhere...so maybe that's a hint [09:18:08] jayme: the stack suggests that it's happening somewhere in io.fabric8.kubernetes.client.informers.impl so most likely k8s [09:18:37] we have selector: app.kubernetes.io/name = "flink-kubernetes-operator" [09:19:39] according to kubectl -n flink-operator get po -l app.kubernetes.io/name=flink-kubernetes-operator - that matches [09:20:03] I was wondering if there is some silliness there with the double quotes, I know it sounds silly [09:22:21] effie: nah [09:22:27] it's the operator [09:22:46] go on [09:22:52] selector: app.kubernetes.io/name = "flink-kubernetes-operator" [09:23:01] vs selector: app.kubernetes.io/name == "flink-kubernetes-operator" [09:23:04] 🤦 [09:23:18] oh [09:23:24] I should have spotted in review, sorry [09:24:22] I should have added it in the first place, we can keep doing it, but it will point back to me [09:24:26] let me fix it [09:24:28] sigh [09:24:30] effie: when you update the chart, could you please also remove kubernetesMastes from the fixtures? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1031811/1/charts/flink-kubernetes-operator/.fixtures/wmf.yaml [09:24:36] sure [09:24:47] dcausse: give me 2' [09:24:54] sure [09:25:11] I've manually edited the policy in staging to verify [09:25:36] at least the operator comes up now [09:25:45] jayme: I think I will use the selector: "istio == '{{ $istio_gateway }}'" syntax like we did with istio, so to keep things consistent [09:25:59] I don't understand [09:26:03] you mean the quotes? [09:26:06] yes [09:26:09] in the template [09:26:17] wait for the patch [09:26:19] seeing reconciliation happening so seems to be working [09:31:34] I think we need to do some cleanup on the flink job policy we might have duplicated ones [09:34:51] dcausse: there is one more outstanding patch by janis, but it is just data [09:34:59] jayme: dcausse you can +1 [09:35:07] and we can proceed [09:36:55] I was to fast, then to slow [09:37:03] effie: you're missing chart version bumop [09:37:12] hehe [09:37:14] k [09:37:43] so if I understand the zk and k8s policies are owned by the flink-operator config, so there should be no need to redeclare these 2 networkpolicies on the job itself right? [09:39:08] my understanding is that the flink-operator as well as the task/job/?-managers in the workload namespaces do need access to the k8s api and zookeeper [09:39:26] yes [09:39:47] that's why we have 2 policies in the flink-operator namespace and two per workload namespace [09:43:21] so this: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1031824 should work IIUC [09:44:27] ah, that you mean. I had this prepared at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1031810 [09:44:49] oh ok, thanks missed it [09:45:19] alright, jenkins is finally done [09:45:51] hmm...yeah, it probably does not need the zookeeper rule there dcausse - I did not remove it [09:46:34] but it should be fine without. You might want to double check with other flink-apps [09:46:54] ...and adapt my/your patch. I think this is also a relic of pre flink-operator times [09:47:09] jayme: yes, I can test all this in staging [09:47:16] cool [09:47:22] we might want to cleanup the flinkapp chart as well I suppose [09:47:51] dcausse: you many undeploy and deploy now [09:48:09] though janis had already applied the change manually [09:48:45] effie: going to apply https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1031824 too if that works for you? [09:49:06] sure sure no problem [09:49:09] ok [09:50:34] dcausse: don;t you want to bundle this in janis' change? [09:50:45] in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1031810 ? [09:52:00] effie: I'd prefer to test staging only first [09:52:26] we will deploy on staging anyway, but whatever you think it is best [09:54:02] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Co-locate kube-apiserver and etcd on new staging control plane nodes - https://phabricator.wikimedia.org/T363307#9798776 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: `kubestagemaster[2001-20... [10:34:37] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9799039 (10akosiaris) >>! In T363212#9797805, @Jclark-ctr wrote: > @akosiaris could you please update preseed.yaml file? Done. Note t... [10:41:02] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Allow to address Kubernets API servers from NetworkPolicy - https://phabricator.wikimedia.org/T287491#9799054 (10BTullis) I will have a look at the spark-operator too. Thanks all for working on this. [11:12:31] effie: what is the funny part of the screenshots? :) [11:15:05] on staging-codfw it complainwd abour rolling back [11:15:15] sorry, I am on my phone [11:15:43] I will attempt to deploy again [11:16:11] I will take a look later or tomorrow [11:16:45] I'll take a look if you don't mind. I just decommed the 3 old api-servers - so there might be something interesting there [11:17:10] cool, I should have asked you before doing codfw-staging actually [11:17:19] not really [11:17:27] rn there is on pod crashing and one working [11:17:28] I didn't expect anything to go wrong there [11:17:41] ok, keep in the loop, I am breaking for lunch [11:17:48] ack [11:18:01] hey - during the hackathon i was asked about https://gerrit.wikimedia.org/r/c/operations/puppet/+/527915 (and the patches below it on the stack). on a quick glance those seem fine to me, can I go ahead and merge/deploy them or would you prefer to do that yourselves? [11:33:00] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Allow to address Kubernetes API servers from NetworkPolicy - https://phabricator.wikimedia.org/T287491#9799219 (10Aklapper) [11:46:21] just need to add a couple tests to httpbb so we're sure it does what we want (we can remove them after), I'll go ahead and do that [11:59:11] 06serviceops, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9799310 (10MoritzMuehlenhoff) [13:01:57] effie: I deployed flink-operator to staging-codfw and it just works [13:02:21] jayme: you just re-applied and it worked? [13:02:25] yeah [13:02:40] cool, then something just borked while I applied it, cool [13:02:58] maybe it failed for you because one of the replicas was in a crashloop already because of me removing the masters [13:03:06] I woudn't bother [13:11:34] effie: will you take care of the remaining patches on the task? I'm currently running a rake full on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1031811 - but I don't expect anything really [13:12:35] yes, I already rebased it [13:12:49] ack [13:13:59] will cont tomorrow though [13:16:07] okay...I will split the patch then and merge removal of kubernetesMasters in staging-codfw [13:20:25] taavi: I'll put those changes in for tomorrow's mediawiki infrastruture mid-day window [13:23:37] jayme: no need, I will do them all tomorrow, I just want to shift to something else for today [13:24:29] effie: already done. I've decommed the masters that where listened in staging-codfw and I'd like to have a clean state there now [13:24:39] fair enough [17:10:41] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9801259 (10VRiley-WMF) a:03VRiley-WMF [17:17:30] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9801312 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye [17:26:45] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 13), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9801327 (10Scott_French) Hi @SGupta-WMF and @mforns - Any updates on the timel... [17:38:32] 06serviceops: docker-reporter-base-images.service failed on build2001 - https://phabricator.wikimedia.org/T364931#9801371 (10ssingh) This has been alerting for a while and in general has many alerts on the -ops channel. If this expected in some way (I have no idea and I haven't looked!) can it at least be silenc... [18:03:33] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9801460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye executed... [20:24:11] 06serviceops, 10Cassandra, 06SRE, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801879 (10Eevans) [20:30:02] 06serviceops, 10Cassandra, 06SRE, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801884 (10Eevans) A Docker image is now published: ` docker pull docker-registry.wikimedia.org/repos/sre... [20:36:46] 06serviceops, 10Cassandra, 06SRE, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801918 (10Eevans) [21:01:27] 06serviceops, 10Cassandra, 06SRE, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801985 (10Eevans) p:05Triage→03High [21:03:50] 06serviceops, 10Cassandra, 06SRE, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801987 (10Eevans) p:05High→03Triage