[07:24:16] 10serviceops, 10Kubernetes: Integrate kube-metrics-server into our infrastructure - https://phabricator.wikimedia.org/T249929 (10JMeybohm) p:05Triage→03Low [07:52:39] 10serviceops, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes, 10Patch-For-Review: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [08:58:10] Morning. I'm hoping to merge the spark and spark-operator images to production-images in the next day or two. Please do let me know if you have any concerns. Thanks. [08:58:10] https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/838151 and https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/838858 [09:27:02] I had some 😇 [09:27:51] <_joe_> btullis: why did you create a separate spark root directory? [09:28:06] <_joe_> all images should really be inside images/ and not use a separate configuration [09:29:26] AIUI those two images are tightly coupled as one would have to rebuild/update the operator whenever spark image changes [09:29:53] <_joe_> jayme: that's why we have Depends: in the control file [09:30:03] <_joe_> so I don't get why having a separate root makes any difference [09:30:07] <_joe_> it won't make any [09:30:12] <_joe_> it will just complicate our work [09:30:19] <_joe_> (our == everyone [09:31:32] <_joe_> is the goal just to have a namespace for the images on the registry? do we need it? [09:32:04] <_joe_> spark seems like something that could run more or less anywhere and isn't really an external image we import or a component of a base k8s component [09:32:16] I don't think we actually need it [09:32:27] it's probably just "optics" [09:32:33] Thanks both. Hugely appreciated. I'm somewhat confused by the back and forth on whether a top-level namespace/directory is used. elukey suggested I look at the images/knative for something similar to copy, but as far as I understand it this doesn't use a namespace, it just adds a `knative-`prefix to each image. [09:32:36] or keeping stuff together [09:32:54] https://usercontent.irccloud-cdn.com/file/nHW9CgPG/image.png [09:33:49] <_joe_> btullis: so, spark is simply an image we build on top of our java ones [09:34:01] <_joe_> I think it should be under images/ [09:34:20] OK, can do. [09:34:28] <_joe_> it also makes you life simpler I think [09:34:53] <_joe_> as you don't need to go aorund and add this new thing to all of our build/check/clean processes [09:34:53] no further patches needed to build scripts at least :) [09:35:09] <_joe_> jayme: I'm not a fan of those either [09:35:19] I know ;) [09:35:20] Then the spark operator is just an image that we build from a go image, then we copy the compiled binary into the spark image. [09:36:24] <_joe_> btullis: your goal is to have spark and the spark operator in the same image? [09:36:45] <_joe_> and, do you need to have a separate image with spark alone? [09:37:10] _joe_: yes. the operator uses/needs a spark install in it's container as well [09:37:21] Yes to both of those questions. That is how upstream does it. spark-operator is a go binary added to a Java based spark base image. [09:37:43] and then AIUI it will schedule pods using just the spark image (without the operator binary) [09:37:45] <_joe_> ok, my question was if we have any use for the base image without the operator [09:37:51] <_joe_> ok [09:38:03] btullis: please also add control files for those images [09:38:12] <_joe_> yeah I was about to say [09:39:21] elukey suggested that it might be useful to publish a combined build image, containing both the binary spark distribution and the operator. I hadn't previously considered that this would be useful, but maybe? [09:39:58] OK, will add contro files and a Depends: field between the two images. [09:40:47] <_joe_> btullis: also in the spark image [09:40:51] <_joe_> Depends: openjdk-11-jre [09:40:57] <_joe_> and then in the dockerfile [09:41:18] <_joe_> FROM {{ "openjdk-11-jre" | image_tag }} [09:41:49] <_joe_> the reason why I am contrary to breaking the dependency chains is that if we do, we won't do dependent upgrades [09:42:01] <_joe_> so if we upgrade openjdk-11-jre we won't rebuild spark [09:43:48] :+1 Thanks. Should I do the same with `Depends: golang:1.15` for the operator? [09:44:14] <_joe_> if it's a build dependency you can also use Build-Depends, but yes [09:45:10] Oh yes, I see. Thanks. OK, I've got enought to be getting on with. Many thanks both for your time. [09:45:50] <_joe_> btullis: np, and if you need to pair a bit, just drop me a private message [09:46:04] Awesome, thanks. [11:26:19] 10serviceops, 10SRE, 10Traffic: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) [11:26:30] 10serviceops, 10SRE, 10Traffic: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) p:05Triage→03Medium [12:25:48] 10serviceops, 10SRE, 10Znuny, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10LSobanski) p:05Triage→03Medium [12:26:26] 10serviceops, 10DBA, 10Phabricator, 10serviceops-collab, and 2 others: sort out mysql privileges for phab1004/phab2002 - https://phabricator.wikimedia.org/T315713 (10LSobanski) p:05Triage→03Medium [12:49:14] 10serviceops, 10SRE, 10Traffic: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) Checking the client implementation for `go.etcd.io/etcd/client/v2 v2.305.4` it looks like the SRV discoverer share code with v3: https://github.com/etcd-io/etcd/blob... [12:52:34] 10serviceops, 10SRE, 10Traffic: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Joe) The correct domain to test for read-only clients is `conftool.eqiad.wmnet`, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/templates/... [12:53:45] 10serviceops, 10Security-Team, 10serviceops-collab, 10GitLab (CI & Job Runners), and 3 others: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481 (10Jelto) [12:57:26] 10serviceops, 10SRE, 10Traffic: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) you're right in that regard: ` vgutierrez@lvs6001:~$ ./l4lb etcd --domain conftool.eqiad.wmnet 2022/10/10 12:55:44 dns lookup errors: lookup _etcd-client-ssl._tcp.co... [12:59:01] 10serviceops, 10Security-Team, 10serviceops-collab, 10GitLab (CI & Job Runners), and 3 others: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481 (10Jelto) Some more explanation to the above edit: Further security hardening of Docker daemon got a dedicated task T320411.... [12:59:26] 10serviceops, 10SRE, 10Traffic: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Joe) yeah this changed with v3. The problem is that AIUI confd uses an older version of the library and expects the simpler form we have now. We can either add a new set of rec... [13:04:18] _joe_: RE T320397 it looks like either way the new _etcd-client-ssl record needs to be added to the DNS zone, right? [13:15:17] 10serviceops, 10Discovery-Search (Current work): Coordinate with ServiceOps Team about a rework of the Search Update Pipeline - https://phabricator.wikimedia.org/T317283 (10Gehel) 05Open→03Resolved Kick off meeting done. Further collaboration is expected to happen on specific subtasks of T317045 or as disc... [13:28:54] <_joe_> vgutierrez: correct [13:33:52] should I submit a CR to the dns repo? [13:36:35] 10serviceops, 10SRE, 10Traffic: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Volans) >>! In T320397#8304869, @Joe wrote: > The correct domain to test for read-only clients is `conftool.eqiad.wmnet`, see https://gerrit.wikimedia.org/r/plugins/gitiles/oper... [13:38:29] <_joe_> vgutierrez: let me check the situation with confd first [13:38:32] ack [13:57:55] 10serviceops, 10SRE, 10Traffic: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) hmm from the mentioned documentation on the task description: ` If etcd is using TLS, the discovery SRV record (e.g. example.com) must be included in the SSL certifi... [14:04:57] <_joe_> vgutierrez: ok good news [14:05:09] <_joe_> we can switch confd too, preparing a patch for dns [14:05:20] nice [14:05:24] <_joe_> we need to also check the other clients, like e.g. mediawiki [14:05:50] <_joe_> but we can surely for now add the new record, and I'd add it to .eqiad.wmnet rather than conftool.eqiad.wmnet [14:06:21] ack [14:06:45] from my point of view the -ssl one should be enough [14:07:16] etcd code seems to agree with me: https://github.com/etcd-io/etcd/blob/08407ff7600eb16c4445d5f21c4fafaf19412e24/client/pkg/srv/srv.go#L121 [14:28:18] <_joe_> vgutierrez: I submitted a few patches [14:28:25] <_joe_> we can sync tomorrow [14:28:38] I was doublechecking https://gerrit.wikimedia.org/r/c/operations/dns/+/841138 [14:28:48] it looks like ulsfo is missing [14:36:46] <_joe_> vgutierrez: yeah perfectly possible, I wanted to write down the patches for now [15:55:06] 10serviceops, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes, 10Patch-For-Review: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10Clement_Goubert) [16:03:22] quick workout, back in ~40 [16:11:10] 10serviceops, 10SRE: eqiad (2) memcached host for wikifunctions service implementation tracking - https://phabricator.wikimedia.org/T313965 (10Joe) p:05Triage→03Medium [16:11:14] 10serviceops, 10SRE: eqiad (2) memcached host for wikifunctions service implementation tracking - https://phabricator.wikimedia.org/T313965 (10Joe) a:05Joe→03None [16:30:44] 10serviceops, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes, 10Patch-For-Review: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [16:32:32] 10serviceops, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes, 10Patch-For-Review: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10Clement_Goubert) [16:51:46] back [17:20:42] lunch/doctor appointment, back in ~2h [19:46:11] back