[08:34:04] good morning folks :) [08:34:26] Hugh and I worked on a new version of Changeprop that in theory it is ready to be deployed [08:34:35] the changes are: [08:34:44] 1) supports the configuration of lift wing rules (by default disabled) [08:35:02] 2) supports calling TLS endpoint with PKI/Puppet-based TLS certs [08:35:36] It seems working fine in staging, would it be ok to proceed with one of the "wikikube" prod clusters this week? [09:16:57] <_joe_> elukey: sure [09:17:08] <_joe_> I mean worst case scenario we rollback [09:22:32] okok perfect, I'll talk with Hugh to set up a rollout date/time and I'll sync again in here :) [09:52:21] <_joe_> btullis: uh I launched build-production-images and I see three spark images to build; is that expected? [09:52:57] I also just launched it to build spark images. I can cancel mine. [09:53:29] Yes, it is expected. See scrollback in this channel from Friday. [09:54:03] I've cancelled mine, so you're free to go ahead. spark image build is not time critical. [09:57:04] <_joe_> ah neither was my image [10:07:59] <_joe_> btullis: the builds all failed [10:29:03] 10serviceops, 10Platform Engineering, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) [10:29:48] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Phase out nutcracker from mediawiki servers - https://phabricator.wikimedia.org/T277183 (10jijiki) 05Open→03Resolved This work is done [10:47:06] 10serviceops, 10SRE: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Lucas_Werkmeister_WMDE) >>! In T306995#8128358, @Michael wrote: > Glancing at the repository, I'm not sure if there is anything that you need from us to migrate `wikibase/termbox` on Wikidata... [10:51:48] _joe_: apologies for the delay in replying. Workstation issues. Just the three spark images failed, or did I break yours too? [10:52:24] <_joe_> no just the spark ones [10:53:06] Ack, thanks. Will take a look asap. [12:26:48] 10serviceops, 10CommRel-Specialists-Support, 10SRE, 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) [13:13:30] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) [13:14:29] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) [13:14:33] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Remove the .Values.kubernetesApi hack - https://phabricator.wikimedia.org/T326729 (10JMeybohm) [13:14:35] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Re-enable seccomProfile in cert-manager chart after k8s 1.23 migration completed - https://phabricator.wikimedia.org/T325620 (10JMeybohm) [13:14:37] 10serviceops, 10Observability-Metrics, 10Prod-Kubernetes, 10Kubernetes: Migrate use of infrastructure_users tokens to client certificates - https://phabricator.wikimedia.org/T325268 (10JMeybohm) [13:14:39] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate charts away from deprecated typology annotations - https://phabricator.wikimedia.org/T325066 (10JMeybohm) [13:14:42] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Drop the use of nonexisting groups in kubernetes infrastructure_users - https://phabricator.wikimedia.org/T290963 (10JMeybohm) [13:15:25] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Remove the .Values.kubernetesApi hack - https://phabricator.wikimedia.org/T326729 (10JMeybohm) [13:15:32] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [13:15:41] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Drop the use of nonexisting groups in kubernetes infrastructure_users - https://phabricator.wikimedia.org/T290963 (10JMeybohm) [13:15:47] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [13:16:17] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate charts away from deprecated typology annotations - https://phabricator.wikimedia.org/T325066 (10JMeybohm) [13:16:23] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [13:16:28] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Re-enable seccomProfile in cert-manager chart after k8s 1.23 migration completed - https://phabricator.wikimedia.org/T325620 (10JMeybohm) [13:16:35] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [13:16:37] 10serviceops, 10Observability-Metrics, 10Prod-Kubernetes, 10Kubernetes: Migrate use of infrastructure_users tokens to client certificates - https://phabricator.wikimedia.org/T325268 (10JMeybohm) [13:16:43] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [13:26:40] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Metrics changes with Kubernetes v1.23 - https://phabricator.wikimedia.org/T322919 (10JMeybohm) >>! In T322919#8556540, @Jelto wrote: > I used the above list to grep through `operations/alerts` a... [13:42:15] 10serviceops, 10SRE, 10wdwb-tech: Migrate wikibase/termbox to newer Node.js version - https://phabricator.wikimedia.org/T328295 (10Lucas_Werkmeister_WMDE) [13:43:33] 10serviceops, 10SRE: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Lucas_Werkmeister_WMDE) Hm, I notice there’s no corresponding `nodejs14-devel` image in the [Docker registry](https://docker-registry.wikimedia.org/), only `nodejs14-slim` (and same for `node... [14:08:48] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update staging-codfw to k8s 1.23 - https://phabricator.wikimedia.org/T326340 (10JMeybohm) [14:10:41] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [14:10:51] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update staging-codfw to k8s 1.23 - https://phabricator.wikimedia.org/T326340 (10JMeybohm) 05Open→03Resolved Moved the rest of the open action items to {T328291}. The "failed to update managedFields" I've not seen again. This seems to happen only once during... [14:14:18] ottomata if you want to work on the flink stuff today LMK, I'm just settling in [15:16:34] 10serviceops, 10MW-on-K8s, 10SRE Observability: Index orchestrator object fields from ECS 1.11.0 in OpenSearch - https://phabricator.wikimedia.org/T328318 (10Clement_Goubert) [15:19:41] 10serviceops, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) [15:34:43] 10serviceops, 10MW-on-K8s, 10SRE Observability: Index orchestrator object fields from ECS 1.11.0 in OpenSearch - https://phabricator.wikimedia.org/T328318 (10colewhite) 05Open→03Resolved p:05Triage→03Medium Refreshed the field list in Dashboards. @Clement_Goubert confirmed in IRC they are functionin... [15:34:53] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10colewhite) [16:14:59] inflatador: o/ got some reviews to take care of first. tthis afternoon gmodena and I are hoping to attemp to deploy the mediawiki-event-enrichment...assuming image build works! its building now! [16:15:00] https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/pipelines/10407 [16:15:19] to do that i'm going to remove the flink-app-example helmfile and deployment, and replace it with mediawiki-event-enrichment. [16:15:39] actually inflatador. [16:15:42] think you could do this one now? [16:15:43] https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/884351 [16:15:58] ^ just bumps java jre version for a bugfix [16:16:51] so, that would be merging, build-production-images, possible image version bump in deployment-charts (don't worry about the flink-example-app one, since we'll do that later today). and a redeploy of the flink-kubernetes-operator in DSE. [16:17:14] since it is simple change, and we are still experimenting in DSE, i wouldn't worry about testing in minikube locally [16:24:09] Cool, will check out when current mtg is done [16:31:10] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MPhamWMF) [16:31:23] 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MPhamWMF) [16:36:45] I think I should start asking sooner rather than later how the DC switch (T327920) will effect Toolhub which currently only exists in eqiad largely because of a lack of an active-active master database for it (T288685). [16:41:08] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10bd808) #Toolhub does not have a working Kubernetes deployment outside of eqiad ({T288685}). Who should I work with to try and preve... [16:48:56] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10LSobanski) [16:53:44] 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [16:54:02] bd808: we just talked about that it probably needs to move to tha aux cluster finally [16:54:18] cc claime [16:54:53] 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [16:55:12] yep, but we may want to upgrade the cluster to 1.23 before any service is on it though [16:55:42] cool. I didn't even know there was an aux cluster :) [16:56:05] <_joe_> bd808: nobody expects the aux cluster! [16:56:22] 10serviceops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10LSobanski) [16:56:48] the aux cluster is also currently eqiad-only but that will very likely change at some point in the future [16:57:20] <_joe_> I just want to go on record saying it's possible to connect to eqiad databases from codfw [16:57:31] <_joe_> I don't think the traffic toolhub does makes it impossible to do [16:57:55] <_joe_> but the latency might be killer [16:58:39] bd808: It's new-ish https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#aux [16:59:39] The app actually doesn't talk to the db much outside of when the crawler job runs. Most of the data is fetched at runtime from the Elasticsearch cluster. But that would also need attention to become functional active-active. [17:02:43] _j.oe_ is correct though that it could be made to work with the db connection backhauled to eqiad if that actually has value. [17:08:01] 10serviceops, 10Infrastructure-Foundations, 10netops: Calico and BFD - https://phabricator.wikimedia.org/T328338 (10ayounsi) [17:14:46] OK, images are rebuilding [17:18:07] <3 [17:24:21] Images rebuilt, LMK if anything looks amiss [17:26:46] inflatador: if they built, i'm sure they are good. i think it is safe to just do the same helmfile -e dse-k8s-eqiad -l name=flink-operator -i apply we did for admin_ng [17:27:31] btw, FYI, fingers crossed later today we'll be able to undeploy flink-app-example, merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/884972, and then try deploying the page-content-change enrichment job! [17:27:50] we need the gitlab image publishing to work firstt though. I'll cc you to convo in slack [17:32:14] {◕ ◡ ◕} [17:57:52] 10serviceops, 10SRE: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10akosiaris) It is intentional indeed. `-devel` because obsolete. More information in T306996#7912881 and overall that task. [18:08:36] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/884983 PR for the new chart w/updated image tag [18:57:04] +1 inflatador merge away! [19:43:55] 10serviceops, 10Prod-Kubernetes, 10PyBal, 10SRE, 10Traffic: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10ayounsi) [19:44:03] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: Calico and BFD - https://phabricator.wikimedia.org/T328338 (10ayounsi) [19:54:14] ottomata gotcha, merged and operator redeployed [20:03:49] inflatador: gabriele and i are goign to try to deploy [20:03:52] we are in a huddle [20:03:53] want to join us [20:03:55] also: THANK YOU! [20:04:00] that is awesome [20:04:12] lemme know if you want to join and we'll invite you [20:04:38] ottomata sure np. I would indeed like to join [20:04:48] okay cool [20:24:37] _joe_, or whoever, pls ignore if you are offline, but. I've set kafka.allowed_clusters and egress.enabled in a values file, and helmfile linter in jenkins is giving a nil error: error calling include: template: flink-app/templates/vendor/base/networkpolicy_1.0.0.tpl:37:14: executing "base.networkpolicy.egress.kafka" at : error calling index: index of untyped nil [20:24:51] is there somewwhere I need to provide a fixture for kafka_brokers? [20:25:18] i have a fixture in the chart already that was working with this before, but now that I am trying to use it from a helmfile (with a real kafka cluster) it is failing. [20:31:21] FYI, am merging skipping the lint check. want to attempt deployment to resolve other issues. [20:31:37] we will revert our change when we are done so that the linter doesn't fail for others while we are gone. [20:40:55] (this is the patch that is failling: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/884972 ) [21:17:37] k, lint breaking change un broken. still to figure out why enabling kafka egress breaks it