[08:38:46] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Write a wrapper function combining pki::get_cert and k8s::kubeconfig - https://phabricator.wikimedia.org/T337826 (10JMeybohm) [11:57:26] jayme: o/ okay so, with https://wikitech.wikimedia.org/wiki/MediaWiki_Event_Enrichment/SLO/Mediawiki_Page_Content_Change_Enrichment#Troubleshooting, would you feel okay with us deploying on eqiad + codfw? [11:57:34] > As of 2023-05, No support SLA is provided. File a Bug at https://phabricator.wikimedia.org/project/view/1474/ and the Event Platform team will follow up within 24 hours (on work days). In case of outage, deleting and re-applying the deployment is considered within SLO targets. [12:08:26] ottomata: +1ed [12:14:24] thanks! [12:15:49] ottomata: would you mind waiting another ~15min with the deployment? [12:30:56] ya can wait, i'm still checkign morning emails, etc. [12:31:07] and have some meetings...might try to do it in about an hour [12:34:35] ack [13:19:20] jayme: okay if i deploy? [13:19:29] ottomata: yeah, go ahead [13:36:44] FYI we've been debugging and fixing a bad relationship between cadvisor and kubelet at T337836 [13:46:48] jayme: encountering an unexpected error on deployment: [13:46:49] Error: rendered manifests contain a resource that already exists. Unable to continue with install: could not get information about the resource FlinkDeployment "flink-app-main" in namespace "mw-page-content-change-enrich": flinkdeployments.flink.apache.org "flink-app-main" is forbidden: User "mw-page-content-change-enrich-deploy" cannot get resource "flinkdeployments" in API group "flink.apache.org" in the namespace [13:46:49] "mw-page-content-change-enrich" [13:47:10] this works fine in staging, I followed https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service and added I think all the right stuff to the main clusters [13:47:39] which cluster? [13:47:42] codfw [13:47:47] haven't tried eqiad [13:48:11] what did you call exactly? [13:48:25] becaue there is a lot of change with this, in rbac rules for users etc [13:48:49] helmfile -e codfw diff will do it, but so does apply [13:49:27] i deployed the operator stuff yesterday, hm let me doublce check the watchNmaespace stuff [13:52:34] ottomata: the admin_ng stuff is not completely deployed, I'll take care of it [13:53:05] wha? [13:53:54] did -l name=flink-operator diff miss something? [13:54:07] yes, beacuse it's way more than that [13:54:21] oh ho, certmanager hm [13:54:30] it needs a change to global rbac rules as well for example: helmfile -e codfw -l name=rbac-rules apply [13:54:46] and -l name=namespace-certificates [13:55:12] i see the namespace-certificates in the diff [13:55:12] the latter is needed for all new namespaces [13:55:15] but not rbac-rules [13:55:32] because I applied them [13:55:34] but ya, i guess this is why its generally good to admin_ng apply without -l if we can? [13:55:37] oh okay :) [13:56:16] yes, I should not have recommendet to scope your deploy. Sorry about that [13:56:32] (tbh I just did not want you to roll restart calico at that time) [13:56:40] should be all good now [13:57:11] no prob, i also did not want to roll restart calico :o [13:57:18] ok thanks, eqiad too? [13:57:35] (proceeding in codfw) [14:00:54] deployment looks like it works in codfw, got an app issue I gotta fix now though... [14:02:15] eqiad too, yes [14:02:57] thank you! [14:04:59] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Jhancock.wm) @Clement_Goubert It's been a week and I'm not seeing any errors from the lifecycle controller. Do you think this could be resolved now? [14:13:59] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) 05Open→03Resolved Yes, thank you, resolving. [14:29:35] 10serviceops, 10SRE-OnFire, 10Traffic, 10conftool, and 2 others: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10BBlack) We've got a pair of patches to review now which configure this on the pybal and safe-service-... [14:33:54] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) - https://phabricator.wikimedia.org/T329826 (10JMeybohm) [14:34:01] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) [14:34:21] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [14:36:12] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) 05Open→03Resolved I'm going to resolve this as the update is done. There are a couple of tasks be... [14:36:19] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Selected IPv6 service-cluster-up ranges are to big - https://phabricator.wikimedia.org/T335285 (10JMeybohm) [14:36:24] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) [16:14:15] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) [16:15:05] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) We are live in wikikube eqiad and codfw!