[08:38:46] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Write a wrapper function combining pki::get_cert and k8s::kubeconfig - https://phabricator.wikimedia.org/T337826 (10JMeybohm)
[11:57:26] <ottomata>	 jayme:  o/ okay so, with https://wikitech.wikimedia.org/wiki/MediaWiki_Event_Enrichment/SLO/Mediawiki_Page_Content_Change_Enrichment#Troubleshooting, would you feel okay with us deploying on eqiad + codfw?  
[11:57:34] <ottomata>	 > As of 2023-05, No support SLA is provided. File a Bug at https://phabricator.wikimedia.org/project/view/1474/ and the Event Platform team will follow up within 24 hours (on work days). In case of outage, deleting and re-applying the deployment is considered within SLO targets.
[12:08:26] <jayme>	 ottomata: +1ed
[12:14:24] <ottomata>	 thanks!
[12:15:49] <jayme>	 ottomata: would you mind waiting another ~15min with the deployment?
[12:30:56] <ottomata>	 ya can wait, i'm still checkign morning emails, etc.
[12:31:07] <ottomata>	 and have some meetings...might try to do it in about an hour
[12:34:35] <jayme>	 ack
[13:19:20] <ottomata>	 jayme:  okay if i deploy?
[13:19:29] <jayme>	 ottomata: yeah, go ahead
[13:36:44] <godog>	 FYI we've been debugging and fixing a bad relationship between cadvisor and kubelet at T337836
[13:46:48] <ottomata>	 jayme:  encountering an unexpected error on deployment: 
[13:46:49] <ottomata>	 Error: rendered manifests contain a resource that already exists. Unable to continue with install: could not get information about the resource FlinkDeployment "flink-app-main" in namespace "mw-page-content-change-enrich": flinkdeployments.flink.apache.org "flink-app-main" is forbidden: User "mw-page-content-change-enrich-deploy" cannot get resource "flinkdeployments" in API group "flink.apache.org" in the namespace 
[13:46:49] <ottomata>	 "mw-page-content-change-enrich"
[13:47:10] <ottomata>	 this works fine in staging, I followed https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service and added I think all the right stuff to the main clusters
[13:47:39] <jayme>	 which cluster?
[13:47:42] <ottomata>	 codfw
[13:47:47] <ottomata>	 haven't tried eqiad
[13:48:11] <jayme>	 what did you call exactly?
[13:48:25] <jayme>	 becaue there is a lot of change with this, in rbac rules for users etc
[13:48:49] <ottomata>	 helmfile -e codfw diff will do it, but so does apply
[13:49:27] <ottomata>	 i deployed the operator stuff yesterday, hm let me doublce check the watchNmaespace stuff 
[13:52:34] <jayme>	 ottomata: the admin_ng stuff is not completely deployed, I'll take care of it
[13:53:05] <ottomata>	 wha?
[13:53:54] <ottomata>	 did -l name=flink-operator diff miss something?
[13:54:07] <jayme>	 yes, beacuse it's way more than that
[13:54:21] <ottomata>	 oh ho, certmanager hm
[13:54:30] <jayme>	 it needs a change to global rbac rules as well for example: helmfile -e codfw -l name=rbac-rules apply
[13:54:46] <jayme>	 and -l name=namespace-certificates
[13:55:12] <ottomata>	 i see the namespace-certificates in the diff
[13:55:12] <jayme>	 the latter is needed for all new namespaces
[13:55:15] <ottomata>	 but not rbac-rules
[13:55:32] <jayme>	 because I applied them
[13:55:34] <ottomata>	 but ya, i guess this is why its generally good to admin_ng apply without -l if we can?
[13:55:37] <ottomata>	 oh okay :)
[13:56:16] <jayme>	 yes, I should not have recommendet to scope your deploy. Sorry about that
[13:56:32] <jayme>	 (tbh I just did not want you to roll restart calico at that time)
[13:56:40] <jayme>	 should be all good now
[13:57:11] <ottomata>	 no prob, i also did not want to roll restart calico :o 
[13:57:18] <ottomata>	 ok thanks, eqiad too?
[13:57:35] <ottomata>	 (proceeding in codfw)
[14:00:54] <ottomata>	 deployment looks like it works in codfw, got an app issue I gotta fix now though...
[14:02:15] <jayme>	 eqiad too, yes
[14:02:57] <ottomata>	 thank you!
[14:04:59] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting:  CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Jhancock.wm) @Clement_Goubert It's been a week and I'm not seeing any errors from the lifecycle controller. Do you think this could be resolved now?
[14:13:59] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting:  CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Clement_Goubert) 05Open→03Resolved Yes, thank you, resolving.
[14:29:35] <wikibugs>	 10serviceops, 10SRE-OnFire, 10Traffic, 10conftool, and 2 others: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10BBlack) We've got a pair of patches to review now which configure this on the pybal and safe-service-...
[14:33:54] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, 10Kubernetes: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) - https://phabricator.wikimedia.org/T329826 (10JMeybohm)
[14:34:01] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm)
[14:34:21] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm)
[14:36:12] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) 05Open→03Resolved I'm going to resolve this as the update is done. There are a couple of tasks be...
[14:36:19] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Selected IPv6 service-cluster-up ranges are to big - https://phabricator.wikimedia.org/T335285 (10JMeybohm)
[14:36:24] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm)
[16:14:15] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata)
[16:15:05] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 A), 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) We are live in wikikube eqiad and codfw!