[11:22:33] 10serviceops, 10Release-Engineering-Team, 10serviceops-collab: Gitlab downtime blocking scap backport - https://phabricator.wikimedia.org/T336162 (10jijiki) [11:22:56] 10serviceops, 10Release-Engineering-Team, 10serviceops-collab: Gitlab downtime blocking scap backport - https://phabricator.wikimedia.org/T336162 (10jijiki) [11:26:24] 10serviceops: Add security-api to operations/deployment-charts - https://phabricator.wikimedia.org/T336163 (10STran) [11:26:37] 10serviceops: Add security-api to operations/deployment-charts - https://phabricator.wikimedia.org/T336163 (10STran) [11:26:40] 10serviceops, 10Service-deployment-requests: New Service Request 'security-api' - https://phabricator.wikimedia.org/T325147 (10STran) [11:31:18] hi, re T325147 above, is there someone from SRE who could help with creating the helm chart? Or is that something that you want engineers from product teams to attempt first? [11:31:37] er, T336163 rather [11:34:24] 10serviceops, 10Security-API-Service, 10Kubernetes: Create helm chart for security-api in operations/deployment-charts - https://phabricator.wikimedia.org/T336163 (10kostajh) [11:37:31] second question: is there a template and/or shared location one should use for documenting k8s services on wikitech? I see https://wikitech.wikimedia.org/wiki/Service/Etcd but that is the only example AFAICT using that format, and the only one in that URL structure [11:42:15] 10serviceops, 10Shellbox, 10SyntaxHighlight, 10User-brennen, 10Wikimedia-production-error: Pages with Pygments or Timeline intermittenly fail to render (Shellbox server returned status code 503) - https://phabricator.wikimedia.org/T292663 (10Krinkle) @legoktm No, not per se. I retitled to signify impact... [12:08:04] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10jijiki) 05Resolved→03Open Hello, I am afraid `mw2448` was not feeling any better today, so for the time being it is marked again as `inactive`. I am terribly... [13:25:38] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10daniel) Jobs are working, confirmation at https://w.wiki/6gDJ [13:29:08] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10jijiki) Adding the graph {F36988754} [13:30:51] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10jijiki) [14:02:40] jayme: i'm going to deploy the flink operator in wikikube staging today, unless you think I shouldn't, or would prefer to do it [14:02:40] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/904226 [14:10:39] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) [14:11:06] hm actually I think i need some help, we have to create the namespace... [14:12:58] i can't recall but I think there was more to it than just kubectl create namespace? [14:16:36] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10Ottomata) @JMeybohm I'd like to proceed, but first we need to create the flink-operator namespace in staging-eqiad a... [14:20:01] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: New Service Request: flink-kubernetes-operator - https://phabricator.wikimedia.org/T333464 (10JMeybohm) You will need to add the namespace like you did in DSE (https://gerrit.wikimedia.org/r/c/operations/deploy... [15:11:06] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Jhancock.wm) The recommended fix for this one (according to Dell) is a reboot and see if the error comes back. I've done a full power cycle. Right now there's n... [15:28:12] jayme: o/ oh does https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/904226/6/helmfile.d/admin_ng/values/main.yaml cause helm to create the namespace? [15:28:22] so I don't need to manually create it? I just apply? [15:30:50] ottomata: yes, exactly [15:32:46] great, going for it then, ty [15:34:49] 10serviceops, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10LSobanski) [15:49:00] ottomata: {"@timestamp":"2023-05-08T15:41:19.777Z","log.level": "INFO","message":"Configuring operator to watch the following namespaces: [JOSDK_ALL_NAMESPACES]." . that does not look right [15:50:26] ohhhh i guess empty list means all namespaces. indeed. [15:50:46] looking at apply output, lemme mamke sure it didn't create any unwanted resources... [15:51:27] it also says {"@timestamp":"2023-05-08T15:41:18.876Z","log.level": "INFO","message":"Operator leader election is disabled." - that will be an issue when tunning multiple replicas in prod [15:52:17] yes, that's true, we haven't really messed with operator HA in dse. if operator is off it just means runnign flink apps can't be redeployed / re-submitted (if job manager dies) [15:52:24] normal operation will be fine to restart operator at will [15:53:17] https://phabricator.wikimedia.org/T324576#8454404 [15:53:38] okay [15:54:26] but, we should probably do that. adding a ticket. [15:54:41] I don't notice any unwanted resources in the apply output [15:54:52] i was worried it was going to create roles or something for all namespaces [15:56:09] hmm [15:56:09] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/helm/#watching-only-specific-namespaces [15:56:10] although [15:56:18] > When this is enabled role-based access control is only created specifically for these namespaces for the operator and the jobmanagers, otherwise it defaults to cluster scope. [15:56:55] hmmm [15:56:57] actually [15:57:19] https://www.irccloud.com/pastebin/zTu9tLcL/ [15:58:18] thats the one you create in the chart that should not be referenced by any clusterrolebinding but only rolebindings [15:59:25] yeah but i think now [15:59:26] kube_env admin staging-eqiad [15:59:32] kubectl get clusterrole flink-operator [15:59:32] NAME CREATED AT [15:59:32] flink-operator 2023-05-08T15:46:29Z [15:59:32] vs [15:59:38] kube_env admin dse-k8s-eqiad [15:59:46] kubectl get clusterrole flink-operator [15:59:46] Error from server (NotFound): clusterroles.rbac.authorization.k8s.io "flink-operator" not found [15:59:56] I think it would not have created that if I had a value for watchNamespaces [16:01:27] yeah it creaed a clusterrolebinding [16:01:30] sorry about that. [16:02:05] kubectl get clusterrolebinding flink-operator-role-binding [16:02:21] full output of apply in staging-eqiad here: https://gist.github.com/ottomata/7a4fe04026513926a58bab83f6de64f7 [16:04:15] jayme: what about: we always include the flink-operator namespace in watchNamespaces? that way it will never create these cluster scope roles? [16:04:33] but it does not make any sense, right? [16:04:41] to have flink-operator in that list? [16:04:45] yes [16:04:55] I would say we should fix the bug instead [16:04:59] hm [16:05:23] i could imagine wanting to have a place to deploy/run a test deployment, to verify things like operator upgrades. [16:05:24] but yeah [16:05:38] i think its not a bug, just I forgot on my part that this would happen. [16:05:45] this is intentional from the upstream helmchart point of view [16:05:52] that should definitely not happen in a namespace dedicted to the operator [16:05:53] there is a .Values.rbac.operatorRole.create that can be set [16:05:59] yeah okay [16:06:14] maybe, to prevent this from happening to someone [16:06:34] we can est that to false in the flink-operator/values.yaml [16:06:49] and add comments that it should be set to true in env specific helmfiles only when watchNamespaces is defined? [16:07:08] or sorry [16:07:11] values.rbac.create [16:07:26] didn't you add the watchNamespaces value? [16:08:28] no, beacuse we don't have a namespace in which we are deploying yet! i was going to do it after we got the operator running as part of https://phabricator.wikimedia.org/T330507 [16:08:47] I meant the functionality in the operator chart [16:09:33] hm, not sure I understand the question, but yes, if we set watchNamespaces, the correct thing (namespace scoped RBAC) would happen. [16:10:08] i hadn't set it yet, because I was just trying to deploy the operator before the flink app. [16:11:01] my question was: Didn't you add the code implementing the config option watchNamespaces to the operator chart? [16:13:02] I didn't add it, it is built into the upstream chart [16:13:21] the problem here is that I did not set watchNamespaces in the helm values. [16:13:30] helmfile* values [16:13:40] yes, yes. I do understand that :) [16:13:59] But I was under the impression that the implementation was ours [16:14:11] ah no, its upstream. https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/helm/#watching-only-specific-namespaces [16:15:06] you might be remembering the dynamic config stuff, which I implemented but then was adopted upstreamed. https://github.com/apache/flink-kubernetes-operator/pull/478 [16:17:53] ack, understood. In that case plesase patch with '.Values.rbac.create: false' for now [16:18:36] yup, doing that now [16:19:01] a proper safeguard might be to bump the chart and change the default of .Values.rbac.create: here [16:19:12] so that one has to explicitely enable it [16:20:39] Hm, yeah. I'd think I'd prefer to do it in admin_ng helmfile values because as is we don't make any modifications to upstream chart except for adding some new template files. if we change the default chart values.yaml, we'll have to make sure anyone upgrading the chart remembers to override it again [16:22:59] jayme: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/917373 [16:23:42] jayme: after merging ^, I suppose I should delete all resources created by this conditional in the chart: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/flink-kubernetes-operator/templates/rbac.yaml#174 [16:23:56] and then diff and make sure no RBAC is created [16:23:57] ? [16:25:36] ottomata: syncing the relase should remove them [16:25:43] oh, really! [16:25:45] okay [16:25:56] 10serviceops, 10Arc-Lamp, 10Performance-Team, 10WikimediaDebug, 10Patch-For-Review: Add per-request flamegraph option to WikimediaDebug - https://phabricator.wikimedia.org/T291015 (10Krinkle) 05Open→03Stalled [16:26:15] I hope so :-) [16:26:17] +1 ed [16:26:48] oh i need to set that to true for dse :) [16:33:18] okay great, that worked. [16:47:13] ottomata: object being removed you mean? [16:58:59] yup [16:59:28] so flink operator uh deployedin staging, but basically doing nothing cuz we don't have an app there yet. [16:59:46] working on that. [17:00:56] ha operator ticket: https://phabricator.wikimedia.org/T336185 [17:45:15] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Dzahn) >>! In T334429#8833688, @Jhancock.wm wrote: > The recommended fix for this one (according to Dell) is a reboot and see if the error comes back. For the r... [18:18:41] 10serviceops, 10MW-on-K8s, 10Performance-Team (Radar), 10Wikimedia-production-error: ResourceLoader icon rasterization fails via MediaWiki-on-Kubernetes - https://phabricator.wikimedia.org/T336025 (10larissagaulia) [18:19:04] 10serviceops, 10MW-on-K8s, 10Performance-Team (Radar), 10Wikimedia-production-error: ResourceLoader icon rasterization fails via MediaWiki-on-Kubernetes - https://phabricator.wikimedia.org/T336025 (10Krinkle) I suspect there's a missing PHP extension or some other Debian package. [20:31:44] Is anyone aware of issues with wikikube or kafka in CODFW? All the wdqs hosts in CODFW are about 2 days behind on rdf updates, rebooting doesn't seem to help [20:59:31] jayme: this is is ready for review: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/895241 [20:59:31] ty [23:10:19] 10serviceops, 10Observability-Alerting, 10observability, 10Patch-For-Review: Port openapi/swagger checks/alerts to Prometheus - https://phabricator.wikimedia.org/T320620 (10colewhite) Will need a grafana dashboard and a runbook to define the alerts.