[09:39:27] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JMeybohm) >>! In T324576#8451544, @Ottomata wrote: > Is this possible to do with helm, or will that require manual e.g. `kubectl ed... [09:42:31] Hmm we seem to have taken a hard perf hit on backend since last night 2300UTC [09:42:52] https://grafana.wikimedia.org/goto/UYaWOuF4z?orgId=1 [09:43:08] https://grafana.wikimedia.org/goto/_VSMdXK4z?orgId=1 [09:44:55] Mostly visible in aggregate https://grafana.wikimedia.org/goto/3BCVOuKVk?orgId=1 [09:50:36] Most probably related to yesterday's late UTC backport window that made parsoid alert for a bit [10:24:37] 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): contint1001 hardware failures (remove contint1001 from production) - https://phabricator.wikimedia.org/T324698 (10hashar) contint1001 can be decommissioned [10:43:18] claime: probably best to comment with that on the task task for the backport window I guess [10:44:00] I wanted to check-in with y'all if it was something to worry about or if I'm seeing things :p [10:47:01] for me it looks as well as if we gained a ~100ms in backend response time [10:47:17] mean latency...but still [10:52:59] otoh if you extend the timer period a couple of days ago we where in that range as well so it might be "normal"? [10:54:33] yeah, it seems way less significant zooming back out [11:17:45] 10serviceops, 10Patch-For-Review: Revisit PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert) 05In progress→03Resolved Merged alarm with 6 retries, 10m interval. Marking as resolved, we can reopen if it causes issues/doesn't alert us when we want. [11:39:07] Just gonna use the swift discovery endpoints for the time being with thumbor if that's cool https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865595 [11:44:39] <3 claime [12:44:37] 10serviceops, 10Phabricator, 10serviceops-collab, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Nikerabbit) [13:40:16] 10serviceops, 10Observability-Tracing: Package OpenTelemetry Collector atop our own base Docker images - https://phabricator.wikimedia.org/T320552 (10Clement_Goubert) 05In progress→03Resolved [13:40:46] 10serviceops, 10Observability-Tracing: Helmchart for OpenTelemetry Collector - https://phabricator.wikimedia.org/T324117 (10Clement_Goubert) 05Open→03In progress [13:40:48] 10serviceops, 10Observability-Tracing: OpenTelemetry Collector running as a DaemonSet on Wikikube - https://phabricator.wikimedia.org/T320564 (10Clement_Goubert) [13:57:18] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > But editing via a values change deployed by helmfile.d would be fine. Actually, this works just fine, we don't have to... [14:37:10] question about how we might do a mixed pooling of k8s and metal instances of thumbor - for something like sessionstore we define `service: kubesvc` in service.yaml, but we both want to reuse the old service definition and also not define a new one for thumbor-on-k8s. Is the only/best solution to just manually add all kubesvc hosts in conftool-data? Either directly or adding a &kubernetes [14:37:16] alias [14:47:49] I'm not very sure, but that is what I understood from the conversation we had about this some time ago [15:34:03] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > I will test and see what happens to a running Flink app when I take the operator offline... # Installed flink-kubernet... [15:34:32] Getting a big latency spike on POST api_appserver (around x2) [15:34:40] Looks like it's coming down now [15:35:39] Yeah, it just flapped [16:22:10] jayme: how do we expect metrics to work from an upstream helmchart? I have prometheus metrics are exposed on a containerPort, but I believe a lot of the Service stuff to get that available and auto-monitored by prometheus is automated somehow...perhaps via our usual chart templates? Trying to look around, but i'm not totally sure. [16:22:24] Should/will I have to augment the upstream helm chart to add the right Services? [16:22:37] what about all the label injection stuff we do? [16:23:13] ottomata: are we talking about the metrics of the operator or the metrics of the FlinkThings the operator operates :D [16:23:21] the operator [16:23:40] haven't started on the Flink app chart yet; that one will be within our usual template stuff though. [16:24:52] for the operator you can add the prometheus.io/scrape annotations via .Values.operatorPod.annotations AIUI [16:25:22] prometheus.io/scrape: "true" [16:25:35] prometheus.io/port: X [16:26:46] prometheus will just pick the operator pod up and scrape it then [16:27:14] i see...hm. no Service needed for that then? [16:27:30] nono [16:27:56] OH, yes I see. [16:28:06] prometheus discovers the pod (ip) directly via the k8s api and connect to that for metrics [16:28:13] .Values.operatorPod.annotations, nice thought i was going to have to modify the deployment spec. [16:28:13] nice [16:28:16] COOOL [16:28:53] that is in our productioni k8s setup right? I won't see any such magic inside of minikube, i'll just see the annotiations on the deployment [16:29:49] yes correct. Thats part of the prometheus setup from o11y. You'll just observe the annotations [16:29:53] got it [16:29:54] very cool [16:30:35] by default PodIP:Port/metrics will be scraped. If you need a different path, set prometheus.io/path: /foo [16:31:03] nope that'll work fine. [16:31:10] sweet [16:31:11] that was easier than I thought! [16:31:21] some things are :-) [16:53:47] 10serviceops, 10RESTBase, 10Wikipedia-iOS-App-Backlog, 10iOS-app-feature-Performance, and 2 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10Dbrant) [17:01:42] inflatador: making progress on flink helm stuff, about to start on a flink-app chart that can be deployed by the flink operator. wanna sync up today and and follow a bit? [17:14:22] here's a cr for the mixed metal/k8s thumbor - I assume there's no issues with having two definitions for a single host under two different services but ehh not sure? https://gerrit.wikimedia.org/r/c/operations/puppet/+/866445 [17:14:35] obviously will not be doing anything with this until next week [17:23:29] _joe_, jayme, btullis, I'm starting to work on a flink-app chart that uses the FlinkDeployment CRD. I think it would help me if I could get a walkthrough (or docs?) on the new modules/mesh/vendor templates stuff, to see what I need. I think I don't need a lot of it, but I don't fully understand it all (especially the mesh stuff). [17:41:21] ottomata sounds good! I have about an hr now, or we can do after 3:30 EST if that work for you [17:42:48] inflatador: gimme 5 mins, will huddle you in slack [17:43:12] ACK [17:44:42] ottomata: tbh I think in this very special case you wont gain very much from the chart modules as you won't have all the usual things like deployments, services etc. [17:45:24] right, most of it i can do away with. what about mesh and tls and envoy stuff? [17:45:38] i do want some routing to jobmanagers, and in HA there will be multiple job managers, some standby [17:45:48] some of the base stuff might be usable, though [17:47:09] I haven't looked at the CRDs by now tbh so I'm not sure what you can define there and what you can't. So it's very possible that you can't simply "drop-in" the tls-proxy sidecar [17:48:01] what do you mean by "routing to jobmanagers"? Is that the actuall flink api that you want to talk to from the outside? [17:48:41] just the FLink UI for admin purposes [17:49:16] ah, I see. Could maybe be done with ingress... [17:49:27] there is a podTemplate thing where I can augment the job and task manager pod spec [17:49:28] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/pod-template/ [17:50:35] regardings docs I don't think there is more than the README in the modules folder unfortunately [17:51:44] okay which is more about writing modules. [17:53:16] i think i just don't know what all the bits are for. okay, i'll see what I come up with and we can add in modules I need after I get somethign working, if you all say I should. [18:00:48] go for something simple in first iteration (check the modules/base stuff) and maybe mesh/deployment_1.0.0.tpl for the sidecars [18:00:53] tls-proxy [18:01:52] ingress we can/should check in a second iteration I guess [18:01:59] gtg, sorry [18:02:10] or myabe? https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.2/docs/operations/ingress/ [18:02:13] ok, thanks! [18:06:18] ottomata hit me up on Slack if you're ready, otherwise I'll probably go to lunch soon [18:07:07] inflatador: I DID! [18:59:32] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Oh, re webhook again: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/try-flink-kubernetes... [22:10:16] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Got a WIP first draft of a flink-app helm chart [[ https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/866510... [23:11:50] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10colewhite) I'm not a kafka expert, but this seems like a reasonable place to start. Pre-creating the topics is definitely the way...