[09:39:27] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JMeybohm) >>! In T324576#8451544, @Ottomata wrote: > Is this possible to do with helm, or will that require manual e.g. `kubectl ed...
[09:42:31] <claime>	 Hmm we seem to have taken a hard perf hit on backend since last night 2300UTC 
[09:42:52] <claime>	 https://grafana.wikimedia.org/goto/UYaWOuF4z?orgId=1
[09:43:08] <claime>	 https://grafana.wikimedia.org/goto/_VSMdXK4z?orgId=1
[09:44:55] <claime>	 Mostly visible in aggregate https://grafana.wikimedia.org/goto/3BCVOuKVk?orgId=1
[09:50:36] <claime>	 Most probably related to yesterday's late UTC backport window that made parsoid alert for a bit
[10:24:37] <wikibugs>	 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Seen): contint1001 hardware failures (remove contint1001 from production) - https://phabricator.wikimedia.org/T324698 (10hashar) contint1001 can be decommissioned
[10:43:18] <jayme>	 claime: probably best to comment with that on the task task for the backport window I guess
[10:44:00] <claime>	 I wanted to check-in with y'all if it was something to worry about or if I'm seeing things :p
[10:47:01] <jayme>	 for me it looks as well as if we gained a ~100ms in backend response time
[10:47:17] <jayme>	 mean latency...but still
[10:52:59] <jayme>	 otoh if you extend the timer period a couple of days ago we where in that range as well so it might be "normal"?
[10:54:33] <claime>	 yeah, it seems way less significant zooming back out
[11:17:45] <wikibugs>	 10serviceops, 10Patch-For-Review: Revisit PHP opcache health alarm - https://phabricator.wikimedia.org/T324649 (10Clement_Goubert) 05In progress→03Resolved Merged alarm with 6 retries, 10m interval. Marking as resolved, we can reopen if it causes issues/doesn't alert us when we want.
[11:39:07] <hnowlan>	 Just gonna use the swift discovery endpoints for the time being with thumbor if that's cool https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865595 
[11:44:39] <hnowlan>	 <3 claime 
[12:44:37] <wikibugs>	 10serviceops, 10Phabricator, 10serviceops-collab, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹️): Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Nikerabbit)
[13:40:16] <wikibugs>	 10serviceops, 10Observability-Tracing: Package OpenTelemetry Collector atop our own base Docker images - https://phabricator.wikimedia.org/T320552 (10Clement_Goubert) 05In progress→03Resolved
[13:40:46] <wikibugs>	 10serviceops, 10Observability-Tracing: Helmchart for OpenTelemetry Collector - https://phabricator.wikimedia.org/T324117 (10Clement_Goubert) 05Open→03In progress
[13:40:48] <wikibugs>	 10serviceops, 10Observability-Tracing: OpenTelemetry Collector running as a DaemonSet on Wikikube - https://phabricator.wikimedia.org/T320564 (10Clement_Goubert)
[13:57:18] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > But editing via a values change deployed by helmfile.d would be fine. Actually, this works just fine, we don't have to...
[14:37:10] <hnowlan>	 question about how we might do a mixed pooling of k8s and metal instances of thumbor - for something like sessionstore we define `service: kubesvc` in service.yaml, but we both want to reuse the old service definition and also not define a new one for thumbor-on-k8s. Is the only/best solution to just manually add all kubesvc hosts in conftool-data? Either directly or adding a &kubernetes 
[14:37:16] <hnowlan>	 alias 
[14:47:49] <jayme>	 I'm not very sure, but that is what I understood from the conversation we had about this some time ago
[15:34:03] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > I will test and see what happens to a running Flink app when I take the operator offline...  # Installed flink-kubernet...
[15:34:32] <claime>	 Getting a big latency spike on POST api_appserver (around x2)
[15:34:40] <claime>	 Looks like it's coming down now
[15:35:39] <claime>	 Yeah, it just flapped
[16:22:10] <ottomata>	 jayme: how do we expect metrics to work from an upstream helmchart?  I have prometheus metrics are exposed on a containerPort, but I believe a lot of the Service stuff to get that available and auto-monitored by prometheus is automated somehow...perhaps via our usual chart templates?  Trying to look around, but i'm not totally sure.
[16:22:24] <ottomata>	 Should/will I have to augment the upstream helm chart to add the right Services?
[16:22:37] <ottomata>	 what about all the label injection stuff we do?
[16:23:13] <jayme>	 ottomata: are we talking about the metrics of the operator or the metrics of the FlinkThings the operator operates :D
[16:23:21] <ottomata>	 the operator
[16:23:40] <ottomata>	 haven't started on the Flink app chart yet; that one will be within our usual template stuff though.
[16:24:52] <jayme>	 for the operator you can add the prometheus.io/scrape annotations via .Values.operatorPod.annotations AIUI
[16:25:22] <jayme>	 prometheus.io/scrape: "true"
[16:25:35] <jayme>	 prometheus.io/port: X
[16:26:46] <jayme>	 prometheus will just pick the operator pod up and scrape it then
[16:27:14] <ottomata>	 i see...hm.  no Service needed for that then?
[16:27:30] <jayme>	 nono
[16:27:56] <ottomata>	 OH, yes I see.  
[16:28:06] <jayme>	 prometheus discovers the pod (ip) directly via the k8s api and connect to that for metrics
[16:28:13] <ottomata>	 .Values.operatorPod.annotations, nice thought i was going to have to modify the deployment spec.
[16:28:13] <ottomata>	 nice
[16:28:16] <ottomata>	 COOOL
[16:28:53] <ottomata>	 that is in our productioni k8s setup right?  I won't see any such magic inside of minikube, i'll just see the annotiations on the deployment
[16:29:49] <jayme>	 yes correct. Thats part of the prometheus setup from o11y. You'll just observe the annotations
[16:29:53] <ottomata>	 got it
[16:29:54] <ottomata>	 very cool
[16:30:35] <jayme>	 by default PodIP:Port/metrics will be scraped. If you need a different path, set prometheus.io/path: /foo
[16:31:03] <ottomata>	 nope that'll work fine.  
[16:31:10] <jayme>	 sweet
[16:31:11] <ottomata>	 that was easier than I thought!
[16:31:21] <jayme>	 some things are :-)
[16:53:47] <wikibugs>	 10serviceops, 10RESTBase, 10Wikipedia-iOS-App-Backlog, 10iOS-app-feature-Performance, and 2 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10Dbrant)
[17:01:42] <ottomata>	 inflatador: making progress on flink helm stuff, about to start on a flink-app chart that can be deployed by the flink operator.  wanna sync up today and and follow a bit?
[17:14:22] <hnowlan>	 here's a cr for the mixed metal/k8s thumbor - I assume there's no issues with having two definitions for a single host under two different services but ehh not sure? https://gerrit.wikimedia.org/r/c/operations/puppet/+/866445 
[17:14:35] <hnowlan>	 obviously will not be doing anything with this until next week
[17:23:29] <ottomata>	 _joe_, jayme, btullis, I'm starting to work on a flink-app chart that uses the FlinkDeployment CRD.  I think it would help me if I could get a walkthrough (or docs?) on the new modules/mesh/vendor templates stuff, to see what I need.  I think I don't need a lot of it, but I don't fully understand it all (especially the mesh stuff).  
[17:41:21] <inflatador>	 ottomata sounds good! I have about an hr now, or we can do after 3:30 EST if that work for you
[17:42:48] <ottomata>	 inflatador:  gimme 5 mins, will huddle you in slack
[17:43:12] <inflatador>	 ACK
[17:44:42] <jayme>	 ottomata: tbh I think in this very special case you wont gain very much from the chart modules as you won't have all the usual things like deployments, services etc. 
[17:45:24] <ottomata>	 right, most of it i can do away with.  what about mesh and tls and envoy stuff?
[17:45:38] <ottomata>	 i do want some routing to jobmanagers, and in HA there will be multiple job managers, some standby
[17:45:48] <jayme>	 some of the base stuff might be usable, though
[17:47:09] <jayme>	 I haven't looked at the CRDs by now tbh so I'm not sure what you can define there and what you can't. So it's very possible that you can't simply "drop-in" the tls-proxy sidecar
[17:48:01] <jayme>	 what do you mean by "routing to jobmanagers"? Is that the actuall flink api that you want to talk to from the outside?
[17:48:41] <ottomata>	 just the FLink UI for admin purposes
[17:49:16] <jayme>	 ah, I see. Could maybe be done with ingress...
[17:49:27] <ottomata>	 there is a podTemplate thing where I can augment the job and task manager pod spec
[17:49:28] <ottomata>	 https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/pod-template/
[17:50:35] <jayme>	 regardings docs I don't think there is more than the README in the modules folder unfortunately
[17:51:44] <ottomata>	 okay which is more about writing modules.
[17:53:16] <ottomata>	 i think i just don't know what all the bits are for.  okay, i'll see what I come up with and we can add in modules I need after I get somethign working, if you all say I should.
[18:00:48] <jayme>	 go for something simple in first iteration (check the modules/base stuff) and maybe mesh/deployment_1.0.0.tpl for the sidecars
[18:00:53] <jayme>	 tls-proxy
[18:01:52] <jayme>	 ingress we can/should check in a second iteration I guess
[18:01:59] <jayme>	 gtg, sorry
[18:02:10] <ottomata>	 or myabe? https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.2/docs/operations/ingress/
[18:02:13] <ottomata>	 ok, thanks!
[18:06:18] <inflatador>	 ottomata hit me up on Slack if you're ready, otherwise I'll probably go to lunch soon
[18:07:07] <ottomata>	 inflatador:  I DID!
[18:59:32] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Oh, re webhook again:    https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/try-flink-kubernetes...
[22:10:16] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Got a WIP first draft of a flink-app helm chart [[ https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/866510...
[23:11:50] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10observability: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10colewhite) I'm not a kafka expert, but this seems like a reasonable place to start.  Pre-creating the topics is definitely the way...