[08:59:01] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10JMeybohm) >>! In T324576#8454404, @Ottomata wrote: >> I will test and see what happens to a running Flink app when I take the opera...
[11:10:13] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10Maps: Enable traffic mirroring from codfw to eqiad - https://phabricator.wikimedia.org/T324459 (10Jgiannelos) 05Open→03Resolved
[11:10:16] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10Maps: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 (10Jgiannelos)
[12:28:20] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm)
[13:52:20] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > cert-manager in our cluster Ah! Do we have a cert-manager that will work with this webhook as is?  Meaning we could ins...
[14:18:53] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Jclark-ctr) 05Open→03Resolved removed power supply and reseated   error has removed
[14:20:12] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Clement_Goubert) Thanks so much!
[14:21:54] <ottomata>	 do the .fixtures in a chart get tested, or do they just get evaluated for valid yaml?
[14:23:19] <jayme>	 ottomata: the .fixtures are used to actually render chart and validate in CI
[14:23:37] <ottomata>	 is there a way I can provide expected template output? 
[14:23:51] <jayme>	 no
[14:23:55] <ottomata>	 rats :)
[14:24:29] <jayme>	 we just use the fixtures to have all if-guards rendered and all the resulting yaml is validated against kubeconform
[14:24:36] <jayme>	 *via kubeconform
[14:24:43] <ottomata>	 i've got a local values.yaml file i'm using for testing and developing, but its not necessarily testing a feature flag.  any reason not to put in in chart .fixtures?
[14:27:29] <claime>	 Not really if it renders to a valid chart I think
[14:27:33] <jayme>	 No. Not really. But every .fixture will result in one helm template | kubeconform run
[14:27:59] <jayme>	 so if ther is no need to to it, you would just extend ci runtime :)
[14:37:00] <ottomata>	 hm, okay.
[14:43:21] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) > what the webhook actually does Responses from Flink mailing list:  > webhooks in general are optional components of the...
[14:49:22] <jayme>	 ottomata: intersting answer from the mailinglist. They don't even mention the mutation the webhook does :D
[14:50:35] <jayme>	 since we're going to validate the FlinkDeployment against the CRD spec prior to submitting it to the cluster anyways (in CI) I guess we don't gain anything from the validation phase
[14:56:46] <ottomata>	 jayme:  okay cool.  i'll leave in support for it then, but disable instaling it in admin_ng helmfile values
[14:57:25] <jayme>	 sounds good
[15:01:19] <ottomata>	 jayme:  will it be better to use the built in FlinkDeployment ingress support https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.2/docs/operations/ingress/
[15:01:22] <ottomata>	 or to use our templates somehow?
[15:01:51] <jayme>	 btw. they cut 1.3 today :-p
[15:01:57] <ottomata>	 oh nicie!
[15:02:16] <ottomata>	 i saw the release branch in github when I was poking around in there, and thought one might be coming up soon :)
[15:09:57] <jayme>	 regarding ingress: we (as in wikikube) supports default ingress API but I have not tested it very much as we're using the istio ingress CRD. Might be possible to just use it but I would need to check in detail
[15:11:54] <jayme>	 but AIUI there is no full templating support but more or less only this "template:" string that get's converted to an ingress object.
[15:12:32] <jayme>	 it might be easier to leverage the ingress stuff - plus we would stick with what we have already
[15:14:20] <ottomata>	 okay, will try our stuff first then...any hints?  its just for admin access to the Flink UI, i don't think we need e.g. a discovery endpoint (although, maybe that would be nice?)  or...do we?  
[15:15:08] <jayme>	 and we could craft a generic/shared hostname for all flinks so new flink deployments would not need DNS changes to be reachable (see https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#More_complex_setups)
[15:15:16] <ottomata>	 oo yeah
[15:15:18] <ottomata>	 oh boy docs...reading
[15:15:38] <jayme>	 not many docs, I kept it sparse so you can stick to guessing :D
[15:16:03] <claime>	 Hmm did we do anything that could cause a sharp increase in api_appserver calls? https://grafana.wikimedia.org/goto/kuJxo3F4z?orgId=1
[15:16:13] <claime>	 (I may be paranoid because 1st oncall)
[15:17:20] <claime>	 I caI'm not finding anything particular in RED
[15:18:07] <jayme>	 hmm
[15:18:22] <jayme>	 did you check SAL?
[15:19:11] <claime>	 Spike's almost an hour after the end of the train deploy
[15:19:32] <claime>	 So I don't think it's that
[15:19:33] <jayme>	 promising :)
[15:19:47] <claime>	 Ah wait
[15:22:08] * jayme 's superpower
[15:24:48] <claime>	 I see an increase in GET requests but it doesn't seem linked to the train actually, it's coincidentally ramping up
[15:25:19] <claime>	 https://grafana.wikimedia.org/goto/R50Z0qFVk?orgId=1
[15:25:29] <claime>	 It's quite a sharp increase if we look at 7 days
[15:28:29] <jayme>	 true indeed
[15:33:14] <claime>	 https://superset.wikimedia.org/r/2187
[15:34:59] <jayme>	 the surge in req/s is rather small compared to the load this seems to put on the workers
[15:35:07] <claime>	 Yeah
[15:35:26] <claime>	 I think it's a red herring
[15:35:44] <jayme>	 eheh, that url can't be opened. "Request Line is too large (4186 > 4094) "
[15:36:21] <jayme>	 if the req/s is a red herring I would still look towards the deployment as load might creep up slowly 
[15:39:21] <claime>	 Yeah, but I'd assume to see a ramp from around 13h10/13h15 until now
[15:39:30] <claime>	 (in active/idle)
[15:39:43] <claime>	 https://grafana.wikimedia.org/goto/XZujA3FVk?orgId=1
[15:40:24] <claime>	 but it's stable for 1h and then dips suddenly
[15:46:34] <ottomata>	 jayme:  recommendations for testing this ingress/istio stuff in minkube?
[15:46:39] <ottomata>	 it that even possible?
[15:47:34] <jayme>	 claime: indeed...or maybe some caches running out after ~60min
[15:47:50] <ottomata>	 i guess i have to install istio?
[15:48:32] <jayme>	 ottomata: you can install istio in minikube. Replicating out exact setup might be a bit tricky, though
[15:49:05] <ottomata>	 i just want something that will work with our ingress templates
[15:49:24] <jayme>	 our istio setup is in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/custom_deploy.d/istio/
[15:49:28] <ottomata>	 reading custom_deploy yeah...
[15:49:29] <ottomata>	 okay
[15:49:36] <jayme>	 yeah. I
[15:50:09] <jayme>	 I think the tricky part is not to enable all the istio stuff as we only have the istio-ingressgateway enabled
[15:50:18] <jayme>	 plus the cert-manager things ofc
[15:50:36] <ottomata>	 i will put that advice into my pocket until I know what it means :p
[15:51:18] <jayme>	 istio is a full blown service mesh which we completely disable apart from the parts that do ingress
[15:51:52] <ottomata>	 right so in main/config.yaml egress things are disabled
[15:52:33] <jayme>	 yes - and all the meshy sidecar things are disabled
[15:56:18] <jayme>	 claime: I think the situation get's worse - and unfortunately I'm not super experienced with mw debugging :/ but I would guess Amir1 is :)
[15:56:30] <ottomata>	 oof, okay, i need to make build istioctl somehow...figureing out which version
[15:56:31] <ottomata>	 then install it
[15:56:35] <ottomata>	 but i'm on macos... hmm 
[15:56:37] <claime>	 jayme: Yeah, I'm starting to not like this either
[15:56:41] <Amir1>	 what terrible thing I have done
[15:56:44] * Amir1 reads up
[15:56:52] <claime>	 I don't think it's your fault
[15:56:55] <jayme>	 [09.12.22 16:50] <icinga-wm> PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4194 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[15:57:05] <claime>	 We're getting workers exhaustion on api_appserver
[15:57:06] <Amir1>	 claime: everything is my fault even proven otherwise
[15:57:25] <jayme>	 ottomata: uh...I was about to say "we have debian packages" :D
[15:57:28] <Amir1>	 okay, let's look at flamegraph
[15:57:49] <Amir1>	 https://performance.wikimedia.org/arclamp/svgs/hourly/
[15:58:11] <Amir1>	 likely some perf regression
[15:58:26] <ottomata>	 hehe, i'm reading the how to deploy section of custom_deploy.d/istio/README.md
[15:59:40] <Amir1>	 it's seems eventbus is at fault here 
[15:59:42] <Amir1>	 https://performance.wikimedia.org/arclamp/svgs/hourly/2022-12-09_14.excimer-wall.api.svgz
[15:59:49] <Amir1>	 compare with https://performance.wikimedia.org/arclamp/svgs/hourly/2022-12-09_10.excimer-wall.api.svgz
[16:00:25] <ottomata>	 eventbus you say? :)
[16:00:41] <ottomata>	 sorry, also reading a little backscroll then.  is this since a deploy?
[16:00:49] <ottomata>	 i'm not aware of any recent eventbus changes
[16:00:52] <Amir1>	 it seems wmf.13
[16:00:58] <claime>	 It started ~1h after train for wmf.13
[16:00:59] <ottomata>	 oo via monologhandler
[16:01:20] <Amir1>	 now 30% of run time is just trying to talk to eventbus it seems
[16:01:30] <ottomata>	 i think the monolog handler eventbus is only really used for sending api request logs, and mayb esome cirrussearch request logs, as events
[16:01:40] <ottomata>	 if there are a LOT more of those types of requests, there will be more events
[16:04:26] <claime>	 we only got a bump of ~10% in requests in eqiad, stable in codfw
[16:04:46] <claime>	 ok maybe a little more
[16:04:56] <jayme>	 the throttled logs are back as well https://grafana-rw.wikimedia.org/d/000000561/logstash?from=1670529261000&orgId=1&to=now&viewPanel=45
[16:05:46] <ottomata>	 i don't see a spike in events for those streams i suspected
[16:05:56] <ottomata>	 lhttps://grafana.wikimedia.org/goto/2njpx3F4z?orgId=1
[16:05:57] <ottomata>	 https://grafana.wikimedia.org/goto/2njpx3F4z?orgId=1
[16:06:09] <Amir1>	 that seems unrelated at least, the target modules logs should show up in load.php requests not api.php
[16:07:19] <Amir1>	 unless load.php logs so much it overwhelms eventgate and then api.php logs can't reach?
[16:08:27] <Amir1>	 I can make it sample the logs but can't promise it would fix this issue
[16:11:38] <ottomata>	 Amir1:  is there any way to see the flame stack that queued the DeferredUPdate?  I guess not eh?
[16:12:40] <Amir1>	 if we can figure out what request type it is, we can replay it in mwdebug with xhgui and you'd have the full callgraph with details
[16:13:14] <ottomata>	 request type meaning which api action?
[16:14:31] <Amir1>	 yeah
[16:15:04] <Amir1>	 if it's a simple GET, then something like just the request url would be enough
[16:19:13] <ottomata>	 i don't see anything suspicious on eventgate end, no extra requests, no weird latencies, no kafka issues.  Amir1 , reading the flame graph, the fact that the EventBus::send method is so wide means that the time is specifically spent there or below it (in curl) not above it, right?
[16:19:41] <Amir1>	 Yup
[16:20:05] <Amir1>	 The curl is wide
[16:20:16] <Amir1>	 So it's stuck in http call
[16:21:17] <cdanis>	 Amir1: ottomata: this is interesting: https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=api_appserver&var-origin_instance=All&var-destination=All&from=now-6h&to=now
[16:21:21] <ottomata>	 oo wait i do see it in eventgate-analytics latencies
[16:21:58] <ottomata>	 cdanis: retries?
[16:22:27] <cdanis>	 retries, connection fail rate as well
[16:22:28] <claime>	 timeouts too
[16:23:19] <cdanis>	 "query processing would load too many samples into memory in query execution" has to be one of my least favorite messages
[16:23:31] <claime>	 oof
[16:24:57] <cdanis>	 https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&var-service=eventgate-analytics&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&var-site=All
[16:25:09] <cdanis>	 eqiad p99 on POSTs went from ~10ms to ~46ms
[16:25:28] <ottomata>	 ya looknig there too, it does seem like there has been a gradual increase in throughput here over time...is it possible we just hit some threshold?
[16:25:30] <ottomata>	 and need more pods?
[16:25:40] <ottomata>	 dunno what would cause this to spike like this
[16:27:32] <ottomata>	 pod cpu throttling? https://grafana.wikimedia.org/goto/v8OYaqK4k?orgId=1
[16:28:09] <ottomata>	 considering bumping replicas, objections?
[16:28:10] <claime>	 Oh, yep, good find
[16:28:25] <cdanis>	 sgtm
[16:28:30] <claime>	 same, go
[16:29:15] <cdanis>	 it is interesting it is only some of the pods
[16:29:24] <Amir1>	 the increase in logs caused by mobile load.php (T324723) might have been a contributing factor I think
[16:29:49] <Amir1>	 but that's going to be reverted soon
[16:29:52] <claime>	 Look at the network usage cdanis 
[16:30:03] <claime>	 They all take a bump before starting throttling
[16:30:16] <cdanis>	 claime: yeah, but I was expecting it to be more uniformly distributed
[16:30:23] <claime>	 I imagine if it keeps going, as it load balances, it'd throttle more evenly
[16:30:34] <ottomata>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/866612
[16:30:40] <claime>	 But it's been balancing for 2 hours and that didn't happen
[16:31:15] <cdanis>	 I am guessing the imbalance is just connection reuse and bad luck
[16:31:19] <claime>	 C'mon jenkins
[16:31:30] <claime>	 Ah no I didn't need jenkins lol
[16:31:47] <claime>	 And here's the page
[16:31:51] <cdanis>	 given the phpfpm saturation page let's get that rolled out expeditiously :D
[16:32:25] <ottomata>	 skippinig jenkins
[16:33:01] <ottomata>	 deploying shortly...
[16:33:19] <claime>	 ottomata: You taking the helmfile apply?
[16:34:26] <ottomata>	 yes
[16:34:29] <ottomata>	 doing now
[16:34:54] <ottomata>	 hm FYI there is a chart version bump being applied too!  the template looks mostly the same, tls config checksum changed tho.
[16:35:12] <ottomata>	 applied in eqiad
[16:36:33] <cdanis>	 https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1&from=now-30m&to=now&refresh=30s
[16:36:43] <cdanis>	 📉
[16:36:55] <jayme>	 sweet
[16:37:05] <claime>	 Noice
[16:37:09] <claime>	 Thanks guys <3
[16:37:27] <cdanis>	 well, hm
[16:37:32] <cdanis>	 a bunch of the older pods are still pretty heavily throttled
[16:38:04] <jayme>	 the uneven distribution might be a result of envoy<->envoy connections beeing kept open
[16:38:21] <ottomata>	 where's our horizontal auto scaling?!
[16:38:25] <ottomata>	 ;)
[16:38:26] <cdanis>	 jayme: yeah, in traffic land there's a maximum number of requests allowed through the same connection (order of 1000 i think)
[16:38:32] <cdanis>	 not sure if the same is done in envoy
[16:38:43] <jayme>	 that *should* be the case there as well IIRC
[16:39:03] <claime>	 we can kill the pods and let the deployments recreate them?
[16:39:17] <claime>	 It would respread the connections
[16:39:27] <cdanis>	 ottomata: tbqh, having used both before, I'd much rather have per-instance utilization-based load balancing rather than autoscaling :)
[16:39:45] <cdanis>	 claime: let's wait and see first
[16:39:49] <ottomata>	 the old pods should be terminated
[16:39:53] <claime>	 yeah, np
[16:39:55] <ottomata>	 i think that is just old grafana stuff
[16:39:57] <cdanis>	 at least 2 of the pods are no longer throttling
[16:40:12] <ottomata>	 old pods should disappear from the dashboard,  think...
[16:40:14] <cdanis>	 ah
[16:40:23] <claime>	 OH yeah
[16:40:29] <claime>	 Because of the chart bump
[16:40:34] <cdanis>	 yeah looks like ottomata is right
[16:40:35] <ottomata>	 yup all pods have been recreated
[16:40:57] <ottomata>	 oh, claime if it weren't for the chart versino bump, increasing replicas would have just spun up 10 new pods?
[16:41:01] <ottomata>	 wihtout taking down the old ones?
[16:41:03] <claime>	 I think
[16:41:04] <claime>	 Not sure
[16:41:07] <claime>	 jayme: ?
[16:41:33] <jayme>	 yes, absolutely
[16:42:11] <ottomata>	 coo
[16:42:25] <jayme>	 hmm "max_requests_per_connection: 1000" we configure per default for the service-proxy stuff ( cdanis )
[16:42:28] <ottomata>	 (deploying in staging for consistency...)
[16:42:37] <claime>	 It would scale up the replicaset right?
[16:42:57] <cdanis>	 jayme: hmmm :/
[16:43:45] <claime>	 https://grafana.wikimedia.org/goto/PSZ-fqKVk?orgId=1
[16:43:46] <jayme>	 claime: yep
[16:43:50] <claime>	 I think it worked
[16:44:06] <ottomata>	 phewf!
[16:44:38] <jayme>	 ah...yeah. We have a one time retry for eventgate-analytics in case of 5xx configured
[16:44:50] <claime>	 %worker in active state still a little high compared to baseline, but not by much
[16:44:59] <jayme>	 so that probably ends up in maximum of 2k requests per connection
[16:45:05] <ottomata>	 so, eventgate-analytics is def the busiest of the eventgates.  i know the search team uses the cirrussearch request events to improve the search indexes, but i don't really know what the api-request logging is used for.
[16:45:12] <ottomata>	 every once and a while its useful i guess...?
[16:45:15] <claime>	 (~22% baseline, 30% rn)
[16:46:06] <claime>	 So I managed to actually get a page during my 1st oncall week
[16:46:28] <ottomata>	 nice one!
[16:46:29] <claime>	 That I could have avoided if I'd called on Amir for help debugging earlier when I started seeing contention
[16:46:48] <claime>	 Lesson learned :P
[16:47:09] <cdanis>	 you noticed the contention because you were unpromptedly looking at dashboards, though
[16:47:13] <cdanis>	 you'll always find something weird that way ;)
[16:47:16] <claime>	 cdanis: Nah
[16:47:36] <claime>	 I'd had the crit flapping but not pageing for an hour or so
[16:47:45] <cdanis>	 ahhh
[16:47:50] <Amir1>	 strange, I didn't get the page
[16:47:54] <claime>	 And I was trying to find out what happened
[16:48:09] <claime>	 Amir1: I acked it immediately so maybe with push latency?
[16:48:26] <Amir1>	 but using flamegraph was literally one of two/three slides of debugging chapter in my mw presentation in Prague
[16:48:35] <Amir1>	 I have to start taking quiz from now on
[16:48:40] <claime>	 yes but I couldn't attend
[16:48:47] <cdanis>	 Amir1: you weren't supposed to
[16:48:53] <claime>	 And flamegraph isn't in the checklist :P
[16:49:00] <cdanis>	 it was just arnold and cole scheduled for right now
[16:49:21] <Amir1>	 cdanis: aah, ok, makes sense, I am oncall for the EU time but this one was a bit late
[16:50:20] <jayme>	 ottomata: you mind logging the replica bump numbers (and reason) to sal?
[16:50:37] <Amir1>	 claime: fair, both of my sessions are actually being added to the onboarding videos though 
[16:50:45] <claime>	 Amir1: Cool!
[16:51:54] <jayme>	 and we probably need to follow up somewhere about the increased message volume - or did I miss something?
[16:52:35] <Amir1>	 jayme: https://phabricator.wikimedia.org/T320518#8455510
[16:54:36] <jayme>	 Amir1: ah, I understood from the following comments that that is expected to be already fixed. That's why I pointed that graph out again earlier :)
[16:55:20] <Amir1>	 It'll get better but not fixed, the revert is not deployed yet
[16:55:28] <Amir1>	 went from 3x to 2x :D
[16:55:43] <jayme>	 ok. I see. That wasn't clear to me :)
[16:56:04] <jayme>	 "drop rate has settled to a 2x increase since the deployment." .. is pretty clear, though :)
[16:58:14] <claime>	 So chain of causality is more logging for mobile for data => eventgate-analytics congestion => api_appservers spend too long because of trying/retrying to log so active% shoots up leading to worker starvation
[16:58:49] <jayme>	 I gtg - have a nice rest of the day/weekend all
[16:58:58] <Amir1>	 jayme: have a nice weekend!
[16:58:59] <jayme>	 claime: (thats what I understood fwiw)
[17:00:05] <Amir1>	 FWIW, I just +2'ed the revert so it'll settle by the next train, don't forget to reduce number of pods next week
[17:00:17] <Amir1>	 (next Friday to be exact)
[17:05:24] <claime>	 Welp, I'll be off then
[17:39:46] <ottomata>	 ah, just saw note about SAL, sorry, thanks cla ime
[17:42:36] <ottomata>	 i wonder if there is a way to make eventgate configurably throttleble, especially for the hasty=true endpoint
[17:42:56] <ottomata>	 just fail the http request early if too many requests?
[17:43:02] <ottomata>	 could the envoy stuff handle that maybe?
[17:43:21] <ottomata>	 eventgate-main is probably the only one we'd care about not doinig thta tfor.
[17:43:28] <ottomata>	 all the others can drop if they are busy
[18:26:45] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Q for @dcausse and @gmodena.  I've thus far been making Flink logs go only to the console in ECS format.  The console log...
[18:31:33] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) @JMeybohm it turns out that Flinks native k8s integration [[ https://nightlies.apache.org/flink/flink-docs-master/docs/de...
[19:07:37] <wikibugs>	 10serviceops, 10Performance-Team, 10Wikimedia Enterprise, 10affects-Kiwix-and-openZIM: large amount of traffic to the action=parse API from MWOffliner - https://phabricator.wikimedia.org/T324866 (10daniel)
[22:50:34] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10Ottomata) Okay @JMeybohm, I'm ready for a first pass review of the [[ https://gerrit.wikimedia.org/r/c/operations/deployment-charts...