[07:09:53] <elukey>	 Morning folks
[07:10:27] <elukey>	 I wrote an update in https://phabricator.wikimedia.org/T338357#9003992, afaics the eventgate's envoy p99 latencies are a little better after yesterday's change
[07:12:36] <elukey>	 if so we may want to keep tuning/refining the eventgate's batching value, IIUC the eventbus extension uses deferred http calls so we shouldn't impact much external clients
[07:13:02] <elukey>	 (of course we should not increase the value too much, but maybe 10/20ms could give us even more benefits)
[07:13:21] <elukey>	 then we have to rebalance kafka main partitions, the brokers are clearly under pressure
[07:15:41] <_joe_>	 Amir1: we need to add the new proxies, rather
[07:16:32] <_joe_>	 elukey: go on with your best judgement re:eventgate
[07:16:44] <_joe_>	 as for kafka-main, we need to embargo adding new stuff to it
[07:19:42] <_joe_>	 and you're underselling the results of your change
[07:19:57] <_joe_>	 the p99 was reduced by 50% on average and has less spikes
[07:22:59] <elukey>	 _joe_ thanks! I'd just need some follow ups about the eventbus extension, 10ms seems another good test, but I'll wait for other folks to comment (not a real rush atm). For kafka main I'd say that after a rebalance we should be good, but yeah at the moment we are not in a great spot (at least for main eqiad)
[07:23:40] <elukey>	 I'll prep a task to rebalance partitions, if folks are ok we could just use the procedure that we followed last time (topicmappr -> json -> kafka reassign)
[07:26:28] <elukey>	 _joe_ this is also very nice https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&var-service=eventgate-main&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&var-site=All&from=now-2d&to=now&viewPanel=44
[07:27:26] <_joe_>	 elukey: yeah and we're not getting new hardware for kafka main anytime soon
[07:32:47] <elukey>	 sigh
[07:32:57] <elukey>	 then the rebalance is even more important
[08:02:07] <akosiaris>	 that's not entirely accurate
[08:02:22] <akosiaris>	 kafka-main clusters are scheduled for a refresh this year
[08:02:53] <akosiaris>	 I have tentatively scheduled them for Q4, but we can move that around
[08:03:05] <akosiaris>	 and e.g. move it to Q2, maybe even Q1 
[08:03:31] <akosiaris>	 that being said, I doubt they are going to be more powerful hardware
[08:04:09] <akosiaris>	 in fact, it's probably going for a minor downsize. e.g. kafka-main hosts have 128GB RAM, but 110GB is fs cache
[08:04:19] <akosiaris>	 that is most probably a waste of RAM
[08:07:30] <elukey>	 akosiaris: o/ IIRC kafka makes extensive use of the page cache, so a downsize may not be ideal, but not dramatic
[08:07:55] <elukey>	 if we have a good balancing for traffic between brokers we should be ok
[08:19:48] <_joe_>	 akosiaris: oh I thought it was dropped for budget reduction
[08:24:21] <akosiaris>	 _joe_: no, just downsized.
[08:24:25] <wikibugs>	 10serviceops, 10envoy: Update envoy to > 1.23 - https://phabricator.wikimedia.org/T341549 (10JMeybohm)
[08:24:41] <akosiaris>	 but your point stands. New, more powerful hardware isn't coming any time soon
[08:24:47] <wikibugs>	 10serviceops, 10envoy: Update envoy to > 1.23 - https://phabricator.wikimedia.org/T341549 (10JMeybohm)
[08:25:21] <wikibugs>	 10serviceops, 10SRE, 10envoy, 10Patch-For-Review: Refactor envoy max_requests_per_connection from Cluster to HttpProtocolOptions - https://phabricator.wikimedia.org/T304124 (10JMeybohm) 05Open→03Resolved
[08:25:26] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[08:25:34] <wikibugs>	 10serviceops, 10SRE, 10envoy, 10Patch-For-Review: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm) 05Open→03Resolved
[08:25:40] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[08:25:48] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Set a limit to the number of allowed active connections via runtime key overload.global_downstream_max_connections - https://phabricator.wikimedia.org/T340955 (10JMeybohm) 05Open→03Resolved
[08:25:56] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[08:26:20] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) 05Open→03Resolved This is done from our end.
[08:35:56] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/937041
[08:36:20] <elukey>	 this is the bump to 10ms for eventgate, if any kind soul review it I'll deploy it today :)
[08:41:35] <hashar>	 jelto: I am updating the checkboxes :)
[08:42:19] <jelto>	 hashar: ack
[08:42:48] <jelto>	 hashar: slet me know when https://gerrit.wikimedia.org/r/c/operations/puppet/+/867712 should be merged
[08:58:53] <wikibugs>	 10serviceops, 10MW-on-K8s: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553 (10Joe)
[08:59:29] <elukey>	 deploying eventgate :)
[09:02:13] <wikibugs>	 10serviceops, 10Observability-Metrics, 10observability: Scrape envoy runtime metrics in ops prometheus - https://phabricator.wikimedia.org/T341554 (10JMeybohm)
[09:07:09] <elukey>	 {{done}}
[09:10:29] <wikibugs>	 10serviceops, 10MW-on-K8s: Allow running periodic jobs for mw on k8s - https://phabricator.wikimedia.org/T341555 (10Joe)
[09:11:36] <wikibugs>	 10serviceops, 10Observability-Metrics, 10observability: Scrape envoy runtime metrics in ops & k8s prometheus - https://phabricator.wikimedia.org/T341554 (10JMeybohm)
[09:20:39] <wikibugs>	 10serviceops, 10Observability-Metrics, 10observability, 10Patch-For-Review: Scrape envoy runtime metrics in ops & k8s prometheus - https://phabricator.wikimedia.org/T341554 (10JMeybohm) a:03JMeybohm
[09:21:19] <wikibugs>	 10serviceops, 10MW-on-K8s: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553 (10Clement_Goubert)
[09:27:34] <wikibugs>	 10serviceops: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 (10elukey)
[09:34:12] <Amir1>	 _joe_: I think they were added before
[09:34:52] <_joe_>	 Amir1: then it shouldn't be an issue if we still have old ones in k8s when you remove them
[09:35:21] <Amir1>	 yeah yeah, I mean for clean ups later
[09:35:29] <_joe_>	 but we should def remove them later
[09:57:00] <wikibugs>	 10serviceops, 10MW-on-K8s: Migrate mwmaint server functionality to mw-on-k8s - https://phabricator.wikimedia.org/T341560 (10jijiki)
[09:57:59] <wikibugs>	 10serviceops, 10MW-on-K8s: Migrate mwmaint server functionality to mw-on-k8s - https://phabricator.wikimedia.org/T341560 (10jijiki)
[09:58:01] <wikibugs>	 10serviceops, 10MW-on-K8s: Allow running periodic jobs for mw on k8s - https://phabricator.wikimedia.org/T341555 (10jijiki)
[09:58:03] <wikibugs>	 10serviceops, 10MW-on-K8s: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553 (10jijiki)
[10:14:07] <wikibugs>	 10serviceops, 10iPoid-Service: Deploy ipoid to staging on Kuberenetes - https://phabricator.wikimedia.org/T341326 (10jijiki)
[10:25:37] <wikibugs>	 10serviceops, 10MW-on-K8s: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553 (10Joe) One update:   it should be possible to use helmfile to support arbitrary release names, with something like  ` releases:  - name: job-{{ requiredEnv "NAME_TOKEN" }}    <<: *default `  and thus...
[13:17:39] <elukey>	 hi folks
[13:18:03] <elukey>	 at around 11 UTC envoy telemetry metrics disappeared from the dashboard, see
[13:18:09] <elukey>	 https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=cxserver&var-destination=All&from=now-6h&to=now
[13:26:49] <elukey>	 Do we need to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/937074 ?
[13:27:55] <jayme>	 yeah, I think so
[13:53:10] <jayme>	 elukey: metrics are back now. Sorry for that :/
[13:54:09] <elukey>	 nice thanks!
[14:00:48] <godog>	 FYI alerts.git will be getting deployed to all k8s-monitoring promethei, there might be additional alerts firing (cfr https://gerrit.wikimedia.org/r/c/operations/puppet/+/937079)
[15:51:35] <wikibugs>	 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10Jdforrester-WMF)
[16:04:22] <volans>	 claime et al.: quick question wrt the increase of traffic to the k8s infrastructure... was the irc.w.o problem solved? (udp broadcast from mw hosts)
[16:06:43] <_joe_>	 volans: what problem sorry?
[16:07:40] <volans>	 the udp stream of events to irc.w.o, I don't recall if they work at all from k8s
[16:08:27] <volans>	 I remember to have talked about that in here long time ago but totally forgot the details
[16:08:52] <volans>	 and I was chatting about irc.w.o in a different context that this question came back to my mind and here I am :)
[16:16:41] <_joe_>	 I see, indeed I have no recollection of how that works
[16:16:57] <_joe_>	 akosiaris: maybe you?
[16:25:32] <_joe_>	 volans: so it's an UDP message sent to various IP addresses, and we explicitly allow it in mediawiki on k8s' networkpolicy https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/_mediawiki-common_/global.yaml#122
[16:25:54] <_joe_>	 volans: so it should work correctly
[16:29:12] <volans>	 ok, great! Thanks for checking it