[07:09:53] Morning folks [07:10:27] I wrote an update in https://phabricator.wikimedia.org/T338357#9003992, afaics the eventgate's envoy p99 latencies are a little better after yesterday's change [07:12:36] if so we may want to keep tuning/refining the eventgate's batching value, IIUC the eventbus extension uses deferred http calls so we shouldn't impact much external clients [07:13:02] (of course we should not increase the value too much, but maybe 10/20ms could give us even more benefits) [07:13:21] then we have to rebalance kafka main partitions, the brokers are clearly under pressure [07:15:41] <_joe_> Amir1: we need to add the new proxies, rather [07:16:32] <_joe_> elukey: go on with your best judgement re:eventgate [07:16:44] <_joe_> as for kafka-main, we need to embargo adding new stuff to it [07:19:42] <_joe_> and you're underselling the results of your change [07:19:57] <_joe_> the p99 was reduced by 50% on average and has less spikes [07:22:59] _joe_ thanks! I'd just need some follow ups about the eventbus extension, 10ms seems another good test, but I'll wait for other folks to comment (not a real rush atm). For kafka main I'd say that after a rebalance we should be good, but yeah at the moment we are not in a great spot (at least for main eqiad) [07:23:40] I'll prep a task to rebalance partitions, if folks are ok we could just use the procedure that we followed last time (topicmappr -> json -> kafka reassign) [07:26:28] _joe_ this is also very nice https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&var-service=eventgate-main&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&var-site=All&from=now-2d&to=now&viewPanel=44 [07:27:26] <_joe_> elukey: yeah and we're not getting new hardware for kafka main anytime soon [07:32:47] sigh [07:32:57] then the rebalance is even more important [08:02:07] that's not entirely accurate [08:02:22] kafka-main clusters are scheduled for a refresh this year [08:02:53] I have tentatively scheduled them for Q4, but we can move that around [08:03:05] and e.g. move it to Q2, maybe even Q1 [08:03:31] that being said, I doubt they are going to be more powerful hardware [08:04:09] in fact, it's probably going for a minor downsize. e.g. kafka-main hosts have 128GB RAM, but 110GB is fs cache [08:04:19] that is most probably a waste of RAM [08:07:30] akosiaris: o/ IIRC kafka makes extensive use of the page cache, so a downsize may not be ideal, but not dramatic [08:07:55] if we have a good balancing for traffic between brokers we should be ok [08:19:48] <_joe_> akosiaris: oh I thought it was dropped for budget reduction [08:24:21] _joe_: no, just downsized. [08:24:25] 10serviceops, 10envoy: Update envoy to > 1.23 - https://phabricator.wikimedia.org/T341549 (10JMeybohm) [08:24:41] but your point stands. New, more powerful hardware isn't coming any time soon [08:24:47] 10serviceops, 10envoy: Update envoy to > 1.23 - https://phabricator.wikimedia.org/T341549 (10JMeybohm) [08:25:21] 10serviceops, 10SRE, 10envoy, 10Patch-For-Review: Refactor envoy max_requests_per_connection from Cluster to HttpProtocolOptions - https://phabricator.wikimedia.org/T304124 (10JMeybohm) 05Open→03Resolved [08:25:26] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [08:25:34] 10serviceops, 10SRE, 10envoy, 10Patch-For-Review: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10JMeybohm) 05Open→03Resolved [08:25:40] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [08:25:48] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Set a limit to the number of allowed active connections via runtime key overload.global_downstream_max_connections - https://phabricator.wikimedia.org/T340955 (10JMeybohm) 05Open→03Resolved [08:25:56] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [08:26:20] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) 05Open→03Resolved This is done from our end. [08:35:56] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/937041 [08:36:20] this is the bump to 10ms for eventgate, if any kind soul review it I'll deploy it today :) [08:41:35] jelto: I am updating the checkboxes :) [08:42:19] hashar: ack [08:42:48] hashar: slet me know when https://gerrit.wikimedia.org/r/c/operations/puppet/+/867712 should be merged [08:58:53] 10serviceops, 10MW-on-K8s: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553 (10Joe) [08:59:29] deploying eventgate :) [09:02:13] 10serviceops, 10Observability-Metrics, 10observability: Scrape envoy runtime metrics in ops prometheus - https://phabricator.wikimedia.org/T341554 (10JMeybohm) [09:07:09] {{done}} [09:10:29] 10serviceops, 10MW-on-K8s: Allow running periodic jobs for mw on k8s - https://phabricator.wikimedia.org/T341555 (10Joe) [09:11:36] 10serviceops, 10Observability-Metrics, 10observability: Scrape envoy runtime metrics in ops & k8s prometheus - https://phabricator.wikimedia.org/T341554 (10JMeybohm) [09:20:39] 10serviceops, 10Observability-Metrics, 10observability, 10Patch-For-Review: Scrape envoy runtime metrics in ops & k8s prometheus - https://phabricator.wikimedia.org/T341554 (10JMeybohm) a:03JMeybohm [09:21:19] 10serviceops, 10MW-on-K8s: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553 (10Clement_Goubert) [09:27:34] 10serviceops: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 (10elukey) [09:34:12] _joe_: I think they were added before [09:34:52] <_joe_> Amir1: then it shouldn't be an issue if we still have old ones in k8s when you remove them [09:35:21] yeah yeah, I mean for clean ups later [09:35:29] <_joe_> but we should def remove them later [09:57:00] 10serviceops, 10MW-on-K8s: Migrate mwmaint server functionality to mw-on-k8s - https://phabricator.wikimedia.org/T341560 (10jijiki) [09:57:59] 10serviceops, 10MW-on-K8s: Migrate mwmaint server functionality to mw-on-k8s - https://phabricator.wikimedia.org/T341560 (10jijiki) [09:58:01] 10serviceops, 10MW-on-K8s: Allow running periodic jobs for mw on k8s - https://phabricator.wikimedia.org/T341555 (10jijiki) [09:58:03] 10serviceops, 10MW-on-K8s: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553 (10jijiki) [10:14:07] 10serviceops, 10iPoid-Service: Deploy ipoid to staging on Kuberenetes - https://phabricator.wikimedia.org/T341326 (10jijiki) [10:25:37] 10serviceops, 10MW-on-K8s: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553 (10Joe) One update: it should be possible to use helmfile to support arbitrary release names, with something like ` releases: - name: job-{{ requiredEnv "NAME_TOKEN" }} <<: *default ` and thus... [13:17:39] hi folks [13:18:03] at around 11 UTC envoy telemetry metrics disappeared from the dashboard, see [13:18:09] https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=cxserver&var-destination=All&from=now-6h&to=now [13:26:49] Do we need to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/937074 ? [13:27:55] yeah, I think so [13:53:10] elukey: metrics are back now. Sorry for that :/ [13:54:09] nice thanks! [14:00:48] FYI alerts.git will be getting deployed to all k8s-monitoring promethei, there might be additional alerts firing (cfr https://gerrit.wikimedia.org/r/c/operations/puppet/+/937079) [15:51:35] 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10Jdforrester-WMF) [16:04:22] claime et al.: quick question wrt the increase of traffic to the k8s infrastructure... was the irc.w.o problem solved? (udp broadcast from mw hosts) [16:06:43] <_joe_> volans: what problem sorry? [16:07:40] the udp stream of events to irc.w.o, I don't recall if they work at all from k8s [16:08:27] I remember to have talked about that in here long time ago but totally forgot the details [16:08:52] and I was chatting about irc.w.o in a different context that this question came back to my mind and here I am :) [16:16:41] <_joe_> I see, indeed I have no recollection of how that works [16:16:57] <_joe_> akosiaris: maybe you? [16:25:32] <_joe_> volans: so it's an UDP message sent to various IP addresses, and we explicitly allow it in mediawiki on k8s' networkpolicy https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/_mediawiki-common_/global.yaml#122 [16:25:54] <_joe_> volans: so it should work correctly [16:29:12] ok, great! Thanks for checking it