[07:17:03] 10serviceops, 10MW-on-K8s: Max upload size on k8s is 2M - https://phabricator.wikimedia.org/T341825 (10Joe) a:03Joe Please note this is now working as intended on mw-debug (just select 'k8s-experimental' from the wikimedia-debug extension, then visit https://en.wikipedia.org/w/debug/ini_get.php?value= 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) a:03JMeybohm [10:48:00] 10serviceops, 10MW-on-K8s: Move noc.wikimedia.org to kubernetes. - https://phabricator.wikimedia.org/T341859 (10Joe) [11:34:49] 10serviceops, 10MW-on-K8s: Max upload size on k8s is 2M - https://phabricator.wikimedia.org/T341825 (10Reedy) 05Open→03In progress [11:35:01] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Reedy) [12:47:52] 10serviceops, 10MW-on-K8s: Move noc.wikimedia.org to kubernetes. - https://phabricator.wikimedia.org/T341859 (10Joe) [13:06:55] 10serviceops: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 (10elukey) Instead of using `rebalance`, I tried `rebuild` (that was the same command used for the last batch of partition moves) in this way: ` ./topicmappr rebuild --force-rebuild... [13:07:05] Hi folks! [13:07:19] I have a plan --^ for rebalancing partitions in kafka main codfw, stated above [13:07:46] basically this time I'd concentrate only on moves of partitions to 2004 or 2005 (the brokers with less partitions) [13:07:57] instead of apply all of them [13:08:15] then we can see how main codfw looks afterwards, and proceed with eqiad [13:08:21] if you are ok I'll start with those on monday [13:09:25] <_joe_> It seems like a good course of action [13:09:27] <_joe_> <3 [13:10:27] ack thanks :) [13:11:14] <_joe_> I'll look at the details in a few [13:14:45] in theory this gap should be reduced: https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&var-datasource=thanos&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=20 [13:15:02] the rest of the moves are related to the first three brokers, that in theory don't need much [13:15:08] so I discarded them [13:16:41] now that we have all the new metrics it will be interesting to understand how brokers behave after a switchover [13:16:55] since now in main-codfw things are relatively good [13:17:10] (even if partitions are heavily unbalanced) [13:18:51] and I think that we should make an alarm out of https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&var-datasource=thanos&var-kafka_cluster=main-eqiad&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=75, if any broker falls under 20% we should get notified [13:19:06] (not now of course since we are already in a not-great-situation :D) [13:24:15] <_joe_> yeah :P [13:25:00] <_joe_> and yes, things are probably good in codfw because we don't have the same volume of events when mediawiki is primary in the other DC [14:35:54] 10serviceops, 10MW-on-K8s: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 (10Reedy) [15:10:38] 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 2 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10hnowlan) [15:19:18] 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 2 others: Switchover plan from restbase to api gateway for wikifeeds - https://phabricator.wikimedia.org/T339119 (10hnowlan) Just to note that wikifeeds sends a different `access-control-allow-headers` header - restbase ove... [23:23:34] 10serviceops, 10Data-Engineering, 10Data-Platform-SRE, 10SRE, and 2 others: Kafka 2.x Upgrade Plan - https://phabricator.wikimedia.org/T302610 (10BTullis) [23:48:40] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 3 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10BTullis) [23:59:21] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 4 others: Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10BTullis)