[07:56:34] <elukey>	 hi folks
[07:57:30] <elukey>	 I'd like to start some of the partitions increase proposed in https://phabricator.wikimedia.org/T341558#9009529
[07:57:46] <elukey>	 maybe 2/3 topics and then a roll restart of changeprop-jobqueue
[08:03:41] <elukey>	 I could just start with mediawiki.job.cirrusSearchLinksUpdate, seems good enough to see if we can improve things (~300 events/s, 1 partition)
[08:04:51] <claime>	 You can do parsoidCachePrewarm with it
[08:05:56] <elukey>	 okok seems a good compromise
[08:06:12] <elukey>	 the aim is to spread the load to multiple brokers
[08:06:23] <elukey>	 and to reduce the request handling queue on the first three
[08:06:45] <claime>	 yep
[08:21:01] <akosiaris>	 elukey: 👍
[08:41:41] <elukey>	 sorry I was afk, will start in a bit!
[09:04:03] <elukey>	 starting!
[09:08:44] <elukey>	 increased partitions, now I am roll restarting changeprop-jobqueue
[09:14:12] <elukey>	 all right done
[09:19:07] <elukey>	 I really hope that the need for a restart will be fixed when Hugh updates the node-rdkafka version, really annoying
[09:32:22] <elukey>	 no visible improvement, but it makes sense, we need to move more partitions
[09:38:14] <claime>	 Did you just increase partitions or change leader too?
[09:39:11] <elukey>	 just increased partitions
[09:39:25] <claime>	 Yeah so I'm not really surprised
[09:40:03] <elukey>	 well the old single leader now has 1/3 of the traffic to handle
[09:40:46] <elukey>	 maybe some improvement is registered but not enough to be visible in the avg metric
[10:26:33] <hnowlan>	 elukey: I've been thinking about the changeprop version/image bumps and I think I'm okay with trying to just roll ahead with it and see if it works. Rolling back is quick, and all of the nodejs components are going to be the same as we're using the same nodejs version between both images
[10:27:07] <elukey>	 hnowlan: yeah it makes sense, I am only a little worried about the kafka client consumer group state
[10:27:12] <elukey>	 if it changes etc..
[10:27:21] <elukey>	 but it shouldn't happen given the changelog
[10:27:24] <elukey>	 (to avoid loosing events)
[10:27:36] <hnowlan>	 ahhh hmm
[10:27:50] <hnowlan>	 a partial rollout would cause the same issues though I guess 
[10:42:07] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[10:42:27] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) 05In progress→03Resolved Done, all looks ok. We'll now start preparing for 5%
[10:43:56] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Ladsgroup) >>! In T341463#9012011, @Clement_Goubert wrote: > Done, all looks ok. We'll now start preparing for 5%    {meme, src=itshappening}
[10:57:39] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Clement_Goubert)
[10:58:39] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Clement_Goubert) p:05Triage→03High
[11:04:05] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[11:04:34] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate group1 to Kubernetes - https://phabricator.wikimedia.org/T340549 (10Clement_Goubert) 05In progress→03Resolved After moving a couple group1 wikis, we have decided to go with a global traffic percentage to roll forward. Marking Resolved.
[11:06:13] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[11:12:42] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[11:48:13] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Clement_Goubert) 05Resolved→03In progress I'm wondering if in the vein of   >>! In T290536#8466377, @Ladsgroup wrote: > This is not really user-impacting, spec...
[11:48:25] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[11:58:13] <elukey>	 hnowlan: yes yes exactly, this is my only worry, namely if anything changes with the current status of the consumer groups
[11:58:18] <elukey>	 but we can test in staging and see
[13:09:48] <wikibugs>	 10serviceops: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 (10elukey) Tested the topicmappr's rebalance command (in the previous task we used `rebuild` since we had two new brokers to add) on kafka-test to see how it worked. The rebalance co...
[13:09:56] <elukey>	 claime: --^
[13:11:51] <wikibugs>	 10serviceops: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 (10elukey) Proposal - we could use the topics listed in [[ https://thanos.wikimedia.org/graph?g0.expr=sort_desc(topk(1000%2C%20sum(irate(kafka_server_BrokerTopicMetrics_MessagesIn_to...
[13:12:27] <elukey>	 if what I did sounds right I can try to generate a plan for kafka main codfw
[13:24:59] <claime>	  From my limited understanding it does make sense
[13:53:39] <elukey>	 ok so lemme try to generate a plan for kafka main codfw
[14:11:27] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10RESTbase Sunsetting, 10Wikifeeds, and 2 others: Switchover plan from restbase to api gateway - https://phabricator.wikimedia.org/T339119 (10Jgiannelos)
[14:41:58] <wikibugs>	 10serviceops: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 (10elukey) Generation of the plan for main-codfw:  ` elukey@kafka-main2001:~/T341558$ ./metrics-fetcher --prometheus-url http://prometheus.svc.codfw.wmnet/ops/ --zk-addr conf2004.cod...
[14:44:25] <elukey>	 I am not 100% convinced about the generated plan, will do some iterations before attempting anything
[14:46:57] <claime>	 ack
[15:03:52] <elukey>	 another unrelated question - the ML team is going to deploy a new version of the recommendation-api, written in python years ago, but long term Research, CTX and other folks prefer to invest time on it
[15:04:28] <elukey>	 since we still don't know how general this api will be, and since we already have the recommendation-api (nodejs) discovery endpoint, I thought to call it 'recommendation-api-ng' to avoid naming clashes
[15:04:37] <elukey>	 it is really horrible but I don't have other ideas
[15:04:57] <elukey>	 I know that you'll hate me when checking deployment-charts, so if you have any suggestion I am all for it :)
[15:05:48] <claime>	 ugh, thanks, I hate it :') but I don't have an idea either
[15:06:43] <elukey>	 I know, totally agree
[15:07:09] <claime>	 recommendation-apython (no don't)
[15:07:42] <elukey>	 we may call it 'ctx-recommendation-api' since it will be probably focused on content translation related things, but if other clients of the nodejs version (like the Android app) will migrate to it we'll be confusing
[15:08:01] <elukey>	 *it will be
[15:08:15] <claime>	 I assume we'll decom the old recommendation-api at some point?
[15:08:43] <elukey>	 this is a very complex point that I am trying to work on :D
[15:09:49] <claime>	 Let's see the rest of the team bikeshed on this :p
[15:10:37] <taavi>	 Copy of recommendation-api-v2-final-really
[15:10:55] <claime>	 recommandation-api-mix2-premaster_final
[15:11:17] <elukey>	 I should have seen it coming, well deserved 
[15:11:26] <claime>	 :D
[15:33:03] <_joe_>	 elukey: call another thing -ng and the icecaps will melt
[15:33:14] <_joe_>	 elukey: actually, I am gonna offer you a deal
[15:33:34] <_joe_>	 1) Ensure that the old recommendation-api will be dismissed within a specific date
[15:33:48] <_joe_>	 2) Rename admin_ng in deployment-charts to admin
[15:33:53] <_joe_>	 then we can talk :D
[15:39:12] <elukey>	 _joe_ the deals usually should be good for both parties :D
[15:40:37] <_joe_>	 elukey: you get to name your service like you want
[15:40:39] <_joe_>	 maybe
[15:40:50] <_joe_>	 and also, I only look out for the common good
[15:40:56] <_joe_>	 hte common good is also good for you
[15:41:39] <elukey>	 sure 
[15:47:51] <dcausse>	 o/ for some reasons I assumed that eventgate-main.discovery.wmnet was being forced to eqiad so that events & jobs are flowing only to kafka-main@eqiad -> changeprop@eqiad... but I see that eventgate-main.discovery.wmnet can actually resolve to codfw when hit from a codfw server so now I'm wondering how we ensure that jobs and events are flowing mainly to eventgate@eqiad?
[15:54:49] <_joe_>	 dcausse: we don't
[15:55:07] <_joe_>	 wherever events are produced, they get consumed by changeprop
[15:55:16] <_joe_>	 and sent to the correct backend
[15:55:25] <_joe_>	 so for instance, let's take the jobqueue
[15:55:41] <_joe_>	 jobs will be created in both DCs (as we generate some jobs on GET requests) 
[15:56:11] <_joe_>	 changeprop-jq will pick the ones in its own dc up and submit them to jobrunners.discovery.wmnet
[15:56:22] <_joe_>	 which points to the current main dc for mediawiki 
[15:56:27] <_joe_>	 (right now, eqiad)
[15:57:21] <_joe_>	 dcausse: what is your current problem? 
[15:57:37] <dcausse>	 no problem just trying to understand :)
[15:58:16] <dcausse>	 do we route POST requests the mw app servers to eqiad?
[15:58:25] <dcausse>	 *to
[15:59:46] <_joe_>	 yes
[15:59:55] <_joe_>	 and then some other stuff :)
[16:00:01] <dcausse>	 ok :)
[16:00:16] <dcausse>	 _joe_: makes sense thanks for explaining!
[16:53:10] <hnowlan>	 elukey: rule execution looks normal in staging anyway with buster+newer node-rdkafka, haven't checked the groups 
[19:13:40] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Reedy)
[21:29:41] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Direct 1% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341463 (10Quiddity) Thanks for the draft, appreciated! I've [[https://meta.wikimedia.org/wiki/Tech/News/2023/29#Tech_News:_2023-29|added this to Tech News]]. (The only major...
[21:41:43] <wikibugs>	 10serviceops, 10MW-on-K8s: Max upload size on k8s is 2M - https://phabricator.wikimedia.org/T341825 (10jijiki)
[21:42:35] <wikibugs>	 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10jijiki)