[00:34:50] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye
[01:16:16] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**)   - Rem...
[01:17:40] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1014.eqiad.wmnet with OS bullseye
[01:30:16] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1014.eqiad.wmnet with OS bullseye executed with errors: - rdb1014 (**FAIL**)   - Rem...
[01:49:19] <wikibugs>	 10serviceops, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10TomerLerner) Thank you @akosiaris  We can only run client requests in the production URL, I guess it'll do for now until...
[05:42:21] <hashar>	 *
[05:42:39] <hashar>	 ^ wrong window :]
[07:29:10] <wikibugs>	 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10JMeybohm) >>! In T326785#9039367, @Jdforrester-WMF wrote: > OK, so situation as I understand it right now at 2023-07-24Z20:55 i...
[07:52:54] <elukey>	 hello folks 
[07:53:09] <elukey>	 back to kafka brainstorm, I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/941315
[07:54:01] <elukey>	 after the last round of rebalance I didn't notice any big movement in https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&viewPanel=75, that measures how busy the kafka threads processing requests are
[07:54:25] <elukey>	 we deliberately set only 4 threads for the job, that is a bottleneck in my opinion
[07:54:43] <elukey>	 so I'd like to test more threads on kafka-main1001, and see if its metrics improve
[07:55:55] <elukey>	 load and cpu usage on the node is really low
[07:56:28] <elukey>	 thoughts?
[07:57:16] <jayme>	 can you share some links to documentation and/or where the original idea of 4 disks means 4 threads comes from?
[07:57:44] <elukey>	 this is what we have in hiera
[07:57:44] <elukey>	 # (4 disks in the broker HW RAID array)
[07:57:45] <elukey>	 # Bump this up to get a little more parallelism between replicas.
[07:58:08] <jayme>	 (not at all against experimenting with it, but I'm totally clueless about actual kafaka config etc.)
[07:58:14] <jayme>	 yeah, that I read :)
[07:58:16] <elukey>	 yes yes denitely
[07:59:14] <elukey>	 I think that the original idea was to have a single thread for each disk, some docs suggest to multiply the disks times processors etc..
[07:59:52] <elukey>	 upstream doesn't really say much, but my thinking is that 4 threads for the amount of requests handled by the kafka main brokers are not a lot
[08:00:13] <elukey>	 and this explains, in my opinion, why their queues are so backlogged
[08:00:38] <elukey>	 with a RAID array the threads == disks doens't make a lot of sense though
[08:00:59] <elukey>	 8 is the default IIUC, at least for the io threads
[08:02:08] <elukey>	 the test that I have in mind is simple - we have idleness metrics, queue sizes, etc.. so if more threads help, we should see an improvement for kafka1001
[08:02:15] <elukey>	 if not I am wrong :)
[08:07:01] <elukey>	 for example, for kafka jumbo we have 12, and a hw raid with 12 disks
[08:07:21] <elukey>	 and its metrics are really nice, considering that it handles a ton more traffic than main
[08:08:09] <elukey>	 (and kafka jumbo is unbalanced too, maybe less pronounced than main)
[08:10:35] <elukey>	 (true that jumbo has a hw raid and main a sw raid, but for this purpose I don't see a real difference)
[08:23:45] <_joe_>	 yeah let's try it
[08:23:50] <_joe_>	 it seems like a good idea
[08:23:56] <jayme>	 all sounds very sensible indeed
[08:25:49] <jayme>	 maybe edit the comment while you're at it, given main has an SW array of 8 dists and not a HW array of 4 :)
[08:26:43] <_joe_>	 lol
[08:29:31] <elukey>	 maybe the first kafka main node had 4
[08:29:33] <elukey>	 who knows
[08:29:36] <elukey>	 mistery :)
[08:29:52] <jayme>	 probably...might be worth it to remove the reference alltogether :)
[08:30:27] <elukey>	 I didn't get what reference you are pointing at, in my change's msg
[08:30:30] <elukey>	 ?
[08:30:51] <elukey>	 ahhh sorry I am stupid
[08:30:57] <elukey>	 lemme fix the code change
[08:31:10] <jayme>	 ;)
[08:31:22] <elukey>	 it was not what I intended to send
[08:32:17] <elukey>	 jayme: done, I applied an override only to kafka-main1001
[08:32:29] <elukey>	 I had the other change staged and forgot to reset
[08:33:07] <jayme>	 elukey: you could still drop the line "# (4 disks in the broker HW RAID array)" from hieradata/role/common/kafka/main.yaml as we now know it's no longer true :)
[08:33:30] <elukey>	 jayme: sure sure but we can do it when we apply it to all the brokers
[08:33:35] <elukey>	 I am pretty sure we'll have to
[08:33:38] <jayme>	 ack
[08:33:59] <jayme>	 +1ed
[08:34:10] <elukey>	 <3 merging and restarting kafka, let's see
[08:40:03] <elukey>	 oh wow something really weird is happening
[08:40:03] <elukey>	 # The number of threads doing disk I/O                                                                                                      
[08:40:07] <elukey>	 num.io.threads=1  
[08:40:25] <elukey>	 this is on all brokers, I checked because from pcc it should have been changed from 4 to 8
[08:40:43] <elukey>	 lol we run a single thread?
[08:41:50] <elukey>	 yeah we don't pass it to the class
[08:42:15] <elukey>	 this means that we do the same on all nodes?
[08:43:11] <elukey>	 yes
[08:44:51] <jayme>	 that's a really nice find I suppose
[08:45:30] <elukey>	 sort of, I am puzzled that kafka jumbo runs so smooth with a single thread
[08:45:48] <jayme>	 num_recovery_threads_per_data_dir is not passed as well probably
[08:45:48] <elukey>	 maybe it is the varnishkafka's batching that helps
[08:45:55] <elukey>	 exactly yes
[08:46:05] <elukey>	 so I am going to revery my last change
[08:46:17] <elukey>	 and fix this, but it will update settings to all clusters
[08:46:25] <elukey>	 so we'll have to gradually roll it out
[08:47:08] <jayme>	 or you set it to 1 globally
[08:47:22] <jayme>	 to not change jumbo for now I mean
[08:47:47] <jayme>	 from there we can then see if 4 is an improvement on main
[08:51:00] <elukey>	 could be an option yes
[09:01:42] <elukey>	 jayme: https://gerrit.wikimedia.org/r/c/operations/puppet/+/941362 lemme know if it works when you have a moment
[09:12:48] <jayme>	 elukey: +1 with the comment of maybe not rolling out the changes at all to jumbo/logging if we don't have issues there
[09:27:12] <kamila_>	 I'm a bit stuck on trying to figure out kafka. Problem I'd like to solve: pick a topic I can use for testing benthos that doesn't have ridiculous volume but has some volume. How I tried to go about it: `kamila@stat1007:~$ kafkacat -C -G "kamila-test-$RANDOM" -b kafka-main1001.eqiad.wmnet:9092 -t 'eqiad.mediawiki.page-create' -c 10 `.
[09:27:38] <kamila_>	 The computer's opinion: `% ERROR: Failed to subscribe to 0 topics: Local: Invalid argument or configuration`. Any clue what I'm doing wrong?
[09:27:48] <kamila_>	 elukey: you have been implicated as someone who might know
[09:30:31] <elukey>	 kamila_: in a meeting but I'll answer asap :)
[09:30:45] <kamila_>	 np, thank you!
[09:36:46] <kamila_>	 elukey: solved by j.oe, I need SSL and no group
[09:40:09] <elukey>	 ack :)
[09:40:47] <elukey>	 jayme: yep yep, I have zero idea why it is not causing issues, but num.io.threads set to one is really dangerous
[09:40:58] <elukey>	 I'll defer to other teams but..
[09:41:06] <jayme>	 okidoke
[09:41:08] <elukey>	 let's see how it goes with kafka-main
[09:41:10] <elukey>	 thanks!
[09:42:01] <jayme>	 maybe there is a safety net changing the setting at runtime or something. Is there a way to read the config from the running process?
[09:42:15] <elukey>	 not that I know
[09:46:46] * akosiaris just read backlog
[09:46:50] <akosiaris>	 I am sorry, 1 thread? 
[09:47:09] * akosiaris searches for wide eyes emoji
[09:48:00] <elukey>	 akosiaris: you can imagine my face as well :D
[09:48:19] <jayme>	 I think it's that one 😱 :-p
[09:49:21] <akosiaris>	 jayme: close enough
[09:49:26] <akosiaris>	 I 'll take it
[09:49:31] <akosiaris>	 😱
[09:49:40] <elukey>	 I am truly puzzled though, not really sure how kafka jumbo survived so long, maybe there is something that I am missing
[09:54:19] <akosiaris>	 batching probably 
[09:58:53] <elukey>	 https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-1h&orgId=1&to=now&var-datasource=thanos&var-kafka_cluster=main-eqiad&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=75
[09:58:57] <elukey>	 lol
[09:59:10] <claime>	 bonk
[09:59:14] <jayme>	 :D
[10:00:04] <elukey>	 wow kafka is really amazing
[10:00:19] <claime>	 http://i0.kym-cdn.com/entries/icons/facebook/000/008/189/demotivational-posters-well-theres-your-problem6.jpg
[10:02:04] <elukey>	 let's see at what value it stabilize, if it doesn't go down (hopefully not) I'll apply the change to all main brokers
[10:03:18] <claime>	 Amazing what happens when config values are actually applied
[10:04:49] <elukey>	 ahahah yes
[10:04:54] <akosiaris>	 look at https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-1h&orgId=1&to=now&var-datasource=thanos&var-kafka_cluster=main-eqiad&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=71 being halved...
[10:07:39] <jayme>	 elukey: you should write a well elaborated two pager blod post on how you single-handedly cut produce times in halv with a one-line change :-)
[10:08:23] <elukey>	 jayme: <3
[10:09:06] <akosiaris>	 hmm
[10:09:17] <akosiaris>	 so... is https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-method=GET&var-site=eqiad&refresh=1m related? 
[10:09:25] <akosiaris>	 probably just a bad coincidence?
[10:09:25] <elukey>	 it was worse 30d ago though, before the batching in eventgate
[10:10:05] <akosiaris>	 request rates aren't increasing so I am right now thinking coincidence
[10:10:31] <elukey>	 latency is horrible though
[10:10:42] <elukey>	 it increased at the same time
[10:10:53] <akosiaris>	 almost, I think there is a 5 minute gap?
[10:11:22] <elukey>	 yeah
[10:11:34] <elukey>	 aaand a page :)
[10:12:05] <jayme>	 should be easy enough to verify if we roll back to 1 thread, no? Just to be sure?
[10:12:43] <elukey>	 latency is climbing down, and traffic is recovering
[10:12:51] <elukey>	 I think it is just a volume of slow requests
[10:14:49] <claime>	 Lots of timeouts coming from /w/rest.php/commons.wikimedia.org/v3/page/pagebundle/User%3ATriplec85%2FTauberbischofsheim_by_year/780188172
[10:15:27] <claime>	 As well as another userpage
[10:18:13] <wikibugs>	 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10MatthewVernon) 05Resolved→03Open We got paged again for this today, and it looks like [[ https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers...
[10:23:41] <godog>	 FWIW kafka.service is definitely a little happier on main1001, in the sense that it is using more cpu now
[10:23:48] <godog>	 I'm looking at https://thanos.wikimedia.org/graph?g0.expr=rate(container_cpu_usage_seconds_total%7Bcluster%3D%22kafka_main%22%2Cid%3D~%22.*kafka%5C%5C.service.*%22%7D%5B5m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
[10:24:35] <elukey>	 nice :)
[10:24:57] <elukey>	 the avg idle percent is still going down, it was less that 10% before, maybe it will set at around 40/50%
[10:36:55] <wikibugs>	 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10MatthewVernon) 05Open→03Resolved [never mind, I gather this is due to a particular tempate change causing slow parsing]
[11:39:03] <wikibugs>	 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10akosiaris) And...  ` akosiaris@kubernetes1007:~$ sudo apparmor_status  apparmor module is loaded. 10 profiles are loaded. 10 pr...
[12:21:32] <wikibugs>	 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10akosiaris) 05Open→03Resolved The apparmor changes have been merged. I think the goal of this task is done. I 'll resolve, b...
[12:31:12] <elukey>	 created https://gerrit.wikimedia.org/r/c/operations/puppet/+/941396
[12:31:35] <elukey>	 afaics the avg idle time seems around 40% now, may vary a bit but it seems more stable
[12:31:51] <elukey>	 the change above is to roll out the change to all main brokers :)
[12:32:05] <elukey>	 we could go up to 8 in my opinion, but maybe later on
[12:44:02] <wikibugs>	 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10Clement_Goubert) >>! In T340087#9040652, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.or...
[12:46:03] <claime>	 elukey: +1'd, great find
[12:52:09] <elukey>	 thanks!
[13:05:00] <godog>	 hi folks, what's parse1002' status? I'd need a puppet run on it to unblock another deployment
[13:07:23] <_joe_>	 godog: I think we need to ask dcops
[13:08:11] <godog>	 thank you _joe_, do you reckon it'll be up enough time to run puppet if I reboot it ?
[13:08:20] <godog>	 or in other words, will it reboot?
[13:08:24] <_joe_>	 godog: https://phabricator.wikimedia.org/T339340
[13:08:31] <_joe_>	 yes it should
[13:08:43] <_joe_>	 if you just need to run puppet there
[13:09:04] <godog>	 yeah I do, ok will check with dcops
[13:15:51] <elukey>	 restarting kafka main codfw with the new settings
[13:16:06] <claime>	 ack
[13:21:54] <wikibugs>	 10serviceops, 10MW-on-K8s: Allow mediawiki on k8s to support ingress - https://phabricator.wikimedia.org/T342356 (10Joe) 05Open→03Resolved
[13:21:59] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 (10Joe)
[13:30:31] <claime>	 elukey: seeing the exact same idle pattern on 2001, smells good :)
[13:36:37] <claime>	 godog: downtiming parse1002 so it won´t bother anybody
[13:38:48] <wikibugs>	 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Gehel)
[13:39:04] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1486.eqiad.wmnet with OS buster completed: - mw1486 (**WARN**)...
[13:44:15] <godog>	 claime: thank you <3
[13:47:54] <elukey>	 claime: fingers crossed
[13:55:43] <wikibugs>	 10serviceops: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 (10bking)
[13:55:45] <wikibugs>	 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10bking)
[14:29:08] <claime>	 elukey: https://grafana.wikimedia.org/goto/qxn0ov34z?orgId=1 pretty waves
[14:48:10] <elukey>	 claime: codfw done, if you are ok I'd proceed with eqiad
[14:48:41] <claime>	 elukey: honestly seeing the results on codfw, you can proceed with my total blessing
[14:48:53] <elukey>	 ack! proceeding!
[14:52:34] <_joe_>	 yep
[16:11:06] <claime>	 elukey: idle time on eqiad looks good too \o/
[16:11:22] <elukey>	 claime: yep! almost done :)
[16:35:45] <elukey>	 claime: restarts done, so far metrics look good
[16:35:57] <elukey>	 latency is also better, but we'll see tomorrow after some hours of work
[16:41:05] <_joe_>	 elukey: <3 that's great
[16:42:32] <elukey>	 _joe_ Some rebalance is still needed but we can reschedule it down the road
[16:42:47] <elukey>	 tomorrow I'll wrap up graphs etc.. and see if the improvements are stable
[16:49:04] <akosiaris>	 elukey: ❤️♾️
[16:49:52] <elukey>	 we could probably add some more threads, but not this week
[16:49:54] <elukey>	 :)