[00:34:50] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye [01:16:16] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**) - Rem... [01:17:40] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1014.eqiad.wmnet with OS bullseye [01:30:16] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1014.eqiad.wmnet with OS bullseye executed with errors: - rdb1014 (**FAIL**) - Rem... [01:49:19] 10serviceops, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10TomerLerner) Thank you @akosiaris We can only run client requests in the production URL, I guess it'll do for now until... [05:42:21] * [05:42:39] ^ wrong window :] [07:29:10] 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10JMeybohm) >>! In T326785#9039367, @Jdforrester-WMF wrote: > OK, so situation as I understand it right now at 2023-07-24Z20:55 i... [07:52:54] hello folks [07:53:09] back to kafka brainstorm, I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/941315 [07:54:01] after the last round of rebalance I didn't notice any big movement in https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&viewPanel=75, that measures how busy the kafka threads processing requests are [07:54:25] we deliberately set only 4 threads for the job, that is a bottleneck in my opinion [07:54:43] so I'd like to test more threads on kafka-main1001, and see if its metrics improve [07:55:55] load and cpu usage on the node is really low [07:56:28] thoughts? [07:57:16] can you share some links to documentation and/or where the original idea of 4 disks means 4 threads comes from? [07:57:44] this is what we have in hiera [07:57:44] # (4 disks in the broker HW RAID array) [07:57:45] # Bump this up to get a little more parallelism between replicas. [07:58:08] (not at all against experimenting with it, but I'm totally clueless about actual kafaka config etc.) [07:58:14] yeah, that I read :) [07:58:16] yes yes denitely [07:59:14] I think that the original idea was to have a single thread for each disk, some docs suggest to multiply the disks times processors etc.. [07:59:52] upstream doesn't really say much, but my thinking is that 4 threads for the amount of requests handled by the kafka main brokers are not a lot [08:00:13] and this explains, in my opinion, why their queues are so backlogged [08:00:38] with a RAID array the threads == disks doens't make a lot of sense though [08:00:59] 8 is the default IIUC, at least for the io threads [08:02:08] the test that I have in mind is simple - we have idleness metrics, queue sizes, etc.. so if more threads help, we should see an improvement for kafka1001 [08:02:15] if not I am wrong :) [08:07:01] for example, for kafka jumbo we have 12, and a hw raid with 12 disks [08:07:21] and its metrics are really nice, considering that it handles a ton more traffic than main [08:08:09] (and kafka jumbo is unbalanced too, maybe less pronounced than main) [08:10:35] (true that jumbo has a hw raid and main a sw raid, but for this purpose I don't see a real difference) [08:23:45] <_joe_> yeah let's try it [08:23:50] <_joe_> it seems like a good idea [08:23:56] all sounds very sensible indeed [08:25:49] maybe edit the comment while you're at it, given main has an SW array of 8 dists and not a HW array of 4 :) [08:26:43] <_joe_> lol [08:29:31] maybe the first kafka main node had 4 [08:29:33] who knows [08:29:36] mistery :) [08:29:52] probably...might be worth it to remove the reference alltogether :) [08:30:27] I didn't get what reference you are pointing at, in my change's msg [08:30:30] ? [08:30:51] ahhh sorry I am stupid [08:30:57] lemme fix the code change [08:31:10] ;) [08:31:22] it was not what I intended to send [08:32:17] jayme: done, I applied an override only to kafka-main1001 [08:32:29] I had the other change staged and forgot to reset [08:33:07] elukey: you could still drop the line "# (4 disks in the broker HW RAID array)" from hieradata/role/common/kafka/main.yaml as we now know it's no longer true :) [08:33:30] jayme: sure sure but we can do it when we apply it to all the brokers [08:33:35] I am pretty sure we'll have to [08:33:38] ack [08:33:59] +1ed [08:34:10] <3 merging and restarting kafka, let's see [08:40:03] oh wow something really weird is happening [08:40:03] # The number of threads doing disk I/O [08:40:07] num.io.threads=1 [08:40:25] this is on all brokers, I checked because from pcc it should have been changed from 4 to 8 [08:40:43] lol we run a single thread? [08:41:50] yeah we don't pass it to the class [08:42:15] this means that we do the same on all nodes? [08:43:11] yes [08:44:51] that's a really nice find I suppose [08:45:30] sort of, I am puzzled that kafka jumbo runs so smooth with a single thread [08:45:48] num_recovery_threads_per_data_dir is not passed as well probably [08:45:48] maybe it is the varnishkafka's batching that helps [08:45:55] exactly yes [08:46:05] so I am going to revery my last change [08:46:17] and fix this, but it will update settings to all clusters [08:46:25] so we'll have to gradually roll it out [08:47:08] or you set it to 1 globally [08:47:22] to not change jumbo for now I mean [08:47:47] from there we can then see if 4 is an improvement on main [08:51:00] could be an option yes [09:01:42] jayme: https://gerrit.wikimedia.org/r/c/operations/puppet/+/941362 lemme know if it works when you have a moment [09:12:48] elukey: +1 with the comment of maybe not rolling out the changes at all to jumbo/logging if we don't have issues there [09:27:12] I'm a bit stuck on trying to figure out kafka. Problem I'd like to solve: pick a topic I can use for testing benthos that doesn't have ridiculous volume but has some volume. How I tried to go about it: `kamila@stat1007:~$ kafkacat -C -G "kamila-test-$RANDOM" -b kafka-main1001.eqiad.wmnet:9092 -t 'eqiad.mediawiki.page-create' -c 10 `. [09:27:38] The computer's opinion: `% ERROR: Failed to subscribe to 0 topics: Local: Invalid argument or configuration`. Any clue what I'm doing wrong? [09:27:48] elukey: you have been implicated as someone who might know [09:30:31] kamila_: in a meeting but I'll answer asap :) [09:30:45] np, thank you! [09:36:46] elukey: solved by j.oe, I need SSL and no group [09:40:09] ack :) [09:40:47] jayme: yep yep, I have zero idea why it is not causing issues, but num.io.threads set to one is really dangerous [09:40:58] I'll defer to other teams but.. [09:41:06] okidoke [09:41:08] let's see how it goes with kafka-main [09:41:10] thanks! [09:42:01] maybe there is a safety net changing the setting at runtime or something. Is there a way to read the config from the running process? [09:42:15] not that I know [09:46:46] * akosiaris just read backlog [09:46:50] I am sorry, 1 thread? [09:47:09] * akosiaris searches for wide eyes emoji [09:48:00] akosiaris: you can imagine my face as well :D [09:48:19] I think it's that one 😱 :-p [09:49:21] jayme: close enough [09:49:26] I 'll take it [09:49:31] 😱 [09:49:40] I am truly puzzled though, not really sure how kafka jumbo survived so long, maybe there is something that I am missing [09:54:19] batching probably [09:58:53] https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-1h&orgId=1&to=now&var-datasource=thanos&var-kafka_cluster=main-eqiad&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=75 [09:58:57] lol [09:59:10] bonk [09:59:14] :D [10:00:04] wow kafka is really amazing [10:00:19] http://i0.kym-cdn.com/entries/icons/facebook/000/008/189/demotivational-posters-well-theres-your-problem6.jpg [10:02:04] let's see at what value it stabilize, if it doesn't go down (hopefully not) I'll apply the change to all main brokers [10:03:18] Amazing what happens when config values are actually applied [10:04:49] ahahah yes [10:04:54] look at https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-1h&orgId=1&to=now&var-datasource=thanos&var-kafka_cluster=main-eqiad&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All&viewPanel=71 being halved... [10:07:39] elukey: you should write a well elaborated two pager blod post on how you single-handedly cut produce times in halv with a one-line change :-) [10:08:23] jayme: <3 [10:09:06] hmm [10:09:17] so... is https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-method=GET&var-site=eqiad&refresh=1m related? [10:09:25] probably just a bad coincidence? [10:09:25] it was worse 30d ago though, before the batching in eventgate [10:10:05] request rates aren't increasing so I am right now thinking coincidence [10:10:31] latency is horrible though [10:10:42] it increased at the same time [10:10:53] almost, I think there is a 5 minute gap? [10:11:22] yeah [10:11:34] aaand a page :) [10:12:05] should be easy enough to verify if we roll back to 1 thread, no? Just to be sure? [10:12:43] latency is climbing down, and traffic is recovering [10:12:51] I think it is just a volume of slow requests [10:14:49] Lots of timeouts coming from /w/rest.php/commons.wikimedia.org/v3/page/pagebundle/User%3ATriplec85%2FTauberbischofsheim_by_year/780188172 [10:15:27] As well as another userpage [10:18:13] 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10MatthewVernon) 05Resolved→03Open We got paged again for this today, and it looks like [[ https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers... [10:23:41] FWIW kafka.service is definitely a little happier on main1001, in the sense that it is using more cpu now [10:23:48] I'm looking at https://thanos.wikimedia.org/graph?g0.expr=rate(container_cpu_usage_seconds_total%7Bcluster%3D%22kafka_main%22%2Cid%3D~%22.*kafka%5C%5C.service.*%22%7D%5B5m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [10:24:35] nice :) [10:24:57] the avg idle percent is still going down, it was less that 10% before, maybe it will set at around 40/50% [10:36:55] 10serviceops, 10Parsoid (Tracking): Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10MatthewVernon) 05Open→03Resolved [never mind, I gather this is due to a particular tempate change causing slow parsing] [11:39:03] 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10akosiaris) And... ` akosiaris@kubernetes1007:~$ sudo apparmor_status apparmor module is loaded. 10 profiles are loaded. 10 pr... [12:21:32] 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review: Kubernetes Wikifunctions security and control measures - https://phabricator.wikimedia.org/T326785 (10akosiaris) 05Open→03Resolved The apparmor changes have been merged. I think the goal of this task is done. I 'll resolve, b... [12:31:12] created https://gerrit.wikimedia.org/r/c/operations/puppet/+/941396 [12:31:35] afaics the avg idle time seems around 40% now, may vary a bit but it seems more stable [12:31:51] the change above is to roll out the change to all main brokers :) [12:32:05] we could go up to 8 in my opinion, but maybe later on [12:44:02] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10Clement_Goubert) >>! In T340087#9040652, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.or... [12:46:03] elukey: +1'd, great find [12:52:09] thanks! [13:05:00] hi folks, what's parse1002' status? I'd need a puppet run on it to unblock another deployment [13:07:23] <_joe_> godog: I think we need to ask dcops [13:08:11] thank you _joe_, do you reckon it'll be up enough time to run puppet if I reboot it ? [13:08:20] or in other words, will it reboot? [13:08:24] <_joe_> godog: https://phabricator.wikimedia.org/T339340 [13:08:31] <_joe_> yes it should [13:08:43] <_joe_> if you just need to run puppet there [13:09:04] yeah I do, ok will check with dcops [13:15:51] restarting kafka main codfw with the new settings [13:16:06] ack [13:21:54] 10serviceops, 10MW-on-K8s: Allow mediawiki on k8s to support ingress - https://phabricator.wikimedia.org/T342356 (10Joe) 05Open→03Resolved [13:21:59] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 (10Joe) [13:30:31] elukey: seeing the exact same idle pattern on 2001, smells good :) [13:36:37] godog: downtiming parse1002 so it won´t bother anybody [13:38:48] 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Gehel) [13:39:04] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1486.eqiad.wmnet with OS buster completed: - mw1486 (**WARN**)... [13:44:15] claime: thank you <3 [13:47:54] claime: fingers crossed [13:55:43] 10serviceops: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 (10bking) [13:55:45] 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10bking) [14:29:08] elukey: https://grafana.wikimedia.org/goto/qxn0ov34z?orgId=1 pretty waves [14:48:10] claime: codfw done, if you are ok I'd proceed with eqiad [14:48:41] elukey: honestly seeing the results on codfw, you can proceed with my total blessing [14:48:53] ack! proceeding! [14:52:34] <_joe_> yep [16:11:06] elukey: idle time on eqiad looks good too \o/ [16:11:22] claime: yep! almost done :) [16:35:45] claime: restarts done, so far metrics look good [16:35:57] latency is also better, but we'll see tomorrow after some hours of work [16:41:05] <_joe_> elukey: <3 that's great [16:42:32] _joe_ Some rebalance is still needed but we can reschedule it down the road [16:42:47] tomorrow I'll wrap up graphs etc.. and see if the improvements are stable [16:49:04] elukey: ❤️♾️ [16:49:52] we could probably add some more threads, but not this week [16:49:54] :)