[08:41:43] 10serviceops, 10SRE, 10ops-codfw: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10akosiaris) The machine isn't pooled yet into traffic. There is an alert for frequent changes due to puppet run. Indeed the following happens at every puppet run `Notice: /Stage[main]/Cpufrequtils/Exec... [09:47:30] 10serviceops, 10Discovery-Search, 10SRE, 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) [09:48:39] 10serviceops, 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) [09:49:33] 10serviceops, 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) 05Open→03In progress p:05Triage→03High [10:18:13] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) [10:36:14] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) 05In progress→03Open a:05pfischer→03None [10:39:53] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) a:03brouberol [10:46:23] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) [10:50:26] 10serviceops, 10Data-Platform-SRE, 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) a:05pfischer→03None [10:56:51] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) ` brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'eqiad.me... [10:58:24] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) [10:58:32] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) We can see the impact on the overall topic size {F41648651} [11:13:37] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) ` brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'codfw.me... [11:17:55] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) 22% of the topic segments were compacted and deleted: {F41648664} [11:18:32] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) [12:17:56] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) [12:18:03] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) 05Open→03Resolved The change has been applied an hour ago (at the line). We don't obs... [13:54:44] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) Interesting! Curious, so the reason for using compaction here is just to save space, not... [14:34:55] 10serviceops, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10MoritzMuehlenhoff) [15:39:24] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) @Ottomata, yes, this was intended to a) save disk space and b) reduce the number of record... [15:40:58] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) Are you sure you want `delete` in the policy then? Perhaps you want to keep all the lates... [16:10:20] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10hashar) That hosts also broke during the MediaWiki train: ` 04:55:49 Started sync_wikiversions 04:55:49 sync_wikiversions: 0% (ok: 0; fail: 0; left: 374) 04:58:04 sudo -u mwdeploy -n --... [16:52:38] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) @Ottomata, we considered this but but decided against it since a) page_rerender is only o... [18:18:48] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) [18:18:59] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) p:05Triage→03Unbreak! [18:19:53] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) This is a blocker until the host is removed from the dsh targets. [18:30:49] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Dzahn) depooled 2394 - per https://sal.toolforge.org/log/vbyWyowBxE1_1c7szGCe previously 2396 was depooled [18:35:15] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) Thanks, @Dzahn. After looking a bit more, I don't think the presence in `scap_targets` should affect train, so I'm deescalating this. Whether or not depooled hosts should still be present in `scap_t... [18:35:27] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) p:05Unbreak!→03Medium [18:35:43] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10dduvall) [18:36:45] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Dzahn) p:05Medium→03High I agree the train should be unblocked and lowering it from UBN to High seems correct. Also that scap_targets should only influence scap deployment. edit: well, High or Medium :) [18:36:51] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Dzahn) p:05High→03Medium [21:22:42] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) +1 k! [22:29:41] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464 (10bking) > Do we track the IOPS bottlenecks we witnessed in some task? I'm also curious about the IOPS issues, since I assume the majority of etcd instances out in th... [22:36:18] 10serviceops, 10MW-on-K8s, 10SRE, 10WMF-JobQueue: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 (10Urbanecm_WMF) [22:37:16] 10serviceops, 10MW-on-K8s, 10SRE, 10WMF-JobQueue: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 (10Urbanecm_WMF) [22:41:20] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10Urbanecm_WMF) I think the k8s migration work as part of this ticket caused {T354229}. [22:59:48] 10serviceops, 10MW-on-K8s, 10SRE, 10WMF-JobQueue: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s - https://phabricator.wikimedia.org/T354229 (10Urbanecm_WMF)