[09:04:37] 06serviceops, 10Language-Technical Support, 06SRE, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#9845479 (10Fuzzy) [09:14:43] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9845525 (10Clement_Goubert) As these servers are up for decom, they won't be migrated to k8s, and they are in the current secondary datacenter. It doesn't rea... [10:47:21] 06serviceops, 06DC-Ops, 10Prod-Kubernetes, 07Kubernetes: update idrac firmware on wikikube-ctrl1001 - https://phabricator.wikimedia.org/T365499#9845935 (10jijiki) 05Open→03Resolved Thnak you papaul! [10:47:42] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9845938 (10jijiki) [10:47:44] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9845939 (10jijiki) [10:47:53] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9845940 (10jijiki) [10:47:58] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9845941 (10jijiki) [10:51:56] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9845947 (10jijiki) **Current status: ** eqiad and codfw: * `(baremetal) wikikube-ctrl` hosts are in production as stacked masters * have joined... [10:52:51] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9845950 (10jijiki) 05Open→03Stalled [10:53:23] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9845957 (10jijiki) a:05JMeybohm→03hnowlan [10:59:57] 06serviceops, 06MediaWiki-Engineering, 10MediaWiki-libs-BagOStuff, 06MediaWiki-Platform-Team, 10Sustainability (Incident Followup): Cache mw-mcrouter service ClusterIP in apcu cache - https://phabricator.wikimedia.org/T363186#9845967 (10Clement_Goubert) As we move to using more services with a daemonset-... [12:29:47] good afternoon akosiaris, I'm just getting online now [12:35:15] 06serviceops, 06Infrastructure-Foundations, 06Release-Engineering-Team: Deprecate buster-backports - https://phabricator.wikimedia.org/T362518#9846191 (10Clement_Goubert) >>! In T362518#9838656, @dancy wrote: >>>! In T362518#9834634, @Clement_Goubert wrote: >> Just to be completely sure before deleting >> >... [12:48:42] cdanis: o/. I think we don't have much to add to yesterday's situation other than it looks like we 'll be able to get 10G for those nodes. eff.ie just finished adding the 3rd wikikube-ctrl node, so we got a bit more capacity. [12:49:03] I 've started looking into the incident review ritual docs though [12:49:23] akosiaris: codfw with otelcol enabled is only +10% bytes during scap as compared to eqiad, I'd like to try turning otelcol back on in eqiad before some MW deploys today [12:50:39] if a quick Meet call would be helpful I'm down for that [12:50:52] so the one that start in 10m? [12:51:15] well there's the evening train, but I could finish the deploy of the daemonset if I started right now [12:51:52] if you feel confident enough that it's only going to add +10% total traffic, I don't see why not. [12:52:04] +1 on my side [12:52:08] I mean, these numbers are all noisy [12:52:12] it might be +20% or so [12:52:16] "if it dies it dies" :p [12:52:18] but with the 5 apiservers, two of them 10G, I feel okay about that [12:52:58] +20% on what though ? [12:53:05] 06serviceops, 06SRE: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9846286 (10jijiki) >>! In T366094#9844588, @CDanis wrote: >>>! In T366094#9842327, @akosiaris wrote: >> I am gonna disagree on this one. [This](https://grafana-rw.wikimedia.org/d/d304d897-54ea-4062-a504-6f2567ed7dba/t36... [12:53:06] the 90MB/s my tests saw yesterday? [12:53:10] cause that's pretty fine [12:53:18] 06serviceops, 06SRE: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9846288 (10jijiki) 05Open→03In progress [12:54:05] akosiaris: not sure if you saw my post from last night but I did a lot of work on otelcol byte usage https://phabricator.wikimedia.org/T366094#9844588 [12:54:06] or something that will approach the totals seen here: https://grafana.wikimedia.org/d/d304d897-54ea-4062-a504-6f2567ed7dba/t366094?orgId=1&from=1716971493322&to=1717017967022&viewPanel=59? [12:54:12] 06serviceops, 06SRE: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9846289 (10jijiki) [12:54:14] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9846291 (10jijiki) [12:54:15] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9846290 (10jijiki) [12:54:53] ah, looking into https://grafana.wikimedia.org/d/d304d897-54ea-4062-a504-6f2567ed7dba/t366094?orgId=1&from=1716971493322&to=1717017967022&viewPanel=70 [12:55:00] I see what you mean by 10-20% more [12:55:08] yeah [12:55:31] it's kind of not apples-to-apples to compare the peak rates when one cluster has 5 apiservers and the other has 4 [12:55:35] of coruse you burst higher on the other one, that way [12:56:19] ok, the other way to look at this is ... worst thing that can happen is a forced rollback and a botched deploy [12:56:25] let's warn the deployer and try it out [12:56:34] 👍 [12:58:44] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9846328 (10Clement_Goubert) `mw2282` is a kubernetes server, so would need to be drained and cordoned as well. However since they are to be decommed and in th... [13:04:27] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9846350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin2002 for host mc2044.codfw.wmnet with OS bookworm [13:04:34] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9846351 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host mc1044.eqiad.wmnet with OS bookworm [13:11:15] helmfile invocations are beginning [13:11:38] total control plane traffic spiking [13:11:43] 200MB/s and increasing [13:11:58] akosiaris: Mb(it) [13:12:03] * akosiaris feeliks that guy in Speed 2 saying every 5 seconds "N knots" [13:12:13] indeed Mb/s, sorry about that [13:13:46] ok now we're at 2Gbit/s+ [13:13:47] ok now we're at 2Gbit/s+ [13:13:53] yeah... [13:15:34] some small number of TCP retransmits on all hosts [13:15:52] but on the single to double digit scale. Nothing super concerning but there's probably some minor saturation [13:16:40] nic-saturation-exporter can probably tell us [13:16:48] scap has finished with k8s btw [13:17:38] Finished sync-prod-k8s (duration: 03m 04s) [13:17:49] that itself is a nice improvement [13:17:53] yeah, so successful test apparently? [13:18:00] I'd say so [13:18:10] I also believe we have two backports for this window [13:18:13] so we're about to have a second [13:18:16] dcausse: anything you experienced out of the ordinary? (minor possible speed improvement aside)? [13:18:19] That's way quicker than it used to be [13:18:35] sigh, why did I say minor? [13:18:39] I meant major :P [13:18:43] Yeah, 2x isn't minor :D [13:18:49] akosiaris: claime: it's been that much faster all this week [13:18:50] akosiaris: no, fpm restarting at the moment everything looks normal to me [13:19:00] basically once we added the two metal nodes to eqiad [13:19:05] and the three metal nodes were in codfw [13:19:13] done [13:19:19] literally scap being blocked by kube-apiserver cpu [13:20:29] cdanis: I think that wikikube-ctrl1002 saturated a bit [13:20:43] but it's the only 1 [13:21:10] the rest are ~40-50MB/s, this one was at 105 TX [13:21:21] the exporter should verify that (or not) [13:22:29] hmm [13:23:57] akosiaris: no saturation indicated [13:24:52] oh [13:24:57] no I was looking at ctrl1003 sorry [13:26:21] akosiaris: https://grafana.wikimedia.org/goto/N6rZ-IsSg?orgId=1 [13:26:52] so yes ctrl1002 did saturate in microbursts, but not nearly as bad as we were last week [13:38:06] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9846559 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host mc1044.eqiad.wmnet with OS bookworm completed: - mc1044 (**PASS**) - Dow... [13:43:26] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9846579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin2002 for host mc2044.codfw.wmnet with OS bookworm completed: - mc2044 (**PASS**) - Dow... [13:45:56] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9846601 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [13:47:01] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9846602 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye executed... [13:48:03] o/ effie [13:55:02] cdanis: that matches what I expected. Thanks! [14:05:55] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9846759 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [14:08:33] second run Finished sync-prod-k8s (duration: 03m 02s), previous run was equivalent at 3m04 [14:21:32] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9846804 (10Papaul) @Clement_Goubert thanks for the update. Since i can not edit your comment I updating it here. The move should be something like: serviceop... [14:29:42] 06serviceops, 10Parsoid (Tracking), 13Patch-For-Review: parsoidtest1001 implementation tracking - https://phabricator.wikimedia.org/T363402#9846850 (10ihurbain) [14:53:33] 06serviceops, 06SRE, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9846950 (10CDanis) https://grafana.wikimedia.org/goto/1rSRUSsSg?orgId=1 As we expected/hoped, the increase in eqiad TX bytes was only about 10-15%. [14:55:42] heh I just realized ofc we don't have nic-saturation-exporter on the old kubemasters because they're ganeti VMs [14:56:46] I guess TCP retransmits really are the best way to measure [15:00:47] https://grafana.wikimedia.org/goto/uwHJ8SsSg?orgId=1 [15:38:22] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9847142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye [16:15:08] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9847436 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye comple... [16:18:24] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265#9847443 (10colewhite) Thanks for your help! As we discussed on IRC, namespace isn't t... [16:32:17] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9847549 (10VRiley-WMF) @kamila is there a preferred time for this activity? I'm more than happy to schedule this at any time. [16:43:06] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9847589 (10akosiaris) kafka-main1009 is successfully imaged fully [18:09:55] 06serviceops, 06MW-Interfaces-Team, 06Traffic: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#9848020 (10daniel) [18:59:58] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571#9848407 (10CDanis) `18 :58:42 <+jinxer-wm> RESOLVED: [2x] KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/... [19:30:07] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9848500 (10Jclark-ctr) @akosiaris kafka-main1010 has imaged but is still failing cookbook for me would you be able to try that one for me?