[02:42:47] 06serviceops, 06SRE, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9840665 (10CDanis) = tldr: * Adding the new control plane workers in eqiad turned what was a CPU saturation issue (causing blackbox probes to be slow but still within timeouts), into a simultaneous... [08:20:16] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9841008 (10akosiaris) >>! In T363212#9839469, @Dzahn wrote: > @akosiaris re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1035769/1/modules/profile/da... [10:05:13] 06serviceops, 06SRE, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9841272 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b65d2df8-871b-4064-b329-026af4d7ec1d) set by akosiaris@cumin1002 for 2:00:00 on 1 host(s) and their services with reason:... [10:05:34] 06serviceops, 06SRE, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9841277 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8fa8366a-d3f2-4a77-8e2b-45de66551026) set by akosiaris@cumin1002 for 2:00:00 on 1 host(s) and their services with reason:... [10:17:16] 06serviceops, 06SRE, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9841311 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2f1b90d9-2cd4-4705-bbf1-70fdacf169cd) set by akosiaris@cumin1002 for 2:00:00 on 1 host(s) and their services with reason:... [10:30:01] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9841411 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host mc1049.eqiad.wmnet with OS bookworm [11:03:29] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9841540 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host mc1049.eqiad.wmnet with OS bookworm completed: - mc1049 (**PASS**) - Dow... [12:04:34] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9841742 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin2002 for host mc2048.codfw.wmnet with OS bookworm [12:05:35] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9841757 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host mc1048.eqiad.wmnet with OS bookworm [12:40:33] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9841894 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host mc1048.eqiad.wmnet with OS bookworm completed: - mc1048 (**PASS**) - Dow... [12:45:38] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9841905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin2002 for host mc2048.codfw.wmnet with OS bookworm completed: - mc2048 (**PASS**) - Dow... [12:55:08] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9841973 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host mc1047.eqiad.wmnet with OS bookworm [12:55:14] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9841975 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin2002 for host mc2047.codfw.wmnet with OS bookworm [12:59:42] 06serviceops, 06SRE, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9841984 (10CDanis) >>! In T366094#9841558, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-sre), href=https://sal.toolforge.org/log/OwwWxI8BGiVuUzOd3n4x} [2024-05-29T11:23:04Z]... [13:05:34] 06serviceops, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9841998 (10MoritzMuehlenhoff) [13:14:05] effie: o/ [13:14:33] thanos-swift is running PKI, all good on tegola side [13:15:30] that is great news !!! [13:15:30] I found something interesting though - do you recall the rise in CPU usage that we were seeing the other day? It is back, and after messing a bit with the tegola dashboard (I've tweaked a little the cpu saturation panel) I found out that it is the pregeneration container that uses the CPU, and gets throttled [13:16:02] I am very ignorant about its role and why/when it is created though [13:16:24] see https://w.wiki/ADs2 [13:22:46] elukey: this is a cron job that runs daily [13:22:56] so it is ok if it is slowed down or whatever [13:24:09] super, then I think we are good [13:24:18] going to write a summary in the task and then I'll close it [13:27:45] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9842070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host mc1047.eqiad.wmnet with OS bookworm completed: - mc1047 (**PASS**) - Dow... [13:29:53] elukey: thank you, this has been a really long staring contest [13:29:57] and I am happy you blinked :p [13:30:31] :) [13:30:39] I was able to write some Go so I am happy [13:31:12] 06serviceops, 06Content-Transform-Team, 07Essential-Work, 13Patch-For-Review, 07Wikimedia-Incident: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324#9842089 (10elukey) 05Open→03Resolved Thanos Swift TLS certs migrated to CFSSL/PKI, n... [13:36:26] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9842115 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin2002 for host mc2047.codfw.wmnet with OS bookworm completed: - mc2047 (**PASS**) - Dow... [14:00:16] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842253 (10VRiley-WMF) Worked with Dell on kafka-main1009, we were able to replace some of the parts (Power Interface Board, and Right Control Panel) Which go... [14:01:35] 06serviceops, 06SRE, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9842257 (10akosiaris) I 've gone ahead and created the following dashboard today [T366094](https://grafana-rw.wikimedia.org/d/d304d897-54ea-4062-a504-6f2567ed7dba/t366094?orgId=1&from=1716974133223... [14:09:15] 06serviceops, 06SRE, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9842288 (10akosiaris) >>! In T366094#9840665, @CDanis wrote: Thanks for writing down all of this. > ===== This was a capacity crunch triggered by expensive operations > * For the past few months... [14:13:37] 06serviceops, 06SRE, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9842327 (10akosiaris) >>! In T366094#9841984, @CDanis wrote: >>>! In T366094#9841558, @Stashbot wrote: >> {nav icon=file, name=Mentioned in SAL (#wikimedia-sre), href=https://sal.toolforge.org/log/... [14:26:45] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9842415 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host mc1046.eqiad.wmnet with OS bookworm [14:26:48] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9842416 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin2002 for host mc2046.codfw.wmnet with OS bookworm [14:27:38] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [14:58:18] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9842496 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host mc1046.eqiad.wmnet with OS bookworm completed: - mc1046 (**PASS**) - Dow... [15:07:28] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9842518 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin2002 for host mc2046.codfw.wmnet with OS bookworm completed: - mc2046 (**PASS**) - Dow... [15:09:14] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye [15:29:36] 06serviceops, 06SRE, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#9842704 (10Jdforrester-WMF) [15:55:18] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye executed... [15:55:54] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842874 (10akosiaris) The fail for kafka-main1009 is expected with the current recipe btw. Let me have a quick look. [15:57:26] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842877 (10akosiaris) >>! In T363212#9842874, @akosiaris wrote: > The fail for kafka-main1009 is expected with the current recipe btw. Let me have a quick loo... [16:01:18] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9842893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin2002 for host mc2045.codfw.wmnet with OS bookworm [16:01:21] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9842894 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1002 for host mc1045.eqiad.wmnet with OS bookworm [16:24:35] 06serviceops, 06Structured-Data-Backlog, 10Thumbor: Thumbor's use of poolcounter is rate limiting Kubernetes IPs - https://phabricator.wikimedia.org/T339863#9843047 (10MarkTraceur) [16:28:21] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9843081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [16:30:11] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9843083 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye [16:33:18] effie: hnowlan: jayme: claime: akosiaris: hello all anyone will be willing to work with me on this task https://phabricator.wikimedia.org/T361856 before row C/D switch migration ? [16:34:38] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9843110 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1002 for host mc1045.eqiad.wmnet with OS bookworm completed: - mc1045 (**PASS**) - Dow... [16:40:46] 06serviceops, 13Patch-For-Review: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9843171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin2002 for host mc2045.codfw.wmnet with OS bookworm completed: - mc2045 (**PASS**) - Dow... [17:05:25] 06serviceops, 10Cassandra, 06Data Products, 06SRE, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9843292 (10Scott_French) [18:47:34] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9843636 (10wiki_willy) a:03VRiley-WMF [18:48:29] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9843643 (10wiki_willy) [18:49:00] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9843647 (10wiki_willy) [18:49:28] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9843655 (10wiki_willy) [18:49:35] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9843656 (10wiki_willy) [22:00:47] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844395 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye [22:01:54] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844410 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [22:05:08] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844420 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye completed... [22:17:02] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye completed... [22:18:55] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844468 (10Jclark-ctr) [23:02:32] 06serviceops, 06SRE: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9844588 (10CDanis) >>! In T366094#9842327, @akosiaris wrote: > I am gonna disagree on this one. [This](https://grafana-rw.wikimedia.org/d/d304d897-54ea-4062-a504-6f2567ed7dba/t366094?orgId=1&from=1716910376624&to=171691...