[07:17:13] 10serviceops, 10PoolCounter, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Add monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729 (10Joe) a:03Joe [07:18:06] 10serviceops, 10PoolCounter, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Add monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729 (10Joe) I frankly prefer to have an alert when a component isn't working, not when it's perceived as not working from one of its c... [08:01:37] 10serviceops, 10SRE, 10Patch-For-Review, 10Sustainability (Incident Followup): Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10Joe) I think there is a larger topic of moving etcd to use the new PKI certs. There has been some work in that direction but I think t... [09:22:24] 10serviceops, 10DC-Ops, 10ops-codfw: hw troubleshooting: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10Clement_Goubert) Thanks ! [09:53:34] 10serviceops, 10observability, 10Sustainability (Incident Followup), 10User-Joe, 10User-jijiki: Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc) - https://phabricator.wikimedia.org/T110169 (10Joe) We're already alerting on disk space for all servers, not sure why this would be differe... [10:44:07] 10serviceops, 10PoolCounter, 10SRE-Sprint-Week-Sustainability-March2023, 10Patch-For-Review, and 2 others: Add monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729 (10Joe) [10:45:06] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10Sustainability (Incident Followup), and 2 others: Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc) - https://phabricator.wikimedia.org/T110169 (10Joe) a:03Joe [10:51:28] 10serviceops, 10Maps: Upgrade maps servers to bullseye - https://phabricator.wikimedia.org/T327513 (10MoritzMuehlenhoff) Also related (since it makes sense to remove those before moving to new servers): https://phabricator.wikimedia.org/T298246 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/760619 [11:03:06] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Patch-For-Review, 10Sustainability (Incident Followup): Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10Joe) [11:07:20] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Continuous-Integration-Config, 10Regression, 10Sustainability (Incident Followup): operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801 (10Joe) 05Open→03Invalid a:03Joe We've dismissed... [11:09:08] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10Sustainability (Incident Followup): High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090 (10Joe) a:03Joe [11:25:46] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Relax nodeAffinity of sessionstore - https://phabricator.wikimedia.org/T325139 (10akosiaris) [11:32:28] 10serviceops, 10Maps: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 (10jijiki) 05In progress→03Resolved Closing, I will add the URL of the relevant documentation when I finish writing it [11:48:05] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE-Sprint-Week-Sustainability-March2023, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10Volans) I've spoken with the people involved, and the original request has been m... [12:27:13] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10Joe) [14:21:28] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) Had a chat with Joe, the idea is to have one node reimaged (so that we can confirm that everything works etc..) leaving the rest of the cluster(s) untouched. I think that moving to PKI is not doable, there are sti... [15:11:42] 10serviceops, 10Maps: Upgrade maps servers to bullseye - https://phabricator.wikimedia.org/T327513 (10jhathaway) a:03jhathaway [15:22:36] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10Joe) Removing the sustainability tag as it doesn't seem like there is any related actionable here. @Clement_Goubert if... [15:24:38] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10Joe) I guess this task is surely in the "serviceops" area, but probably @Eevans has the most experience being one of the o... [15:25:33] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire, and 2 others: Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10Volans) [15:44:28] 10serviceops, 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Add failure rate triggered rollback to scap - https://phabricator.wikimedia.org/T317405 (10Volans) [16:17:38] 10serviceops, 10Commons, 10MediaWiki-File-management, 10SRE, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10TheDJ) >>! In T266155#8707579, @doctaxon wrote: > @TheDJ thanks for your comment. These 429 errors "n... [17:03:44] 10serviceops, 10MW-on-K8s, 10Scap, 10Patch-For-Review: Add a flag to scap to force updating the /etc/helmfile-defaults/mediawiki/release/* files - https://phabricator.wikimedia.org/T332187 (10jnuche) a:03jnuche [17:22:54] 10serviceops, 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Scap, 10Sustainability (Incident Followup): Add etcdmirror status check to scap - https://phabricator.wikimedia.org/T317403 (10Joe) [17:26:08] 10serviceops, 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Wikimedia-Incident: Add etcdmirror connection retry on etcd-tls-proxy unavailability - https://phabricator.wikimedia.org/T317535 (10Joe) >What I don't understand is why the python etcd lib client would fail on connection to only one... [17:26:44] 10serviceops, 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Wikimedia-Incident: Update Etcd/Main cluster#Replication documentation with safe restart conditions and information - https://phabricator.wikimedia.org/T317537 (10Joe) [18:08:12] 10serviceops, 10WMF-JobQueue, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10Joe) Either solution proposed in this task is not currently supported by Extension:TimeMediaHandler, see https://github.com/wikimedia/mediawiki... [20:26:51] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10Eevans) >>! In T320398#8710536, @Joe wrote: > I guess this task is surely in the "serviceops" area, but probably @Eevans h...