[07:57:56] 10serviceops, 10Data-Persistence (work done), 10SRE-Sprint-Week-Sustainability-March2023, 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10Joe) I think there might be valid reasons to have one datacenter read-only gl... [07:59:45] 10serviceops, 10Performance-Team (Radar): Incident: 2022-09-08 codfw appservers degradation - https://phabricator.wikimedia.org/T317340 (10Joe) Is there anything specific about this task that is actionable? [08:00:34] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10WMF-JobQueue, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10Joe) [08:02:22] 10serviceops, 10Patch-For-Review: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) Next steps: 1) Try the new reuse partman recipe for one of kafka-main[12]00[45]. It uses the test.cfg scheme for the moment, so it will need a manual confirmation in d-i (to verify that the... [08:04:05] 10serviceops, 10ChangeProp, 10SRE-Sprint-Week-Sustainability-March2023, 10envoy, 10Sustainability (Incident Followup): Investigate shorter-lived persistent connections for Envoy - https://phabricator.wikimedia.org/T304799 (10Joe) I'm not sure this is really actionable without any number attached. We alr... [08:13:39] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Kubernetes, 10Sustainability (Incident Followup): Investigate whether running >1 replicas of calico-typha is feasible and prudent - https://phabricator.wikimedia.org/T292077 (10Joe) @akosiaris anything left to do for this task? I would assume you... [08:18:27] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) I see... [08:21:51] 10serviceops, 10Observability-Logging, 10WMF-General-or-Unknown, 10Performance-Team (Radar): Ingest logs from scheduled maintenance scripts at WMF in Logstash - https://phabricator.wikimedia.org/T285896 (10Joe) Not sure why this task was marked "Incident followup". [08:22:42] 10serviceops, 10ChangeProp, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10Joe) 05Open→03Resolved a:03Joe Nothing actionable left on this task. [08:22:56] 10serviceops, 10ChangeProp, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10Joe) [08:23:48] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Joe) [08:24:26] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Joe) 05Open→03Resolved a:03Joe [08:29:51] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10TimedMediaHandler-Transcode, 10WMF-JobQueue, 10Sustainability (Incident Followup): Add rate limiting to the jobqueue vidoscalers to prevent overloads - https://phabricator.wikimedia.org/T278945 (10Joe) Given videoscaling happens asynchronously o... [08:41:38] 10serviceops, 10Platform Engineering Roadmap Decision Making, 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, and 3 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Joe) 05Open→03Declined The task was more or less refused by the owners of the subs... [08:41:40] eoghan: thanks so much for the Apache/PHP error logging for the doc hosts :] [08:42:18] I think I remember something about having the Apache logs to respect the ECS logging format, but I could not find any trace of that in my notes. I guess we will see what happens when it is deployed [09:04:45] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Thumbor, 10Sustainability (Incident Followup): Reverse proxy supporting XFF-based per-IP concurrency limit and request queueing - https://phabricator.wikimedia.org/T252749 (10Joe) [09:05:41] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Thumbor, 10Sustainability (Incident Followup): Reverse proxy supporting XFF-based per-IP concurrency limit and request queueing - https://phabricator.wikimedia.org/T252749 (10Joe) 05Open→03Declined While this task is definitely too big for spr... [09:08:35] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10Sustainability (Incident Followup), 10User-jijiki: add monitoring of sustained memcached TKO rates - https://phabricator.wikimedia.org/T253384 (10Joe) I think that with the new structure we've put in place for mcrouter we don't... [09:08:45] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10Sustainability (Incident Followup), 10User-jijiki: add monitoring of sustained memcached TKO rates - https://phabricator.wikimedia.org/T253384 (10Joe) 05Open→03Declined [09:08:53] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Joe) [09:11:36] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): mcrouter memcached flapping in gutter pool - https://phabricator.wikimedia.org/T255511 (10Joe) 05Open→03Resolved a:03Joe I think this task was completed. Feel free to reopen if that's not the case. [09:11:40] 10serviceops, 10SRE, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Joe) [09:13:28] 10serviceops, 10ChangeProp, 10SRE-Sprint-Week-Sustainability-March2023, 10Kubernetes, 10Sustainability (Incident Followup): Raise an alarm on container restarts/OOMs in kubernetes - https://phabricator.wikimedia.org/T256256 (10Joe) [09:20:35] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10Sustainability (Incident Followup): Add alerting for Memcached timeout errors - https://phabricator.wikimedia.org/T278946 (10Joe) [09:26:20] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Scrape controller-manager and scheduler metrics - https://phabricator.wikimedia.org/T324959 (10jijiki) a:03jijiki [09:31:42] hello folks [09:31:50] o/ [09:32:01] I had a chat with Joe earlier on, I'd like to attempt a reimage of kafka-main1005 if people are ok [09:32:21] I created a new recipe for that node, that should keep /srv etc.. [09:32:37] kafka-main[12]00[45] have a more modern layout so I cannot use the old reuse recipe [09:33:00] I may need to keep the node down for a couple of tries, since the new recipe may not work at first attempt [09:33:28] I set everything to use the reuse-test.cfg script, so d-i waits for confirmation before applying the partitioning [09:33:46] (so worst case we abort and reboot) [09:34:43] the other kafka main eqiad nodes should be ok in running without 1005 for a while [09:34:50] if you are ok I'll proceed in a bit [09:34:54] 10serviceops, 10MW-on-K8s, 10SRE, 10Shellbox, 10Platform Team Workboards (Purple): Make Shellbox actually do streaming - https://phabricator.wikimedia.org/T268427 (10Aklapper) a:05tstarling→03None @tstarling: Removing task assignee as this open task has been assigned for more than two years - See the... [09:36:29] sounds well thought out to me [09:37:55] thanks! Going to start in a few then, I'll report progess here [09:40:04] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Patch-For-Review, 10Sustainability (Incident Followup): Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10jbond) >>! In T307382#8708705, @Joe wrote: > I think there is a larger topic of moving etcd to us... [09:43:23] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main1005.eqiad.wmnet with OS bullseye [09:47:19] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10cmooney) [09:56:46] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Vgutierrez)... [10:05:40] hashar: Ah cool, I didn't see we had a custom formatter for ecs available, that's great. I'll update the change to include that later on. [10:09:40] eoghan: yeah I couldn't find it yesterday and eventually I found it via a `git grep --all-matches --grep ecs --grep gerrit` on Puppet which gave me cwhite changes to convert the Gerrit Apache log to ECS logging :] [10:09:54] (he also added some magic for the Gerrit java logs to be ecs aware \o/) [10:14:18] I am restoring kafka-main1005 back into service, upgrade failed, there seems to be an issue with the NIC and DHCP (sigh). [10:18:24] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) The upgrade of 1005 failed due to a DHCP issue in d-i. The task is blocked on T304483 [10:20:37] 10serviceops, 10MW-on-K8s, 10Scap: Add a flag to scap to force updating the /etc/helmfile-defaults/mediawiki/release/* files - https://phabricator.wikimedia.org/T332187 (10jnuche) @Clement_Goubert I will deploy the a new release with these changes this afternoon. There are two things: * `--stop-before-syn... [10:34:06] 10serviceops, 10MW-on-K8s, 10Scap: Add a flag to scap to force updating the /etc/helmfile-defaults/mediawiki/release/* files - https://phabricator.wikimedia.org/T332187 (10Clement_Goubert) >>! In T332187#8713994, @jnuche wrote: > @Clement_Goubert I will deploy a new release with these changes this afternoon.... [10:37:05] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10Sustainability (Incident Followup): High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090 (10Joe) 05Open→03Resolved [10:37:39] 10serviceops, 10MW-on-K8s, 10Scap: Add a flag to scap to force updating the /etc/helmfile-defaults/mediawiki/release/* files - https://phabricator.wikimedia.org/T332187 (10Clement_Goubert) Documentation at https://wikitech.wikimedia.org/wiki/MediaWiki_On_Kubernetes#How_to_deploy_MediaWiki_on_Kubernetes updated. [10:52:41] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10Sustainability (Incident Followup): Add alerting for Memcached timeout errors - https://phabricator.wikimedia.org/T278946 (10Joe) a:03Joe This task is so sparse, and so much time has passed, that I'm not sure what the point is h... [11:42:42] 10serviceops, 10PoolCounter, 10SRE-Sprint-Week-Sustainability-March2023, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Add monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729 (10Joe) 05Open→03Resolved [11:58:56] 10serviceops, 10MW-on-K8s, 10Scap: Add a flag to scap to force updating the /etc/helmfile-defaults/mediawiki/release/* files - https://phabricator.wikimedia.org/T332187 (10jnuche) > Documentation at https://wikitech.wikimedia.org/wiki/MediaWiki_On_Kubernetes#How_to_deploy_MediaWiki_on_Kubernetes updated. Th... [12:06:46] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Relax nodeAffinity of sessionstore - https://phabricator.wikimedia.org/T325139 (10eoghan) a:03eoghan [12:07:19] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Kubernetes, 10Sustainability (Incident Followup): Investigate whether running >1 replicas of calico-typha is feasible and prudent - https://phabricator.wikimedia.org/T292077 (10akosiaris) 05Open→03Resolved a:03akosiaris Nope, resolving it. [12:11:26] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Relax nodeAffinity of sessionstore - https://phabricator.wikimedia.org/T325139 (10akosiaris) Note that we also have taints on the dedicated to sessionstore nodes (albeit marked as kask, to avoid having other thing... [13:09:45] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10Sustainability (Incident Followup): Add alerting for Memcached timeout errors - https://phabricator.wikimedia.org/T278946 (10Joe) 05Open→03Resolved We already added such an alert (porting it from check_prometheus) that is also... [13:13:56] hello folks, I have stopped kafka on kafka1005, working with Rob to update bios/idrac/nic/etc.. [13:14:08] hopefully this should unblock the reimage [13:27:52] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10Papaul) @elukey yes there is a nic upgrade/downgrade to do if the server is using 10G nic. so if you have the firmware of the the nic at any version below 21.85, you need to upgrade it to 21.85. There is a cookbook to do... [13:35:21] 10serviceops, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 4 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [14:23:17] 10serviceops, 10MW-on-K8s, 10Scap: Add a flag to scap to force updating the /etc/helmfile-defaults/mediawiki/release/* files - https://phabricator.wikimedia.org/T332187 (10jnuche) 05Open→03Resolved Deployed now. [14:30:29] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10Sustainability (Incident Followup), and 2 others: Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc) - https://phabricator.wikimedia.org/T110169 (10Joe) 05Open→03Resolved [14:31:49] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10TimedMediaHandler-Transcode, 10WMF-JobQueue, 10Sustainability (Incident Followup): Add rate limiting to the jobqueue vidoscalers to prevent overloads - https://phabricator.wikimedia.org/T278945 (10Joe) a:03Joe [14:32:25] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10WMF-JobQueue, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10Joe) a:03Joe [14:38:00] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main1005.eqiad.wmnet with OS bullseye executed with errors: - kafka-main1005 (**FAIL**) - Downtimed on Icinga/Alertmana... [14:38:16] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main1005.eqiad.wmnet with OS bullseye [15:02:34] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main1005.eqiad.wmnet with OS bullseye executed with errors: - kafka-main1005 (**FAIL**) - Downtimed on Icinga/Alertmana... [15:04:42] 10serviceops, 10Patch-For-Review: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) >>! In T332013#8714660, @Papaul wrote: > @elukey yes there is a nic upgrade/downgrade to do if the server is using 10G nic. so if you have the firmware of the the nic at any version below 21.... [15:08:31] kafka-main1005 is still down, with the help of Rob and Papaul I fixed the dhcp issue, now I am testing the partman recipe [15:10:49] 10serviceops, 10Patch-For-Review: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main1005.eqiad.wmnet with OS bullseye [15:18:01] finally 1005 is being reimaged [15:22:44] 10serviceops, 10Maps, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability), and 2 others: Disable unused services on maps nodes - https://phabricator.wikimedia.org/T298246 (10hnowlan) [15:22:57] 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Patch-For-Review, 10User-jijiki: Maps 2.0 roll-out plan - https://phabricator.wikimedia.org/T280767 (10hnowlan) [15:23:05] 10serviceops, 10Maps, 10Patch-For-Review, 10Platform Team Workboards (Platform Engineering Reliability), and 2 others: Disable unused services on maps nodes - https://phabricator.wikimedia.org/T298246 (10hnowlan) 05In progress→03Resolved [15:49:58] kafka-main1005 up and running with bullseye! [15:50:10] it is very painful to upgrade a node, but there is a way :D [15:50:17] I'll do 1004 tomorrow [15:52:06] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main1005.eqiad.wmnet with OS bullseye completed: - kafka-main1005 (**PASS**) - Removed from Puppet and PuppetDB if pres... [15:53:38] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) Status: * kafka-main1005 on bullseye * kafka-main[12]00[45] need idrac+nic+bios upgrades via cookbook before reimaging. Next inline: kafka-main1004 [15:57:25] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) >>! In... [16:04:48] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Vgutierrez)... [16:15:14] elukey: ❤️ [17:23:07] 10serviceops, 10Commons, 10MediaWiki-File-management, 10SRE, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10doctaxon) @TheDJ thanks a lot