[11:32:29] <wikibugs>	 10serviceops, 10SRE: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10hnowlan) >>! In T355117#9462317, @jnuche wrote: > Every Tuesday morning, an automated process runs to presync new MW versions to hosts. It also takes care of deleting the older vers...
[11:38:20] <wikibugs>	 10serviceops, 10SRE: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10jnuche) >>! In T355117#9465167, @hnowlan wrote: >>>! In T355117#9462317, @jnuche wrote: >> Every Tuesday morning, an automated process runs to presync new MW versions to hosts. It a...
[11:39:32] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:44:06] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[12:00:23] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=87150827-7740-4075-ada6-a08469c8b7f6) set by cgoubert@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Bad DIMM ` mw...
[12:01:16] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10CodeReviewBot) jnuche opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/204  prune old inactive branches as first step of staging a train
[12:01:18] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Clement_Goubert) 05Resolved→03Open Reopening this task since hardware failures for this server happened very close to each other. `mw2394` crashed this morning due to a DIMM error  `---------------------...
[12:38:27] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2267.codfw.wmnet with OS bullseye
[13:19:25] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2267.codfw.wmnet with OS bullseye completed: - mw2267 (**PASS**)   - Downt...
[13:28:16] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Jhancock.wm) I will check on this this morning. thank you for depooling
[13:39:48] <wikibugs>	 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10pfischer) As of today, all non-private wikis featuring the cirrussearch extension publish page_reren...
[14:23:49] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye
[14:25:37] <wikibugs>	 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10bking)
[15:13:30] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**)   - Downtimed on Ici...
[15:16:58] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye
[15:26:15] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[15:49:34] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**)   - Removed from Pup...
[15:50:01] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye
[16:04:59] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Jhancock.wm) I moved the DIMM to a different slot and the error moved with it. I've put in a dispatch with Dell. SR183504113
[16:08:11] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert)
[16:08:45] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10Clement_Goubert)
[16:09:11] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High
[16:09:20] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert) Reverting to bare metal in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/991377
[16:23:32] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**)   - Removed from Pup...
[16:23:56] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye
[16:39:25] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**)   - Removed from Pup...
[16:39:39] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye
[16:48:08] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**)   - Removed from Pup...
[16:48:55] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2357.codfw.wmnet with OS bullseye
[16:56:40] <wikibugs>	 10serviceops, 10iPoid-Service, 10Trust and Safety Product Sprint: ipoid logs not visible in Logstash - https://phabricator.wikimedia.org/T355247 (10kostajh)
[17:00:16] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2395.codfw.wmnet with OS bullseye
[17:00:25] <wikibugs>	 10serviceops, 10iPoid-Service, 10Trust and Safety Product Sprint: ipoid logs not visible in Logstash - https://phabricator.wikimedia.org/T355247 (10kostajh) Maybe related to {T289766}?
[17:19:08] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Jdforrester-WMF)
[17:19:13] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Jdforrester-WMF) p:05High→03Unbreak!
[17:20:43] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert) As we can see on a [[ https://logstash.wikimedia.org/goto/aa282fd4b9efeb635c3767593fb2f58c | wider log view ]] errors coincide with tr...
[17:25:45] <wikibugs>	 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Clement_Goubert) Thank you @Jhancock.wm
[17:29:13] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2357.codfw.wmnet with OS bullseye completed: - mw2357 (**PASS**)   - Downtimed on Icinga/Alertma...
[17:30:05] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10kamila)
[17:30:15] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10kamila) 05Open→03Resolved All traffic is now going to k8s \o/  I will keep an eye on php workers saturation, but it should be fine, so I'm calling it resolved.
[17:39:14] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2395.codfw.wmnet with OS bullseye completed: - mw2395 (**PASS**)   - Downtimed on Icinga/Alertma...
[18:16:59] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) I see the errors still showing up in prod. By looking at Scap's code, I think redeploying the train to group1 should make this change go to pro...
[18:26:59] <wikibugs>	 10serviceops, 10conftool: requestctl should fail with error if fails parsing yaml file - https://phabricator.wikimedia.org/T355256 (10Fabfur)
[18:37:20] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) After digging a bit more, I think a simpler `sync-world` with a few flags will be enough to deploy this Helm config change. I'll try it in a bi...
[19:04:43] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) Error rate seems unaffected after deploying the configuration change: https://logstash.wikimedia.org/goto/5f6d40ce5bbf6313e2fcec6ccc28ea51 :( P...
[19:14:35] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) I have a commitment soon and need to stop for the day. I've asked the backup conductor @jeena to follow up on this.
[19:22:59] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jeena) isn't a deployment of changeprop in kubernetes needed here? I don't think scap does this.
[20:42:56] <wikibugs>	 10serviceops, 10API Platform, 10CirrusSearch, 10MediaWiki-Configuration, and 3 others: Create new NetworkSession mediawiki extension - https://phabricator.wikimedia.org/T354976 (10EBernhardson) Went through https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment to make sure we've done what's ne...
[21:15:06] <kostajh>	 anyone around to help with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/991416/2/helmfile.d/services/ipoid/values.yaml?
[21:15:26] <kostajh>	 the updated cron schedule doesn't show up in helmfile output
[21:49:20] <kamila_>	 kostajh: helmfile seems to think it's already deployed with `schedule: "0 8,13,18 * * *"`, is that possible?
[21:50:20] <kostajh>	 kamila_: it’s definitely possible… I just didn’t see any diff in the helmfile apply command
[21:50:47] <kamila_>	 I see a chart version update but no diff for the schedule because that's already the same
[21:50:49] <kamila_>	 which is interesting
[21:51:25] <kamila_>	 wait
[21:54:10] <kamila_>	 there was a deployment at 21:06UTC, which is a few minutes after that change was merged
[21:55:10] <kamila_>	 if you didn't deploy it, it's possible it went for a ride with someone else's deploy...
[21:55:57] <kostajh>	 I did a chart version update but didn’t deploy it because it is apparently a no op
[21:56:01] <kamila_>	 no, SAL thinks that was you
[21:56:14] <kamila_>	 mhm
[21:56:28] <kostajh>	 If you ctrl-c out of the deployment, it still looks like you deployed in SAL
[21:56:39] <kamila_>	 oh, right
[21:56:42] <kamila_>	 weird
[21:56:51] <kamila_>	 because helm history thinks it did get deployed '^^
[21:57:09] <kamila_>	 `25      	Wed Jan 17 21:06:30 2024	deployed  	ipoid-0.2.3	           	Upgrade complete`
[21:57:10] <kostajh>	 I need to sign off and will pick up this thread tomorrow. Thanks for your help!
[21:57:21] <kostajh>	 Ah
[21:57:29] <kostajh>	 I deployed the version bump of the image
[21:57:49] <kostajh>	 Then saw there was no update to the cron in the diff
[21:58:19] <kostajh>	 Then made a chart.yaml version (0.2.4) update patch, thinking that might update the cron schedule
[21:58:31] <kostajh>	 Then I canceled on deploying that
[21:59:15] <kamila_>	 then I'm guessing that the schedule might have gotten updated together with the image
[21:59:48] <kamila_>	 checking...
[22:00:24] <kamila_>	 yep
[22:00:36] <kamila_>	 the previous deployment had the old schedule, the 21:06 one the new