[11:32:29] 10serviceops, 10SRE: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10hnowlan) >>! In T355117#9462317, @jnuche wrote: > Every Tuesday morning, an automated process runs to presync new MW versions to hosts. It also takes care of deleting the older vers... [11:38:20] 10serviceops, 10SRE: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10jnuche) >>! In T355117#9465167, @hnowlan wrote: >>>! In T355117#9462317, @jnuche wrote: >> Every Tuesday morning, an automated process runs to presync new MW versions to hosts. It a... [11:39:32] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:44:06] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:00:23] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=87150827-7740-4075-ada6-a08469c8b7f6) set by cgoubert@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Bad DIMM ` mw... [12:01:16] 10serviceops, 10SRE, 10Patch-For-Review: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10CodeReviewBot) jnuche opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/204 prune old inactive branches as first step of staging a train [12:01:18] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Clement_Goubert) 05Resolved→03Open Reopening this task since hardware failures for this server happened very close to each other. `mw2394` crashed this morning due to a DIMM error `---------------------... [12:38:27] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2267.codfw.wmnet with OS bullseye [13:19:25] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2267.codfw.wmnet with OS bullseye completed: - mw2267 (**PASS**) - Downt... [13:28:16] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Jhancock.wm) I will check on this this morning. thank you for depooling [13:39:48] 10serviceops, 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10pfischer) As of today, all non-private wikis featuring the cirrussearch extension publish page_reren... [14:23:49] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [14:25:37] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 (10bking) [15:13:30] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Downtimed on Ici... [15:16:58] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [15:26:15] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:49:34] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [15:50:01] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [16:04:59] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Jhancock.wm) I moved the DIMM to a different slot and the error moved with it. I've put in a dispatch with Dell. SR183504113 [16:08:11] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert) [16:08:45] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10Clement_Goubert) [16:09:11] 10serviceops, 10MW-on-K8s, 10SRE: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [16:09:20] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert) Reverting to bare metal in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/991377 [16:23:32] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [16:23:56] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [16:39:25] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [16:39:39] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [16:48:08] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [16:48:55] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2357.codfw.wmnet with OS bullseye [16:56:40] 10serviceops, 10iPoid-Service, 10Trust and Safety Product Sprint: ipoid logs not visible in Logstash - https://phabricator.wikimedia.org/T355247 (10kostajh) [17:00:16] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2395.codfw.wmnet with OS bullseye [17:00:25] 10serviceops, 10iPoid-Service, 10Trust and Safety Product Sprint: ipoid logs not visible in Logstash - https://phabricator.wikimedia.org/T355247 (10kostajh) Maybe related to {T289766}? [17:19:08] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Jdforrester-WMF) [17:19:13] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Jdforrester-WMF) p:05High→03Unbreak! [17:20:43] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert) As we can see on a [[ https://logstash.wikimedia.org/goto/aa282fd4b9efeb635c3767593fb2f58c | wider log view ]] errors coincide with tr... [17:25:45] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Clement_Goubert) Thank you @Jhancock.wm [17:29:13] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2357.codfw.wmnet with OS bullseye completed: - mw2357 (**PASS**) - Downtimed on Icinga/Alertma... [17:30:05] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10kamila) [17:30:15] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10kamila) 05Open→03Resolved All traffic is now going to k8s \o/ I will keep an eye on php workers saturation, but it should be fine, so I'm calling it resolved. [17:39:14] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2395.codfw.wmnet with OS bullseye completed: - mw2395 (**PASS**) - Downtimed on Icinga/Alertma... [18:16:59] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) I see the errors still showing up in prod. By looking at Scap's code, I think redeploying the train to group1 should make this change go to pro... [18:26:59] 10serviceops, 10conftool: requestctl should fail with error if fails parsing yaml file - https://phabricator.wikimedia.org/T355256 (10Fabfur) [18:37:20] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) After digging a bit more, I think a simpler `sync-world` with a few flags will be enough to deploy this Helm config change. I'll try it in a bi... [19:04:43] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) Error rate seems unaffected after deploying the configuration change: https://logstash.wikimedia.org/goto/5f6d40ce5bbf6313e2fcec6ccc28ea51 :( P... [19:14:35] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) I have a commitment soon and need to stop for the day. I've asked the backup conductor @jeena to follow up on this. [19:22:59] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jeena) isn't a deployment of changeprop in kubernetes needed here? I don't think scap does this. [20:42:56] 10serviceops, 10API Platform, 10CirrusSearch, 10MediaWiki-Configuration, and 3 others: Create new NetworkSession mediawiki extension - https://phabricator.wikimedia.org/T354976 (10EBernhardson) Went through https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment to make sure we've done what's ne... [21:15:06] anyone around to help with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/991416/2/helmfile.d/services/ipoid/values.yaml? [21:15:26] the updated cron schedule doesn't show up in helmfile output [21:49:20] kostajh: helmfile seems to think it's already deployed with `schedule: "0 8,13,18 * * *"`, is that possible? [21:50:20] kamila_: it’s definitely possible… I just didn’t see any diff in the helmfile apply command [21:50:47] I see a chart version update but no diff for the schedule because that's already the same [21:50:49] which is interesting [21:51:25] wait [21:54:10] there was a deployment at 21:06UTC, which is a few minutes after that change was merged [21:55:10] if you didn't deploy it, it's possible it went for a ride with someone else's deploy... [21:55:57] I did a chart version update but didn’t deploy it because it is apparently a no op [21:56:01] no, SAL thinks that was you [21:56:14] mhm [21:56:28] If you ctrl-c out of the deployment, it still looks like you deployed in SAL [21:56:39] oh, right [21:56:42] weird [21:56:51] because helm history thinks it did get deployed '^^ [21:57:09] `25 Wed Jan 17 21:06:30 2024 deployed ipoid-0.2.3 Upgrade complete` [21:57:10] I need to sign off and will pick up this thread tomorrow. Thanks for your help! [21:57:21] Ah [21:57:29] I deployed the version bump of the image [21:57:49] Then saw there was no update to the cron in the diff [21:58:19] Then made a chart.yaml version (0.2.4) update patch, thinking that might update the cron schedule [21:58:31] Then I canceled on deploying that [21:59:15] then I'm guessing that the schedule might have gotten updated together with the image [21:59:48] checking... [22:00:24] yep [22:00:36] the previous deployment had the old schedule, the 21:06 one the new