[02:04:22] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [06:04:22] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [08:13:58] I don't think this alert shold be here, reported at https://phabricator.wikimedia.org/T350694#9358223 FYI [08:16:41] thx! [08:44:28] 10netbox, 10Infrastructure-Foundations: taavi's netbox-next account is stuck - https://phabricator.wikimedia.org/T351950 (10Volans) Interesting, I can confirm that on netbox-next admin the user `taavi` doesn't have any groups associated and as such doesn't have the additional privileges. But looking at the `op... [08:44:48] 10netbox, 10Infrastructure-Foundations: taavi's netbox-next account is stuck - https://phabricator.wikimedia.org/T351950 (10Volans) a:05Volans→03None [10:03:26] (SystemdUnitFailed) firing: netbox_ganeti_codfw_test_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:04:37] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [10:48:26] (SystemdUnitFailed) firing: (2) netbox_ganeti_codfw_test_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:01] moritzm: is ganeti codfw-test cluster still WIP? ^^^ [10:54:12] yeaj, currently in a meeting, will do the missing parts when done [10:55:04] ack no prob was just to be sure it was still wip [10:56:17] volans, topranks, I'd be interested in your feedback on https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/976749 (as well as traffic and serviceops) :) [11:01:14] 10Packaging, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1-Q2): wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10fnegri) Thanks @jcrespo -- I'm not sure who did the upgrade, but I checked in debmonitor and 0.8.3 is now installed on all cloud hosts. [11:02:17] XioNoX: idea is great! I am on leave today I’ll take a closer look at the patch tomorrow [11:02:46] no pb, enjoy your break [11:10:57] XioNoX: {done} [11:11:20] volans: thanks! any preference on the approach? (what to store in netbox) [11:12:41] 10Packaging, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1-Q2): wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10jcrespo) I'm afraid you don't have the latest version, https://debmonitor.wikimedia.org/packages/python3-wmfbackups you should upgrade t... [11:13:17] I don't like the hardcoded mapping but that's ok for now, I guess we'll move it out once we'll have the right place where to put it [11:14:03] as for the bgp setting, I guess it depends on how much flexibility and plan to expand that you have [11:14:48] yeah, that's why it also depends on ServiceOps [11:18:26] (SystemdUnitFailed) firing: (2) netbox_ganeti_codfw_test_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:33:26] (SystemdUnitFailed) resolved: netbox_ganeti_codfw_test_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:34:53] (SystemdUnitFailed) firing: netbox_ganeti_codfw_test_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:34:59] 10netbox, 10Infrastructure-Foundations: taavi's netbox-next account is stuck - https://phabricator.wikimedia.org/T351950 (10jbond) ill leave this to @SLyngshede-WMF as im guessing they have ben experimenting with migrating netbox to OIDC {T308002} [12:52:11] 10netbox, 10Infrastructure-Foundations: taavi's netbox-next account is stuck - https://phabricator.wikimedia.org/T351950 (10SLyngshede-WMF) a:03SLyngshede-WMF [12:52:38] 10netbox, 10Infrastructure-Foundations: taavi's netbox-next account is stuck - https://phabricator.wikimedia.org/T351950 (10taavi) >>! In T351950#9358295, @Volans wrote: > @taavi could you please check if going to `idp-test.wikimedia.org` and checking the attributes they look the same of idp and have the ops g... [14:04:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [14:26:15] 10Packaging, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1-Q2): wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10fnegri) Good catch, I only checked if there was anything on <0.8.3 and didn't notice the `u1` vs `u2` difference! I have now upgraded a... [14:50:13] 10Packaging, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1-Q2): wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10jcrespo) Thank you, and sorry for the urgency- normally these kind of packages always keep backwards compatibility (and they did here to... [14:56:58] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10Volans) @JMeybohm could you confirm the above or give me more context? [15:32:49] 10SRE-tools, 10Observability-Logging: Create a cookbook for managing Logstash cluster restarts - https://phabricator.wikimedia.org/T293929 (10joanna_borun) [15:33:41] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase - https://phabricator.wikimedia.org/T317855 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff I'm taking this one, for coordinationd and partly implementing myself. [15:35:07] moritzm: ack thanks! [15:35:13] (from the meeting) [15:35:39] :-) [15:35:56] perfect Google Chat -> IRC bridge [15:37:51] rotfl [15:49:20] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Spicerack: Add a cookbook to safely deploy puppet changes - https://phabricator.wikimedia.org/T341442 (10jbond) [15:55:14] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10joanna_borun) p:05Triage→03Medium [15:58:18] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't write logs to disk - https://phabricator.wikimedia.org/T342079 (10joanna_borun) p:05Triage→03Low [15:59:14] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325 (10joanna_borun) p:05Triage→03High [16:01:04] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:04:20] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Support cookbooks resume after user interruption - https://phabricator.wikimedia.org/T345402 (10joanna_borun) 05Open→03Declined [16:07:36] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10joanna_borun) 05Open→03Declined [16:15:24] 10SRE-tools, 10Data-Persistence, 10Infrastructure-Foundations, 10Spicerack, and 3 others: Switch conftool to use the version 3 etcd datastore - https://phabricator.wikimedia.org/T350565 (10jbond) [16:20:48] was running agent on A:netbox and saw: [16:20:49] Warning: The directory '/srv/reposync/netbox-hiera' contains 2734 entries, which exceeds the default soft limit 1000 and may cause excessive resource consumption and degraded performance. To remove this warning set a value for `max_files` parameter or consider using an alternate method to manage large directory trees [16:20:54] just as an FYI [16:23:28] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: ServiceLVS without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10joanna_borun) p:05High→03Low [16:25:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10Jclark-ctr) [16:27:00] sukhe: i think moritzm has a task for that tl;dr for now you can ignore it [16:27:08] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10Jclark-ctr) [16:27:34] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10Jclark-ctr) 05Stalled→03Resolved [16:29:13] haven't opened a task yet, but will do sometime this week (max_files, this also affects the number of facts on some hosts) [16:29:35] np, just wanted to share. if you want, I can put this in a task as well [16:29:41] [it didn't block anything on my end [16:29:41] ] [16:29:55] I'll take care of it, thanks [16:34:53] (SystemdUnitFailed) firing: netbox_ganeti_codfw_test_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:04:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [20:38:26] (SystemdUnitFailed) firing: netbox_ganeti_codfw_test_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:04:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [22:14:29] 10Packaging, 10Diffusion-Repository-Administrators, 10Infrastructure-Foundations, 10Performance-Team, and 2 others: Consider archiving Gerrit repository "operations/software/sentry" (20150926) - https://phabricator.wikimedia.org/T352108 (10Aklapper) [22:20:15] 10Packaging, 10Diffusion-Repository-Administrators, 10Infrastructure-Foundations, 10Performance-Team, and 2 others: Consider archiving Gerrit repository "operations/software/sentry" (20150926) - https://phabricator.wikimedia.org/T352108 (10Tgr) Thanks. It's unused, and won't be needed in the future since S...