[00:08:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:53:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:13:05] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate CAS to Bookworm - https://phabricator.wikimedia.org/T357748#9582344 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp-test1003.wikimedia.org with OS bookworm executed with errors: - idp... [04:18:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:23:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:29] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:54] slyngs: FYI there are pending changes in the dns from netbox for idp-test2003, is that you? [08:50:19] volans: Yes sorry [08:51:37] Better? [08:52:08] icinga takes a bit to detect that :D [08:52:43] I'll only take the blame for starting makevm and forgetting about it :-) [08:53:08] :D [08:54:47] slyngs: let us know when you're done so we can do a spicerack release [08:55:19] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate CAS to Bookworm - https://phabricator.wikimedia.org/T357748#9582674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp-test2003.wikimedia.org with OS bookworm [08:55:37] Will do [09:23:43] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate CAS to Bookworm - https://phabricator.wikimedia.org/T357748#9582714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp-test2003.wikimedia.org with OS bookworm executed with errors: - idp... [09:24:54] XioNoX: Done [09:25:00] thx! [09:37:08] slyngs: do you have any more vms to create ? :) [09:37:20] Not today [09:37:48] testvm it is then :) [09:45:44] slyngs: I pushed the hiera/netbox change for idp-test (it asked me during my vm provisioning) not sure if there was some kind of race condition ? (cc volans ) [09:46:21] not race condition but someone not showing up during the proper step [09:46:21] Thanks, maybe because I took like an hour to type "Go" in the cookbook [09:46:32] haha :) [09:48:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:50:30] gosh alertmanager, 22 issues (reported 44 because of the duplicate bug), 20 of those are on pki1001, and it reports netbox as the failing one :/ [09:51:27] A patch is on the way for the 20/40 [09:51:52] slyngs: is there any benefit on aggregating the SystemdUnitFailed? I understand aggregating the same unit across multiple hosts [09:51:55] volans: is there a task for that ? I kind of stopped looking at the alerts because of it :( [09:52:01] but aggregating various units across various hosts [09:52:02] seems weird [09:52:13] it's on us or o11y? [09:52:18] this specific one [09:52:45] The bug is mine, the aggregation is something we need to have o11y help with [09:53:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:58:36] ok [09:58:49] let me report it [10:00:26] So just looking briefly at how you can aggregate alerts in AlertManager, I think we might be able to group the alerts by name, that would aggregate cross hosts, but split them up so we don't get 44 firing because it's the same overall alert. I'll take a look a little later [10:02:44] volans: could you get me the phab ticket when created? [10:02:57] sure [10:12:14] slyngs: https://phabricator.wikimedia.org/T358648 [10:13:03] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:30] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:17:43] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate CAS to Bookworm - https://phabricator.wikimedia.org/T357748#9582848 (10MoritzMuehlenhoff) I tried a testbuild of CAS, but it seems CAS 6.6 still needs Java 11 to build (not run). So for now we'll need to keep one of the old idp-test* hosts... [10:18:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:28:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:07] Okay, those alert messages are just not helpful. That last one is basically to inform us that we're down to somewhere between 4 and 8 failed systemd units [10:43:47] Good morning. We would like to create a new LDAP group called `superset-admin`- See: https://phabricator.wikimedia.org/T358650 [10:44:37] Have you any concerns about this request? Are we OK just to create it with `ldapvi` or are there any other considerations that we would need to take into account? Thanks. [11:19:19] looks good, also left a note on task [11:21:08] instead of ldapvi you can also simply create an LDIF like the one below and use ldapmodify [11:21:10] dn: cn=superset-admins,ou=groups,dc=wikimedia,dc=org [11:21:11] cn: superset-admins [11:21:13] objectClass: groupOfNames [11:21:14] member: uid=foo01,ou=people,dc=wikimedia,dc=org [11:21:16] member: uid=foo02,ou=people,dc=wikimedia,dc=org [11:22:54] moritzm: Many thanks. [11:25:22] -D "cn=admin,dc=wikimedia,dc=org" has the necessary permissions, password is in pwstore under openldap-labs [11:41:37] Perfect, cheers. [12:05:48] (PuppetZeroResources) firing: Puppet has failed generate resources on testvm2006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:47:57] 07Puppet, 10Cloud-VPS, 06Infrastructure-Foundations, 06cloud-services-team, 13Patch-For-Review: wmf_auto_restart_cron.service failing in Cloud VPS bookworm instances - https://phabricator.wikimedia.org/T358343#9583540 (10MoritzMuehlenhoff) 05Open→03Resolved I added a new Hiera option for this: profi... [14:28:29] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:56:06] 10netops, 06DBA, 06Infrastructure-Foundations, 06SRE, 10ops-codfw: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871#9583876 (10JMeybohm) [15:40:37] topranks: I've drained/depooled the serviceops stuff for T355871 - not sure if I should note that down somewhere [15:40:47] T355871: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871 [15:41:08] jayme: all good thanks I'll mark it on the sheet I'm using to track thanks! [15:41:17] thanks [15:46:00] 10netops, 06DBA, 06Infrastructure-Foundations, 06SRE: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9584222 (10cmooney) 05Open→03Resolved a:03cmooney [15:46:10] 10netops, 06Infrastructure-Foundations, 06SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9584224 (10cmooney) [15:49:37] 10netops, 06DBA, 06Infrastructure-Foundations, 06SRE, 10ops-codfw: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871#9584239 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1f99f40e-0648-48d6-a40a-a3ebae9e7b2b) set by cmoon... [16:04:39] 10netops, 06DBA, 06Infrastructure-Foundations, 06SRE, 10ops-codfw: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871#9584297 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=691919af-8b8a-4f2d-b390-eea3c6a54f5c) set by cmoon... [16:12:28] 10netops, 06DBA, 06Infrastructure-Foundations, 06SRE, 10ops-codfw: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871#9584311 (10cmooney) Works completed, all servers moved to the new switch and back responding to ping now. No issues. [18:10:48] (PuppetFailure) firing: Puppet has failed on install1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:15:48] (PuppetFailure) firing: (3) Puppet has failed on apt2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:17:49] (PuppetFailure) firing: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:20:48] (PuppetFailure) firing: Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:22:48] (PuppetFailure) firing: (2) Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:25:48] (PuppetFailure) firing: (5) Puppet has failed on apt2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:28:29] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:30:48] (PuppetFailure) firing: (6) Puppet has failed on apt1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:35:48] (PuppetFailure) firing: (7) Puppet has failed on apt1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:35:53] (PuppetFailure) firing: (2) Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:40:48] (PuppetFailure) firing: (7) Puppet has failed on apt1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:45:49] (PuppetFailure) firing: (2) Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:50:48] (PuppetFailure) firing: (7) Puppet has failed on apt1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:55:49] (PuppetFailure) firing: (7) Puppet has failed on apt1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:00:48] (PuppetFailure) resolved: (2) Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:00:49] (PuppetFailure) firing: (6) Puppet has failed on apt1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:02:49] (PuppetFailure) resolved: (2) Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:05:49] (PuppetFailure) firing: (6) Puppet has failed on apt1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:15:48] (PuppetFailure) resolved: (3) Puppet has failed on apt1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure