[01:25:01] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:26:17] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:25:01] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:26:16] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:55:01] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:56:16] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:15:01] (SystemdUnitFailed) firing: (3) httpbb_kubernetes_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:57:48] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10ayounsi) iirc it has been useful a few times to detect faulty cables leading to 10/100M links. Lo... [07:15:01] (SystemdUnitFailed) resolved: httpbb_kubernetes_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:24:12] 10netops, 10Analytics-Radar, 10Infrastructure-Foundations: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10MoritzMuehlenhoff) >>! In T273026#8733992, @cmooney wrote: > Must be a race condition of some kind I'm guessing but not sure what it might be. Pro... [07:55:01] (SystemdUnitFailed) firing: kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:16] (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:17] (SystemdUnitFailed) firing: (2) kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:01] (SystemdUnitFailed) resolved: (2) kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:53] jbond: why the alerts for aux-k8s-ctrl1001 are sent here? [08:06:54] I/F owns the aux cluster [08:21:56] do we? I see mostly other people taking care of that would be interested in this alert more than us... but maybe I'm wrong [08:22:48] in general I don't think that siloing those alerts to each team channel is a great idea, might allow to miss a lot of pattens of things happening [08:24:32] +1 [08:24:35] fully agreed, we should at least broad cast all alerts to -operations as well [08:25:01] the same issue with Phab as well, some teams have started to drop the SRE tag when it gets tagged with the team task [08:25:22] and then it also disappears from updates in -operations (where I frequently notice relevant things) [08:31:16] (SystemdUnitFailed) firing: kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:35:02] (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:35:49] volans: the routing is based on the role_owner. also worth noting that the routing to this channel is in addition to the normal routing not instead of. i.e. alerts tagged with I/F are routed exactly the same as if the where routed sre but the *also* apear in this channel [08:36:23] also worth noting that warning (which this alert now is) dont get routed to #wikimedia-operations at all [08:37:05] when is the next SRE summit so we can all talk about alerting? [08:38:23] I'm not sure the ownership of a host works well for alerting in all cases, for example if the debmonitor client fails we want to know and the owner might just ignore it or not realize it's a spread issue [08:41:27] again there has been no change to the default behaviour accept that team owneres can also choose to rute them in a different manner. As to debmonitor client failing im not sure i agree. id say tha we want to know if debmonitor client fails everywhere but if its just a one of then its is normaly something the team owner shuld fix i.e. puppet is failing or disabled. [08:41:44] however again there has been no change here, we will still recive the email when the systemd timer fails [08:42:29] Of course im not saying this wont change, i have a feeling that 011y want to start utilising this a bit more, however we should check wit them i dont want to miss represent [10:04:51] jobo: ##T323484 [10:04:52] T323484: Fine tune the SSHd config of the restricted bastion for better performances - https://phabricator.wikimedia.org/T323484 [10:05:26] * jbond ignore that [11:03:38] XioNoX, topranksL anything to do for alertname = Storage /var over 50% instance = cloudsw1-b1-codfw.mgmt.codfw.wmnet [11:03:41] ? [11:07:53] Social Auth now support OIDC via CAS: https://github.com/python-social-auth/social-core/pull/743 [11:10:08] nice! does the warning on the README still stand? project looking for maintainers... [11:10:55] Sadly yes [11:16:54] very nice! [11:17:35] good work [11:19:49] Next time the social-core do a release, I think we can start work on migrating NetBox [11:21:26] can we rely on this library also if officially without an official maintainer? :D [11:28:21] That's a fair question, but it is the integration for OAuth2 and OIDC that ships with NetBox [11:37:18] ok then, not much choice :) I meant more for the other python prijects [11:49:12] If we stick to OIDC, then we could use Mozillas module [11:51:16] Assuming that it works with the CAS implementation of OIDC. The patch of social core is adding a new OIDC backend that takes into account that CAS doesn't place information where it's "suppose" to go. [11:55:50] ack [11:58:25] volans: should clear now [11:58:31] hx [11:58:33] *thx [11:59:26] back to the alerts, (SystemdUnitFailed) used to be a critical in Icinga (so alerting to everybody in -operations), then it got moved to AM as Warnings (so not alerting anymore), then now only alerting to teams? [12:00:22] Did I miss a larger conversation about alerts and what should notify whom and how to prevent silos? [12:20:03] 10netbox, 10Infrastructure-Foundations: Improve Netbox "locations" use - https://phabricator.wikimedia.org/T333948 (10ayounsi) [12:36:21] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) @wiki_willy this is now deployed and should exlude the devices list in the accounting spreadsheet, Recycled sheet. FYI from line 340 on that sheet there are a bunch... [12:39:06] XioNoX: https://wm-bot.wmcloud.org/browser/index.php?start=03%2F28%2F2023+09%3A48&end=03%2F28%2F2023+10%3A31&display=%23wikimedia-sre [12:39:29] nopt that since that discussion the refire time has been changed to 3 days so it would be less spammy to add them back to operations [12:42:04] XioNoX: i dont think you missed any discussion. As the icinga alert is still exists i think the decision was to change the new alert so we could decrease the amount of spam it was sending [12:43:51] I think that any real alerts will not change where they report with out wider descussion and would expect that this specific change would either be re-evaluated or communicated more widley when we come to drop the corrosponding icinga alert. but better to talk to o11y (cc godog) [12:50:34] I can confirm what jbond said, you didn't miss any discussion XioNoX, and the icinga alerts for systemd unit failed are still in place and critical [12:51:46] definitely warrants wider discussion on what to do collectively with alert notifications, I want to write down at least some points to be discussed and haven't found the time yet [12:52:02] but yes any significant change will be communicated [13:21:02] (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:22] volans: ^ [13:22:21] XioNoX: nothing to do there, the report works fine, there are outstanding errors beside the ones skipped [13:22:49] shouldn't alert here then :) [13:22:50] also those were supposed to alert on the dcops chan, not here [13:25:34] (SystemdUnitFailed) firing: (2) netbox_ganeti_eqiad_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:14] FWIW I can't recommend having warnings on team channels [13:28:26] (of course one of the things to at least call out explicitly I was referring to above) [13:31:33] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [13:35:03] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [13:37:23] XioNoX: volans: as i said in the meeting and here before feelf ree to change it so warnings dont come here [13:37:43] jbond: not blaming you :) [13:38:21] sure sure (and thanks) but still if its not desired that also fine with me [13:38:30] trying to find a consistent way to alert, and prevent drift between teams [13:38:47] if we remove warnings here, then we don't see the systemd alerts anymore [13:38:56] well, not on IRC I mean [13:39:06] if we remove warnings here then it gose back to how it was before sprint week [13:39:12] (SystemdUnitFailed) firing: (2) netbox_ganeti_eqiad_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:32] i.e. we never got warnings in #w-operations (AFAIK) [13:40:01] so we get Icinga systemd alerts in -operations, but not AM systemd alerts [13:40:15] XioNoX: correct [13:40:37] the nmain reason fro that is becauise AM refires after N x hours, icinga never refires [13:40:58] when we first introduced the sytemd timer warnings refired every 4 hours so the AM alert was very spammy [13:41:14] and thus got dongraded to a warning (the icingae one remains critical) [13:42:36] fyi here is an axample of the icinga one still been triggered (checked for my own benefit ) [13:42:36] we're big enough that we need some kind of framework on what should be warnings and what should be criticals, where they alert, etc [13:42:39] https://wm-bot.wmcloud.org/browser/index.php?start=04%2F04%2F2023+01%3A30&end=04%2F04%2F2023+01%3A31&display=%23wikimedia-operations [13:46:39] XioNoX: i agree, i think this is still very much a moving target [13:48:33] I see that the refresh for cumin/sretest has 10G... do we really need that? is that because we're going 10G in most places/ [13:49:12] (SystemdUnitFailed) resolved: (2) netbox_ganeti_eqiad_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:34] can't cumin be a VM and sretest any older host? [13:51:05] I think we already asked for sretest to be an old host and got a negative reply [13:51:22] as for cumin... back then we said to use physical to avoid the dependenc on ganeti entirely [13:51:31] volans: sorry i thoght i had mentioned this but re sretest i spoke with dc-ops (rob.h) th reaons fo rthe 10G card is because we asked foir a 10G box for doing testing. i did say we could use an old one or but an old card into a different server but they prtefre to have serveres in warrenty [13:53:09] ah right I forgot that bit, sorry [13:54:51] is the dependency on ganeti still a problem with an instance in each DC? [13:56:04] for cumin runs and most cookbooks probably not, for backups and dba's work not sure [14:28:08] Sorry folks for missing the meeting yesterday, I got stuck in Denver as we were flying standby and all the flights were full, so we ended up taking the train back to Chicago and arrived yesterday afternoon! [14:30:28] California Zephyr! [14:31:32] XioNoX: indeed, coach sleeping was a masochistic affair, but we had a good time regardless! [14:31:52] I took it from SF to denver, it was absolutely beautiful [14:33:54] great views indeed, especially from SF to Denver, Denver to Chicago is pretty flat, but still pretty [15:19:25] jhathaway: I've wanted to do that train trip for a while, but not without a sleeper car 😅 [15:20:29] Yeah, I definitely recommend the sleeper car, unless you love sleeping on an incline [15:49:12] (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:01] 10netbox, 10Infrastructure-Foundations: Improve Netbox "locations" use - https://phabricator.wikimedia.org/T333948 (10jbond) for hosts i think we may be able to get the parent which would be a much better way of doing this? (we still need to talk about how we do that for net devices) [16:19:12] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:16] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Thanks @Volans, we'll get the additional info for lines 340 and onwards. I'm still seeing the following S/N's on the Accounting Spreadsheet that are still repor... [17:13:22] (SystemdUnitFailed) firing: httpbb_kubernetes_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:43:21] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) >>! In T320955#8755710, @wiki_willy wrote: > Thanks @Volans, we'll get the additional info for lines 340 and onwards. I'm still seeing the following S/N's on the Ac... [18:13:22] (SystemdUnitFailed) resolved: httpbb_kubernetes_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:16:22] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10jbond) thanks, @RobH wonder if you had anything additional to add here? [18:29:15] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10RobH) Cathal and Arzhel summarize things pretty clearly, and cover all the things I would. Det... [22:17:10] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Hi @Volans - yup, that's correct. They've all been recycled. For any equipment that we've already sent out for recycling, we've been adding the asset tags and...