[01:25:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:26:17] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:25:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:26:16] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:55:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:56:16] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:15:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) httpbb_kubernetes_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:57:48] <wikibugs>	 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10ayounsi) iirc it has been useful a few times to detect faulty cables leading to 10/100M links. Lo...
[07:15:01] <jinxer-wm>	 (SystemdUnitFailed) resolved: httpbb_kubernetes_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:24:12] <wikibugs>	 10netops, 10Analytics-Radar, 10Infrastructure-Foundations: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10MoritzMuehlenhoff) >>! In T273026#8733992, @cmooney wrote: > Must be a race condition of some kind I'm guessing but not sure what it might be.  Pro...
[07:55:01] <jinxer-wm>	 (SystemdUnitFailed) firing: kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:56:16] <jinxer-wm>	 (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:01:17] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:05:01] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:05:53] <volans>	 jbond: why the alerts for aux-k8s-ctrl1001 are sent here?
[08:06:54] <XioNoX>	 I/F owns the aux cluster
[08:21:56] <volans>	 do we? I see mostly other people taking care of that would be interested in this alert more than us... but maybe I'm wrong
[08:22:48] <volans>	 in general I don't think that siloing those alerts to each team channel is a great idea, might allow to miss a lot of pattens of things happening
[08:24:32] <XioNoX>	 +1
[08:24:35] <moritzm>	 fully agreed, we should at least broad cast all alerts to -operations as well
[08:25:01] <moritzm>	 the same issue with Phab as well, some teams have started to drop the SRE tag when it gets tagged with the team task
[08:25:22] <moritzm>	 and then it also disappears from updates in -operations (where I frequently notice relevant things)
[08:31:16] <jinxer-wm>	 (SystemdUnitFailed) firing: kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:35:02] <jinxer-wm>	 (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:35:49] <jbond>	 volans: the routing is based on the role_owner.  also worth noting that the routing to this channel is in addition to the normal routing not instead of.  i.e. alerts tagged with I/F are routed exactly the same as if the where routed sre but the *also* apear in this channel
[08:36:23] <jbond>	 also worth noting that warning (which this alert now is) dont get routed to #wikimedia-operations at all 
[08:37:05] <XioNoX>	 when is the next SRE summit so we can all talk about alerting?
[08:38:23] <volans>	 I'm not sure the ownership of a host works well for alerting in all cases, for example if the debmonitor client fails we want to know and the owner might just ignore it or not realize it's a spread issue
[08:41:27] <jbond>	 again there has been no change to the default behaviour accept that team owneres can also choose to rute them in a different manner. As to debmonitor client failing im not sure i agree.  id say tha we want to know if debmonitor client fails everywhere but if its just a one of then its is normaly something the team owner shuld fix i.e. puppet is failing or disabled.  
[08:41:44] <jbond>	 however again there has been no change here, we will still recive the email when the systemd timer fails
[08:42:29] <jbond>	 Of course im not saying this wont change, i have a feeling that 011y want to start utilising this a bit more, however we should check wit them i dont want to miss represent
[10:04:51] <jbond>	 jobo: ##T323484
[10:04:52] <stashbot>	 T323484: Fine tune the SSHd config of the restricted bastion for better performances - https://phabricator.wikimedia.org/T323484
[10:05:26] * jbond ignore that
[11:03:38] <volans>	 XioNoX, topranksL anything to do for alertname = Storage /var over 50% instance = cloudsw1-b1-codfw.mgmt.codfw.wmnet
[11:03:41] <volans>	 ?
[11:07:53] <slyngs>	 Social Auth now support OIDC via CAS: https://github.com/python-social-auth/social-core/pull/743 
[11:10:08] <volans>	 nice! does the warning on the README still stand? project looking for maintainers...
[11:10:55] <slyngs>	 Sadly yes
[11:16:54] <moritzm>	 very nice!
[11:17:35] <jbond>	 good work
[11:19:49] <slyngs>	 Next time the social-core do a release, I think we can start work on migrating NetBox
[11:21:26] <volans>	 can we rely on this library also if officially without an official maintainer? :D
[11:28:21] <slyngs>	 That's a fair question, but it is the integration for OAuth2 and OIDC that ships with NetBox
[11:37:18] <volans>	 ok then, not much choice :) I meant more for the other python prijects
[11:49:12] <slyngs>	 If we stick to OIDC, then we could use Mozillas module
[11:51:16] <slyngs>	 Assuming that it works with the CAS implementation of OIDC. The patch of social core is adding a new OIDC backend that takes into account that CAS doesn't place information where it's "suppose" to go. 
[11:55:50] <volans>	 ack
[11:58:25] <XioNoX>	 volans: should clear now
[11:58:31] <volans>	 hx
[11:58:33] <volans>	 *thx
[11:59:26] <XioNoX>	 back to the alerts, (SystemdUnitFailed) used to be a critical in Icinga (so alerting to everybody in -operations), then it got moved to AM as Warnings (so not alerting anymore), then now only alerting to teams?
[12:00:22] <XioNoX>	 Did I miss a larger conversation about alerts and what should notify whom and how to prevent silos?
[12:20:03] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Improve Netbox "locations" use - https://phabricator.wikimedia.org/T333948 (10ayounsi)
[12:36:21] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) @wiki_willy this is now deployed and should exlude the devices list in the accounting spreadsheet, Recycled sheet. FYI from line 340 on that sheet there are a bunch...
[12:39:06] <jbond>	 XioNoX: https://wm-bot.wmcloud.org/browser/index.php?start=03%2F28%2F2023+09%3A48&end=03%2F28%2F2023+10%3A31&display=%23wikimedia-sre 
[12:39:29] <jbond>	 nopt that since that discussion the refire time has been changed to 3 days so it would be less spammy to add them back to operations
[12:42:04] <jbond>	 XioNoX: i dont think you missed any discussion.  As the icinga alert is still exists i think the decision was to change the new alert so we could decrease the amount of spam it was sending
[12:43:51] <jbond>	 I think that any real alerts will not change where they report with out wider descussion and would expect that this specific change would either be re-evaluated or communicated more widley when we come to drop the corrosponding icinga alert.  but better to talk to o11y (cc godog)
[12:50:34] <godog>	 I can confirm what jbond said, you didn't miss any discussion XioNoX, and the icinga alerts for systemd unit failed are still in place and critical
[12:51:46] <godog>	 definitely warrants wider discussion on what to do collectively with alert notifications, I want to write down at least some points to be discussed and haven't found the time yet
[12:52:02] <godog>	 but yes any significant change will be communicated
[13:21:02] <jinxer-wm>	 (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:21:22] <XioNoX>	 volans: ^
[13:22:21] <volans>	 XioNoX: nothing to do there, the report works fine, there are outstanding errors beside the ones skipped
[13:22:49] <XioNoX>	 shouldn't alert here then :)
[13:22:50] <volans>	 also those were supposed to alert on the dcops chan, not here
[13:25:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_ganeti_eqiad_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:27:14] <godog>	 FWIW I can't recommend having warnings on team channels
[13:28:26] <godog>	 (of course one of the things to at least call out explicitly I was referring to above)
[13:31:33] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[13:35:03] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[13:37:23] <jbond>	 XioNoX: volans: as i said in the meeting and here before feelf ree to change it so warnings dont come here
[13:37:43] <XioNoX>	 jbond: not blaming you :)
[13:38:21] <jbond>	 sure sure (and thanks) but still if its not desired that also fine with me
[13:38:30] <XioNoX>	 trying to find a consistent way to alert, and prevent drift between teams
[13:38:47] <XioNoX>	 if we remove warnings here, then we don't see the systemd alerts anymore
[13:38:56] <XioNoX>	 well, not on IRC I mean
[13:39:06] <jbond>	 if we remove warnings here then it gose back to how it was before sprint week
[13:39:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_ganeti_eqiad_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:39:32] <jbond>	 i.e. we never got warnings in #w-operations (AFAIK)
[13:40:01] <XioNoX>	 so we get Icinga systemd alerts in -operations, but not AM systemd alerts
[13:40:15] <jbond>	 XioNoX: correct
[13:40:37] <jbond>	 the nmain reason fro that is becauise AM refires after N x hours, icinga never refires
[13:40:58] <jbond>	 when we first introduced the sytemd timer warnings refired every 4 hours so the AM alert was very spammy
[13:41:14] <jbond>	 and thus got dongraded to a warning (the icingae one remains critical)
[13:42:36] <jbond>	 fyi here is an axample of the icinga one still been triggered (checked for my own benefit )
[13:42:36] <XioNoX>	 we're big enough that we need some kind of framework on what should be warnings and what should be criticals, where they alert, etc
[13:42:39] <jbond>	 https://wm-bot.wmcloud.org/browser/index.php?start=04%2F04%2F2023+01%3A30&end=04%2F04%2F2023+01%3A31&display=%23wikimedia-operations
[13:46:39] <jbond>	 XioNoX: i agree, i think this is still very much a moving target
[13:48:33] <volans>	 I see that the refresh for cumin/sretest has 10G... do we really need that? is that because we're going 10G in most places/
[13:49:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) netbox_ganeti_eqiad_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:49:34] <XioNoX>	 can't cumin be a VM and sretest any older host?
[13:51:05] <volans>	 I think we already asked for sretest to be an old host and got a negative reply
[13:51:22] <volans>	 as for cumin... back then we said to use physical to avoid the dependenc on ganeti entirely
[13:51:31] <jbond>	 volans: sorry i thoght i had mentioned this but re sretest i spoke with dc-ops (rob.h) th reaons fo rthe 10G card is because we asked foir a 10G box for doing testing.  i did say we could use an old one or but an old card into a different server but they prtefre to have serveres in warrenty
[13:53:09] <volans>	 ah right I forgot that bit, sorry
[13:54:51] <XioNoX>	 is the dependency on ganeti still a problem with an instance in each DC?
[13:56:04] <volans>	 for cumin runs and most cookbooks probably not, for backups and dba's work not sure
[14:28:08] <jhathaway>	 Sorry folks for missing the meeting yesterday, I got stuck in Denver as we were flying standby and all the flights were full, so we ended up taking the train back to Chicago and arrived yesterday afternoon!
[14:30:28] <XioNoX>	 California Zephyr!
[14:31:32] <jhathaway>	 XioNoX: indeed, coach sleeping was a masochistic affair, but we had a good time regardless!
[14:31:52] <XioNoX>	 I took it from SF to denver, it was absolutely beautiful
[14:33:54] <jhathaway>	 great views indeed, especially from SF to Denver, Denver to Chicago is pretty flat, but still pretty
[15:19:25] <cdanis>	 jhathaway: I've wanted to do that train trip for a while, but not without a sleeper car 😅
[15:20:29] <jhathaway>	 Yeah, I definitely recommend the sleeper car, unless you love sleeping on an incline
[15:49:12] <jinxer-wm>	 (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:08:01] <wikibugs>	 10netbox, 10Infrastructure-Foundations: Improve Netbox "locations" use - https://phabricator.wikimedia.org/T333948 (10jbond) for hosts i think we may be able to get the parent which would be a much better way of doing this?  (we still need to talk about how we do that for net devices)
[16:19:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:13:16] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Thanks @Volans, we'll get the additional info for lines 340 and onwards.  I'm still seeing the following S/N's on the Accounting Spreadsheet that are still repor...
[17:13:22] <jinxer-wm>	 (SystemdUnitFailed) firing: httpbb_kubernetes_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:43:21] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) >>! In T320955#8755710, @wiki_willy wrote: > Thanks @Volans, we'll get the additional info for lines 340 and onwards.  I'm still seeing the following S/N's on the Ac...
[18:13:22] <jinxer-wm>	 (SystemdUnitFailed) resolved: httpbb_kubernetes_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:16:22] <wikibugs>	 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10jbond) thanks, @RobH wonder if you had anything additional to add here?
[18:29:15] <wikibugs>	 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10RobH) Cathal and Arzhel summarize things pretty clearly, and cover all the things I would.    Det...
[22:17:10] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Hi @Volans - yup, that's correct.  They've all been recycled.  For any equipment that we've already sent out for recycling, we've been adding the asset tags and...