[00:23:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:23:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:53:03] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:18:03] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:03] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:48:08] 10Mail, 06Infrastructure-Foundations, 07User-notice: Stop sending change notification email if edit is done by a bot - https://phabricator.wikimedia.org/T356984#9595530 (10Ladsgroup) Now I can explain about this a bit more (cleared by the security team). Yes. When I said the discussion happened privately, I... [12:05:03] XioNoX: Can I try out something on Netbox-Next, or would that make you sad? [12:05:17] slyngs: go for it ! [12:05:38] slyngs: we should also discuss when to switch prod netbox to the new auth system [12:06:08] Yes, because we either need to add a new seervice to cas or break logins briefly [12:32:04] Needs a little tweaking with the naming of metrics, but it does work: [12:32:07] # HELP netbox_report_accounting_Accounting Report generation failed [12:32:07] # TYPE netbox_report_accounting_Accounting gauge [12:32:07] netbox_report_accounting_Accounting 0.0 [12:32:26] And it's forward compatible :-) [12:37:17] slyngs: what is it? [12:37:36] Failed netbox reports, via the Prometheus metrics endpoint [12:37:48] nice ! [12:38:09] Do we not have any plugins installed in Netbox? [12:38:17] we have 1, yes [12:38:35] https://phabricator.wikimedia.org/T311052 [12:39:26] the python lib is installed, but it's not enabled in the config (with the PLUGIN config option) [12:39:43] few years have passed, call me old, but I still don't really see the use of a time series for a boolean check with info (for which we don't even care about its history) as "the right tool for the job" [12:39:55] although that ship has sailed... [12:40:55] we could care in some cases [12:41:00] about history [12:41:20] but yeah, not critical [12:41:35] we care about the failures, not that it failed [12:41:49] and we can't have what failed in prometheus ;) [12:41:53] just that it failed [12:42:26] dunno, we currently have a bug where some reports crash randomly [12:42:43] having data on the crash pattern could be useful [12:43:28] we do have that, it''s in netbox's db [12:43:46] slyngs: btw, will the thing you're working on replace the "GetStats" netbox script hack? [12:43:57] Maybe :-) [12:44:04] awesome, thx! [12:44:34] The functionality can be lifted out and put into the ntc-netbox-plugin-metrics-ext plugin, depending on how we want to go about it [12:45:36] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9595718 (10fnegri) @bking thanks for having a look! No rush really, I was... [12:57:23] XioNoX: It's the getstats.py you want to replace correct? It seems fairly trivial to move to the buildin /metrics endpoint [12:58:21] yeah if we can get rid of it it would be awesome, it's currently a hack where we run a netbox script over and over again [12:58:32] flooding the DB, etc [12:59:11] you mean the per frontend /metrics, or a generic netbox.wikimedia.org/metrics ? [12:59:14] The metrics endpoint does the same though, it also just rely on the database for information [13:00:21] I suppose we could do either per frontend or the generic [13:00:35] generic is better here [13:00:44] to not have duplicated data [13:01:04] Do we currently scrape both, I haven't checked [13:02:03] yeah, the per frontend is used for metrics like django health [13:02:25] generic should be for netbox data stats (reports, object count, etc [13:11:14] We don't scrape the generic endpoint, is seems, data is currently collect from both netbox1002 and netbox2002 via the node exporter [13:23:03] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:48:03] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:32] 07Puppet, 10Observability-Alerting, 10Puppet-Infrastructure, 06SRE: Notification spam from "last puppet run" upon re-enabling puppet - https://phabricator.wikimedia.org/T263720#9595942 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Optimistically resolving since we've moved to prometheus-based alert... [14:04:57] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9596015 (10bking) > @bking what if we release spicerack with the change... [14:25:06] XioNoX: topranks: volans: https://phabricator.wikimedia.org/T359054 [14:25:22] [not public because of the data] [14:32:36] thx [14:34:41] sukhe: not sure why the data needs to be secret ? [14:35:00] XioNoX: as I mention, Turnilo graphs, I have been told in the past that they should not be made public [14:35:08] but why? [14:35:14] don't ask me :) [14:35:32] some of this data is not public, at different levels of granularity [14:35:34] I understand that turnilo is private because it has IPs [14:35:49] but those screenshots don't have any private data on them [14:35:53] I think this is particular case is fine but I am going to err on the side of caution, ask for approval and then open this [14:35:56] yep, don't disagree [14:37:23] sukhe: cool, thanks ! [15:54:13] volans: i think tracing fits best under automation, in that it is kinda a kind of toil reduction [15:56:27] wfm! [16:00:53] volans: I tried to make some bullet points for the network updates [16:00:54] https://etherpad.wikimedia.org/p/state_of_union_-_if_-network_bullet_points [16:01:05] not sure if it helps much might still be too verbose [16:01:17] thanks! [16:01:25] Arzhel may also want to change the last two around perhaps I didn't focus best on what he'd like [16:02:18] topranks: link is correct? empty for me [16:02:34] yeah weird, I clicked and got same, despite in other tab it had all the text [16:02:42] ahahaha [16:04:01] very odd it's doing the same thing to me again [16:04:39] maybe it'll let you paste [16:04:39] https://phabricator.wikimedia.org/P58364 [16:11:47] thanks [16:20:48] 10netops, 06Infrastructure-Foundations, 06SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9596886 (10cmooney) [16:26:40] volans: I added a note about the dcl tooling to the config section, https://sites.google.com/wikimedia.org/sreinfrastructurefoundationsou/projects/configuration-management [16:28:35] thx [17:37:40] nice work volans! [17:38:00] volans: <3 [17:38:45] fiuuu, thanks, I thought I did a horrible job, in particular for some of them, I was mixing things up [17:38:48] :/ [17:39:15] not at all, I thought you did a really good job of covering all the ground [17:46:31] +1 on that! [17:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:32] thanks all for the help on the content! (and you're too kind :) ) [17:56:19] 10Mail: Create user preference to receive change notification emails for bot edits - https://phabricator.wikimedia.org/T358087#9597535 (10matmarex) I understand that this is not the solution you're asking for, but just in case it helps, let me note that there's an Atom feed (like RSS) available for your watchlis... [21:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed