[00:33:02] FIRING: [3x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:03:02] FIRING: [3x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:03:02] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:05] jayme: I can try to have a look today (re rename cookbook) [08:22:26] volans: cool, thanks! [08:44:30] volans and XioNoX did we make a decision on how to do the Netbox report monitoring. There where some add-on Prometheus exporter, which requires a little work, and then there is the minimalist one I wrote. [08:46:18] I didn't :) [08:46:29] I can have another look, forgot about it since [08:47:12] Do you remember what the one you found was called [08:48:59] slyngs: there was https://phabricator.wikimedia.org/T243928 [08:49:12] but I think there is another newer plugin [08:49:33] for prometheus we need 2/3 kind of exporters [08:50:37] one about Netbox's data (device count per site, reports, ) so one for all of netbox. And another one per frontend server, about django health [08:50:39] Yeah, there was the one with the UI, but that was kinda broken [08:51:09] The Django health ships with Netbox [08:51:26] yeah, and we're using it [08:51:41] I was listing them for sake of completness [08:51:53] we already have that one https://wikitech.wikimedia.org/wiki/Netbox#Prometheus_2 for the netbox data, but it's not great [08:52:06] so whatever solution we use, it needs to replace that [08:52:37] eh, and the doc lists the proper task https://phabricator.wikimedia.org/T311052 [08:53:19] so there is ntc-netbox-plugin-metrics-ext, and netbox-more-metrics [08:53:29] And: https://gitlab.wikimedia.org/slyngshede/netbox_prometheus_metrics [08:53:34] :-) [08:54:43] slyngs: of course :) [08:56:20] slyngs: you know the topic better than me, so if your tool checks all the boxes and the alternatives are not as good, it's fine for me. Just want to make sure we really need it if we start having to maintain a new tool [08:56:49] indeed [08:56:52] Very much agree [08:58:27] The more-metrics one looks like it might be the best option, even if it is Netbox 3.4+ [09:00:18] I'll see if I can make it work on a local Netbox 4.0 - 1 release [09:01:39] All of the options is going to require us (me) doing some work, but probably best if we can use the effort on keeping something existing running [09:02:50] +1 :) [09:04:00] if the picked one works with 3.X up to the latest is fine, if it's a maintaned project they will make it work for 4.X surely before we upgrade [09:05:52] last commit is from last year [09:06:00] https://github.com/TheDJVG/netbox-more-metrics/issues/23#issue-1789910903 [10:03:02] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:52] None of the Prometheus metrics plugins I've found seems to be actively developed, but more-metrics seems like it's the one which a newish release [12:49:38] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:29:40] hello folks [13:30:42] yesterday we noticed that after moving thanos-fe1001 to PKI, Tegola (maps tiles on k8s) showed an increase in cpu usage across all containers. It wasn't a heavy one, but it may increase if we move the other 3 nodes to PKI [13:31:39] my impression is that it is due to a change in the cipher suite used, and sadly the fact that it opens a TCP/TLS connection every time to Thanos (no conn pooling etc..). In order to use the envoy sidecar, request signing needs to be fixed on the client, I filed https://gerrit.wikimedia.org/r/c/operations/software/tegola/+/1032482 [13:32:06] the alternative is to use something like https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/aws_request_signing_filter, more generic but I am not 100% sure if it is prod-ready. [13:33:19] lemme know your thoughts, of course maps is not really owned by anybody so the best course of action is to work on it [13:33:27] this would unblock the cergen deprecation etc.. [14:03:02] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:20] 10netops, 06Infrastructure-Foundations, 06SRE: Switch BGP (EVPN) topology between rows/spines at core sites - https://phabricator.wikimedia.org/T365169 (10cmooney) 03NEW p:05Triage→03Low [15:04:02] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9805245 (10cmooney) [15:04:06] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Switch BGP (EVPN) topology between rows/spines at core sites - https://phabricator.wikimedia.org/T365169#9805246 (10cmooney) [15:08:31] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Switch BGP (EVPN) topology between rows/spines at core sites - https://phabricator.wikimedia.org/T365169#9805265 (10cmooney) [15:13:27] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Switch BGP (EVPN) topology between rows/spines at core sites - https://phabricator.wikimedia.org/T365169#9805297 (10cmooney) [15:15:52] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Switch BGP (EVPN) topology between rows/spines at core sites - https://phabricator.wikimedia.org/T365169#9805309 (10cmooney) p:05Low→03Medium [16:33:48] 10netops, 06Infrastructure-Foundations, 06SRE: magru network setup - https://phabricator.wikimedia.org/T362421#9805776 (10ssingh) Thanks to @cmooney for rolling the above out. For further context, we (Traffic and netops) decided to try out the anycast range in magru for the Wikidough service before doing it... [16:49:38] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:45] 10netops, 06Infrastructure-Foundations, 06SRE: magru network setup - https://phabricator.wikimedia.org/T362421#9806067 (10cmooney) And fwiw announcement looks good, all 3 of our transits are learning it ok, and I see it on other carriers from those sources as well. We also see live requests on the doh servers. [16:51:56] 10Mail, 06Infrastructure-Foundations, 06SRE: Evaluate whether and how to route abuse@ emails to Legal - https://phabricator.wikimedia.org/T302549#9806065 (10Aklapper) a:05RLazarus→03None @RLazarus: Removing task assignee as this open task has been assigned for more than two years - see the email sent to... [16:54:57] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Switch BGP (EVPN) topology between rows/spines at core sites - https://phabricator.wikimedia.org/T365169#9806104 (10cmooney) [17:08:12] 10netops, 06Infrastructure-Foundations, 10Sustainability (Incident Followup): Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355#9806267 (10Aklapper) a:05cmooney→03None @cmooney: Removing task assignee as this open task has been assigned for more than two years - see the e... [18:03:02] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:05:11] 10netops, 06Infrastructure-Foundations, 10Sustainability (Incident Followup): Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355#9806679 (10cmooney) p:05Medium→03Low Thanks. It is very much something we wish to do but unfortunately other priorities have always trumped it... [19:05:30] 10netops, 06Infrastructure-Foundations, 10Sustainability (Incident Followup): Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355#9806681 (10cmooney) [19:54:22] 10netops, 10Cloud Services Proposals, 06Infrastructure-Foundations, 06SRE: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847#9806824 (10cmooney) 05Open→03Resolved This has been implemented and the new vlan setup is recorded [[ https://wikitech.wikimedia.... [20:49:38] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:23:59] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204 (10cmooney) 03NEW p:05Triage→03High [21:31:15] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9807207 (10cmooney) [21:31:39] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9807214 (10cmooney) [21:45:24] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9807226 (10cmooney) [22:03:02] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed