[03:10:18] 10Traffic, 10Okapi [Wikimedia Enterprise], 10SRE: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10RBrounley_WMF) a:03Eugene.chernov [04:18:05] 10netops, 10DC-Ops, 10SRE: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10Papaul) @RobH I have only one question for now. what is or will be your approach on keeping the TFTP server up to date with the latest firmware. [07:25:01] 10netops, 10DC-Ops, 10SRE: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10ayounsi) As you said it would be a good idea to see how it fits in the big automation picture. First by detailing precisely the current workflows, identifying the pain poin... [09:34:00] (EmaTestingAlertManager) firing: EmaTestingAlertManager - https://alerts.wikimedia.org [09:34:09] \o/ [09:38:56] 10Traffic, 10observability, 10User-fgiunchedi: Port traffic/netops grafana alerts to AM - https://phabricator.wikimedia.org/T282806 (10ema) I've added a always-firing test alert on Grafana with the following tags: `team: traffic`, `severity: critical`. Shortly after I did so, we received an alert both via em... [09:38:56] cool [09:44:00] (EmaTestingAlertManager) resolved: EmaTestingAlertManager - https://alerts.wikimedia.org [09:47:00] \o/ \o/ \o/ [09:47:06] the jinxer jinxes [09:48:20] godog: I've updated the alert defined on varnish-http-requests (https://grafana-rw.wikimedia.org/d/000000180/varnish-http-requests?tab=alert&editPanel=6&orgId=1) to also send notifications to AlertManager [09:48:49] my understanding though is that with the current settings there will be no indication of which DC the alert refers to? [09:49:09] right now the rule name is "70% GET drop in 30min alert", and the message is "GET requests percentage difference compared to 30 minutes ago" [09:49:58] I'm checking the dashboard but yeah I think so too [09:50:07] would adding {{ $labels.site }} be the right thing to do to add the DC name? [09:51:34] unfortunately I think that's a grafana limitation, there's no templating in that sense [09:52:31] essentially this https://github.com/grafana/grafana/issues/6557 [09:53:08] oh no actually I think I'm wrong, that's slightly different [09:54:05] ema: try with ${site} e.g. in the alert message [09:54:26] https://grafana.com/docs/grafana/latest/alerting/add-notification-template/ that is [09:54:31] godog: will do this afternoon, now lunch! thanks :) [09:54:44] cheers [10:17:24] (70% GET drop in 30min alert) firing: 70% GET drop in 30min alert - https://alerts.wikimedia.org [10:18:45] ema: can we get a common tag for those alerts? [10:18:58] [traffic-alert] or something like that [10:37:24] (70% GET drop in 30min alert) resolved: (2) 70% GET drop in 30min alert - https://alerts.wikimedia.org [12:18:39] vgutierrez: do you mean that all our alerts should have [traffic-alert] as a prefix, or this one specifically? [12:20:08] it would be nice to have a common prefix to highlight it [12:20:37] I don't care if it is [traffic-alert] or #thingsarebroken or whatever you like :) [12:23:22] vgutierrez: how about /hilight jinxer-wm ? [12:24:11] all of them are equally bad? I was thinking in something analogue to the #p.a.g.e tag used in -operations [12:24:52] good point, right now we've got only one alert so yes they're all equally bad :) [12:25:33] grafana allows to specify tags though, so maybe we could work with that? [12:26:20] right now I've used only the team, severity, and dashboard tags, but I imagine we could define another one (eg: prefix?) and use that for IRC notifications [12:26:29] godog: thoughts? ^ [12:31:00] meanwhile I've changed the alert Name to: "70% GET drop in text@${site} during the past 30 minutes" [12:31:23] (changed the name and not the message as it's the name that ends up on IRC apparently) [12:37:42] 10Traffic, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Port traffic/netops grafana alerts to AlertManager - https://phabricator.wikimedia.org/T282806 (10ema) [12:42:37] yeah you can add whichever text you'd like to the alert for sure, and the tags from grafana [12:43:10] I'd also argue that a non-urgent alert shouldn't be on irc, but YMMV [12:49:09] 10netops, 10SRE: Netbox has incorrect email address for GTT - https://phabricator.wikimedia.org/T246564 (10ayounsi) 05Open→03Resolved Thanks all set and updated. noc@gtt.net is a valid email according do their website (so maybe it was a temporary issue?), and I added their 2nd level escalation email as we... [12:59:17] 10netops, 10SRE: Netbox has incorrect email address for GTT - https://phabricator.wikimedia.org/T246564 (10ayounsi) Actually looks like they don't want emails. So I left a note in Netbox saying that it's phone or portal only. [14:06:02] 10Traffic, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Port traffic/netops grafana alerts to AlertManager - https://phabricator.wikimedia.org/T282806 (10ema) OK so it turns out that defining the alerts in Grafana is possible but not recommended, and the right thing to do is adding them to the [... [15:05:20] 10Traffic, 10Analytics, 10Analytics-Kanban, 10SRE: Traffic anomalies: Factor out list of countries into a dedicated Hive table - https://phabricator.wikimedia.org/T272052 (10mforns) [15:06:08] 10Traffic, 10Analytics-Radar, 10Privacy Engineering, 10SRE: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990 (10mforns) [15:36:36] 10netops, 10DC-Ops, 10SRE: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10RobH) >>! In T283771#7118281, @Papaul wrote: > @RobH > I have only one question for now. what is or will be your approach on keeping the TFTP server up to date with the la... [16:31:18] 10Traffic, 10SRE, 10User-jbond: Setup a new PKI software as an alternative to the puppet CA for managing services certificates - https://phabricator.wikimedia.org/T194031 (10jbond) 05Open→03Resolved Closing we now have https://wikitech.wikimedia.org/wiki/PKI/ [17:06:13] 10Traffic, 10netops, 10SRE, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10cmooney) There is this script for AWS that @ema pointed me towards: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/pro... [18:27:53] 10Traffic, 10SRE, 10vm-requests: Please create two Ganeti VMs for Wikidough - https://phabricator.wikimedia.org/T283852 (10ssingh) [18:28:52] 10Traffic, 10SRE, 10vm-requests: Please create two Ganeti VMs for Wikidough - https://phabricator.wikimedia.org/T283852 (10ssingh) p:05Triage→03Medium [18:43:23] 10netops, 10DC-Ops, 10SRE: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10RobH) I think either A or C, B seems problematic and allows for one person to serve as a blocker for updates being timely. Also I fear that B would make me the single poin... [18:44:50] 10Traffic, 10SRE, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in esams - https://phabricator.wikimedia.org/T283852 (10ssingh) [18:45:08] 10netops, 10DC-Ops, 10SRE: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10RobH) >>! In T283771#7118462, @ayounsi wrote: > As you said it would be a good idea to see how it fits in the big automation picture. > First by detailing precisely the cur...