[01:39:55] 10netbox, 10Infrastructure-Foundations, 10Traffic: Restoring Netbox data for lvs3007 - https://phabricator.wikimedia.org/T334253 (10Peachey88) [02:18:22] (SystemdUnitFailed) firing: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:41] (SystemdUnitFailed) firing: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:54:17] so for Netbox, we do have hourly backup, we're just a bit unlucky this time that the backup is at :37 and the last change before the unwanted deletion were at :39 and :52 of the previous hour and just one aferwards. [06:54:35] so either way we have to replay some changes, I'll check the current status to make a more informed decision on which way to go [06:55:22] 10netbox, 10Infrastructure-Foundations: Should we have two versions of the Juniper QFX5120-48Y in Netbox? - https://phabricator.wikimedia.org/T331519 (10ayounsi) I'm more and more on the side of using a single device type here. [06:55:57] cc sukhe for backlog reading later [06:56:05] some clarification first [06:56:29] 1) the hourly backups are directly on the secondary db host, no involvment of bacula or anything else [06:57:08] 2) the process is documented at https://wikitech.wikimedia.org/wiki/Netbox#Restore_the_DB_dump [06:58:52] 3) netbox-next.w.o has always a stale DB of production that can be a useful place where to look for previously long-existing data. Ofc there is a small chance that the data in netbox-next could have been modified for testing or is older than the latest modification in production. But for example usually cables don't change and might have been a good place to look for the missing info [07:01:36] 4) In pages with multiple selection available, Netbox has usually 2 delete buttons, one at the top-right that is for the whole object and is present in all pages and the one at the bottom that is specific to the selection made. The two confirmation pages are totally different (the former is a popup, the latter a page). That said it can be confusing and if not reading carefully the [07:01:42] confirmation message it's easy to make a mistake [07:01:49] has happened already few times by different people. [07:07:47] I'm restoring the missing bits manually from netbox-next in this case, the cable is [07:07:50] https://netbox-next.wikimedia.org/dcim/cables/1070/ [07:08:14] hey, I've deleted a whole switch in the past, good thing we had backups [07:10:14] I've also set the device's platform to Linux as it was missing, I'm now checking all the IPs's details [07:12:16] ok, I think it's all back now, the only difference but I guess is expected is the virtual iface that changed name to vlan100 instead of $iface.100 [07:13:46] the sre.dns.netbox cookbook keep showing a noop, so we should be good to go AFAICT [07:16:06] thanks also to Willy for adding the serial number [07:21:53] ah, I've also run in dry-run mode the import from puppetdb script, noop [07:23:11] 10netbox, 10Infrastructure-Foundations, 10SRE, 10Traffic: Restoring Netbox data for lvs3007 - https://phabricator.wikimedia.org/T334253 (10Volans) p:05Triage→03High a:03Volans I've updated on IRC while I was working on this, this is my backlog :) > So for Netbox, we do have hourly backup, we're jus... [07:24:35] 10netops, 10Infrastructure-Foundations, 10SRE: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10ayounsi) FYI, the above patch caused the following outstanding diff on some cloudsw switches: `lang=diff Changes for 1 devices: ['cloudsw1-c8-eqiad.mgmt.eqiad.wmn... [08:18:22] (SystemdUnitFailed) resolved: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:09:53] the good old netbox delete button :) [09:10:33] at least it has the warning now that's saved me a few times [09:12:02] in theory we could only grant device delete permissions to DCops [09:12:10] but it's not trivial [09:12:35] wouldn't be a bad idea, rest of us can set them to "decom" or whatever [09:12:42] but not sure how much effort it's worth [09:13:30] btw XioNoX did you see https://phabricator.wikimedia.org/T334180 [09:14:04] 200 characters license ?! [09:14:37] yep [09:16:00] not fan of the device custom field, as it will apply to all the devices, and will be a mess for devices that need more than 1 licence (like VCs) [09:16:32] yeah if things need multiple then the inventory items maybe works better [09:17:22] I guess custom field on inventory item is the least worse option :) [09:18:04] cool yeah that'll work fine [09:18:10] annoying little snag [09:20:52] XioNoX: another thing I hit, after the 'routing-options' change, is what caused the config diffs for cloudsw :( [09:21:15] the cloudsw in c8/d5 have a few static routes for the loopback IP of the cloudsw2 in those cabs [09:21:32] which the 'replace' is trying to remove (should have realised) [09:21:33] yeah, I didn't notice that the file was included from the cloudsw templates [09:21:37] yeah [09:21:38] https://www.irccloud.com/pastebin/CVPAqzyI/ [09:22:09] So I'm wondering if I should just implement some simple YAML-based model under a device for static routes? [09:22:43] could also maybe help if we have the ones for LVS or whatever, although in general we want to avoid them [09:23:26] I was thinking pretty generic, like { "ipv4_routes": { "route": "x.x.x/x", next-hop: "y.y.y.y"} } [09:23:44] buying the BGP license on the cloudsw2 for the loopback might not be a good way to spend money :) [09:24:11] haha yeah it would be BGP but no licence exactly [09:24:46] it makes sens to me to have some basic static routes support yeah [09:25:45] Ok I’ll take a look. Don’t want to move backward and separate out routing-options for cloudsw [09:25:56] { "static_routes": { "route": "x.x.x/x", next-hop: "y.y.y.y"} } and then have Homer be smart about v4 vs v6 [09:26:17] a description field too [09:27:23] similar to the `{% for prefix in anycast_prefixes|selectattr("version", "eq", 6) %}` etc [09:28:37] smart thinking nice :) [09:29:10] for the description field, one we'll add on the juniper as a /* comment */ above the route? [09:29:31] yeah [09:29:47] makes sense yeah [11:25:56] XioNoX: why is it the simplest things always turn out so complex :P [11:26:10] I was working on the static route thing, ok for the most part [11:26:30] worked fine for the /128 and /32 routes on cloudsw1* [11:26:47] however hit a bit of a snag with the defaults going the other way on cloudsw2* [11:26:54] the v4 is fine, but the v6 causes an issue [11:27:03] YAML doesn't like ::/0 as a value [11:27:31] if you put it in quotes it becomes a string, not interpreted as an IP, and the "selectattr()" filtering breaks [11:28:05] While messy, about the best I could come up with is put the route as 0::/0 in the YAML [11:28:17] Which the Juniper is ok with: https://phabricator.wikimedia.org/P46155 [11:38:12] actually ignore that suggestion - 0::/0 makes the selectattr() function go crazy and eventually crash :) [11:41:21] which part has problems with the quotes? homer? [11:42:47] volans: the quotes themselves aren't a problem [11:43:11] just that makes the value a string, without the quotes it is recognized as an IP address [11:43:20] and the resulting value is an ipaddress object instead [11:43:37] but in homer that's done by our filter [11:43:53] ah yes I was trying to dig into where/how that happens [11:43:58] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/homer/+/refs/heads/master/homer/config.py#19 [11:44:09] and the quotes shouldn't make a difference [11:44:24] we convert it to string first line of the function [11:44:26] yeah just looking at the code, quotes shouldn't matter [11:44:51] hmm ok let me try again maybe I mixed myself up [11:44:58] that's why I was asking :D [11:45:23] heh [11:45:42] I also have found that in the YAML putting the "::/0" on a new line makes it happy [11:45:55] so instead of { "route": ::/0 } [11:46:06] "route": [11:46:08] ::/0 [11:46:27] but let me re-try the quotes cos that seems cleaner [11:46:31] rotfl, win for the parser [11:49:23] the quotes makes it throw an error alright [11:49:25] jinja2.exceptions.UndefinedError: 'str object' has no attribute 'version' [11:49:45] which is caused by this in the template: [11:49:46] {% for network in static.route|selectattr("version", "eq", 6) %} [11:50:16] The input it's working on it [11:50:19] - { route: 10.64.146.250/32, next_hop: 10.64.147.9, descr: "cloudsw2-c8-eqiad loopback" } [11:50:34] I guess it doesn't match the regex [11:50:35] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/homer/+/refs/heads/master/homer/config.py#78 [11:50:36] sorry that's a bad example, I didn't put the quotes back around the IPs there [11:50:55] without the quotes it works, with them it ends up as a string [11:51:02] volans: thanks let me look at that [11:51:09] I'm not sure if the constructors are given the quotes or not [11:51:17] we should check yaml documentation [11:55:02] ok yep, tbh that code is confusing me :) [11:55:13] which part? [11:55:19] the purpose of the regex is to identify what values to try and parse as IPs? [11:55:38] is to tell yaml for which values call the custom constructor [11:55:46] that will return ipaddress objects [11:55:51] instead of strings [11:56:18] ok yep [11:56:18] I think I got the regexes from some places, weird I didn't add a comment with a link [11:56:21] I usually d [11:56:22] *do [11:56:26] so potentially the quotes are making it not match that [11:56:41] haha no worries :) [11:56:44] yeah although I would expect yaml would pass hte value already unquoted [11:56:47] but who knows [11:56:55] I'm checking [11:57:15] thanks, I'm guessing it somehow must NOT strip the quotes, otherwise it'd behave the same with or without? [12:02:30] yep [12:02:32] it's that [12:02:33] >>> load_yaml_config('foo.yaml') [12:02:33] {'route': ['10.64.12.11']} [12:02:33] >>> load_yaml_config('foo.yaml') [12:02:35] {'route': [IPv4Address('10.64.12.11')]} [12:03:17] sending patch [12:04:05] volans: only if it's simple [12:04:12] I've hit another issue it seems [12:04:45] looks like the operation I was doing on the address, getting the IP version, doesn't work with the v6 default [12:04:59] I hit this when I'd tried to use 0::/0 as a workaround and thought it was that syntax [12:05:13] but using the "new line" trick in the YAML it also happens with ::/0 [12:05:40] not sure where it's going wrong, cpu thread hits 100% for a few mins then kernel kills the homer process on me [12:06:10] so I suspect I may need to just deal with this another way - one which it won't matter if it's a string or ipaddress object :( [12:08:03] https://www.irccloud.com/pastebin/oPgpSwAq/ [12:08:17] of course that works fine. [12:08:23] and the regex too [12:08:32] >>> re.match(network_re, '0::/0') [12:08:32] [12:08:55] and doesn't have backtracking issues, is immediate [12:10:22] hmm [12:31:25] 10netops, 10Infrastructure-Foundations, 10SRE: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney) p:05Triage→03Low [12:33:31] volans: we can fold the fix in the homer release I'm trying to make and failing at the Makefile :) [12:33:39] at building wheels [12:33:52] ehehe yes, but not sure I can keep wrking on this today [12:34:08] 10netops, 10Infrastructure-Foundations, 10SRE: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney) [12:34:11] there are some subtleness that makes it a bitt more complex than expanding the regex [12:34:19] I'm reading pyyaml source code [12:41:14] fwiw I've been able to work around things ok now [12:41:37] I created a task and submitted a patch that works https://phabricator.wikimedia.org/T334281 [12:41:43] so the problem is https://github.com/yaml/pyyaml/issues/457 [12:44:27] topranks: you have to change your value from: [12:44:40] foo: "10.0.0.1" [12:44:41] to [12:44:44] foo: ! "10.0.0.1" [12:44:53] or explicitely tagging it with [12:45:03] foo: !ip_address "10.0.0.1" [12:45:49] nice work / find!! [12:46:02] XioNoX: sorry, no need to patch homer :D [12:47:39] actually, no need to relese, I'm sending a patch to add those to the tests [12:47:58] beter the ! [12:48:08] that just tells to not treat that quoted string as quoted [12:51:54] volans: I can confirm it works with ! "::/0" [12:54:14] the issue says that it's probably not 100% adherent to the yaml specs, but it's the pyyaml way of allowing you to do that [12:54:22] I'm adding hte test to ensure it will keep working in the future [12:55:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Add generic mechanism to add static routes on switches - https://phabricator.wikimedia.org/T334281 (10cmooney) [12:55:21] volans: great, thanks for all the help on this one :) [12:55:29] np [12:55:33] I've modified the patch to use that syntax, I think it's probably the cleanest optoin [12:56:17] you have a gift for finding this kind of corner cases :D [12:56:50] haha :P [13:00:25] https://gerrit.wikimedia.org/r/c/operations/software/homer/+/906729/1 [13:17:16] 10netbox, 10Infrastructure-Foundations, 10SRE, 10Traffic: Restoring Netbox data for lvs3007 - https://phabricator.wikimedia.org/T334253 (10ssingh) >>! In T334253#8764695, @Volans wrote: > I've updated on IRC while I was working on this, this is my backlog :) Many thanks for taking care of this, @volans!... [13:25:05] 10netops, 10Infrastructure-Foundations, 10SRE: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10cmooney) 05Open→03Resolved Indeed Arzhel thanks, my bad I had forgot those were present. I'll close this one, the static's have been (rather laboriously) deal... [14:03:22] topranks: cloudsw2-c8 only have 2 server facing ports left, so we can probably move them over cloudsw1 and decom that switch [14:05:48] and for d5 we're 2 free ports away from being able to decom it too [14:31:48] XioNoX: ok didn't realise we'd got that low on the second switch [14:31:58] yeah with e/f taking some servers, and move to 1 nic that's great [14:32:14] motivates us to get the ceph nodes to 1 nic for d5 [15:30:51] 10netbox, 10Infrastructure-Foundations, 10SRE, 10Traffic: Restoring Netbox data for lvs3007 - https://phabricator.wikimedia.org/T334253 (10Volans) 05Open→03Resolved Perfect, resolving.