[10:28:20] 10Puppet, 10Infrastructure-Foundations, 10SRE: Where to Put puppetlabs Core Mudules - https://phabricator.wikimedia.org/T302481 (10jbond) [10:28:28] 10Puppet, 10Infrastructure-Foundations, 10SRE: Where to Put puppetlabs Core Mudules - https://phabricator.wikimedia.org/T302481 (10jbond) p:05Triage→03Low [10:29:02] 10Puppet, 10Infrastructure-Foundations, 10SRE: Where to Put puppetlabs Core Mudules - https://phabricator.wikimedia.org/T302481 (10jbond) As all the packages we need have already been packaged by debian, my view is we just go with the debian packages and close this ticket down. [10:30:19] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jbond) >>! In T302423#7733445, @jbond wrote: >> in comparison to say the cron module which is still shipped by Puppet as part of their agent package. How the cron module should be packaged in... [10:35:19] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2021/2022-Q3): Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10fgiunchedi) >>! In T293209#7698301, @Volans wrote: > Today @jbond and I joined the office hours of #sre_observabi... [10:57:14] 10Puppet, 10Infrastructure-Foundations, 10SRE: Where to Put puppetlabs Core Modules - https://phabricator.wikimedia.org/T302481 (10Aklapper) [11:10:34] 10Puppet, 10Infrastructure-Foundations, 10SRE: Where to put puppetlabs Core Modules - https://phabricator.wikimedia.org/T302481 (10RhinosF1) [13:39:03] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10CDanis) Have you also considered [[ https://www.atlassian.com/git/tutorials/git-subtree | git subtree ]] instead of git submodules? [14:46:32] godog: quick follow up on your replies, as it might be easier here than in gerrit. Are alerts replicated in any way between the two alertmanagers? [14:47:38] to understand what we should do if the request to one of them fails [14:51:27] volans: sure, my understanding is that silences once received are broadcasted to peers in the cluster [14:55:03] ah misread silences with alerts; prometheus does send alerts to all peers individually, IIRC the individual alerts are not part of the cluster state but I might be misremembering [14:56:26] * volans is now more confused than earlier :) [14:57:55] so if what' you're saying is correct, there is no guarantee that firing alerts are the same at any given time in the two endpoints? [14:59:19] With the current implementation we stop at the first successful endpoint when silencing, but those are then replicated by AM itself? And so this should guarantee that no matter the diffs in alerts we'll silence them all anyway? [15:01:18] for the first question yes I think in theory there could be inconsistencies, though clients do re-send firing alerts so I think things would eventually converge. For the second question also yes my understanding is that the silences are replicated by alertmanager to its cluster peers [15:01:58] let me clarify the last bit: I'm reading the code and silences are broadcasted to the cluster [15:02:05] ok [15:03:26] so unless there is a split brain between the two clusters and for some reason eqiad gets some alerts that codfw doesn't and we hit codfw for silences but codfw can't broadcast the silence to eqiad we should be good [15:04:21] yeah I think so too, seems good enough to start with at least [15:05:14] agree, seems unlikely enough [15:08:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: blackbox-exporter no icmp replies on prometheus1006 for a few services - https://phabricator.wikimedia.org/T302265 (10fgiunchedi) Thank you @cmooney and @BBlack for the explanations and for digging int... [15:31:51] 10Puppet, 10Infrastructure-Foundations, 10SRE: Where to put puppetlabs Core Modules - https://phabricator.wikimedia.org/T302481 (10jhathaway) >>! In T302481#7734582, @jbond wrote: > As all the packages we need have already been packaged by debian, my view is we just go with the debian packages and close this... [15:34:45] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [15:34:47] 10Puppet, 10Infrastructure-Foundations, 10SRE: Where to put puppetlabs Core Modules - https://phabricator.wikimedia.org/T302481 (10jbond) 05Open→03Resolved ack ill resolve this in that case, the task is still around if anyone wants to object they can re-open [16:35:38] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-ulsfo: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10RobH) Opened inbound ticket 00765408 to track down both of these shipments that arrived last week. [17:01:03] XioNoX, sukhe: what's the rationale behind peering with the ASW link-local IP and not the global unicast one in drmrs? [17:01:29] just curious [17:02:47] topranks: it's the default gateway, learned via RAs [17:04:15] ah ok, and that then goes back to the Bird default behaviour of using the default gw IP if no specific one is listed? [17:04:35] Ideally we would not need RA, but instead have everything set in stone by provisioning (IP, gateway, etc) [17:05:13] Yeah I'm inclined to agree, although RA doesn't seem particularity troublesome to support. [17:07:00] true! [17:49:01] topranks, XioNoX: did you made changes to the wmf homer plugin? [17:49:04] File "/srv/deployment/homer/venv/lib/python3.7/site-packages/homer_plugins/wmf-netbox.py", line 164, in _get_junos_router_interfaces jri[interface_name] = {**jri[parent], **interface_config} [17:49:08] UnboundLocalError: local variable 'parent' referenced before assignment [17:49:18] see https://www.irccloud.com/pastebin/5zLugcWO/ for full stacktrace [17:52:10] hi all i wonder if we should have the vm-request tag added to the sre-foundations section here https://phabricator.wikimedia.org/project/view/1025/ and the herold rule that automatically tags sre-foundations? [17:52:25] why not [17:54:46] * jbond realises he knows nothing about phabricator [17:56:12] jbond: added to the project view [17:57:01] ahh thanks ill try and dig out jo.bo's original ticket for the herlad rule [17:57:18] no need [18:00:06] jbond: I've changed the link that generates the form in: [18:00:06] https://wikitech.wikimedia.org/wiki/SRE/SRE_Team_requests#Virtual_machine_requests_%28Production%29 [18:00:14] so might not need the herald rule probably [18:00:26] ahh great thanks [18:07:16] volans: No I have some patches in but they aren't merged. [18:08:01] that code is failing for asw2-b-eqiad but not for asw2-c-eqiad [18:08:09] any difference that might come to mind [18:08:09] ? [18:08:35] Yeah I'd noticed, just having a look now I'll see if I can spot anything. [18:09:22] topranks: it's definitely wrong the code [18:09:26] uses parent before it's defined [18:09:27] line 164 [18:10:10] it seems that we need the parent, sub = interface_name.split('.', 1) [18:10:17] in 3 cases of that if/elif that has 4 cases [18:14:35] just looking at it there. [18:14:57] that's not appropriate in that case, the "elif" without it is for interface names with no "." in them [18:15:12] so .split('.') isn't going to do it. [18:15:21] so what parent should be? [18:16:31] That elif should be triggered when the current interface is the parent. [18:16:36] ok all im signing off, will be around this evening but from tomrrow ill be mostly unreachable (checking in once a day or so) until tuesday [18:16:52] ok enjoy your time off John :) [18:17:01] thanks :) [18:17:27] jbond: enjoy your time off! [18:17:39] volans: I'm not familiar enough to know exactly what we should do here. [18:17:50] will do cheers :) [18:18:05] Comparing say to the first "if" in that sequence of statements. [18:18:31] That's when it's found a sub-interface, say ge-0/0/0.100, and it already processed the parent, 'ge-0/0/0' [18:18:42] so the elif at line 162 means that our interface_name is not a subinterface, so it's already the parent? [18:19:06] so it should be jri[interface_name] = {**jri[interface_name], **interface_config} [18:19:13] that basically is eqivalent to [18:19:19] yeah, so in the first statement it's finding a sub, and adding those details to an exsiting key (parent) of the jri dict. [18:19:24] jri[interface_name].update(interface_config) [18:19:59] yeah I think that makes sense. [18:20:19] jri[interface_name] should already exist from the previous iteration when it found a sub-int of it. [18:20:41] ok let me try real quick a hot fix on cumin1001 to see the diff [18:20:44] and then I'll make the patch [18:20:44] The other question is why have we only seen this now [18:21:07] the elastic host ryan way decommissioning has been down for a while [18:21:09] But probably easiest if you try the hot fix, and then we look and see if the diff looks ok [18:21:15] maybe it iface was already removed? [18:21:32] elastic[1039,1043].eqiad.wmnet [18:21:36] check the one in row b [18:22:59] it's still configured on the switch [18:23:28] and netbox matches. [18:23:53] k [18:23:55] anyway I suggest hot patch and we can look at the diff, in netbox I don't spot any odd-looking interface names or similar but it's a long list. [18:24:13] thx for checking [18:24:16] I'm running the diff [18:26:43] topranks: so it add to disabled [18:26:43] + member ge-3/0/23; [18:27:03] actually [18:27:25] https://phabricator.wikimedia.org/P21509 [18:27:43] so seems correct AFAICT [18:27:57] should I commit? [18:28:09] not sure about the ge-3/0/24 [18:28:42] that's restbase1032 and maybe it's being provisioned right now, is in planned status [18:28:59] yeah the ge-3/0/24 is odd [18:29:05] https://www.irccloud.com/pastebin/Y2LyVqvZ/ [18:29:22] ^^ that is current config on switch. [18:29:48] Something odd happened here, as homer would normally either add the 1st line, or the last 2, but it should never produce a config with all 3 of these lines [18:30:22] i.e. port should either be part of interface-range disabled or interface-range , never both. [18:30:30] I agree [18:30:35] so does my diff fix it? [18:30:36] I am still scratching my head about your issue though. [18:30:52] None of the interfaces on asw2-b-eqiad have a "." in them / there are no sub-ints. [18:30:54] it removes 24 from the disabled block [18:31:13] yes your diff fixes it, and it's the correct thing to do so you can say yes. [18:31:32] ok committing, I'll test is a noop on asw-c [18:31:34] and then send the patch [18:31:53] going back to the interface names, cos none have a dot in them, I've no idea how the code ever evaluated that "if" on line 162 as true. [18:32:10] have you checked the iface names in the switch? [18:32:17] IIRC we don't represent them with .0 in netbox [18:32:21] but they are .0 in junos [18:32:26] "interface_name" should never be in jri already. [18:32:44] then I have no idea [18:32:52] theres some subtleties there but you're largely correct. [18:33:11] I don't think the .0's matter though, this code is purely parsing netbox data, so doesn't matter the switch adds the .0 [18:33:46] topranks: we do have 2 ge-3/0/24 [18:33:54] https://netbox.wikimedia.org/dcim/interfaces/24297/ [18:33:58] ah :) [18:34:02] https://netbox.wikimedia.org/dcim/interfaces/16255/ [18:34:25] the one on b2 is clearly bogus [18:35:26] Well either the switch should be asw2-b3, or the interface should be ge-2/0/24 [18:35:42] give me a sec I should be able to work out which one is the case [18:37:58] ack, thx [18:40:03] so my tought right now is that we discovered the bug just because of the wrong data in Netbox [18:40:08] in reality we never enter in that if [18:40:18] patch is: https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/765581 [18:41:27] topranks: https://netbox.wikimedia.org/extras/changelog/79683/ [18:41:43] Ok so restbase1032 is 100% connected to port ge-3/0/4. [18:42:43] Port 0/4 on switch 2 has a 10G SFP+ in it, and xe-2/0/4 is in use (cloudelastic1002) [18:42:56] ryankemper: all details above, but TL;DR there was some wrong data in Netbox (duplicated interface name), that caused the code to hit that line with the bug, that usually should never hit [18:43:12] I've committed the change for the decom of elastic1039 [18:43:32] volans: yes, I suspect Netbox is always returning these in an order that we always hit the sub-int "parent" first, and then the sub-interfaces [18:43:43] (normal sorted list would result in that for instance) [18:44:04] so we've never hit that line. But the fix is right we should do it. [18:44:36] We did hit that statement this time cos of duplicate interface name (I think this is a virtual-chassis quirk, on a regular stand-alone device netbox wouldn't allow it) [18:44:53] yep [18:45:26] feel free to delete the bogus iface [18:45:44] I'm not sure why you were referring to ge-3/0/4. and xe-2/0/4 [18:46:01] arent't the duplicate ones [18:46:01] asw2-b3-eqiad / ge-3/0/24 [18:46:06] asw2-b2-eqiad / ge-3/0/24 [18:46:07] ? [18:46:31] Because I'm an idiot is the answer [18:46:33] ffs! [18:47:44] I'd convinved myself there were even more levels of the mess due to that mistake. [18:47:59] Ok that's fine, I've removed the bogus 3/0/24 from asw2-b2-eqiad now. [18:48:32] * volans running diff without hotfix [18:48:36] got it [18:54:09] topranks: I've reverted the hotpatch, the diff shows me a change in order, but apart from that nothing [18:54:14] ! member ge-3/0/1 { ... } [18:54:15] ! member ge-3/0/3 { ... } [18:54:16] ... [18:54:29] yep I actually ran the same. [18:54:41] I think that's good to commit. [18:55:01] order may have got mixed up with the bad name in there. [18:55:11] ack are you committing/ [18:55:12] ? [18:55:28] ok yes running now. [18:55:34] thx [18:55:46] I'm in doubt if deploying the fix or not [18:55:51] is not really needed or urgent [18:55:54] as long as the data is correct [18:55:59] so we can wait Ar.zhel too [18:56:13] but no strong opinion, either works [18:56:16] No need to merge today, so yeah we can wait to get Arzhel's thoughts. [18:56:22] k [18:56:38] We should merge it though, even though I suspect the order Netbox supplies the data means it's never triggered. [18:56:54] thanks for the support! [18:56:57] I don't think we're certain enough it'll never happen to remove that "elif" completely [18:56:58] yes I've merged [18:57:09] so it will go if we do a netbox release [18:57:29] cool, yeah doesn't merit a release I would think. [18:57:38] commit went fine. [18:58:50] great, thx [23:59:50] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Znuny, 10vrts: Not receiving VRT notification emails - https://phabricator.wikimedia.org/T302139 (10Dzahn) I gave people on #wikmedia-vrt the summary of the incident report basically. Since some were wondering why they got mails at once etc. There is a doc...