[00:59:44] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9844770 (10Papaul) [02:47:25] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9844864 (10Papaul) [09:05:45] I'm trying to figure out why LDAP access would be blocked from 10.192.0.0/22 (cloudidm2001-dev) to cloudservices2004-dev. Ferm rules looks correct and IP range should be covered by ($PRODUCTION_NETWORKS [09:10:32] slyngs: there are specific ACLs for the cloud vlan [09:10:38] Aaah [09:11:11] slyngs: maybe there https://github.com/wikimedia/operations-homer-public/blob/master/policies/cr-cloud.yaml#L48 [09:11:22] slyngs: it's applied to traffic leaving the cloud vlan [09:15:16] slyngs: you can probably add `cloudidm-dev_group` to that list [09:15:32] the full host list is there https://netbox.wikimedia.org/extras/scripts/results/5819958/ [09:15:40] Hmm, or maybe cloudservices2004-dev.codfw.wmnet should have been on the cloud VLAN [09:17:11] slyngs: it is on the cloud vlan https://netbox.wikimedia.org/dcim/interfaces/29944/ [09:18:10] Sorry, I meant: cloudidm2001-dev.codfw.wmnet [09:19:34] It only need to communicate the other cloud-dev stuff [09:20:32] slyngs: that's a good question ! it depends on the overall data flows [09:20:59] I guess cloudidm have to communicate to something outside of cloud ? [09:21:29] Well, yes, mariadb [09:23:44] slyngs: and who will be managing this service/server ? [09:24:32] Realistically me :-) [09:25:02] It's for WMCS labtest [13:07:39] slyngs: are you working on netbox-next? I see netbox broken there [13:11:19] Nope..but I'll check [13:11:36] Seems fine [13:14:15] Something tried to do some weird login that failed: GET /oauth/login/oidc/? [13:14:15] next=/dcim/devices/3423/ => generated 0 bytes in 147 msecs (HTTP/1.1 302) 11 headers in 828 bytes (1 switches on core 0) [13:14:15] [2024-05-30T13:07:14] Internal Server Error: /oauth/complete/oidc/ [13:32:16] Okay, is this what happened: You're signed out, then get redirected to the idp-test signin page. Wait there for some number of minutes and then authenticate? [13:33:10] Apparently that will trigger something in the OIDC plugin for netbox, presumably some timing is off, this causes the bug I found in the log from when I assume you tried to login [13:38:46] that was me [13:39:16] but I tried to login few seconds after loading the page [13:39:26] I got [13:39:27] [13:39:27] Missing needed parameter state [13:39:32] Exactly [13:39:49] if I hit the get to the homepage button actually works [13:39:51] weird [13:40:02] I can reproduce that fairly easily. I'm looking at the social-auth plugin to see what the timing is [13:40:27] I'm not sure yet, but I think the encrypted secret from Netbox to IDP times out pretty fast [13:40:47] thanks [13:41:01] not sure if related, puppet is disabled [13:41:04] (3145 minutes ago). Puppet is disabled. test-swift - ayounsi [13:41:09] When you then hit the Homepage the whole OIDC thing kicks in again that just pulls the token from IDP-test and you're authenticated [13:41:26] I don't believe so [13:42:12] It's an interesting little "bug", would be nice if you had a few minutes to signin, but I can tell if that would be a security risk [13:51:51] lmk if you can't repro and I shoul dretry to repro [13:52:23] No no, it's rather trivial to do [13:54:11] I'm trying to figure out how social-auth loads the data. It's in "request_data()" but that's wrapped in a number of classes and inheritance [13:55:26] raise NotImplementedError('Implement in subclass') , great, now where are the subclasses :-) [13:56:58] lol [13:59:41] For those following at home, it is in https://github.com/python-social-auth/social-app-django/blob/5.0.0/social_django/strategy.py#L49 [14:10:26] I wonder if it's CAS that fails to ship the variable [14:16:34] I'll do a bit more digging tomorrow. I'd like to understand the issue before switching to OIDC for production [15:42:11] XioNoX: topranks: given IXbr is live, I was thinking of turning on ns2. any concerns from your end on doing that today? if not today then maybe Monday? [15:45:27] sukhe: +1 from me yeah, can't think of any reason why not ?? [15:48:22] topranks: are you around for that or too late for you? please don't feel compelled to do it today :) [15:48:29] checking it by bblack once but I am up for it [15:48:43] we will need to turn on the ns2 adverts from dns7x, but that's a single command + the homer merge I think should be it? [15:53:43] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:25] sukhe: hey yep I'm around for it [15:55:35] indeed should only be a matter of doing both those things [15:56:02] let's do it, I have bblack's +1 as well [15:56:58] ok [15:57:11] fire away with announcing the IP, I'll check that, run a bunch of test queries etc. [15:57:18] thanks, on it [15:57:25] if we're good then we can change the aggregate config on the CRs to announce the /24 upstream [16:01:04] sorry, I pasted in -ops [16:01:11] good to go, we are advertising ns2 [16:21:45] topranks: there was a pending change I think [16:21:45] [edit interfaces ae0 aggregated-ether-options lacp] [16:21:45] - periodic slow; [16:21:45] + periodic fast; [16:22:02] hold on [16:22:06] er ops [16:22:07] hmm [16:22:17] XioNoX: that's for ix.br we have a LAG going do we? [16:22:23] that was only cr2 yes [16:22:37] sorry, I merged it as it was hidden in the scrollback and I realized only later. should I revert it? [16:24:26] no it seems ok for now [16:24:36] The port to IX.br is still up [16:27:00] I'll change the config for the LACP settings at IX.br back to what they were manually, so as not to revert the ns2 range / aggregate bit [16:27:04] I'll look at the automation shortly [16:27:11] ok thanks [16:27:15] I see IX.br are set for "slow" mode (30 seconds I think) [16:27:25] we've just changed to fast, I would have expected that to break it but it didn't [16:27:27] which is nice [16:27:30] sorry about that! but in cases of conflict such as this in homer, there is no other way to selectively apply changes right? [16:28:40] ok changed back and it's still stable, now showing matching as 'slow' either side [16:28:49] no worries [16:28:59] there ought not to be changes in Homer outstanding. [16:29:01] thanks, can I ask where you changed it? for my own knowledge [16:29:09] but humans do human (I am possibly most guilty) [16:29:23] cmooney@cr2-magru# show | compare [16:29:23] [edit interfaces ae0 aggregated-ether-options lacp] [16:29:23] - periodic fast; [16:29:23] + periodic slow; [16:29:33] ^^ that's the change applied on the router manually [16:29:42] right, so I guess you manually applied it back? [16:29:55] I need to check our automation, it may be that IX.br only does slow for LACP and we don't have a toggle for that in our automation yet [16:30:16] I manually changed it back as other option is an atomic rollback, which would have undone the aggregate route stuff added for ns2 [16:30:30] https://phabricator.wikimedia.org/T351505#9815280 says "slow" here as well [16:31:26] yeah so after the change the "actor" (us) showed fast on that output, the "partner" (ix.br) showed as slow [16:31:40] a mis-match I'd expect to bring the connection down, but it didn't [16:31:54] they are both back on 'slow' now, leave it to us to fix the automation [16:32:04] yeah [16:34:37] https://grafana.wikimedia.org/d/Jj8MztfZz/authoritative-dns?orgId=1&refresh=30s&viewPanel=202 [16:41:47] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9847581 (10cmooney) >>! In T362421#9808055, @cmooney wrote: > Cogent are picking the magru announcement as best globally from Novvacore it seems also. We could add `28189:8094... [17:08:14] topranks: I wanted to check if it was like BFD, if it uses the lowest configured setting on both sides [17:08:39] to not have to add a new config knob for just that one link.. [17:10:30] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: magru network setup - https://phabricator.wikimedia.org/T362421#9847731 (10cmooney) Ok after change Cogent are going to Telia, which we have at all the other POPs, so I think a better result. Novvacore are still announcing it at IX.br so d... [17:11:33] XioNoX: yep makes sense, I actually don't know myself [17:11:45] I was sure I saw a mis-match on this making things not work in the past [17:11:59] I agree though, let's dig in a little and avoid the knob if we can [17:12:19] btw: https://phabricator.wikimedia.org/T362421#9847581 [17:13:03] topranks: sounds good! [17:15:24] looking at the actual stats for dns7xxx it doesn't seem hugely significant (i.e. small drop since change made), but still I think probably for the best [17:16:34] sukhe: to confirm from a netops point-of-view the ns2 range announcement looks ok [17:16:47] it's been picked up by our transits etc. [17:18:37] reasonable uptick in traffic over ix.br too [17:18:40] https://usercontent.irccloud-cdn.com/file/jsz0Yryl/image.png [17:18:44] nice :D [17:19:29] thanks for the help as always! [17:23:09] np... great to get it done! [17:25:50] nice :) [18:21:45] Is there an established way to set SSH keys via puppet? I'd like to create a Gerrit bot and set an SSH key via puppet on the vm's filesystem [18:23:52] More specifically, the way the bot works requires checking out/pushing rather than using the API [18:24:34] I've been looking for prior art but seem to not see much [18:45:23] brett: I think you'd probably just do that using file resources and keep the private key file in a secrets repo [18:45:45] all the built-in ssh keys support in Puppet I know if is for managing known hosts or authorized keys files [18:45:59] and there's a community package for managing ssh-keygen via [18:46:41] I was thinking using the secrets repo was the way to go. I just didn't want to go this route if there was a more established way. Thanks for confirming [18:46:47] yeah not that I've seen [18:47:12] usually when we want to do something like this in production, we use either the host key of the machine, or we make a tls cert, etc [18:47:30] I suppose using the host key of the machine would also be a possibility actually [18:49:24] e.g. /etc/ssh/ssh_host_ed25519_key? [18:58:13] Does someone here have admin access that can create a user account for me? (I assume we can/should do that directly rather than with ldap?) [19:06:43] that's a good question, I think I would ask hashar [19:44:18] Since the host key is owned by root and the application is running unprivileged it's probably best to just create one [19:45:15] hashar, I'd love to create a gerrit that has access to clone/create a CR to operations/dns and operations/puppet, please! [20:02:03] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9848597 (10Papaul) [20:02:36] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9848600 (10Papaul)