[01:19:45] (SystemdUnitFailed) firing: (7) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:45] (SystemdUnitFailed) firing: (7) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:45] (SystemdUnitFailed) firing: (8) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:45] (SystemdUnitFailed) firing: (8) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:38] 10netbox, 10Infrastructure-Foundations, 10SRE, 10decommission-hardware, and 2 others: decommission gerrit1001.wikimedia.org (dcops, netbox) - https://phabricator.wikimedia.org/T340077 (10Volans) @Dzahn I've deleted both IPs, nothing to sync as their DNS was managed manually and not via netbox: https://netb... [08:09:45] (SystemdUnitFailed) firing: (8) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:24:45] (SystemdUnitFailed) firing: (8) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:48] 10SRE-tools, 10Infrastructure-Foundations: Add --depool-sleep runtime argument when using SRELBBatchRunner class - https://phabricator.wikimedia.org/T339151 (10jbond) Before we implement this it would be useful to understand further why this needs to be adjust ed at run time, this feels inherently wrong to me... [11:32:12] FYI I've added the test-cookbook to the docs: https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Test_before_merging (cc moritzm ) [11:46:30] I will attempt to switch Netbox Next to OIDC [11:47:52] via puppet? [11:47:58] YES [11:48:38] volans: looks good [11:49:38] thx [11:49:52] slyngs: ack, go ahead for me whenever you want if it's only netbox-next [11:50:23] Which is good, because I made a small mistake :-) [11:51:35] :D [11:51:45] test/canary/next hosts are there for this [11:59:48] Can I trouble someone for a quick review, mostly because it touches CAS as well: https://gerrit.wikimedia.org/r/c/operations/puppet/+/932223 [12:03:12] Nevermind, I'm fairly sure it will work ... At least not break anything :-) [12:24:45] (SystemdUnitFailed) firing: (7) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:27:56] Apparently I'm not a member of the NDA groups ... [12:34:06] slyngs: you should be a member of the ops group [12:35:25] Yeah, but CAS complained that I wasn't in NDA. I've tried removing that requirement, but now it just complains that also not in ops and wmf, which isn't true [12:42:50] is there a simple way to silence IRC notifications for cookbooks in general, i.e. so that test-cookbook could provide an option for it? 99% of all tests with test-cookbook will be against harmless things not needing any IRC update and currently it's quite noisy [12:44:48] 10netops, 10Infrastructure-Foundations: Configure bgp-error-tolerance on Juniper routers - https://phabricator.wikimedia.org/T340111 (10ayounsi) [12:46:52] https://phabricator.wikimedia.org/T324655 [12:50:59] moritzm: no, currently not if you run it for real, because it's doing changes in production. If you run it with dry-run it does not ! log [12:51:37] slyngs: so the logic should be that to access you have to be wmf or nda and then for authz you should be in ops to be superuser/staff [12:51:46] (I don't recall which one) [12:52:03] ok [12:52:53] You have to be wmf, nda or ops, and ops is superuser/staff, but it seems like CAS is doing the checking wrong. It lists all my memberOf correctly and then tells me that I'm not in ops and wmf [12:53:09] moritzm: and what arzhel linked ofc, I'm not saying we'll never do something about it, but it's a slippery slope so I'd like to do it "right" (for some definition of right) [12:53:39] slyngs: ack, I said wmf or nda because ops is a subset of them [12:54:20] slyngs: when you sa CAS, are yuo saying you are getting the error at idp.wikimedia.org and all websites [12:54:27] or is this specific to netbox-next [12:54:32] and oidc [12:54:57] yeah, I think this is best addressed under the bigger T324655 umbrella eventually. there are definitely use cases where an explicit opt-in not to log would be useful. but it's also only a cornercase as well [12:54:57] T324655: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 [12:55:12] I get redirected to idp just fine, and then I get the error that would indicate that I configured it wrong, but looking at the logs it seems like it's my account is doesn't want to validate [12:55:57] slyngs: do you get the error on the idp.wikimedia.org domain or netbox-next.wikimedia.org? [12:56:07] can you send a screen shot [12:56:08] jbond: idp.wikimedia.org [12:56:12] Sure [12:57:11] slyngs: never mind i see it now [12:57:41] My assumption would be that the secret or service name is work, but that doesn't seem to be the case [12:58:54] "is wrong" ... not work [12:59:45] volans: btw why did you want netbox-next going to idp and not idpe-test? (they both use the same ldap db) [13:00:45] isnt' idp-test there only to test the IDP and can be down randmly for tests and upgrades and such? [13:01:03] currently with CAS it was going to idp if I read that correctly [13:01:21] so I just asked why we were changing it in the CR and if that was intentional [13:01:23] well its there to test the idp and onboarding new services etc [13:02:12] if you feel strong about using the test one and if in general -next -dev -test hosts should use the test one feel free to change it [13:02:20] in this case as we are changing from cas to OIDC i think it would be better to point netbox-next and idp-test so we can more easily see the correct logs, make tweaks etc untill we get it right [13:02:35] ah if it's temporary sure absolutely [13:02:40] no prob at all [13:02:54] Cool, I'll just move over to test then [13:03:04] wether it shuld be permenent im not sure but yes untill we get things sorted it probably makes sense to switch it to -dev [13:03:30] sorry i think i saw the cr in the morning and this didn;t really click [13:03:39] +1 [13:07:06] jbond: https://gerrit.wikimedia.org/r/c/operations/puppet/+/932239 [13:08:28] slyngs: FYI we shuld also disable the standard netbox loging page. im gussing we had something to do this in cas.settings.py [13:08:39] *login page [13:12:45] jbond: Based on the debate on the netbox github page, I'm not totally sure that you can do that for anything social_auth related, without paying for Netbox [13:13:32] hmm that seems strange, but one issue at a time :) [13:13:48] Someone suggested a workaround [13:14:34] Okay, so that actually worked fine [13:14:35] slyngs: of course that has worked straight away [13:15:06] But WHY [13:15:21] * jbond looking [13:16:14] slyngs: ahh yes i think it might be that in production we have entries for cas and oidc [13:16:26] i wonder if we need to be a bit more specific with the uris [13:16:31] so cas knows which on to pick [13:16:44] Maybe it's because the group limit is missing on idp-test [13:16:57] add it [13:18:04] slyngs: it dose have the resticted groups [13:18:22] Oh okay, yeah it's in Puppet as well [13:19:09] I don't think it's deployed yet [13:19:17] ? [13:19:33] I got it, just a sec [13:21:22] Puppet complains that it can't find the client secret [13:21:31] are you adding the client ytes just checking one sec [13:25:10] Hmm, not see the bug [13:25:58] jbond: did you fix it? [13:26:23] slyngs: i added the secret and we see the same issue on idp-test now [13:26:58] So it does seem related to the groups [13:27:47] yes [13:28:53] That really weird given that it seems to work fine for gitlab [13:29:01] well, gitlab-repliace [13:29:27] you got to love java errors https://phabricator.wikimedia.org/P49470 [13:29:50] i think that the error message on the page is perhaps a bit missleading [13:31:47] I think that just generic for all things error [13:33:09] doesn't it look like the java error is CAS complaining that it doesn't have a Danish error message, or am I read that wrong [13:34:22] upossibly but i feel that would be strang as i dont have a danish lang header [13:34:38] Reasonable logic [13:38:12] slyngs: fyi if you didn;t find them the logs are in /var/log/cas/cas.log [13:38:15] they are verbose [13:38:33] Yes, that's where I'm looking [13:39:26] It's really weird that it would claim I'm not in the groups, unless it tries and exact match, but again, working for gitlab-replica-oidc [13:39:41] yes its strange [13:39:57] btw are you sure its working for gitlab-replica. did you test [13:40:03] Yes [13:40:38] But hard to tell because you don't really see CAS before you're signed in [13:41:53] gitlab replica is still using cas protocol and its still hitting idp.w.o afaict [13:42:03] Oh okay... [13:43:46] and idm dosn't have any restriction [13:45:14] ahh we can trun up the logging one sec [13:45:33] No, everyone is suppose to be able to login to the IDM... We can try to add a restriction temporarily [13:45:57] no no i think we can be confident its this bit [13:47:43] Oh, is this is, my groups are in "group", but we're checking the "memberOf" attribute [13:48:21] Maybe we can "just" add memberOf as a scope [13:48:23] ahh yes that would make senses [13:48:55] hmm i think its shuld be easy to just check groups one sec let me send something [13:52:09] slyngs: do you have a task id [13:52:35] https://phabricator.wikimedia.org/T308002 [13:53:40] https://gerrit.wikimedia.org/r/c/operations/puppet/+/932247 [13:57:47] CI isn't happy about the " [13:58:26] thats blacks fault :( [13:59:48] updated [14:01:22] +1 [14:06:33] ok that gives me a nbew error [14:06:41] But prettier [14:08:10] Now my groups are in memberOf.... [14:12:55] For some reason that renamed the "groups" in SimplePrincipal to memberOf, and the memberOF in requiredAttributes to groups [14:13:58] working theory but im wondering if things are checked twice. once at the start and then again once all attributes have been re-mapped [14:14:26] it looks like its failing much eariler in the auth process [14:14:30] Yeah, so cas.authn.oidc.core.claims-map.groups=memberOf need to be cas.authn.oidc.core.claims-map.groups=groups, or just removed [14:14:54] perhaps just removed do you want to try that and restart [14:15:00] Sure [14:16:14] in fact im not sure that will work. memberOf is comeing from the identy provider i.e. ldap and we need to mapp it into the groups scope [14:16:41] of course lets test and see what happens [14:18:11] perhas cas.authn.oidc.core.user-defined-scopes.groups=memberOf [14:18:14] Okay, maybe a reload of tomcat had been enough [14:20:48] just modified cas.authn.oidc.core.user-defined-scopes.groups to be memberOf [14:21:06] ack lets see [14:22:08] Still no [14:23:44] Just trying one last thing, then I have to run. Got a four year old who want's to pretend it's halloween [14:24:05] oh one sec i just change the access stradagy back to member of [14:24:14] Okay :-) [14:24:36] okay go ahead [14:25:13] I just tried setting both cas.authn.oidc.core.claims-map.groups=memberOf and cas.authn.oidc.core.user-defined-scopes.groups=memberOf [14:26:39] just re-reading https://apereo.github.io/cas/6.6.x/authentication/OIDC-Authentication-Claims.html#casauthnoidccoreskewPropertyConfig [14:26:45] oh sorry oi got a meeting [14:27:02] Yeah, I have to run as well, I'd take a look again tomorrow [14:29:44] I'll re-read the docs and try again tomorrow [16:24:45] (SystemdUnitFailed) firing: (7) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:02:28] jbond, volans: https://gerrit.wikimedia.org/r/c/operations/software/homer/+/928795/10#message-98bd124849fe9ed0dffea7dcbfdaeaf2d6462140 :) [18:03:03] lol [20:24:45] (SystemdUnitFailed) firing: (7) debian-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed