[01:12:37] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-web_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:12:37] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-web_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:02:37] (SystemdUnitFailed) firing: (2) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:02:37] (SystemdUnitFailed) firing: (3) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:57:37] (SystemdUnitFailed) firing: (3) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:36] moritzm: it seems that the os report is still trying to connect to puppetdb1003:443 instead of the discovery record and port for the proxy [08:07:10] yes, given that I didn't change the config yet, that isn't really surprising either :-) [08:07:20] probably later today [08:10:16] lol, sorry, I somehow thought it was already merged :D [08:10:18] my bad [08:31:02] I'll add you to CC when I make a patch :-9 [08:36:06] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10aborrero) We are about to run the procedure again for {T346892} in case you want to test/observe/re... [08:42:37] (SystemdUnitFailed) firing: (2) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:37] (SystemdUnitFailed) firing: (2) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:27:22] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10Volans) Which partman recipe do you use? Does it include `modules/install_server/files/autoinstall/... [09:28:19] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10Volans) And we don't see the same issue on plain reimages, where we don't even run wipefs. [09:30:59] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10aborrero) Yes, apparently, we do! `lang=shell-session $ git grep cloudservices modules/install_ser... [09:46:23] volans: merging https://gerrit.wikimedia.org/r/c/operations/dns/+/959691 requires https://wikitech.wikimedia.org/wiki/DNS/Netbox#Atomically_deploy_auto-generated_records_and_a_manual_change right? [09:48:21] because the forward will be still included right? [09:48:25] yes I'd say so [09:48:39] 10netops, 10Infrastructure-Foundations, 10SRE: Include Netbox Anycast IPs in Capirca host definitions - https://phabricator.wikimedia.org/T347016 (10cmooney) p:05Triage→03Low [09:49:02] volans: cool, do you mind a quick review as well? [09:59:00] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10User-aborrero: cookbook: sre.hosts.decommission: also remove logical volumes (to allow rename while reimage) - https://phabricator.wikimedia.org/T346875 (10aborrero) Could the problem be related to the rename? Just a theory The renames we have been condu... [10:05:13] volans: when you have a moment, https://gerrit.wikimedia.org/r/c/operations/puppet/+/959696/ is the patch to switch os-reports to the puppetdb-api record [10:09:44] XioNoX: sorry was in a call [10:11:10] moritzm: done [10:11:13] with a comment [10:11:37] no rush :-) [10:14:02] XioNoX: done [10:14:11] thx [10:20:34] 10netops, 10Infrastructure-Foundations, 10SRE: Include Netbox Anycast IPs in Capirca host definitions - https://phabricator.wikimedia.org/T347016 (10cmooney) 05Open→03Resolved Script updated and re-run, seems fine. [10:38:21] ganeti-test is using nftables now, let me know if you see anything odd (but did test a few things and all looks fine) [11:48:32] 10netops, 10Infrastructure-Foundations, 10SRE: Audit cloud filters on CR in respect of new cloud-private and public VIP networks - https://phabricator.wikimedia.org/T347030 (10cmooney) p:05Triage→03Medium [11:57:55] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE: Investigate and deploy 'max-repeaters = 20' to all librenms devices - https://phabricator.wikimedia.org/T346759 (10ayounsi) a:03ayounsi [11:58:03] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE: Investigate and deploy 'max-repeaters = 20' to all librenms devices - https://phabricator.wikimedia.org/T346759 (10ayounsi) 05Open→03Declined Thanks, I spent a bit more time on that. Bumping `max-repeaters` to 20 didn't change a t... [12:16:57] 10netops, 10Infrastructure-Foundations, 10SRE: Users management on SONiC - https://phabricator.wikimedia.org/T338028 (10ayounsi) 05Open→03Resolved a:03ayounsi This is done for now, more improvements to come from Dell, tracked in T342673. [12:17:02] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) [12:18:43] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) @cmooney I think this can be closed? [12:36:56] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack, 10Patch-For-Review: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase - https://phabricator.wikimedia.org/T317855 (10MoritzMuehlenhoff) [13:03:52] 10netops, 10Infrastructure-Foundations, 10SRE: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) @ayounsi yeah I think so, the RMA is complete as far as Juniper is concerned and we are no longer using the old card. It's unclear to me if the new card has been received in cod... [13:13:10] (SystemdUnitFailed) firing: generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:38] slyngs: "In the set up the team asked for a couple more items. Can you also share the “aud” (audience) & cid (clientId)values from the ID token?" I thought we didn't need to give Juniper the IDtoken?! do you know what they're talking about (/cc jbond) [14:21:10] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) > In the set up the team asked for a couple more items. Can you also share the “aud” (audience) & cid (clientId)values from the ID token? [14:21:13] No really sure no, we haven't used the idtokens before [14:22:16] ClientID might just be uid [14:23:41] XioNoX: this i think is similar to what i wrote here https://phabricator.wikimedia.org/T306238#8637017 [14:24:07] basicly it look slike they are asking for a time limited token which is normaly limited to a few hours butthey are expecting it to last for ever. [14:24:13] but i cold also be missing somehing [14:26:31] Adding you to the thread :) [14:27:24] They request was really weird: Either IDToken, or the URL [14:27:28] Their [14:29:22] But I think the cIient_id is just "juniper" and the aud is the https://idp.wikimedia.org/oidc/oidcAccessToken [14:29:49] REQUIRED. Audience(s) that this ID Token is intended for. It MUST contain the OAuth 2.0 client_id of the Relying Party as an audience value. It MAY also contain identifiers for other audiences. In the general case, the aud value is an array of case sensitive strings. In the common special case when there is one audience, the aud value MAY be a single case sensitive string. [14:29:55] from https://openid.net/specs/openid-connect-core-1_0.html#IDToken [14:30:33] so aud must contain cid? [14:31:47] and it's part of the id token, maybe in that case they're asking us what aud and cid to use when they generate the temporary IDtoken to send to our endpoint? [14:32:23] But they already got those [14:32:49] FYI i decoded that base64 blob and it is a JWT [14:33:37] also here is a a more human explanation https://developer.okta.com/docs/guides/validate-id-tokens/main/ [14:34:14] It also makes a little more sense when reading the docs for the next version of CAS https://apereo.github.io/cas/development/authentication/OAuth-Authentication-Clients.html [14:34:29] There is a new audience field [14:35:22] Based on that description I'd assume that the current aud is just the client ID [14:35:41] The client ID just being "juniper" in this case [14:36:19] +1 [14:38:15] should I tel them both aud and cid are "juniper" ? [14:38:19] and see what happens? [14:38:20] :) [14:38:59] XioNoX: have they allready read https://phabricator.wikimedia.org/T306238#8637017 [14:39:19] specifically [14:39:21] This bit is a bit confusing to me. ID tokens have a specific meaning in oauth/ODIC. it is a token that an authorized client application can request which is then used to make further requests. e.g. [14:39:25] you authenticate to a client application e.g. netbox (lets say) by default we don't release any attributes but we support the authorise code grant type netbox the application requests an ID token (via browser redirects with potential for the user to authorise) netbox uses the received ID Token to ask CAS for more attributes about the user e.g. name, email, groups [14:39:28] dunno if they read it but I emailed it to them :) [14:39:35] ack [14:40:41] 10Puppet, 10Infrastructure-Foundations, 10Wikimedia-production-error: logspam-watch doesn’t handle normalized exceptions well - https://phabricator.wikimedia.org/T347064 (10Lucas_Werkmeister_WMDE) [14:40:43] i think we shuld say that manually generating an ID token and sending it via email is not how OpenID works, can they send us the part of the spec they hav implmented or are refering to [14:40:56] slyngs: whats your thought? [14:41:13] 10Puppet, 10Infrastructure-Foundations, 10Wikimedia-production-error: logspam-watch doesn’t handle normalized exceptions well - https://phabricator.wikimedia.org/T347064 (10Lucas_Werkmeister_WMDE) (Not sure which tags `logspam-watch` belongs to, I grabbed some relevant-seeming ones from older tasks.) [14:41:36] jbond: they asked for the idtoken OR the endpoint URL [14:41:44] we provided them with the endpoint and they're ok with it [14:41:50] and now asking for the aud/cid [14:42:10] then yse tell them its juniper [14:42:22] cool [14:42:24] or we can set it to whatever they like [14:42:44] 10Puppet, 10Infrastructure-Foundations, 10Wikimedia-production-error: logspam-watch doesn’t handle normalized exceptions well - https://phabricator.wikimedia.org/T347064 (10Lucas_Werkmeister_WMDE) The message seems to be normalized correctly in Logstash, at least: {F37746900} [14:44:42] replied to their email, thx [14:49:49] 10Puppet, 10Infrastructure-Foundations, 10Wikimedia-production-error: logspam-watch doesn’t handle normalized exceptions well - https://phabricator.wikimedia.org/T347064 (10dancy) `logspam-watch` works by reading the /srv/mw-log/{exception,error}.log files which only have the final error message (no template... [16:18:49] 10SRE-tools, 10Spicerack, 10cloud-services-team: [spicerack] Add remote command output to log file - https://phabricator.wikimedia.org/T347093 (10fnegri) [16:19:46] jbond: git-sync-upstream seems to be crashing with 'NameError: name 'old_environ' is not defined [16:19:46] '. the actual rename works but updating the prometheus metrics does not [16:20:25] taavi: ack let me see if i can send a quick patch otherwise ill revert [16:29:20] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: [spicerack] Add remote command output to log file - https://phabricator.wikimedia.org/T347093 (10Volans) You can find in the logs the command and the exit code, but that's correct the output of a remote command is not automatica... [16:31:25] taavi: are you free to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/959808 [16:32:52] sure, one moment [16:32:59] thanks <3 [16:34:07] it works. that was simple :D [16:34:17] yes silly mistake :) [16:50:04] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: [spicerack] Add remote command output to log file - https://phabricator.wikimedia.org/T347093 (10fnegri) I see your point of avoiding to spam the logs, but I still think it can be useful in some situations. Maybe the output coul... [17:13:09] (SystemdUnitFailed) firing: generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:13:10] (SystemdUnitFailed) firing: generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed