[11:15:13] <_joe_> just seen on alert1002: puppet-agent[3360430]: The directory '/etc/acmecerts/icinga' contains 1007 entries, which exceeds the default soft limit 1000 and may cause excessive resource consumption and degraded performance. To remove this warning set a value for `max_files` parameter or consider using an alternate method to manage large directory [11:15:14] <_joe_> trees [11:22:50] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#11082227 (10MoritzMuehlenhoff) [13:08:25] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:23] ^ that's me building OpenJDK... [13:23:25] RESOLVED: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:25] FIRING: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:25] thanks _joe_ I'll take a look [14:15:16] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:20:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:25] RESOLVED: SystemdUnitFailed: user@11984.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:42] 07Puppet, 06Infrastructure-Foundations: alert1002.wikimedia.org: Puppet warning of too many entries in /etc/acmecerts/icinga - https://phabricator.wikimedia.org/T401858#11083678 (10jhathaway) p:05Triage→03Low a:03jhathaway [19:06:42] hi, I have a Puppet change to allow releng to `systemctl status` a few MediaWiki train deployment units [19:06:42] IIRC as it touches sudo rules that has to pass through your team https://gerrit.wikimedia.org/r/c/operations/puppet/+/1177958 [20:04:07] hashar: does it not work if you invoke systemctl without sudo? [20:04:29] we don't have sudo :] [20:04:45] hence that patch [20:05:05] sorry, I must have been unclear [20:05:55] hashar: https://phabricator.wikimedia.org/P81282 [20:06:28] I mean there's no problem with the patch, but, systemctl status doesn't need any perms [20:06:59] ahh I see what you mean [20:07:44] on Tuesday I wanted to quick check the status of `pretrain` [20:08:00] so I went with `sudo systemctl status pretrain` [20:08:14] which should give the last few lines of the unit, that is often sufficient to pinpoint the breakage [20:08:39] without sudo, I don't have access to the journal and thus: [20:08:39] Warning: some journal files were not opened due to insufficient permissions. [20:08:59] I have mentioned that in the commit message: "AND the last few lines of the journal." [20:09:10] I will amend and indicate without sudo I get a permission deny [20:09:29] (if as non root we could be granted access to the journal files, that would be even better, but I don't know whether systemd supports that) [20:10:40] ahh okay [20:11:08] I think there's a system group `systemd-journald` it allows journal access to [20:11:44] ah good to know [20:11:49] ahh but we allow journalctl only for specific units right now [20:12:03] then I imagine that would grant us access to everything [20:12:06] I'm not sure how big a deal that is, I'd have to ask M.oritz [20:12:08] yeah [20:12:28] so I think the process is that the patch has to be pass through some SRE weekly meeting for discussion/approval [20:13:00] cause it is rather easy to grant too many permissions. At least this one is small enough :] [20:13:05] it is merely for convenience [20:13:22] in this case I'm happy to just +2 and move on, the 'new' permissions requested are equivalent to what's already granted [20:13:28] yup [20:13:39] if you want to be in the journald group, feel free to submit a patch or task though [20:13:52] that one sounds more scary [20:14:44] what I though about is to move those tasks to a user systemd [20:14:56] also a fine option IMO although some other things to double-check there [20:15:30] and I am not sure we have any other cases of using `systemctl --user` or having the unit started under a user [20:15:32] (like, I'm not sure offhand about the puppetization side of that, or if the common monitoring for failed systemd units cares about user units, etc) [20:15:32] so hmm [20:15:49] it is probably better to keep the same model as everything else [20:15:57] fwiw, admin/data.yaml has a couple: 'ALL = NOPASSWD: /bin/journalctl *' .for example of mw-maint (maintenance-log-readers). but it might be evil because the * allows too much. [20:16:19] would indeed ran it past m.oritz [20:16:42] for now, the sudo systemctl status patch is merged and I'm running puppet on the deploy hosts now [20:16:47] \o/ [20:17:08] mutante: and there is a "journalctl*" (with no space separator) [20:17:43] 3+ minutes to load facts sheesh [20:17:56] but yeah sudo rules are terrible. I wanted to allow `journalctl -u pretrain` to optionally be passed `-f` (to follow) but you can't build an allow list of arguments. Or maybe that can be done using some magic regex so I gave up) [20:18:12] just saying that we did this thing in the past to solve "let non-roots read all journals" and "add existing user to existing group" used to be a problem in puppet.. so we had to execute usermod -G [20:19:18] yea, puppet is always extra slow just on deploy* [20:20:14] that is surely debuggable [20:20:39] I bet it is doing a bunch of git commands across all of /srv/deployment/*/* [20:20:42] done [20:22:06] Warning: journal has been rotated since unit was started, output may be incomplete. [20:22:10] which hmm looks good [20:22:14] thank you so much cdanis! [20:22:41] (why has the journal been rotated for just a few lines is some other rabbit hole I am not going to jump into) [20:22:54] it's a systemwide journal :) [20:23:28] which leaves me to wonder whether systemd has per unit journals [20:23:28] :b [20:24:07] I'll check next week, I am sure those sudo rules will be handy cause I did use them a few times according to my bash history [20:24:13] so I must have made the same mistake over and over [20:24:23] and now it is fixed :party: [20:28:59] yea, journals can be per unit. journalctl -u [20:51:53] that is still hitting the systemd wide journal and hence require sudo [20:52:10] `-u` filters for a unit/pattern [20:52:22] but still requires access to the journal [20:54:15] * hashar sleeps