[15:09:20] moritzm Have you ever an alert for "user@499.service" systemd unit failure? uid 499 is the debmonitor user and this seems to happen every night on a few of the wdqs servers [15:13:38] systemd-timedated.service apparently fails at the same time....still investigating but just wondering if you've seen that [15:16:26] inflatador: we have seen it before, https://phabricator.wikimedia.org/T199911 [15:16:32] do you have an example server, can have a look? [15:16:48] wdqs1022, see https://grafana.wikimedia.org/d/000000342/node-exporter-server-metrics?orgId=1&var-node=wdqs1022:9100&from=now-12h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-disk_device=All&var-net_dev=All [15:18:26] these hosts run a slightly different stack than the other wdqs1022 , so it's possible our puppet code is part of the problem [15:18:33] than the other wdqs hosts, that is [15:18:55] more context in https://phabricator.wikimedia.org/T352878 [15:21:06] the failure of debmonitor seems rather like natural fallout of the high load which happens at the time, debmonitor runs a daily systemd timer to ingest package data and if the host is under high load by the time, the systemd session will fail [15:21:14] as in the task that Jesse linked [15:21:33] and we do have an automatic cleanup of those [15:21:42] that can be opt-in IIRC [15:23:53] yeah, there's a toil class which gets applied to the swift hosts (which run into high load from time to time) [15:23:57] ACK, looks like it's this class https://gerrit.wikimedia.org/r/c/operations/puppet/+/636633/6/modules/profile/manifests/mariadb/dbstore_multiinstance.pp [15:24:29] I don't see high load in our case, but there could be something else triggering it. It's definitely recurring [15:25:06] Anyway, I'll get a patch up for adding this class. Thanks for y'all's help! [16:06:16] Can someone help me understand why https://gerrit.wikimedia.org/r/c/operations/puppet/+/984620/8/modules/role/manifests/wdqs/test.pp#14 is a style violation ? I based this on https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/role/manifests/swift/storage.pp which I presume passed CI? [16:08:05] I don't want to be too shady, but not all of the swift puppetry is best current practice :-/ [16:08:19] it is a violation :) [16:08:21] (also, what does the CI say?) [16:08:32] as that is not a profile [17:22:30] OK, I think I the toil class imported correctly...if anyone has time to look it's here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/984620 [19:20:26] inflatador: looks good, +1d [19:21:17] moritzm excellent, thank you [22:54:47] I ran the decom cookbook on wdqs100[6-8], but it failed in the middle due to failing to acquire a lock for one of the netbox-change cookbooks (accidentally lost the exact log line). Anyway, diffing https://phabricator.wikimedia.org/T351671#9420135 w/ https://phabricator.wikimedia.org/T351671#9407888, seems like it missed the steps to remove from puppetdb/debmonitor & configure linked switch interfaces [22:55:49] What's the best way to proceed? Do I need to manually run `sre.network.configure-switch-interfaces` and manually remove from puppetdb/debmonitor or is there a better approach? [22:57:22] ryankemper: try to rerun the decom [22:57:50] if it didn't go too far should be able to do its job [22:58:19] it's currently not fully idempotent as it should be [22:58:40] volans: should have mentioned that, it refuses to run due to `spicerack.netbox.NetboxError: Server wdqs1008 does not have any primary IP with a DNS name set.` [23:04:44] ryankemper: in this case yes you can run the switch cookbook [23:05:19] as for the rest add me to the task and I'll check it tomorrow [23:05:41] I'd like to fix the cookbook itself [23:14:44] volans: excellent. as always, thanks for the help!