[08:06:09] checking pawsnfsdown [08:14:20] to me it seems like a false alarm so far, the nfs-server.service is active on paws-nfs-1 [08:28:08] got to go out for an errand, will resume later [09:37:29] mmhh metricsinfra-prometheus-3 has a full disk, -2 does not [09:43:09] not sure yet why -3 has pulled in 3.5G blocks while -2 1.9G blocks [09:43:33] at any rate I'm thinking about switching metricsinfra prometheus to space-based retention rather than time [09:46:03] morning [09:47:05] thanks godog for looking at pawsnfsdown, I was wondering what it was about, it moved to "resolved" 5 mins ago [09:47:43] dhinus: sure np, indeed once I freed up some space on -3 then the alert resolved [09:48:51] right [09:49:21] I remember we had some diff between the two prom servers in the past, not sure if it was metricsinfra or another pair [09:49:50] maybe one of the hosts is failing to fetch some of the metrics/targets? [09:50:26] I'm not against moving to space-based, but I would also consider expanding the volumes... [09:51:32] indeed I'm trying to see if there are obvious differences [09:53:44] which ftr there are not in terms of samples ingested/s afaics https://prometheus.wmcloud.org/graph?g0.expr=sum%20by%20(instance)%20(rate(prometheus_tsdb_head_samples_appended_total%5B5m%5D))&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=1&g0.store_matches=%5B%5D&g0.engine=prometheus&g0.analyze=0&g0.tenant= [09:54:04] and now I'm wondering about allowing toolforge.org and wmcloud.org on w.wiki [09:55:41] godog: that was declined in T231518 :) [09:55:41] T231518: Add *.wmflabs.org to w.wiki shortener - https://phabricator.wikimedia.org/T231518 [09:56:15] but there's T232240! [09:56:15] T232240: New service to shorten wmflabs URLs - https://phabricator.wikimedia.org/T232240 [09:56:47] /o\ thank you for fishing out context dhinus [09:59:49] anyways back to prometheus: given the upcoming holidays I'm for going with space-based retention and revisit in jan [09:59:56] sgtm [10:01:31] ok opening a task since it'll require some puppet-fu [10:08:08] T412926 and T412927 [10:08:09] T412926: metricsinfra-prometheus-3 using more space than metricsinfra-prometheus-2 - https://phabricator.wikimedia.org/T412926 [10:08:09] T412927: Allow prometheus space-based retention on metricsinfra - https://phabricator.wikimedia.org/T412927 [10:27:29] sth like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1219132 [10:29:35] modules/profile/manifests/wmcs/metricsinfra/prometheus.pp:73: 'allowmethods', [10:29:38] Typo found! [10:29:39] ... [10:32:28] but anyways, you get the idea [10:38:24] lgtm, not sure what the typo is about [10:39:10] 'wmet' in typos file, I remember taavi running into the same [10:39:46] I get it why, though I'm not sure I want it [10:41:41] could the typo entry be [10:41:45] ".wmet" instead? [10:46:07] good point yeah, I'll submit a patch for that [11:16:06] does toolforge/alerts.git require any action post-merge ? [11:16:28] I'm assuming not since it is based on production alerts.git [11:23:37] no [11:24:00] ack thank you [14:03:58] anything to consider when rolling out pdns-recursor updates on cloudservices*? the update will incur a brief service restart [14:06:16] I'm not aware of any moritzm [14:06:22] also checking https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS/Designate nothing stands out to me [14:07:51] ok, I'll start shortly and will wait a bit before upgrading the other node [14:09:46] ack [14:11:22] moritzm: yeah, one at the time is the main thing, if you want to be extra nice you can stop bird beforehand to ensure no traffic is hitting that node [14:13:34] 1005 is upgraded, but for 1006 I can do that; so a) stop bird b) upgrade pdns-rec c) start bird ? [14:14:36] yep [14:14:52] ok [14:15:47] both upgraded now [14:20:52] thanks! [15:05:03] I'm looking at the "puppet has failed" alert on cloudcontrol2010-dev, looks like another instance of T373815 [15:05:03] T373815: Puppet fails on cloudcontrol when updating /srv/tofu-infra - https://phabricator.wikimedia.org/T373815 [15:10:26] fixed manually, let me see if I can find a better fix [15:43:07] this might fix it https://gerrit.wikimedia.org/r/c/operations/puppet/+/1219170