[00:37:15] <mutante>	 I have a puppet resource of firewall::service on my machine, can see it in compiler and that it's  ensure:present but on the actual host the rule it adds isn't in iptables or nftcables config and I already tried both providers.. hhmmm
[00:39:45] <mutante>	 the ferm/conf.d snippet doesn't get created even though the resource is there and I switched (back) to provider iptables
[01:57:17] <mutante>	 solved, profile::firewall wasn't included, only firewall::service and provider was none
[02:15:26] <Krinkle>	 I wonder if this swift metric is actually working
[02:15:27] <Krinkle>	 https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22000000026%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22000000026%22%7D,%22editorMode%22:%22code%22,%22expr%22:%22increase%28swift_container_stats_objects_total%7Bclass%3D%5C%22originals%5C%22%7D%5B15m%5D%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from
[02:15:27] <Krinkle>	 %22:%22now-7d%22,%22to%22:%22now%22%7D%7D
[02:15:45] <Krinkle>	 https://w.wiki/9NLe
[02:16:29] <Krinkle>	 The increase is basically ~1000 per hour most of the time, but then there are peaks of supposedly 125M uploads in a 15min period
[02:16:39] <Krinkle>	 Seems to me like some kind of metric rollover is not being picked up correctly
[02:17:01] <Krinkle>	 so prometheus starts counting from 0 or something and it hits the same peak/rollover every time
[02:18:23] <Krinkle>	 https://w.wiki/9NLj
[02:18:32] <Krinkle>	 > (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:18:37] <Krinkle>	 ^ is the reason I'm looking out of curiousity
[03:22:08] <TimStarling>	 seems like prometheus's fault
[03:23:25] <TimStarling>	 "Any decrease in the value between two consecutive float samples is interpreted as a counter reset."
[03:25:11] <TimStarling>	 it decreases by about 50, which counts as a reset to zero, so you get a data point which is the full value of the counter at that time, i.e. 116M
[03:27:14] <TimStarling>	 Krinkle if you use deriv() instead of increase() then it's allowed to go negative instead of going crazy
[03:27:38] <TimStarling>	 the quote is from https://prometheus.io/docs/prometheus/latest/querying/functions/#resets
[03:36:59] <TimStarling>	 and it's not a counter -- the number can go down when objects are deleted
[05:51:33] <TimStarling>	 I made https://gerrit.wikimedia.org/r/c/operations/alerts/+/1008590 for this
[07:41:25] <moritzm>	 jelto, eoghan: when draining ganeti2019/ganeti2020 for today's switch maintenance it complained about etherpad2001 being powered down. the hostname is not referenced in site.pp, is that some leftover of the recent migration?
[07:43:36] <eoghan>	 I think so, etherpad2002 is the current replica/failover host. 
[07:44:26] <eoghan>	 I'll let jelto say for sure whether it can be deleted/decommed or not though
[07:44:31] <moritzm>	 ack, thx
[07:59:11] <jelto>	 Yep etherpad2002 is the Replica. I guess 2001 is leftover from previous testing which was not decommed properly. But defenitely not from current migration
[08:07:33] <moritzm>	 looking at Netbox history it was only created on 2024-02-12, though
[08:19:55] <jelto>	 according to SAL it was from a failed attempt to create a codfw etherpad vm: https://sal.toolforge.org/production?p=0&q=etherpad2001&d=. But it's not in use
[08:38:18] <moritzm>	 ok, I'll ping Daniel to clean it out
[09:08:37] <jelto>	 ack thanks!
[10:31:08] <Krinkle>	 TimStarling: I see, it's recorded as a gauge by Swift/statsd, not as some kind of observed file creation counter
[10:31:18] <Krinkle>	 Makes sense
[13:55:21] <moritzm>	 jelto, eoghan: in my tests of the new apt servers I noticed that our gitlab repo import key expired on March 1st, could you please import the latest one?
[14:18:33] <jelto>	 moritzm: like in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008868?
[14:29:35] <moritzm>	 lgtm, +1d
[14:46:59] <moritzm>	 FYI; I'm running some tests with the new apt servers, notification mails coming from apt1002 are by the new host and don't impact the current apt repo state
[14:47:52] <volans>	 k
[15:41:15] <arturo>	 are jenkins PCC jobs working for you? I don't see them being scheduled at all
[15:41:29] <arturo>	 example: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1584/console
[15:43:14] <arturo>	 nevermind
[15:43:18] <volans>	 arturo: as almost every day arund this time there is a spike
[15:43:18] <volans>	 https://integration.wikimedia.org/zuul/
[15:43:26] <volans>	 check if it's in the queu
[15:44:02] <arturo>	 ok, thanks
[15:44:24] <volans>	 a spike in # of submitted jobs
[15:59:39] <godog>	 Jeff_Green: moving here since -operations is busy now, how's looking on your end re: alerts ?