[00:37:15] I have a puppet resource of firewall::service on my machine, can see it in compiler and that it's ensure:present but on the actual host the rule it adds isn't in iptables or nftcables config and I already tried both providers.. hhmmm [00:39:45] the ferm/conf.d snippet doesn't get created even though the resource is there and I switched (back) to provider iptables [01:57:17] solved, profile::firewall wasn't included, only firewall::service and provider was none [02:15:26] I wonder if this swift metric is actually working [02:15:27] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22000000026%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22000000026%22%7D,%22editorMode%22:%22code%22,%22expr%22:%22increase%28swift_container_stats_objects_total%7Bclass%3D%5C%22originals%5C%22%7D%5B15m%5D%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from [02:15:27] %22:%22now-7d%22,%22to%22:%22now%22%7D%7D [02:15:45] https://w.wiki/9NLe [02:16:29] The increase is basically ~1000 per hour most of the time, but then there are peaks of supposedly 125M uploads in a 15min period [02:16:39] Seems to me like some kind of metric rollover is not being picked up correctly [02:17:01] so prometheus starts counting from 0 or something and it hits the same peak/rollover every time [02:18:23] https://w.wiki/9NLj [02:18:32] > (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:18:37] ^ is the reason I'm looking out of curiousity [03:22:08] seems like prometheus's fault [03:23:25] "Any decrease in the value between two consecutive float samples is interpreted as a counter reset." [03:25:11] it decreases by about 50, which counts as a reset to zero, so you get a data point which is the full value of the counter at that time, i.e. 116M [03:27:14] Krinkle if you use deriv() instead of increase() then it's allowed to go negative instead of going crazy [03:27:38] the quote is from https://prometheus.io/docs/prometheus/latest/querying/functions/#resets [03:36:59] and it's not a counter -- the number can go down when objects are deleted [05:51:33] I made https://gerrit.wikimedia.org/r/c/operations/alerts/+/1008590 for this [07:41:25] jelto, eoghan: when draining ganeti2019/ganeti2020 for today's switch maintenance it complained about etherpad2001 being powered down. the hostname is not referenced in site.pp, is that some leftover of the recent migration? [07:43:36] I think so, etherpad2002 is the current replica/failover host. [07:44:26] I'll let jelto say for sure whether it can be deleted/decommed or not though [07:44:31] ack, thx [07:59:11] Yep etherpad2002 is the Replica. I guess 2001 is leftover from previous testing which was not decommed properly. But defenitely not from current migration [08:07:33] looking at Netbox history it was only created on 2024-02-12, though [08:19:55] according to SAL it was from a failed attempt to create a codfw etherpad vm: https://sal.toolforge.org/production?p=0&q=etherpad2001&d=. But it's not in use [08:38:18] ok, I'll ping Daniel to clean it out [09:08:37] ack thanks! [10:31:08] TimStarling: I see, it's recorded as a gauge by Swift/statsd, not as some kind of observed file creation counter [10:31:18] Makes sense [13:55:21] jelto, eoghan: in my tests of the new apt servers I noticed that our gitlab repo import key expired on March 1st, could you please import the latest one? [14:18:33] moritzm: like in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008868? [14:29:35] lgtm, +1d [14:46:59] FYI; I'm running some tests with the new apt servers, notification mails coming from apt1002 are by the new host and don't impact the current apt repo state [14:47:52] k [15:41:15] are jenkins PCC jobs working for you? I don't see them being scheduled at all [15:41:29] example: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1584/console [15:43:14] nevermind [15:43:18] arturo: as almost every day arund this time there is a spike [15:43:18] https://integration.wikimedia.org/zuul/ [15:43:26] check if it's in the queu [15:44:02] ok, thanks [15:44:24] a spike in # of submitted jobs [15:59:39] Jeff_Green: moving here since -operations is busy now, how's looking on your end re: alerts ?