[11:35:48] hi folks! [11:36:22] there seems to be something weird happening on kafka-logging100[4,5], some ISRs are failing [11:37:07] the errors are all like [11:37:08] "org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [Topic authorization failed." [11:37:27] that seem related to ACLs, but not sure if you added/removed anything recently [13:05:49] elukey: https://phabricator.wikimedia.org/T334733#8823904 comes to mind [13:31:10] herron: ah wow that's definitely a possible culprit! [13:31:26] I was wondering why we had only that rule in kafka logging, and the time matches perfectly [13:31:42] I just added a comment on that task, yeah timing really is perfect [13:32:01] ottomata: o// around? [13:32:42] I think we can just [13:32:48] kafka acls --delete --allow-principal User:ANONYMOUS --cluster --operation IdempotentWrite [13:32:54] and test if it solves the issue [13:33:03] logging is the only one not using ACLs I think [13:33:11] so probably adding one triggers some defaults [13:33:41] ok, yeah makes sense sgtm [13:33:56] the alternative is to come up with ACLs (maybe matching the ones on kafka-main) [13:34:00] but it could be risky [13:34:17] I was having a look around to see if other clusters had ACL entries that lined up with the error from kafka-authorizer.log like this [13:34:19] User:CN=kafka-logging1005.eqiad.wmnet is Denied Operation = ClusterAction from host = 10.64.135.13 on resource = Cluster:kafka-cluster [13:34:30] but didn't see anything for that operation offhand [13:34:40] yeah revert seems best for now [13:35:14] herron: ok to proceed? [13:35:18] +1 [13:36:42] done [13:36:45] let's see [13:36:57] annnd the errors immediately stopped [13:37:30] yeah replicas catching up already [13:38:48] good catch elukey thank you, now we know to tail the kafka journal for a bit after acl change [13:39:18] np! Very weird state of ACLs across clusters though, I didn't know we had some for main for example [13:40:24] same I'm wondering which of the ACLs in main is addressing this issue there [14:28:13] elukey: o/ [14:28:35] OHH NOOO [14:28:42] yeah :( [14:28:57] i'm sorry i didn't even check that possiblity! gahhhh [14:29:12] its been so long since we had to change ACLs that i thought how could it hurt adding one for anon [14:29:28] but of course adding one for anon meant it started thinking about anons! [14:29:52] I think so yes, I wasn't aware of that behavior either :( [14:30:17] we should remoe it from logging codfw too [14:30:20] we can probably think about a meaningful set of ACLs to add for a cluster, kafka main could be used as baseline (didn't know it had ACLs) [14:30:30] ah right yes, doing it [14:30:42] i mean, if we had somethign like https://phabricator.wikimedia.org/T276088 config management for kafka stufff [14:31:25] wow we don't have the problem in logging-codfw [14:32:04] I am wondering if it triggers when one creates topics or similar [14:33:00] like if the anon perms get applied? yea dunno [14:33:20] I created test-elukey now, let's see [14:33:31] k will you delete acl when you ready? [14:33:33] or should I? [14:33:57] I will no problem, but I can't repro the issue [14:34:09] mmm maybe I need to produce to ti [14:34:55] weird [14:35:31] nope, all good [14:43:16] strange [14:46:14] before removing it I'd love to repro the issue, not sure how to trigger it though [15:18:45] herron: lemme know what you prefer, shall we simply remove from logging-codfw too and figure out what ACLs to rollout next? [15:22:24] elukey: sgtm, strange indeed that it isn't happening in codfw [15:22:36] but good I think to bring back to known good state [15:23:04] done! [15:23:09] ty ty [18:10:20] herron: hey, do these grr preview dashboards expire automatically? or do I need to be conscientious about cleaning up after myself with the delete links? [18:46:51] there is a flag to expire them, but its nothing much to worry about they can be cleaned up via the grafana ui too [19:13:20] ah okay, cool [20:00:24] hrm, I was wrong about that recording rule -- it turns out we were dividing a percentage by a percentage, so the existing math is correct -- it's just that one of the percentages is named "ratio," a little misleadingly [20:01:10] so I do want to get them both to unit scale but (a) we'll have to figure out what to do about the naming, and (b) there's no rush to change it, since it isn't wrong