[07:00:21] it seems we lost the IRC notification bot? [07:30:37] o/ [07:30:43] do you mean sirenbot? [07:32:01] not sure which of the bots, but at least the bot used for sending notifications from cookbooks is inactive [07:32:14] ah okok [07:33:11] I think there was a bit of a netsplit at the weekend, bots probably need a kick [07:33:21] same for sirenbot I think [07:33:26] the bot should be logmsgbot [07:34:02] !incidents [07:34:02] 4007 (RESOLVED) db1137 (paged)/MariaDB Replica SQL: x1 (paged) [07:34:03] 4006 (RESOLVED) db1137 (paged)/mysqld processes (paged) [07:34:03] 4005 (RESOLVED) db1137 (paged)/MariaDB Replica IO: x1 (paged) [07:34:03] 4004 (RESOLVED) Host db1137 (paged) - PING - Packet loss = 100% [07:34:09] OK, I've kicked sirenbot [07:34:26] Emperor: nice! Where does it run? I didn't see any info in https://wikitech.wikimedia.org/wiki/Vopsbot [07:34:33] (maybe I am missing some docs) [07:34:48] I'll do logmsgbot and then answer that [07:35:23] ack thanks [07:36:06] ahh sirenbot is on the alert nodes [07:36:42] https://wikitech.wikimedia.org/wiki/Logmsgbot explains the latter; as to the former, I think I went grobbling round git grep inside puppet to work out where to find it the last time it was sad [07:37:10] let me update the Vopsbot page [07:38:03] Emperor: already done it! [07:38:42] I think it's more "log onto the active alert node" than the alert nodes plural [07:39:57] sure fixing [07:40:35] already did :) [07:40:50] super thanks [10:46:39] hnowlan: Hey, could you take care of deploying https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/949487 ? [10:47:00] zabe: yep, sure [10:47:16] Thanks! [12:01:18] the above caused some issues with restbase, would appreciate a hand if anyone is about [12:01:35] what kind of issues? [12:01:58] ah, I see in -operations [12:02:19] I also see recoveries? [12:02:39] it appears to be flapping a bit [12:02:45] transient or did you also deploy the revert? [12:02:53] I reverted yeah [12:03:06] ah, that would explain the recoveries then [12:03:32] I'm seeing new failures since the revert [12:03:50] restbase2018 for example just flapped again [12:04:31] it's also not clear to me how that patch may have causes those issues [12:04:37] it's just adding 3 wikis ... [12:04:38] yeah, I don't understand it either [12:04:52] there's a lot of noise in the logs because restbase1030 is out of action [12:04:59] but even with those filtered it just seems like random timeouts [12:09:55] I wonder if this is something to do with the rate limiter panicking https://logstash.wikimedia.org/goto/0507daac4fbb9318e27ec55ebf928aec [12:09:59] hmmm, from the deploy hosts, swagger always returns All endpoints are healthy, which at least means we aren't serving that many errors [12:10:08] restbase1030 is listed as a seed for it [12:10:33] maybe taking it down while everything was running was fine but then redeploying made it fail shut or something [12:10:44] which is not great but that rate limiter has been broken forever [12:11:19] appservers are at least seeing increased errors to restbase https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=All [12:11:36] I was about to say that the rate limiter hasn't worked ok in a looong while [12:11:38] if ever [12:12:09] it seems like a bit much for it to break this little bit more at this late juncture [12:12:22] true [12:17:00] the warn/table/cassandra/driver errors also increased, now back at lower level but 20% more than baseline. However, apparently heading back to baseline [12:17:34] so, the entire set of extra events is indeed the rate limiter [12:17:54] I am inclined to say, it's definitely adjacent to the root cause, if not it. [12:18:29] now if only someone understood the rate limiter >_> [12:18:46] do we need to? [12:19:03] if restbase1030 is a seed and has issues, maybe just removing it from the seed list will suffice? [12:19:55] however, the gossiping protocol.. should account for that [12:20:10] ah dammit, no wait, the rate limiter is implemented using a UDP based DHT [12:20:18] nothing to do with cassandra seeds [12:20:30] in *theory* those hosts can't even talk to each other because that UDP port is closed [12:20:42] https://phabricator.wikimedia.org/T249699 [12:20:58] but that's no help now. Annoyingly the seed list is just all the cassandra hosts for a DC [12:22:37] I was wrong, that list is only used by the ratelimiter https://gerrit.wikimedia.org/r/c/operations/puppet/+/954672 [12:23:01] main thing bugging me is why hasn't this happened before? we've had lots of downed hosts for periods of time [12:23:02] can't hurt to try [12:28:16] is it me or is it recovering? [12:28:55] also, how is codfw affected? [12:29:18] I mean, the rate limiting would explain eqiad, but why do we see an error rate in codfw too? [12:29:36] take "explain" with many many many grains of salt [12:29:45] yeah that makes no sense [12:30:15] I keep getting faked out that it's recovering [12:30:24] yeah, me too [12:32:06] almost all of the ratelimit events are for wikifeeds https://logstash.wikimedia.org/goto/ece4e5592f3923bab9870198aad15f4c [12:33:14] I don't think restbase restarts on the ratelimit seed change, but tbh given that it's happening in codfw too that change won't have much impact [12:35:18] I can't think anything else than "have you tried turning it off and on again?" right now [12:36:11] it's pretty clearly related to ratelimiting, which isn't working cross cluster for sure [12:36:23] so apparently each node is rate limiting locally anyway [12:37:05] one more thing to try https://gerrit.wikimedia.org/r/c/operations/puppet/+/954676 [12:37:25] restbase1030 is listed as a seed for table specs - dunno why that would cause a failure but it's not good either way [12:37:48] +1ed [12:37:51] ty [12:37:52] i'll try this then a roll restart [12:38:59] external traffic patterns are at baseline btw. Elevated at this time of the day ofc, but not more than other days of the week [12:49:51] no joy on the restarts [12:50:32] rate limit messages are down at least but nothing improved [12:54:11] I am not sure nothing improved. I see only 3 alerts in alerts.w.o now [12:54:38] there is a huge spike in https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=api_appserver&var-origin_instance=All&var-destination=All&from=now-1h&to=now&viewPanel=16 [12:54:46] but it is also subsiding? [12:55:07] codfw unfortunately paints a very weird picture [12:55:12] https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=api_appserver&var-origin_instance=All&var-destination=All&from=now-1h&to=now&viewPanel=16 [12:55:16] ok... what on earth? [12:56:35] ah, dropping again [12:57:03] and now all restbase alerts resolved [12:57:45] sigh [12:57:50] so... now what? what fixed it? the removal of restbase1030 from ratelimiter ? Doubtful. The removal from seeds? 🤷. Just the restarts? 🙈 [12:59:30] removal from seeds wouldn't be it either because codfw uses local seeds only [12:59:44] I suspect those config changes wouldn't take effect until a restart anyway [13:00:25] quite possibly [13:00:32] logs from this whole thing on rb are largely useless [13:00:49] wonder if it'd be useful to disable the ratelimiter in genreal [13:03:22] all kinds of weird stuff here (particularly citoid related) https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=restbase&var-origin_instance=All&var-destination=All&from=1693831210164&to=1693832312767 [13:04:09] (not that would be a cause here) [13:05:29] I'll keep digging after lunch [13:07:57] hnowlan: I wonder the same thing, but also, is it worth it ? [13:38:43] I would like to run the DC switchover cookbook live test tomorrow at 14:00 UTC. This should be non-disruptive, but I don't want to run it at the same time as something else, so I'm asking. OK with y'all? [13:54:22] kamila_: Don't forget to put it in the Deployments calendar (https://wikitech.wikimedia.org/wiki/Deployments) so that deployers are aware. As you point out it should be non disruptive, but it won't hurt if they know. [13:54:45] can I assume they were already run in dry-run mode? [13:55:19] sorry if I missed it if it was already mentioned ;) [13:56:17] volans: no, I haven't done that yet, planning to do that today [13:56:24] akosiaris: yup, thanks! [14:00:58] kamila_: ack, I was mentioning it because if you find something wrong you might need a bit of time to fix them, just that [14:01:22] fair point, will move if I need to, thanks volans [14:01:37] (yes I'm not doing fantastic wrt time management '^^) [14:01:46] (but that's exactly why I want to run the live test ASAP) [14:08:51] :D [14:08:56] no prob [14:14:48] volans: it's no prob until I actually run it and give you work, right? :D [14:14:58] why me? :D [14:15:01] * volans hides [14:15:17] XD [14:15:35] I don't own all the cookbooks ;) [14:16:08] but I will likely be pestering you with questions even if I'm nominally the one fixing stuff :D [14:16:50] absolutely, I was joking, feel free to ping anytime. I'm also happy to attach to the tmux or tail the logs for the live test if needed [14:17:40] thanks ^_^ [14:20:24] I just deployed something something service.yaml to cumin1001 [14:20:34] I needed to run puppet for backup reasons