[14:37:35] Hi all! is anyone around to back me up for an high priority (but not SUPER urgent) deployment for the rest gateway? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1259956 [14:38:14] This would help fix some issues that thw wikipedia app is having with hitting rate limits. claime, Raine, are you around? [14:39:34] duesen: After the service switchover [14:39:50] oh right, that's today... never mind then [14:40:04] We should be ok to do it in an hour or so [14:40:18] It's just services + traffic, not mediawiki switchover, it's less involved and risky [14:40:31] yea, that's when my block of meetings starts, and then it'S dinner time... I may come back to it in the evening, but tomorrow works as well. [14:40:51] Tomorrow would be morning, as we do the mediawiki switchover in the afternoon :) [14:41:29] I'll ping you if we're available earlier than in an hour though, we may be able to squeeze it in [14:53:51] duesen: I'll be around in the evening [18:07:34] Raine, claime: walked the dog, had a coffee, wrote an email... feeling better now :) [18:07:59] \o/ [18:08:17] I rebased the docs patch. I'll hit +2 on it, then I go and test it together with the patch we merged earlier, on staging. [18:08:32] Version bump is merged too [18:08:59] saw it, thank you! [18:13:51] ...running make check... [18:17:32] hm, two tests flaked out, running again. the more tests I add, the more can flake out... but as long as they don't fail consistently, it should be fine. rate limiting *is* timing sensitive... [18:18:47] is it because sometimes you're going to be at the minute boundary? [18:20:10] yes... I tried to account for that, but... i guess i have to look into that issue again. [18:21:02] on the second try, two different tests flaked out. i'll do a third run chust for charm, but if nothing is failing consistently, it's just a timing issue [18:21:15] ack [18:21:36] I added a lot more tests recently. that increases the chance of flake. but should look into counter-measures, though [18:24:22] two different failures again! I'll take that [18:25:32] applying to codfw [18:28:22] Raine: uh, something doesn't look right. The ratelimiter metrics vanished entirely. Traffic still looks good, so it's probably a prometheus issue. But I have no visibility on the the effects of my change. [18:28:37] Raine: is it possible that prometheus is not finding the new pods? [18:28:53] doesn't seem likely [18:29:58] the stats vanished retroactively. I can't see older metrics anymore either... [18:30:15] https://grafana-rw.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway [18:30:27] the ratelimit section. you need to pick a policy at the top [18:30:29] grafana logged you out, maybe? [18:30:51] works for me [18:32:04] works for eqiad, broken for codfw [18:32:24] huh right, sorry [18:32:43] some tag value changed? [18:32:47] eh but that's just DC switchover? [18:33:03] hm? how so? [18:33:06] we're in single DC mode for 1 week [18:33:51] oh...! ok, i didn't know that [18:34:03] the traffic stats are still there, but everything routes to eqiad? [18:34:29] yeah [18:34:42] so... i guess i can deploy to eqiad then. [18:34:57] got me scared for a minute :D [18:35:06] :D [18:43:16] Raine: looking good, thank you! [18:48:00] \o/ [19:09:04] apergos: --^ [19:10:47] lol [19:11:06] finally some time zone overlap and I miss it!