[08:33:31] FYI, at 10UTC I'm going to upgrade ulsfo switches, that means a hard downtime for that site. It will be depooled ahead of time. [09:27:13] !incidents [09:27:13] 3212 (RESOLVED) Host cr3-eqsin (paged) - PING - Packet loss = 100% [09:27:14] 3210 (RESOLVED) [FIRING:4] ProbeDown (text-https:443 probes/service ops page eqsin prometheus sre) [09:27:14] 3211 (RESOLVED) [FIRING:1] FrontendUnavailable cache_text (page thanos sre) [09:27:14] 3209 (RESOLVED) [FIRING:1] FrontendUnavailable (varnish-text page thanos sre) [10:13:09] getting slightly delayed on the upgrade due to some errors, checking if they can be fixed [10:20:03] thank you a lot for the updates! [10:41:59] I'm going to cancel today's maintenance and hopefully push it to tomorrow [10:42:09] and will follow up with Juniper's support [10:52:12] actually, I might have found the issue [10:52:22] doing another try [10:55:55] assuming service ops permission, I am going to file a ticket regarding the alert "High average POST latency for mw requests on api_appserver in codfw on alert1001" I will add observability too, as probably it is mostly about tuning the triggering metrics [10:56:26] it never hurts to open a task :) [10:57:14] I want to justify it to also get attention, as it is not technically wrong, but it is spaming alerts for a lont time now [10:57:23] jynus: Yes, I've been meaning to get to it but haven't yet [10:57:30] Please file a task <3 [10:57:34] no prob, just want to start a process here :-D [10:58:17] We need to alert on latency only if we get more than X rps, mostly [10:58:52] yes, although I think it may not be that easy, so I will file a ticket so more people can input thoughts [10:59:11] the right way to "disable it/enable it" [10:59:39] let me rephrase it- there could be a lot of ways to do it wrongly :-D [11:03:25] well, different error for my upgrade, I won't push it too much :) [11:09:26] I created https://phabricator.wikimedia.org/T326544 included obs in case a similar fix could be applied to other latency-related metrics [11:10:00] thanks :) [11:10:45] no super high priority, but after some weeks at leasy worth tracking for oncall people/general SRE knowledge [11:22:12] Yes, I completely agree [18:09:42] jbond: we made the types for UIDs, up to 499 and up from 1000.. but we actually use 9xx UIDs.. do they fall between ranges for users and system users? [18:13:25] mutante: ill double check tomorrow there are a few outlieres with useres with uid's lower then 1000 as well i think [18:13:44] ACK, thank you:) [18:14:59] mutante: https://wikitech.wikimedia.org/wiki/UID#UID_ranges should be up to date [18:16:48] taavi: yea, this does not match https://gerrit.wikimedia.org/r/c/operations/puppet/+/875446 [18:17:31] and the example right below where we used 920 [18:17:45] to reserve a daemon user [18:25:09] mutante: iirc 100-499 is what can be allocated dynamically when it doesn't matter if they match between hosts [18:26:58] thx, ACK, we should change the range to 100-999 for the data type [18:29:00] yep, and maybe also include 60000-64999 depending on its use case [19:20:24] rzl: wondering if you agree that https://phabricator.wikimedia.org/T308952 is resolved or not quite [19:22:35] mutante: I think he's OOO for the next week and a half [19:24:20] cdanis: oh, thank you. ack [21:06:58] Hello SRE. I was preparing to do the UTC late backport, but #wikimedia-operations is filled with all kinds of alerts. Is anyone aware of any ongoing issue? [21:08:35] kindrobot: yes, we got paged just now.. also seeing recoveries [21:09:40] OK. Should I cancel the window, postpone it for a bit, or something else? [21:10:14] kindrobot: yes, please hold for now [21:10:16] kindrobot: yeah, wait a bit [21:12:40] OK thank you, could you ping me if/when it's safe to deploy? [21:12:47] kindrobot: will do [21:13:11] Thanks. :) [21:19:40] kindrobot: you can proceed [21:20:26] kindrobot: we'll watch [21:20:57] Great, thanks! [21:28:29] thanks for checking in <3 [21:30:34] so deployment happened and things look still ok [21:30:54] incident not active anymore [21:31:19] thanks mutante