[03:49:28] <lmata>	 !oncall-now
[03:49:29] <sirenbot>	 You're not allowed to perform this action.
[03:49:40] <lmata>	 !oncall-now sre
[03:49:40] <sirenbot>	 You're not allowed to perform this action.
[07:22:46] <vgutierrez>	 !incidents
[07:22:46] <sirenbot>	 No incidents occurred in the past 24 hours for team SRE
[07:22:50] <vgutierrez>	 Lovely
[07:55:59] <godog>	 Krinkle: yeah can't say for sure but either removing the hotfix or changing to openlog('php-fpm') seems okay to me from a quick look
[08:04:45] <_joe_>	 lmata: you don't have permissions, I need to add you to the access list
[08:05:04] <_joe_>	 godog: or using PHP_VERSION
[08:08:38] <godog>	 even better
[08:09:10] <_joe_>	 volans: CI for the cookbooks fails again *surprise surprise*
[08:09:16] <_joe_>	 because of pydocstyle
[08:09:37] <_joe_>	 can I suggest we freeze all dependencies for the linters for a year?
[08:39:58] <volans>	 _joe_: fixed. We can try to freeze prospector but it will break in different ways (such as pip backtracking) as their deps tree is quite complex in terms of allowed versions
[08:54:09] <_joe_>	 yeah :/
[08:55:17] <_joe_>	 volans: the only way to do it would be to bake the tox env into the docker image
[08:55:57] <_joe_>	 or maybe to download the dependencies once, then freeze them somehow via frozen-requirements.txt
[10:17:19] <XioNoX>	 _joe_, volans, are any of you on Telecom Italia? https://phabricator.wikimedia.org/T325965
[10:17:34] <volans>	 not me
[10:17:36] <_joe_>	 XioNoX: I have the mobile connection with them
[10:18:30] <XioNoX>	 I'm checking NEL but could you let me know if you can reproduce the issue?
[10:18:34] <_joe_>	 the mobile app has always worked, let me try to tether to my laptop
[10:19:29] <_joe_>	 I can't reproduce at all
[10:20:08] <XioNoX>	 ok, thanks!
[10:20:33] <XioNoX>	 NEL shows an increased rate of error on the 26/27th
[10:20:46] <XioNoX>	 but seems better now
[10:21:14] <_joe_>	 yeah I saw the task on the 28th I think, and I did check the mobile app
[10:21:21] <_joe_>	 and it was working from telecom
[10:26:04] <XioNoX>	 replied on the task
[12:15:01] <lmata>	 Thanks _joe_
[13:26:11] <Krinkle>	 _joe_: note that php version is already logged separately by both Mw messages and afaik php-wmerror as well so not sure we need to make it a dynamic string. Easy to do tho if it's useful for something!
[13:26:21] <Krinkle>	 And yeah php8.0 fixes it it seems
[13:26:40] <_joe_>	 Krinkle: not really indeed
[17:03:04] <vgutierrez>	 Emperor: what changed on swift @ codfw around 13:00 UTC today?
[17:04:05] <vgutierrez>	 ats-be is reporting an increased number of 502s from swift an all DCs that reach codfw (codfw + ulsfo + eqsin), see eqsin as an example: https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cluster=upload&var-origin=swift.discovery.wmnet&from=now-24h&to=now&viewPanel=12
[17:19:02] <_joe_>	 vgutierrez: I'd bet it's the usual frontend issue? cc Emperor 
[17:19:03] <TheresNoTime>	 hi, related? - user reporting a repeatable `An unknown error occurred in storage backend "local-swift-codfw"` when attempting to upload a file. Don't currently have any further information
[17:19:24] <_joe_>	 yeah, I would say we probably need to roll restart the swift proxies in codfw
[17:19:33] <_joe_>	 cwhite / sukhe ^^
[17:19:43] <sukhe>	 hi
[17:19:51] <sukhe>	 ok
[17:20:28] <_joe_>	 check https://wikitech.wikimedia.org/wiki/Incidents/2022-11-04_Swift_issues if things look similar
[17:23:47] <_joe_>	 uhm numbers are much smaller than in that occasion it seems
[17:25:38] <_joe_>	 but that's probably a traffic thing
[17:26:17] <_joe_>	 sukhe: the old incident doc suggests sudo cumin -b 1 -s 5 O:swift::proxy 'depool && sleep 3 && systemctl restart swift-proxy && sleep 3 && pool'
[17:26:20] <sukhe>	 yeah smaller bumps as compared to the 2022-11-04
[17:26:25] <sukhe>	 I am caught up and ready to proceed
[17:26:37] <_joe_>	 but take a look at the swift proxy logs
[17:27:03] <sukhe>	 _joe_: thanks, also adding to service restarts page (the 2022-11-04 talks about the same thing)
[17:27:33] <sukhe>	 Emperor: ^ 
[17:30:03] <sukhe>	 ok then given that they don't seem to be around, I am going to run the above for codfw
[17:33:44] <cwhite>	 There was a spike in latency on the ms-fe boxes around 1300.  It feels similar to prior incidents that restarting the proxies resolved.  It may be too small to register on the swift graphs?
[17:34:00] <sukhe>	 and done. thanks _joe_! Emperor: please note, ran the above but for A:codfw
[17:35:00] <_joe_>	 errors are going down
[17:37:11] <sukhe>	 yeah, looks like it did help
[17:37:17] <sukhe>	 the timing matches
[17:41:01] <Emperor>	 sukhe: thanks (sorry, I'd finished for today)
[17:41:22] <sukhe>	 np, I figured but thought you should be aware
[17:42:10] <Emperor>	 We have new swift front-ends in the being-ordered process
[17:49:37] <_joe_>	 should we add an alert on the backend errors from ATS to swift? This error was user-visible
[18:23:17] <TheresNoTime>	 (incidentally the user who mentioned the upload error has now managed to successfully upload their file and passes on their thanks)
[19:41:06] <andrewbogott>	 is CI working for anyone else?  I submitted some puppet patches an hour ago, still no tests results.
[19:41:21] <sukhe>	 andrewbogott: 14:20:25 < dduvall> well, the good news is that zuul/gearman are back to processing jobs. back news is that everything that was queued must be resubmitted
[19:41:35] <andrewbogott>	 thanks sukhe !
[19:44:33] <taavi>	 andrewbogott: long chains of patches tend to get zuul stuck for whatever reason, and I suspect that happened here too
[19:44:55] <andrewbogott>	 probably my fault then!