[03:49:28] !oncall-now [03:49:29] You're not allowed to perform this action. [03:49:40] !oncall-now sre [03:49:40] You're not allowed to perform this action. [07:22:46] !incidents [07:22:46] No incidents occurred in the past 24 hours for team SRE [07:22:50] Lovely [07:55:59] Krinkle: yeah can't say for sure but either removing the hotfix or changing to openlog('php-fpm') seems okay to me from a quick look [08:04:45] <_joe_> lmata: you don't have permissions, I need to add you to the access list [08:05:04] <_joe_> godog: or using PHP_VERSION [08:08:38] even better [08:09:10] <_joe_> volans: CI for the cookbooks fails again *surprise surprise* [08:09:16] <_joe_> because of pydocstyle [08:09:37] <_joe_> can I suggest we freeze all dependencies for the linters for a year? [08:39:58] _joe_: fixed. We can try to freeze prospector but it will break in different ways (such as pip backtracking) as their deps tree is quite complex in terms of allowed versions [08:54:09] <_joe_> yeah :/ [08:55:17] <_joe_> volans: the only way to do it would be to bake the tox env into the docker image [08:55:57] <_joe_> or maybe to download the dependencies once, then freeze them somehow via frozen-requirements.txt [10:17:19] _joe_, volans, are any of you on Telecom Italia? https://phabricator.wikimedia.org/T325965 [10:17:34] not me [10:17:36] <_joe_> XioNoX: I have the mobile connection with them [10:18:30] I'm checking NEL but could you let me know if you can reproduce the issue? [10:18:34] <_joe_> the mobile app has always worked, let me try to tether to my laptop [10:19:29] <_joe_> I can't reproduce at all [10:20:08] ok, thanks! [10:20:33] NEL shows an increased rate of error on the 26/27th [10:20:46] but seems better now [10:21:14] <_joe_> yeah I saw the task on the 28th I think, and I did check the mobile app [10:21:21] <_joe_> and it was working from telecom [10:26:04] replied on the task [12:15:01] Thanks _joe_ [13:26:11] _joe_: note that php version is already logged separately by both Mw messages and afaik php-wmerror as well so not sure we need to make it a dynamic string. Easy to do tho if it's useful for something! [13:26:21] And yeah php8.0 fixes it it seems [13:26:40] <_joe_> Krinkle: not really indeed [17:03:04] Emperor: what changed on swift @ codfw around 13:00 UTC today? [17:04:05] ats-be is reporting an increased number of 502s from swift an all DCs that reach codfw (codfw + ulsfo + eqsin), see eqsin as an example: https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cluster=upload&var-origin=swift.discovery.wmnet&from=now-24h&to=now&viewPanel=12 [17:19:02] <_joe_> vgutierrez: I'd bet it's the usual frontend issue? cc Emperor [17:19:03] hi, related? - user reporting a repeatable `An unknown error occurred in storage backend "local-swift-codfw"` when attempting to upload a file. Don't currently have any further information [17:19:24] <_joe_> yeah, I would say we probably need to roll restart the swift proxies in codfw [17:19:33] <_joe_> cwhite / sukhe ^^ [17:19:43] hi [17:19:51] ok [17:20:28] <_joe_> check https://wikitech.wikimedia.org/wiki/Incidents/2022-11-04_Swift_issues if things look similar [17:23:47] <_joe_> uhm numbers are much smaller than in that occasion it seems [17:25:38] <_joe_> but that's probably a traffic thing [17:26:17] <_joe_> sukhe: the old incident doc suggests sudo cumin -b 1 -s 5 O:swift::proxy 'depool && sleep 3 && systemctl restart swift-proxy && sleep 3 && pool' [17:26:20] yeah smaller bumps as compared to the 2022-11-04 [17:26:25] I am caught up and ready to proceed [17:26:37] <_joe_> but take a look at the swift proxy logs [17:27:03] _joe_: thanks, also adding to service restarts page (the 2022-11-04 talks about the same thing) [17:27:33] Emperor: ^ [17:30:03] ok then given that they don't seem to be around, I am going to run the above for codfw [17:33:44] There was a spike in latency on the ms-fe boxes around 1300. It feels similar to prior incidents that restarting the proxies resolved. It may be too small to register on the swift graphs? [17:34:00] and done. thanks _joe_! Emperor: please note, ran the above but for A:codfw [17:35:00] <_joe_> errors are going down [17:37:11] yeah, looks like it did help [17:37:17] the timing matches [17:41:01] sukhe: thanks (sorry, I'd finished for today) [17:41:22] np, I figured but thought you should be aware [17:42:10] We have new swift front-ends in the being-ordered process [17:49:37] <_joe_> should we add an alert on the backend errors from ATS to swift? This error was user-visible [18:23:17] (incidentally the user who mentioned the upload error has now managed to successfully upload their file and passes on their thanks) [19:41:06] is CI working for anyone else? I submitted some puppet patches an hour ago, still no tests results. [19:41:21] andrewbogott: 14:20:25 < dduvall> well, the good news is that zuul/gearman are back to processing jobs. back news is that everything that was queued must be resubmitted [19:41:35] thanks sukhe ! [19:44:33] andrewbogott: long chains of patches tend to get zuul stuck for whatever reason, and I suspect that happened here too [19:44:55] probably my fault then!