[18:01:24] TheresNoTime: php fpm busy has been flapping all weekend [18:01:29] Just not enough to page [18:01:47] it paged [18:02:00] as in, we got an alert, not sure if of what you were talking [18:02:08] high latency [18:02:13] here, can I help somehow? [18:02:15] jynus: yes only just paged now [18:02:38] The high latency & busy alerts have been on and off all weekend. Just never paging until a few minutes ago [18:02:42] the world cup final just finished......maybe related? [18:02:47] for sure [18:02:56] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&var-site=All&var-cluster=api_appserver&var-method=GET&var-code=200&var-php_version=All [18:03:33] https://grafana.wikimedia.org/d/3u6RLsL7k/status-page?orgId=1&viewPanel=2 [18:03:38] it is eqiad [18:04:07] https://phabricator.wikimedia.org/T325477#8476614 got created [18:04:14] if available, marostegui can you check dbs while I check anything else [18:04:22] here as well [18:04:31] here, getting to a laptop [18:04:50] yep [18:04:52] on it [18:05:11] I am going to notify editing issues so far [18:05:16] even if it may be wider [18:05:36] hello [18:05:51] also here [18:05:53] who want to take IC? [18:05:58] https://www.wikimediastatus.net/ [18:06:04] I will take it now that there is more people [18:06:09] creating doc [18:07:02] DBs seem to be fine for now [18:07:15] I am checking a bit more [18:07:21] claime: ^^ [18:07:35] Ok [18:07:39] the 50x logstash dashboard seems to indicate that only API is affected? [18:07:45] there was a recovery [18:08:03] Yeah, it seems mostly API [18:08:07] taavi: sadly the panel only has editing or reading granularity, editing is the best fit [18:08:07] pages acked [18:08:12] thanks claime [18:08:18] https://docs.google.com/document/d/1zhFUcPxxp7C5uHERgKFH6hvVhUy1g4S0jM2lGOXgwSM/edit [18:08:33] someone please check current health [18:08:40] while other checks app servers and request logs [18:08:45] We are recovering [18:09:03] jynus: I know, that was more for 'figure out what's wrong so it can be fixed', not 'make the status page more accurate' [18:09:37] edits still not recovering [18:10:14] We're starved of fpm workers [18:10:25] check eventgate analytics [18:10:26] DBs mostly loaded in s7 (where spanish wikipedia lives) [18:10:40] makes sense [18:10:55] but are they saturated or just higher than usual? [18:11:00] Higher than usual [18:11:04] They are coping fine [18:11:08] No lag and no errors for now [18:11:24] The obvious reasoning makes sense- but ofc it shouldn't happen [18:12:07] I am going to move to monitoring [18:12:15] as I think there is not active incidents? [18:12:15] Writes still not recovered https://grafana.wikimedia.org/d/-K8NgsUnz/home?orgId=1&from=now-1h&to=now&viewPanel=10 [18:12:22] ah, then not doing that [18:13:06] yeah, things are flapping, not as bad, but we are not ok yet based on graphs [18:13:10] Seeing the same pattern as the incident the other day with eventgate analytics [18:13:24] claime: what graph are you looking at? [18:13:26] still high latencies [18:13:26] This is looking better https://grafana.wikimedia.org/d/-K8NgsUnz/home?orgId=1&from=now-1h&to=now&viewPanel=8 [18:13:45] jhathaway: the wrong one. [18:14:04] :), just trying to understand the edit issue [18:14:09] regular app servers (non api look ok) [18:14:20] jhathaway: not edit itself, api app server issue [18:14:30] Yeah sorry I jumped the gun on eventgate [18:14:30] jynus: thanks [18:14:38] although we have low edits [18:14:46] probably those coming from bots? [18:15:22] can someone posibly look at rolling restart api app servers on eqiad? [18:15:39] someone from service ops, more familiar with how to do it nicely, I think it is clogged [18:15:55] claime: ^ [18:16:00] ack [18:16:13] claime: look at it, communicate if about to do it [18:16:23] I think sometimes it helps recovering faster [18:16:40] but I am not 100% sure [18:16:57] as a reset > all workers busy [18:17:09] https://grafana.wikimedia.org/goto/ybaA6Lc4k?orgId=1 [18:17:39] I can do a rolling 5% restart [18:17:40] my guess is they got "stuck" like last time, except a different set [18:17:44] it's worth a try as we have seen in the past, unless someone has a better idea [18:17:44] Via the cookbook [18:17:55] I am ok with that, anyone disagrees? [18:17:59] It's just api_appserver? [18:18:03] just on eqiad [18:18:07] api indeed [18:18:25] codfw is all healthy [18:18:27] ok [18:18:32] sudo cookbook -d sre.mediawiki.restart-appservers --datacenter eqiad --cluster appserver_api --percentage 5 [18:18:39] claime: I say go ahead [18:18:44] Running [18:18:45] +1 from me too fwiw [18:18:59] they are stuck anyway [18:19:13] not sure if the recoveries are related to the command already run or just a coincidence :) [18:19:31] what does 5% mean, is that how much is restarted at a time? [18:19:32] claime: confirm when it starts? [18:19:35] Ok that needs a service [18:19:40] Haven't started [18:19:54] jhathaway: how much at the same time (batching if I understnd correctly) [18:20:33] we want just the apache, not the whole server [18:20:48] or whatever web service is being used [18:20:50] jynus: thanks [18:21:08] someone else please give updates about impact [18:21:20] jynus: php-fpm + apache are probably more likely to be needed unless you know which one is locked [18:21:27] busy threads are going down now [18:21:31] claime: wait a sec [18:21:36] I'm heading out now though [18:21:48] yeah recoveries are happening without the restart [18:21:52] probably best to leave it alone? [18:21:52] jynus: holding [18:22:50] are edits going through now?= [18:23:03] yep https://grafana.wikimedia.org/d/-K8NgsUnz/home?orgId=1&from=now-1h&to=now&viewPanel=10 [18:23:14] at least some [18:23:15] not yet to the same level [18:23:25] Let me check the appservers graphs [18:23:51] latency still high [18:24:01] but 5XX look ok [18:24:05] Still starved but less [18:24:54] but it is not fully recovering [18:25:04] No [18:25:13] open for suggestions [18:26:29] the recoveries seem to have stopped too [18:26:45] I can still restart php-fpm but slower [18:27:01] Like 2% rate [18:27:11] worth a shot to see if it helps [18:27:14] so +1 from me [18:27:27] why is the 5xx rate still low? [18:27:40] edits going up [18:28:27] jynus: I think probably a bit early to say but yeah [18:28:39] I am unsure what to do, we have still high number of requests [18:28:52] appserver latency is also going down, but not recovered to baseline [18:29:09] if the issues comes from load, depooling some may make things worse [18:29:27] depooling eqiad during all of this might also not be a good idea [18:29:35] what are outstanding alerts? [18:30:01] I say because when we depooled a db last time, it didn't help [18:30:21] if the issue come from the high load [18:30:41] jelto do you have the appserver restart command handy? [18:30:48] with the cookbook? [18:30:49] no active alerts right now ? [18:31:25] the latency one [18:31:28] is still ongoing [18:31:48] workers are also not great, but not terrible [18:32:09] parsoid suffered too, mobile maybe? [18:32:34] parsoid took a hit but seems ok now [18:32:41] yes, not a worry now [18:33:11] I guess latency https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&viewPanel=66&var-site=eqiad&var-cluster=api_appserver&var-method=GET&var-code=200&var-php_version=All&from=now-2d&to=now is probably the main concern now [18:33:55] human edits seem ok on enwiki, eswiki [18:33:59] claime: no, reading cookbook -h and wikitech https://wikitech.wikimedia.org/wiki/Service_restarts#Application_servers_(also_image/video_scalers_and_job_runners_and_parsoid) atm [18:35:04] ^ not an expert on this, so probably we can do A:mw-api and A:eqiad and using the same batch and sleep options? [18:35:10] so we think the higher latency is not due to the current request volume, but due to some after effects of the initial traffic burst? [18:36:00] I'm trying to get a sense from logs but... [18:36:43] Lots of "No stashed content found" [18:37:31] jhathaway: I think we aren't sure what the smoking gun is yet [18:37:47] nod [18:38:08] yeah, I didn't even try to check that- other than generic "load" [18:38:42] the latency graph isn't getting better fwiw. so I think we should continue to investigate to see if it there is any other ongoing issue/impact that might be causing this or be caused by this [18:38:57] the latency is due to the busy threads [18:39:24] I thought that was mostly stable now? let me see the graph again [18:39:25] https://grafana.wikimedia.org/goto/G-KwgEcVz?orgId=1 [18:39:41] they are not saturated, but they are higher than usual [18:39:46] it correlates with latency [18:39:51] Yeah [18:39:56] right, I zoomed out and now I get a better view [18:40:04] so probably the restart option? :) [18:40:16] according to turnilo and grafana, we are behin the traffic peak but traffic is still high: https://w.wiki/682S and https://grafana.wikimedia.org/d/3u6RLsL7k/status-page?orgId=1&viewPanel=2&from=1671305747000&to=1671390450000 [18:40:17] claime: why did that cookbook fail? all I saw the non-zero exit code [18:40:22] it needed a service? [18:40:45] I am having an issue with my connection sorry [18:40:51] I say go ahead, I think we should have the availability now to have 1/2 app servers down [18:40:54] if that helps [18:40:59] I am happy running that [18:41:02] It needs a service but won´t take php7.4-fpm or php-fpm [18:41:18] DRY-RUN: Machines cache initialised to ['https://conf1008.eqiad.wmnet:4001', 'https://conf1009.eqiad.wmnet:4001'] [18:41:20] DRY-RUN: Now acting on: appserver_api/eqiad [18:41:22] DRY-RUN: Issuing read for key /conftool/v1/pools with args {'recursive': True} [18:41:24] DRY-RUN: Exception raised while executing cookbook sre.mediawiki.restart-appservers: [18:41:26] Traceback (most recent call last): [18:41:28] File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 357, in query_confctl [18:41:30] hosts_conftool = [obj.name for obj in conftool.get(**tags)] [18:41:32] File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 357, in [18:41:34] hosts_conftool = [obj.name for obj in conftool.get(**tags)] [18:41:36] File "/usr/lib/python3/dist-packages/spicerack/confctl.py", line 118, in get [18:41:38] for obj in self._select(tags): [18:41:40] File "/usr/lib/python3/dist-packages/spicerack/confctl.py", line 89, in _select [18:41:42] raise ConfctlError("No match found") [18:41:44] spicerack.confctl.ConfctlError: No match found [18:41:46] Looks like borked aliases [18:42:05] claime: try doing 1 manually [18:42:13] as in, depool; restart; pool [18:42:24] (as an idea) [18:43:45] cgoubert@mw1356:~$ sudo depool [18:43:47] Depooling all services on mw1356.eqiad.wmnet [18:43:49] sudoeqiad/api_appserver/nginx/mw1356.eqiad.wmnet: pooled changed yes => no [18:43:51] cgoubert@mw1356:~$ sudo -i /usr/local/sbin/restart-php7.4-fpm [18:43:53] 2022-12-18 18:43:37,798 [INFO] The server is depooled from all services. Restarting the service directly [18:44:13] cool, now lets see the effect [18:44:28] if it works, we can do that a bunch of times manually if needed, if not, we try something else [18:44:31] 0 :-D [18:44:33] 13:20 <+icinga-wm> RECOVERY - PHP7 rendering on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 2.769 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:44:42] ok [18:44:42] we already had a recovery for this [18:44:51] pool now [18:44:56] It's repooled [18:44:59] ah, sorry [18:45:07] this is ~14 minutes ago sorry, in my local time [18:45:10] I'll go find one that's still alerting [18:45:47] claime: you do it but ask for help if needed [18:48:31] not sure that works, they go to busy instantly? [18:48:43] https://grafana.wikimedia.org/goto/AyFdzEc4z?orgId=1 [18:49:31] we have another page [18:49:55] I think my vote is for cumin -b 1 -s 30 'A:mw-api and A:eqiad' 'restart-php7.4-fpm' [18:49:55] Yeah, it's whack a mole [18:50:03] Agreed [18:50:15] Please run it as my connection is not stable at all [18:50:20] happy to [18:50:21] jynus: ? [18:50:23] And i'll do more harm than good [18:50:23] what do you think? [18:50:30] let's get another +1 I guess [18:50:37] acking the alarms [18:50:40] thanks [18:50:40] I don't think restarting is working, but if you want to do some? [18:51:15] +1, I dont think there is much harm in trying [18:51:22] go ahead [18:51:24] then [18:51:30] sukhe: does it [18:52:17] ok. [18:52:22] should I do for all A:eqiad [18:52:29] or should I try with a restricted subset first? [18:52:38] I guess we have already seen a restricted subset recover and made no impact [18:52:42] so perhaps a cluster-wide is in order? [18:54:03] go ahead with 1 first and then all? [18:54:12] I think cluster-wide is ok as long as it's slow, so with batch 1 we should be good [18:54:16] it is only api that have the issue [18:54:32] ok going ahead with -b 1 and -s 30 [18:54:45] we can do 60 too if that's more desirable (I have not done this before and I don't understand this fully) [18:54:49] but not to delay it further [18:54:53] going with 30 [18:54:55] and we can see [18:55:38] please go ahead [18:55:43] running [18:55:49] logged in #operations [18:57:59] Latency seems to be decreasing [19:07:15] discussiontoolspageinfo this is taking 25% of requests [19:07:22] query is taking 34% [19:09:23] maybe it is normal, but looks weird [19:09:57] edits have recovered [19:10:06] latency graph looks better but not fully there yet [19:11:01] I think it may take a little while to show the restart because of opcache warmup [19:11:21] I wonder if we have an issue with edits being heavier than usual [19:11:28] as in, code-wise [19:11:34] that might also be organic traffic [19:12:10] see the graph it shouldn't dominate over query [19:13:27] as in there is some optimization needed, but it was only triggered when many people started discussing [19:15:20] As I said at the start, the alerts didn't start with the end of the match. They have been flapping all weekend. Might be worth checking what happened when latency + workers first started alerting. [19:15:23] we had a 75% increase in traffic, we held higher spikes [19:15:39] End of the match was just when it got bad enough to page [19:15:44] so I think load contributed, but there is something else going on [19:16:11] sukhe: yeah not saying it is not organic, just that looks strange performance-wise [19:16:19] yeah I hear you [19:16:21] I am thinking lack of optimization, but have not proof [19:17:12] as in, some huge increase in specific code paths related to visual editor or something [19:19:33] things are gettin better now, but I think it could be also organicly caused [19:19:56] 43/62 [19:20:10] we can try to match up the timings of the restart to see how much things helped for the future [19:20:12] users confirm api is reachable, just slow [19:20:17] but I am also not exactly convinced how much it did [19:20:20] can you do that? [19:20:26] I trust you to make a call [19:20:56] I think it only helps if requests are stuck on a backend [19:21:02] if it is slow code, we are not helping at all [19:21:35] we started the restart at 18:55:12 [19:21:59] ^ in case we want to associate the "things getting better" with the restarts [19:22:08] what is your take? [19:22:10] jynus: yes happy to. I am not also entirely convinced [19:22:14] 18:55 to me, on RED, is the last spike before the long tail [19:22:16] need to dig in better [19:22:29] It is trending downwards, just not fast [19:22:40] faster from :55 or as before? [19:22:43] that is the key [19:23:23] https://grafana.wikimedia.org/goto/yQyvZPcVk?orgId=1 [19:23:55] More or less centered around 18:55 to now to try to see if it trends downwards faster [19:24:31] I vote to pause [19:24:44] 51/62 if we decide to pause [19:24:51] ok, then it doesn't matter [19:24:55] If there's only ten left, pausing doesn't matter [19:25:02] I am quite sure it didn't help [19:25:03] we're in agreement [19:25:09] It didn't make it worse either [19:25:25] my votes is in discussiontools creating too much load [19:25:40] load will be organicly low from now on [19:25:54] tomorrow people can debug for performance [19:26:27] can you make sure no temporary measure is in place in preparation for that to finish? [19:26:56] unless of course you have another signal telling you trigger or root cause? [19:27:15] I have nothing [19:27:25] from what I see here (manual restart of mw1356 claime did): the cpu usage just got up to before the restart, memory saturation a bit better than before: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=mw1356&var-datasource=thanos&var-cluster=api_appserver&from=1671387917340&to=1671390518521 [19:27:29] jynus: how were able to zero in on discussiontools? [19:27:45] jhathaway: I don't know yet [19:27:56] nod [19:28:04] but knowing it was api and "cpu" (as in, not an external service) [19:28:21] not literally cpu, but hope you understand, load of the workers/saturation [19:28:35] I looked at api usage graph [19:29:11] "query" is the module that it is used 80% of the api calls (give me a page) [19:29:44] jhathaway: graph on doc [19:29:50] jynus: thanks [19:30:29] so this is a guess but: new functionality that it needs optimization and only was triggered when we almost double the traffic [19:30:40] but needs research [19:31:01] jhathaway: https://grafana.wikimedia.org/d/000000002/api-backend-summary [19:31:17] honestly, I am not an app server person, so I am the wrong person to know about this :-( [19:31:28] !log [FINISHED] running sudo cumin -b 1 -s 30 'A:mw-api and A:eqiad' 'restart-php7.4-fpm' [19:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:36] er twice, but that's ok [19:31:38] we should also involve performance team tomorrow [19:31:51] in addition to service ops [19:32:30] any outstanding alert? [19:32:37] yeah personally (not that I know much about app servers), I am not convinced that we found the smoking gun today [19:33:43] idle workers still recovering jynus [19:34:04] latency deacreased a lot now [19:34:06] 0.3548 gt 0.3 [19:34:08] Close though [19:34:21] ignore that [19:34:31] it is lowering at normal pace [19:34:43] Yep [19:35:13] I was just answering "any outstanding alert?", I thought you meant "anything still critical", my bad [19:35:34] that is consistent with some functions being slower and spiking, reaching outage [19:35:38] but not still being slow [19:36:20] thoughts about this? [19:36:45] idle threads are finally higher than busy [19:37:39] I am going to reduce the ticket from unbreak now to high [19:37:50] +1 [19:37:53] jynus: makes sense, thanks [19:38:02] also outage to monitoring [19:38:41] agreed [19:39:06] I will add my suspicion of the cause if you are ok with that [19:39:17] mostly for the readers to start from somewhere [19:39:34] please do [19:40:23] thanks everyone, and jynus for ICing, I am going to step away for a bit [19:41:10] yes I think we can do that [19:41:21] thanks claime and jynus. I will be around if needed (since it's late for you folks anyway) [19:42:10] declaring it resolved then [19:42:15] ty [19:42:17] There's something quite strange in the API backend graphs though, the request rate for discussiontoolspageinfo didn't rise that much (2.76 -> 3.2k rps) but load increased 10-fold [19:42:43] load in general, or for that? [19:42:52] jynus: for that [19:43:06] ok, so still a possible root cause? [19:43:13] Possible yeah [19:43:14] to note it on the doc [19:43:18] ok doing