[12:11:43] 10serviceops: Fix sre.mediawiki.restart-appservers cookbook and doc - https://phabricator.wikimedia.org/T325739 (10Clement_Goubert) [12:12:01] 10serviceops: Fix sre.mediawiki.restart-appservers cookbook and doc - https://phabricator.wikimedia.org/T325739 (10Clement_Goubert) [12:12:34] 10serviceops: Fix sre.mediawiki.restart-appservers cookbook and doc - https://phabricator.wikimedia.org/T325739 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [13:57:41] _joe_: so you have a minute to talk about shellbox-(syntaxhighlight)? [13:57:51] *do [13:57:57] <_joe_> jayme: what about it? [13:58:16] https://wikitech.wikimedia.org/wiki/Incidents/2022-12-21_shellbox-syntaxhighlight [13:58:37] 10serviceops, 10Release Pipeline, 10User-MoritzMuehlenhoff: Clean-up / delete old versions of service pipeline created docker images from the public docker registry? - https://phabricator.wikimedia.org/T307797 (10MoritzMuehlenhoff) [13:59:20] <_joe_> we had a similar outage in the past due to some bug [13:59:24] <_joe_> in mediawiki [13:59:33] <_joe_> search the incident report folder [14:00:01] yeah, I recall. I'm looking into envoy telemetry data currently and wonder if shellbox is only ever "frequently" called by jobrunners/apiservers and if/what of that calls are async [14:00:25] <_joe_> so most calls would be async [14:00:35] <_joe_> because that is called when rendering wikitext [14:00:46] <_joe_> and that mostly happens asynchronously [14:00:57] This time it was suggested that maybe some template got changed that required a buch of pages to be re-parsed (leading to a backlog and exaustion of syntaxhighlight) [14:01:01] <_joe_> some calls will come from the api (typically for Visual Editor) [14:01:31] <_joe_> the backlog doesn't mean we increase the consumption rate we set as maximum [14:01:52] <_joe_> have you checked refreshlinksUpdate job rate of execution during the incident? [14:01:54] hmm...so in theory that should prevent such a spike [14:01:57] no [14:02:10] <_joe_> so either a bunch of pages with a lot of code were edited [14:02:20] <_joe_> or there's a bug :) [14:02:57] is that https://grafana-rw.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=refreshLinks ? [14:04:37] <_joe_> yes [14:04:45] that does not look very suspicious to me [14:04:47] <_joe_> uhm wait a template edit [14:04:48] <_joe_> sorry [14:04:51] <_joe_> not that job [14:05:23] yeah, that was the idea. Can I verify that somehow? [14:05:35] <_joe_> https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=htmlCacheUpdate&from=1671606475304&to=1671623687804 [14:05:42] <_joe_> there [14:06:36] ah, this looks more like something happened indeed [14:07:43] but that could still be a bunch of pages edited "manually" right? does not 100% prove that a template was involved? [14:09:09] and it does not look like it would justify an increase from 5 to 70 req/s to shellbox. But IIRC every syntaxhighlight block triggers one call, so one htmlcacheupdate job could lead to N requests [14:32:22] <_joe_> yes, depending on the page, that can happen [14:32:41] <_joe_> but then, why this would justify a starvation of workers on the api cluster [14:32:50] 10serviceops, 10Patch-For-Review: Fix sre.mediawiki.restart-appservers cookbook and doc - https://phabricator.wikimedia.org/T325739 (10Clement_Goubert) Added a bit more documentation at https://wikitech.wikimedia.org/wiki/Service_restarts#Application_servers_(also_image/video_scalers_and_job_runners_and_parsoi... [14:34:43] <_joe_> that I can't explain tbh [14:38:06] I thought it (syntaxhighlight) was maybe responding so slow that requests where piling up? [14:38:29] but AIUI that should (in a template change situation) all be limited to the jobrunners, right? [14:40:08] There has to be something else, we've been having a lot more api worker starvation recently [14:40:17] Look at the last 30 days https://grafana.wikimedia.org/goto/puUHdC5Vk?orgId=1 [14:41:28] <_joe_> claime: let's call most of it "traffic patterns" [14:41:53] <_joe_> jayme: my point is - api calls should only sometimes hang on syntaxhighlight, but what was the total RPS at peak? [14:43:40] <_joe_> claime: but the *average* utilization increased as well [14:44:23] <_joe_> now not sure if the traffic did increase accordingly [14:44:33] <_joe_> anyways, for the current problem [14:45:06] <_joe_> the api servers have what, 4k workers? [14:45:22] <_joe_> I can't imagine we'd have 4k workers stuck waiting for syntaxhighlight [14:46:14] <_joe_> something definitely changed on 12/09 [14:46:32] that's what I just spotted too [14:46:55] <_joe_> Amir1: do you remember when we enabled rendering of the parsoid html from jobs? [14:47:26] _joe_: total RPS to syntaxhighlight? [14:47:39] <_joe_> claime: ah no it's traffic, see https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=1670165993790&orgId=1&to=1671633938345&var-cluster=api_appserver&var-datasource=codfw%20prometheus%2Fops&var-method=&var-site=eqiad&var-code=200&var-php_version=All&forceLogin=&viewPanel=17 [14:47:45] <_joe_> now why we make so much more api calls [14:47:48] <_joe_> no idea [14:47:53] It was enabled in different ways in different times. I honestly list track [14:47:59] <_joe_> jayme: yeah let me check [14:48:05] <_joe_> Amir1: yeah not it, don't worry [14:48:16] Awesome [14:48:18] _joe_: IIRC ~70req/s [14:48:38] Eating lunch. Afterwards I can come and help debugging [14:49:12] <_joe_> Amir1: there is 0 rush, eat your lunch :) [14:49:25] <_joe_> jayme: so how can that occupy 4k workers? [14:50:21] yeah it can't. that's why I started asking questions :) [14:50:54] (it was more like 80 req/s but still) [14:51:07] <_joe_> interesting https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=1671612181828&orgId=1&to=1671615094511&var-cluster=parsoid&var-datasource=codfw%20prometheus%2Fops&var-method=GET&var-site=eqiad&var-code=200&var-php_version=All&forceLogin=&viewPanel=17 [14:51:39] <_joe_> so parsoid saw an increment in requests around the same time [14:52:54] <_joe_> what's the timeout we set in the service proxy for a call to shellbox-syntaxhighlight? [14:54:08] _joe_: There's a scap that coincides with the start of the rise of api calls [14:54:32] <_joe_> claime: what is the justification? [14:54:40] not there yet [14:54:43] <_joe_> I mean why would someone scap today? [14:54:48] I don't mean today [14:55:01] I mean looking at the traffic graph rise [14:55:02] <_joe_> ahh no then disregard [14:55:06] <_joe_> it's traffic [14:55:08] From 12/09 [14:55:13] ack [14:55:14] <_joe_> I doubt it's self-inflicted [14:55:21] <_joe_> oh wait, lol [14:55:32] <_joe_> did the traffic go back to normal on monday this week? [14:55:46] shellbox timeout in envoy is 60s [14:55:55] _joe_: looks like it... [14:55:57] <_joe_> in that case then yes, it's the discussiontools issue that almost killed us during the world cup final [14:56:05] bit soon to tell, but yeah [14:56:06] <_joe_> claime: you might remember that outage [14:56:13] NO I DON'T [14:56:18] <_joe_> no I'm definitely sure if the traffic went down [14:56:33] <_joe_> jayme: uhhh [14:56:42] <_joe_> well yeah we need to revisit those timeouts heh [14:56:45] <_joe_> for each service [14:57:13] for syntaxhighlight it might be on purpose. there are tasks about it being pretty slow [14:57:15] <_joe_> so 70 rps * 2 min * 60s of timeout [14:58:18] <_joe_> jayme: not supported by data [14:58:20] <_joe_> https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-syntaxhighlight&var-release=main&viewPanel=36&from=now-2d&to=now [14:58:38] <_joe_> p99 rn is 500 ms [14:58:46] hmm, indeed [14:58:53] <_joe_> so I would say if we set a timeout of 2 seconds it's more reasonable [14:59:13] <_joe_> and for stuff like shellbox it might make sense to start thinking about autoscaling [15:02:11] _joe_ │ so 70 rps * 2 min * 60s of timeout < Where do the 2 min come from? [15:08:24] <_joe_> claime: a coarse integration of https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=api_appserver&var-origin_instance=All&var-destination=shellbox-syntaxhighlight&from=1671611077534&to=1671615934553 [15:09:11] <_joe_> at the peak, which is a couple of minutes, it's even 100 rps or more [15:09:21] <_joe_> many unsuccessful [15:09:33] <_joe_> so yeah something definitely happened re:api [15:09:41] <_joe_> I would look at traffic patterns tbh [15:10:12] <_joe_> but yeah basically we need: better circuit breaking, lower timeouts [15:59:15] I probably need to read the discussiontools incident again to understand how that is related [15:59:43] <_joe_> it caused an increase in api requests for certain type of pages [16:00:04] ah, so the relation is that apiservers are under more pressure in general [16:07:16] looking at the request data there is one UA standing out (Java/1.8.0_312) which started hitting w/api.php this morning [16:08:13] https://superset.wikimedia.org/r/2248 [16:09:27] but no exceptional amounts of requests AIUI and it came around way before the incident [16:29:30] does this help in any way? https://grafana.wikimedia.org/d/000000002/api-backend-summary?orgId=1&refresh=5m&from=now-7d&to=now&viewPanel=36 [16:29:38] time-wise specially [16:31:51] the spike in parse load goes nicely with the timing of today, yes. the other one was already recorded in the discussiontools incident doc [16:33:26] jayme: yeah but that's not just any parse, it's calls to action=parse in API, that would narrow down the search a bit [16:33:45] (in action API) [16:54:56] <_joe_> Amir1: we should seriously consider changing behaviour of action=parse [16:55:10] <_joe_> or restricting narrowly its usage [17:01:52] Yeah. Totally agreed [17:02:05] Possibly via pool counter? [17:35:12] <_joe_> I was thinking more of requiring authentication, and throttling per-user, but yes poolcounter is probably a good first block [17:35:52] <_joe_> at some point we'll have to deal with the reality that we either add a lot of servers or we start limiting how our apis can be accessed [17:36:55] <_joe_> the reason why I mostly dislike measures like poolcounter is that while they are fair to *us* (preventing a disaster so we don't get paged) they're not fair to our users - typically the abusers will hoard all available parsing spots [17:37:14] <_joe_> leaving other users with poolcounter errors [17:37:30] <_joe_> but more specifically, why do we even have action=parse? [17:38:03] <_joe_> what is the use-case? why forcing a reparse should be something we allow a client to make a decision about? [17:38:29] <_joe_> I get why it could be useful for administrative reasons, but then we protect it with authn/authz [17:50:16] apart from getting distracted by the UA "MWOffliner" as being a pun I got nothing really from the webrequests [17:59:50] <_joe_> mwoffliner is a known (ab)user with an historically legit reason to use action=parse [18:00:01] <_joe_> I think they have better options now though [18:00:24] yeah, I figured :) [18:10:50] action=parse&text=... is the only way to find out what html a snippet of wikitext turns into [18:11:35] * bd808 has used that in multiple non-wiki but wiki related tools to render user input [19:55:14] action=parse is used by MW for live previews, etc.