[12:11:43] <wikibugs>	 10serviceops: Fix sre.mediawiki.restart-appservers cookbook and doc - https://phabricator.wikimedia.org/T325739 (10Clement_Goubert)
[12:12:01] <wikibugs>	 10serviceops: Fix sre.mediawiki.restart-appservers cookbook and doc - https://phabricator.wikimedia.org/T325739 (10Clement_Goubert)
[12:12:34] <wikibugs>	 10serviceops: Fix sre.mediawiki.restart-appservers cookbook and doc - https://phabricator.wikimedia.org/T325739 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High
[13:57:41] <jayme>	 _joe_: so you have a minute to talk about shellbox-(syntaxhighlight)?
[13:57:51] <jayme>	 *do
[13:57:57] <_joe_>	 jayme: what about it?
[13:58:16] <jayme>	 https://wikitech.wikimedia.org/wiki/Incidents/2022-12-21_shellbox-syntaxhighlight
[13:58:37] <wikibugs>	 10serviceops, 10Release Pipeline, 10User-MoritzMuehlenhoff: Clean-up / delete old versions of service pipeline created docker images from the public docker registry? - https://phabricator.wikimedia.org/T307797 (10MoritzMuehlenhoff)
[13:59:20] <_joe_>	 we had a similar outage in the past due to some bug
[13:59:24] <_joe_>	 in mediawiki
[13:59:33] <_joe_>	 search the incident report folder
[14:00:01] <jayme>	 yeah, I recall. I'm looking into envoy telemetry data currently and wonder if shellbox is only ever "frequently" called by jobrunners/apiservers and if/what of that calls are async
[14:00:25] <_joe_>	 so most calls would be async
[14:00:35] <_joe_>	 because that is called when rendering wikitext
[14:00:46] <_joe_>	 and that mostly happens asynchronously
[14:00:57] <jayme>	 This time it was suggested that maybe some template got changed that required a buch of pages to be re-parsed (leading to a backlog and exaustion of syntaxhighlight)
[14:01:01] <_joe_>	 some calls will come from the api (typically for Visual Editor)
[14:01:31] <_joe_>	 the backlog doesn't mean we increase the consumption rate we set as maximum
[14:01:52] <_joe_>	 have you checked refreshlinksUpdate job rate of execution during the incident?
[14:01:54] <jayme>	 hmm...so in theory that should prevent such a spike
[14:01:57] <jayme>	 no
[14:02:10] <_joe_>	 so either a bunch of pages with a lot of code were edited
[14:02:20] <_joe_>	 or there's a bug :)
[14:02:57] <jayme>	 is that https://grafana-rw.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=refreshLinks ?
[14:04:37] <_joe_>	 yes
[14:04:45] <jayme>	 that does not look very suspicious to me
[14:04:47] <_joe_>	 uhm wait a template edit
[14:04:48] <_joe_>	 sorry
[14:04:51] <_joe_>	 not that job
[14:05:23] <jayme>	 yeah, that was the idea. Can I verify that somehow?
[14:05:35] <_joe_>	 https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=htmlCacheUpdate&from=1671606475304&to=1671623687804
[14:05:42] <_joe_>	 there
[14:06:36] <jayme>	 ah, this looks more like something happened indeed
[14:07:43] <jayme>	 but that could still be a bunch of pages edited "manually" right? does not 100% prove that a template was involved?
[14:09:09] <jayme>	 and it does not look like it would justify an increase from 5 to 70 req/s to shellbox. But IIRC every syntaxhighlight block triggers one call, so one htmlcacheupdate job could lead to N requests
[14:32:22] <_joe_>	 yes, depending on the page, that can happen
[14:32:41] <_joe_>	 but then, why this would justify a starvation of workers on the api cluster
[14:32:50] <wikibugs>	 10serviceops, 10Patch-For-Review: Fix sre.mediawiki.restart-appservers cookbook and doc - https://phabricator.wikimedia.org/T325739 (10Clement_Goubert) Added a bit more documentation at https://wikitech.wikimedia.org/wiki/Service_restarts#Application_servers_(also_image/video_scalers_and_job_runners_and_parsoi...
[14:34:43] <_joe_>	 that I can't explain tbh
[14:38:06] <jayme>	 I thought it (syntaxhighlight) was maybe responding so slow that requests where piling up?
[14:38:29] <jayme>	 but AIUI that should (in a template change situation) all be limited to the jobrunners, right?
[14:40:08] <claime>	 There has to be something else, we've been having a lot more api worker starvation recently
[14:40:17] <claime>	 Look at the last 30 days https://grafana.wikimedia.org/goto/puUHdC5Vk?orgId=1
[14:41:28] <_joe_>	 claime: let's call most of it "traffic patterns"
[14:41:53] <_joe_>	 jayme: my point is - api calls should only sometimes hang on syntaxhighlight, but what was the total RPS at peak?
[14:43:40] <_joe_>	 claime: but the *average* utilization increased as well
[14:44:23] <_joe_>	 now not sure if the traffic did increase accordingly
[14:44:33] <_joe_>	 anyways, for the current problem
[14:45:06] <_joe_>	 the api servers have what, 4k workers?
[14:45:22] <_joe_>	 I can't imagine we'd have 4k workers stuck waiting for syntaxhighlight
[14:46:14] <_joe_>	 something definitely changed on 12/09
[14:46:32] <claime>	 that's what I just spotted too
[14:46:55] <_joe_>	 Amir1: do you remember when we enabled rendering of the parsoid html from jobs?
[14:47:26] <jayme>	 _joe_: total RPS to syntaxhighlight?
[14:47:39] <_joe_>	 claime: ah no it's traffic, see https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=1670165993790&orgId=1&to=1671633938345&var-cluster=api_appserver&var-datasource=codfw%20prometheus%2Fops&var-method=&var-site=eqiad&var-code=200&var-php_version=All&forceLogin=&viewPanel=17
[14:47:45] <_joe_>	 now why we make so much more api calls
[14:47:48] <_joe_>	 no idea
[14:47:53] <Amir1>	 It was enabled in different ways in different times. I honestly list track
[14:47:59] <_joe_>	 jayme: yeah let me check
[14:48:05] <_joe_>	 Amir1: yeah not it, don't worry
[14:48:16] <Amir1>	 Awesome 
[14:48:18] <jayme>	 _joe_: IIRC ~70req/s
[14:48:38] <Amir1>	 Eating lunch. Afterwards I can come and help debugging 
[14:49:12] <_joe_>	 Amir1: there is 0 rush, eat your lunch :)
[14:49:25] <_joe_>	 jayme: so how can that occupy 4k workers?
[14:50:21] <jayme>	 yeah it can't. that's why I started asking questions :)
[14:50:54] <jayme>	 (it was more like 80 req/s but still)
[14:51:07] <_joe_>	 interesting https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=1671612181828&orgId=1&to=1671615094511&var-cluster=parsoid&var-datasource=codfw%20prometheus%2Fops&var-method=GET&var-site=eqiad&var-code=200&var-php_version=All&forceLogin=&viewPanel=17
[14:51:39] <_joe_>	 so parsoid saw an increment in requests around the same time
[14:52:54] <_joe_>	 what's the timeout we set in the service proxy for a call to shellbox-syntaxhighlight?
[14:54:08] <claime>	 _joe_: There's a scap that coincides with the start of the rise of api calls
[14:54:32] <_joe_>	 claime: what is the justification?
[14:54:40] <claime>	 not there yet
[14:54:43] <_joe_>	 I mean why would someone scap today?
[14:54:48] <claime>	 I don't mean today
[14:55:01] <claime>	 I mean looking at the traffic graph rise
[14:55:02] <_joe_>	 ahh no then disregard
[14:55:06] <_joe_>	 it's traffic
[14:55:08] <claime>	 From 12/09
[14:55:13] <claime>	 ack
[14:55:14] <_joe_>	 I doubt it's self-inflicted
[14:55:21] <_joe_>	 oh wait, lol
[14:55:32] <_joe_>	 did the traffic go back to normal on monday this week?
[14:55:46] <jayme>	 shellbox timeout in envoy is 60s
[14:55:55] <claime>	 _joe_: looks like it...
[14:55:57] <_joe_>	 in that case then yes, it's the discussiontools issue that almost killed us during the world cup final
[14:56:05] <claime>	 bit soon to tell, but yeah
[14:56:06] <_joe_>	 claime: you might remember that outage
[14:56:13] <claime>	 NO I DON'T
[14:56:18] <_joe_>	 no I'm definitely sure if the traffic went down
[14:56:33] <_joe_>	 jayme: uhhh
[14:56:42] <_joe_>	 well yeah we need to revisit those timeouts heh
[14:56:45] <_joe_>	 for each service
[14:57:13] <jayme>	 for syntaxhighlight it might be on purpose. there are tasks about it being pretty slow
[14:57:15] <_joe_>	 so 70 rps * 2 min * 60s of timeout
[14:58:18] <_joe_>	 jayme: not supported by data
[14:58:20] <_joe_>	 https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-syntaxhighlight&var-release=main&viewPanel=36&from=now-2d&to=now
[14:58:38] <_joe_>	 p99 rn is 500 ms
[14:58:46] <jayme>	 hmm, indeed
[14:58:53] <_joe_>	 so I would say if we set a timeout of 2 seconds it's more reasonable
[14:59:13] <_joe_>	 and for stuff like shellbox it might make sense to start thinking about autoscaling
[15:02:11] <claime>	  _joe_ │ so 70 rps * 2 min * 60s of timeout < Where do the 2 min come from?
[15:08:24] <_joe_>	 claime: a coarse integration of https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=api_appserver&var-origin_instance=All&var-destination=shellbox-syntaxhighlight&from=1671611077534&to=1671615934553
[15:09:11] <_joe_>	 at the peak, which is a couple of minutes, it's even 100 rps or more 
[15:09:21] <_joe_>	 many unsuccessful
[15:09:33] <_joe_>	 so yeah something definitely happened re:api
[15:09:41] <_joe_>	 I would look at traffic patterns tbh
[15:10:12] <_joe_>	 but yeah basically we need: better circuit breaking, lower timeouts
[15:59:15] <jayme>	 I probably need to read the discussiontools incident again to understand how that is related
[15:59:43] <_joe_>	 it caused an increase in api requests for certain type of pages
[16:00:04] <jayme>	 ah, so the relation is that apiservers are under more pressure in general
[16:07:16] <jayme>	 looking at the request data there is one UA standing out (Java/1.8.0_312) which started hitting w/api.php this morning
[16:08:13] <jayme>	 https://superset.wikimedia.org/r/2248
[16:09:27] <jayme>	  but no exceptional amounts of requests AIUI and it came around way before the incident
[16:29:30] <Amir1>	 does this help in any way? https://grafana.wikimedia.org/d/000000002/api-backend-summary?orgId=1&refresh=5m&from=now-7d&to=now&viewPanel=36 
[16:29:38] <Amir1>	 time-wise specially 
[16:31:51] <jayme>	 the spike in parse load goes nicely with the timing of today, yes. the other one was already recorded in the discussiontools incident doc
[16:33:26] <Amir1>	 jayme: yeah but that's not just any parse, it's calls to action=parse in API, that would narrow down the search a bit
[16:33:45] <Amir1>	 (in action API)
[16:54:56] <_joe_>	 Amir1: we should seriously consider changing behaviour of action=parse
[16:55:10] <_joe_>	 or restricting narrowly its usage
[17:01:52] <Amir1>	 Yeah. Totally agreed
[17:02:05] <Amir1>	 Possibly via pool counter?
[17:35:12] <_joe_>	 I was thinking more of requiring authentication, and throttling per-user, but yes poolcounter is probably a good first block
[17:35:52] <_joe_>	 at some point we'll have to deal with the reality that we either add a lot of servers or we start limiting how our apis can be accessed
[17:36:55] <_joe_>	 the reason why I mostly dislike measures like poolcounter is that while they are fair to *us* (preventing a disaster so we don't get paged) they're not fair to our users - typically the abusers will hoard all available parsing spots
[17:37:14] <_joe_>	 leaving other users with poolcounter errors
[17:37:30] <_joe_>	 but more specifically, why do we even have action=parse?
[17:38:03] <_joe_>	 what is the use-case? why forcing a reparse should be something we allow a client to make a decision about?
[17:38:29] <_joe_>	 I get why it could be useful for administrative reasons, but then we protect it with authn/authz
[17:50:16] <jayme>	 apart from getting distracted by the UA "MWOffliner" as being a pun I got nothing really from the webrequests
[17:59:50] <_joe_>	 mwoffliner is a known (ab)user with an historically legit reason to use action=parse
[18:00:01] <_joe_>	 I think they have better options now though
[18:00:24] <jayme>	 yeah, I figured :)
[18:10:50] <bd808>	 action=parse&text=... is the only way to find out what html a snippet of wikitext turns into
[18:11:35] * bd808 has used that in multiple non-wiki but wiki related tools to render user input
[19:55:14] <legoktm>	 action=parse is used by MW for live previews, etc.