[02:06:58] 10serviceops, 10Observability-Logging, 10WMF-General-or-Unknown, 10Performance-Team (Radar): Ingest logs from scheduled maintenance scripts at WMF in Logstash - https://phabricator.wikimedia.org/T285896 (10Tgr) [06:41:43] 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Joe) Can someone please elaborate on what the "generic content store" would need to look like for this project... [07:08:59] elukey: buon giorno! I 've added you to https://phabricator.wikimedia.org/T341625 as a subscriber. Serviceops has been asked whether the new cirrus update events can make it to kafka-main (replacing eventually the old system that also uses kafka-main) and there is some reluctance on our side given the ongoing stuff that is going on with kafka-main. [07:08:59] Your input would be much appreciated. [07:10:17] akosiaris: o/ kalimera! Yep yep already discussing it with Janis, I think that we need to do the rebalances first, and then we can decide.. today I should finish main-codfw, then I'll do eqiad, I think ETA ~ a week [07:14:58] ok, thanks [08:33:22] hi folks, the avg latency for parsoid get/200 alert has been firing ~constantly since the weekend, is it known? [08:41:34] hmm https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&refresh=1m&from=now-7d&to=now&var-site=All&var-cluster=parsoid&var-method=GET&var-code=200&var-php_version=All&viewPanel=9 [08:41:57] what happened on Saturday 12pm ? [08:43:32] excellent question, and it looks like parsoid basically never recovered [08:47:29] the interesting this is that there's isn't a change in requests [08:48:00] just a change in 302 responses per https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=1689413672923&to=1689437553915&var-site=All&var-cluster=parsoid&var-method=GET&var-code=200&var-php_version=All&viewPanel=22 [08:50:10] indeed, and a corresponding 100k get/s to mcrouter, I'm thinking maybe more expensive requests [08:50:19] the 302s are a red herring [08:50:31] zooming out points out it was just a spike: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=now-7d&to=now&var-site=All&var-cluster=parsoid&var-method=GET&var-code=200&var-php_version=All&viewPanel=22 [08:50:47] yeah I noticed the get/s to mcrouter too, huge increase there too [08:51:33] ok filing a task in the meantime [08:55:19] 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10fgiunchedi) [08:55:19] https://phabricator.wikimedia.org/T342085 [08:55:59] ruling out some specific parsoid server being the culprit too, they all have similar patterns [08:57:30] this is starting to stink of a template change [08:59:04] ah yeah that would explain indeed [09:00:14] and yet I can't find the usual consequences of that, e.g. increased rate of jobs etc [09:02:15] and nothing of note in sal as expected, being on sat [09:10:30] restarts on parse1001 did nothing [09:12:10] ok, the first kinda clear sign: https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=1689418150653&to=1689449568835&viewPanel=27 [09:12:20] indeed transclusions [09:21:32] 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10akosiaris) This is apparently due to transclusions per https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=1689418150653&to=1689449568835&viewPane... [09:23:40] nice find [09:24:04] the concerning thing is that increase at the end of the graph [09:24:09] akosiaris: do we wait the transclusions to be done in this case or there sth else to do ? [09:24:13] heh was about to ask about that too [09:24:13] apparently backlog is increasing [09:24:39] but yes, generally we just wait for the transclusions to be done [09:24:47] not much we can do anyway [09:25:07] still, I am not sure I have a causal link here [09:25:30] I mean, I 'd expect a lot more requests to parsoid due to transclusion changes [09:25:43] but what we witness is increased latencies but without increased traffic [09:26:23] fair enough, (spitballing here, please bear with me) could we bring parsoid codfw online to help ? afaics it is idle in codfw [09:28:05] it's active/passive as it also serves POSTs, not just GETs [09:28:31] so, no, I don't think so, given our current infra [09:28:50] got it, thank you [09:29:11] :x [09:29:17] oopsie - sorry [09:29:52] that's okay, :x to you too jayme [09:30:01] ❤️ [09:33:22] this is mostly zh.wikipedia.org and mobileapps [09:33:35] in a weird way that I don't yet comprehend [09:47:52] <_joe_> akosiaris: I think the zh.wikipedia.org stuff is usually a red herring [09:48:13] yeah, quite possibly [09:48:49] the biggest problem in all of this is .. once more restbase [09:49:04] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&refresh=1m&var-site=All&var-cluster=parsoid&var-method=GET&var-code=200&var-php_version=All&viewPanel=9&from=now-7d&to=now [09:49:24] <_joe_> something happened on the 15th at around 12:20 [09:49:34] _joe_: read backlog :P [09:49:58] just saw that you go down the same paths I already went down [09:50:01] so* [09:50:26] <_joe_> it doesn't fully fit though [09:50:34] btw, I found one more thing that breaks with json access logs [09:50:40] <_joe_> have you checked deployments around those times? [09:50:43] <_joe_> akosiaris: sigh [09:50:51] and this one is way more large than user-agent [09:50:54] it's page titles [09:50:55] :-( [09:50:59] <_joe_> oh gosh [09:51:07] <_joe_> they get escaped too by apache do they [09:51:22] yes... [09:51:35] <_joe_> ok, I *guess* we'll have to somehow unescape those logs before passing them to rsyslog [09:51:51] yeah, this just became more important [09:51:52] dammit [09:52:23] <_joe_> can you comment on the task? [09:52:24] <_joe_> btw [09:52:37] <_joe_> https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=now-7d&to=now&viewPane&viewPanel=2 seems to show this is *not* related to transcludes [09:53:05] <_joe_> we had a bigger spike of transclude jobs on the 11th but latency wasn't that bad [09:54:24] <_joe_> the transclude jobs match pretty well with the GET requests https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&refresh=1m&var-site=All&var-cluster=parsoid&var-method=GET&var-code=200&var-php_version=All&from=now-7d&to=now&viewPanel=17 [09:54:36] <_joe_> so I think there's something else going on [09:57:51] the json thing is pretty weird btw [09:57:56] we log the page titles with [09:58:04] http://%{Host}i%U%q [09:58:11] but [09:58:26] For security reasons, starting with version 2.0.46, non-printable and other special characters in %r, %i and %o are escaped using \xhh sequences, where hh stands for the hexadecimal representation of the raw byte [09:58:38] so... why it something like this happening ? [09:58:56] "Url": "http://zh.wikipedia.org/w/rest.php/zh.wikipedia.org/v3/transform/pagebundle/to/pagebundle/\xe4\xb8\xa4\xe6\xb9\x96\xe4\xb9\xa6\xe9\x99\xa2" ? [10:01:19] <_joe_> elukey: ^^ [10:01:21] <_joe_> :D [10:03:03] the correct URL btw is https://zh.wikipedia.org/w/rest.php/zh.wikipedia.org/v3/transform/pagebundle/to/pagebundle/%E4%B8%A4%E6%B9%96%E4%B9%A6%E9%99%A2 [10:03:59] anyway, putting this in the task and we 'll circle back to it later [10:06:52] 10serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10akosiaris) Just found something more important than User-Agent unfortunately ` Url": "http://zh.wikipedia.org/w/rest.php/zh.wikipedia.org/v3/transform/pagebundle/to/pag... [10:08:55] 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) I don't think transclusions are the actual cause - we had a very high rate of transclusions on the 11th when the latencies didn't explode like this. Looking at... [10:14:08] 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) Interestingly, this doesn't seem to have affected the prewarm job at all: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&var-dc=eqia... [10:33:28] _joe_ sorry I was away for a doc apt, what's up? [10:35:43] <_joe_> elukey: nothing, more apache logging woes [10:35:49] ahahahha okok [10:35:58] I thought I messed up something [10:37:45] <_joe_> akosiaris: maybe we could do something like apache | echo -e | logger [10:39:04] 10serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10akosiaris) A search says: https://svn.apache.org/viewvc/httpd/httpd/trunk/modules/loggers/mod_log_config.c?r1=98912&r2=98911&pathrev=98912 with the message saying ` Fo... [10:39:23] <_joe_> but I'd rather make apache not escape the logs given we're not persisting them to file [10:41:11] <_joe_> I think the reason to escape unicode chars is to make sure you don't allow control chars to be injected [10:49:32] with this new (but really old age wise) finding, we are kinda screwed on our approach [10:51:22] <_joe_> akosiaris: again, one solution is to add to the httpd image the following script [10:51:46] <_joe_> #!/bin/bash [10:52:06] <_joe_> echo -e "$@" | logger $LOGGER_OPTS - [10:52:16] lol [10:52:18] <_joe_> right now we pipe directly to logger [10:52:33] <_joe_> but yeah I'd like a somewhat better one tbh [10:52:53] <_joe_> maybe we can also sed \x to \\x [10:52:56] what makes you believe that would work ? [10:53:02] <_joe_> echo -e ? [10:53:15] <_joe_> it will convert the \x escapes back into utf8 [10:53:20] yup [10:53:32] <_joe_> which is valid json at least according to parsers [10:53:32] -e only does interpretation of backslash sequences [10:53:52] but here we kinda have utf-8 that has been decoded to bytes [10:53:58] s/kinda// actually [10:54:13] <_joe_> yeah and echo -e does handle that correctly [10:55:46] hmm, indeed it does, I didn't expect that [10:59:06] <_joe_> but I mean, there is some value I guess in the security concerns the apache devs had [10:59:19] <_joe_> non printable chars should be filtered [11:00:31] yeah, bypassing/undoing/disabling all that should not be taken lightly. It is possible btw to disable it. It's a build time flag. Not keen at all to try that out in our case. [11:00:46] <_joe_> we could also use printf(1) [11:00:54] <_joe_> :D [11:01:27] <_joe_> In terms of cleanliness, solutions would be: [11:01:38] <_joe_> 1 - write a proper apache module to log to json [11:01:40] that being said, their concerns might be rooted in assumptions made in 1990s [11:01:47] <_joe_> yeah [11:01:53] e.g. always viewing logs via a terminal [11:02:03] <_joe_> which we will do less and less of [11:04:10] <_joe_> 2 - stop using mmjsonparse in rsyslog, write a ruleset and use mmnormalize, then use field format "jsonf" in your templates. That will make rsyslog escape the fields correctly [11:04:58] <_joe_> 3 - use printf | logger in the apache log [13:37:05] 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10ssastry) I also see that the increase is strongly correlated to the change in http 302 responses --- https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-re... [13:42:22] 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10ssastry) So, this is just a change in request profile ... so, very like a lot more wt2html requests where the oldid is provided vs. just a title .. obviously it is f... [13:49:07] 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10ssastry) https://logstash.wikimedia.org/app/dashboards#/view/4582cb60-6dc4-11ec-9f01-8f331b96b090?_a=h@3f49d9f&_g=h@d7d2a59 shows it is possible it is all commons. I... [14:13:56] 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) I see that changeprop is calling some specific image files over and over. Specifically, the problem seems to be this upload of a full book as individual images... [14:18:31] 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) It looks to me like this only happens on requests from change-propagation, but there is no re-parsing induced by the parsoid cache warm job. @Daniel is it poss... [14:38:12] 10serviceops, 10MW-on-K8s: Migrate api-gateway to mw-api-int - https://phabricator.wikimedia.org/T334065 (10Joe) 05Open→03Resolved [14:38:25] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [14:39:09] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [15:42:44] 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10JMeybohm) Could you please also add the expected events/s, requests/s towards the kafka-main cluster? @elukey... [15:49:23] elukey: fyi I messed up how cirrus routes jobs between cirrusSearchLinksUpdate & cirrusSearchLinksUpdatePrioritized, the later is now getting most jobs instead of the former, will ship a fix asap [15:51:21] dcausse: ack thansk!