[02:06:58] <wikibugs>	 10serviceops, 10Observability-Logging, 10WMF-General-or-Unknown, 10Performance-Team (Radar): Ingest logs from scheduled maintenance scripts at WMF in Logstash - https://phabricator.wikimedia.org/T285896 (10Tgr)
[06:41:43] <wikibugs>	 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Joe) Can someone please elaborate on what the "generic content store" would need to look like for this project...
[07:08:59] <akosiaris>	 elukey: buon giorno! I 've added you to https://phabricator.wikimedia.org/T341625 as a subscriber. Serviceops has been asked whether the new cirrus update events can make it to kafka-main (replacing eventually the old system that also uses kafka-main) and there is some reluctance on our side given the ongoing stuff that is going on with kafka-main.
[07:08:59] <akosiaris>	 Your input would be much appreciated.  
[07:10:17] <elukey>	 akosiaris: o/ kalimera! Yep yep already discussing it with Janis, I think that we need to do the rebalances first, and then we can decide.. today I should finish main-codfw, then I'll do eqiad, I think ETA ~ a week
[07:14:58] <akosiaris>	 ok, thanks 
[08:33:22] <godog>	 hi folks, the avg latency for parsoid get/200 alert has been firing ~constantly since the weekend, is it known?
[08:41:34] <akosiaris>	 hmm https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&refresh=1m&from=now-7d&to=now&var-site=All&var-cluster=parsoid&var-method=GET&var-code=200&var-php_version=All&viewPanel=9
[08:41:57] <akosiaris>	 what happened on Saturday 12pm ?
[08:43:32] <godog>	 excellent question, and it looks like parsoid basically never recovered
[08:47:29] <akosiaris>	 the interesting this is that there's isn't a change in requests
[08:48:00] <akosiaris>	 just a change in 302 responses per https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=1689413672923&to=1689437553915&var-site=All&var-cluster=parsoid&var-method=GET&var-code=200&var-php_version=All&viewPanel=22
[08:50:10] <godog>	 indeed, and a corresponding 100k get/s to mcrouter, I'm thinking maybe more expensive requests
[08:50:19] <akosiaris>	 the 302s are a red herring
[08:50:31] <akosiaris>	 zooming out points out it was just a spike: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=now-7d&to=now&var-site=All&var-cluster=parsoid&var-method=GET&var-code=200&var-php_version=All&viewPanel=22
[08:50:47] <akosiaris>	 yeah I noticed the get/s to mcrouter too, huge increase there too
[08:51:33] <godog>	 ok filing a task in the meantime
[08:55:19] <wikibugs>	 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10fgiunchedi)
[08:55:19] <godog>	 https://phabricator.wikimedia.org/T342085
[08:55:59] <akosiaris>	 ruling out some specific parsoid server being the culprit too, they all have similar patterns 
[08:57:30] <akosiaris>	 this is starting to stink of a template change
[08:59:04] <godog>	 ah yeah that would explain indeed
[09:00:14] <akosiaris>	 and yet I can't find the usual consequences of that, e.g. increased rate of jobs etc
[09:02:15] <godog>	 and nothing of note in sal as expected, being on sat
[09:10:30] <akosiaris>	 restarts on parse1001 did nothing 
[09:12:10] <akosiaris>	 ok, the first kinda clear sign: https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=1689418150653&to=1689449568835&viewPanel=27
[09:12:20] <akosiaris>	 indeed transclusions 
[09:21:32] <wikibugs>	 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10akosiaris) This is apparently due to transclusions per https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=1689418150653&to=1689449568835&viewPane...
[09:23:40] <godog>	 nice find
[09:24:04] <akosiaris>	 the concerning thing is that increase at the end of the graph
[09:24:09] <godog>	 akosiaris: do we wait the transclusions to be done in this case or there sth else to do ?
[09:24:13] <godog>	 heh was about to ask about that too
[09:24:13] <akosiaris>	 apparently backlog is increasing
[09:24:39] <akosiaris>	 but yes, generally we just wait for the transclusions to be done 
[09:24:47] <akosiaris>	 not much we can do anyway
[09:25:07] <akosiaris>	 still, I am not sure I have a causal link here
[09:25:30] <akosiaris>	 I mean, I 'd expect a lot more requests to parsoid due to transclusion changes
[09:25:43] <akosiaris>	 but what we witness is increased latencies but without increased traffic 
[09:26:23] <godog>	 fair enough, (spitballing here, please bear with me) could we bring parsoid codfw online to help ? afaics it is idle in codfw
[09:28:05] <akosiaris>	 it's active/passive as it also serves POSTs, not just GETs 
[09:28:31] <akosiaris>	 so, no, I don't think so, given our current infra
[09:28:50] <godog>	 got it, thank you
[09:29:11] <jayme>	 :x
[09:29:17] <jayme>	 oopsie - sorry
[09:29:52] <godog>	 that's okay, :x to you too jayme 
[09:30:01] <jayme>	 ❤️
[09:33:22] <akosiaris>	 this is mostly zh.wikipedia.org and mobileapps 
[09:33:35] <akosiaris>	 in a weird way that I don't yet comprehend 
[09:47:52] <_joe_>	 akosiaris: I think the zh.wikipedia.org stuff is usually a red herring
[09:48:13] <akosiaris>	 yeah, quite possibly 
[09:48:49] <akosiaris>	 the biggest problem in all of this is .. once more restbase
[09:49:04] <_joe_>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&refresh=1m&var-site=All&var-cluster=parsoid&var-method=GET&var-code=200&var-php_version=All&viewPanel=9&from=now-7d&to=now
[09:49:24] <_joe_>	 something happened on the 15th at around 12:20
[09:49:34] <akosiaris>	 _joe_: read backlog :P
[09:49:58] <akosiaris>	 just saw that you go down the same paths I already went down
[09:50:01] <akosiaris>	 so*
[09:50:26] <_joe_>	 it doesn't fully fit though
[09:50:34] <akosiaris>	 btw, I found one more thing that breaks with json access logs
[09:50:40] <_joe_>	 have you checked deployments around those times?
[09:50:43] <_joe_>	 akosiaris: sigh
[09:50:51] <akosiaris>	 and this one is way more large than user-agent
[09:50:54] <akosiaris>	 it's page titles 
[09:50:55] <akosiaris>	 :-(
[09:50:59] <_joe_>	 oh gosh
[09:51:07] <_joe_>	 they get escaped too by apache do they
[09:51:22] <akosiaris>	 yes...
[09:51:35] <_joe_>	 ok, I *guess* we'll have to somehow unescape those logs before passing them to rsyslog
[09:51:51] <akosiaris>	 yeah, this just became more important
[09:51:52] <akosiaris>	 dammit 
[09:52:23] <_joe_>	 can you comment on the task?
[09:52:24] <_joe_>	 btw
[09:52:37] <_joe_>	 https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=now-7d&to=now&viewPane&viewPanel=2 seems to show this is *not* related to transcludes
[09:53:05] <_joe_>	 we had a bigger spike of transclude jobs on the 11th but latency wasn't that bad
[09:54:24] <_joe_>	 the transclude jobs match pretty well with the GET requests https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&refresh=1m&var-site=All&var-cluster=parsoid&var-method=GET&var-code=200&var-php_version=All&from=now-7d&to=now&viewPanel=17
[09:54:36] <_joe_>	 so I think there's something else going on
[09:57:51] <akosiaris>	 the json thing is pretty weird btw
[09:57:56] <akosiaris>	 we log the page titles with
[09:58:04] <akosiaris>	 http://%{Host}i%U%q
[09:58:11] <akosiaris>	 but
[09:58:26] <akosiaris>	 For security reasons, starting with version 2.0.46, non-printable and other special characters in %r, %i and %o are escaped using \xhh sequences, where hh stands for the hexadecimal representation of the raw byte
[09:58:38] <akosiaris>	 so... why it something like this happening ? 
[09:58:56] <akosiaris>	 "Url": "http://zh.wikipedia.org/w/rest.php/zh.wikipedia.org/v3/transform/pagebundle/to/pagebundle/\xe4\xb8\xa4\xe6\xb9\x96\xe4\xb9\xa6\xe9\x99\xa2" ?
[10:01:19] <_joe_>	 elukey: ^^
[10:01:21] <_joe_>	 :D
[10:03:03] <akosiaris>	 the correct URL btw is https://zh.wikipedia.org/w/rest.php/zh.wikipedia.org/v3/transform/pagebundle/to/pagebundle/%E4%B8%A4%E6%B9%96%E4%B9%A6%E9%99%A2
[10:03:59] <akosiaris>	 anyway, putting this in the task and we 'll circle back to it later
[10:06:52] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10akosiaris) Just found something more important than User-Agent unfortunately  ` Url": "http://zh.wikipedia.org/w/rest.php/zh.wikipedia.org/v3/transform/pagebundle/to/pag...
[10:08:55] <wikibugs>	 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) I don't think transclusions are the actual cause - we had a very high rate of transclusions on the 11th when the latencies didn't explode like this.  Looking at...
[10:14:08] <wikibugs>	 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) Interestingly, this doesn't seem to have affected the prewarm job at all:  https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&var-dc=eqia...
[10:33:28] <elukey>	 _joe_ sorry I was away for a doc apt, what's up?
[10:35:43] <_joe_>	 elukey: nothing, more apache logging woes
[10:35:49] <elukey>	 ahahahha okok
[10:35:58] <elukey>	 I thought I messed up something
[10:37:45] <_joe_>	 akosiaris: maybe we could do something like apache | echo -e <line> | logger
[10:39:04] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10akosiaris) A search says: https://svn.apache.org/viewvc/httpd/httpd/trunk/modules/loggers/mod_log_config.c?r1=98912&r2=98911&pathrev=98912  with the message saying  ` Fo...
[10:39:23] <_joe_>	 but I'd rather make apache not escape the logs given we're not persisting them to file
[10:41:11] <_joe_>	 I think the reason to escape unicode chars is to make sure you don't allow control chars to be injected
[10:49:32] <akosiaris>	 with this new (but really old age wise) finding, we are kinda screwed on our approach
[10:51:22] <_joe_>	 akosiaris: again, one solution is to add to the httpd image the following script
[10:51:46] <_joe_>	 #!/bin/bash
[10:52:06] <_joe_>	 echo -e "$@" | logger $LOGGER_OPTS -
[10:52:16] <akosiaris>	 lol
[10:52:18] <_joe_>	 right now we pipe directly to logger
[10:52:33] <_joe_>	 but yeah I'd like a somewhat better one tbh
[10:52:53] <_joe_>	 maybe we can also sed \x to \\x
[10:52:56] <akosiaris>	 what makes you believe that would work ? 
[10:53:02] <_joe_>	 echo -e ?
[10:53:15] <_joe_>	 it will convert the \x escapes back into utf8
[10:53:20] <akosiaris>	 yup
[10:53:32] <_joe_>	 which is valid json at least according to parsers
[10:53:32] <akosiaris>	 -e only does interpretation of backslash sequences
[10:53:52] <akosiaris>	 but here we kinda have utf-8 that has been decoded to bytes
[10:53:58] <akosiaris>	 s/kinda// actually 
[10:54:13] <_joe_>	 yeah and echo -e does handle that correctly
[10:55:46] <akosiaris>	 hmm, indeed it does, I didn't expect that
[10:59:06] <_joe_>	 but I mean, there is some value I guess in the security concerns the apache devs had
[10:59:19] <_joe_>	 non printable chars should be filtered
[11:00:31] <akosiaris>	 yeah, bypassing/undoing/disabling all that should not be taken lightly. It is possible btw to disable it. It's a build time flag. Not keen at all to try that out in our case. 
[11:00:46] <_joe_>	 we could also use printf(1)
[11:00:54] <_joe_>	 :D
[11:01:27] <_joe_>	 In terms of cleanliness, solutions would be:
[11:01:38] <_joe_>	 1 - write a proper apache module to log to json
[11:01:40] <akosiaris>	 that being said, their concerns might be rooted in assumptions made in 1990s
[11:01:47] <_joe_>	 yeah
[11:01:53] <akosiaris>	 e.g. always viewing logs via a terminal
[11:02:03] <_joe_>	 which we will do less and less of
[11:04:10] <_joe_>	 2 - stop using mmjsonparse in rsyslog, write a ruleset and use mmnormalize, then use field format "jsonf" in your templates. That will make rsyslog escape the fields correctly
[11:04:58] <_joe_>	 3 - use printf | logger in the apache log 
[13:37:05] <wikibugs>	 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10ssastry) I also see that the increase is strongly correlated to the change in http 302 responses --- https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-re...
[13:42:22] <wikibugs>	 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10ssastry) So, this is just a change in request profile ... so, very like a lot more wt2html requests where the oldid is provided vs. just a title .. obviously it is f...
[13:49:07] <wikibugs>	 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10ssastry) https://logstash.wikimedia.org/app/dashboards#/view/4582cb60-6dc4-11ec-9f01-8f331b96b090?_a=h@3f49d9f&_g=h@d7d2a59 shows it is possible it is all commons. I...
[14:13:56] <wikibugs>	 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) I see that changeprop is calling some specific image files over and over. Specifically, the problem seems to be this upload of a full book as individual images...
[14:18:31] <wikibugs>	 10serviceops, 10Parsoid: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 (10Joe) It looks to me like this only happens on requests from change-propagation, but there is no re-parsing induced by the parsoid cache warm job.  @Daniel is it poss...
[14:38:12] <wikibugs>	 10serviceops, 10MW-on-K8s: Migrate api-gateway to mw-api-int - https://phabricator.wikimedia.org/T334065 (10Joe) 05Open→03Resolved
[14:38:25] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe)
[14:39:09] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe)
[15:42:44] <wikibugs>	 10serviceops, 10Data-Platform-SRE, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10JMeybohm) Could you please also add the expected events/s, requests/s towards the kafka-main cluster? @elukey...
[15:49:23] <dcausse>	 elukey: fyi I messed up how cirrus routes jobs between cirrusSearchLinksUpdate & cirrusSearchLinksUpdatePrioritized, the later is now getting most jobs instead of the former, will ship a fix asap
[15:51:21] <elukey>	 dcausse: ack thansk!