[08:48:56] Welcome back everyone! I'll be catching up on email and slack. As always, ping me if there is something urgent [09:01:56] welcome back! [09:09:53] Welcome back! [09:16:29] Enabling kafka topic compaction + deletion for page_rerender reduces to topic size by ~60% (53GB on main vs 21GB on jumbo) - I guess we should make the push for enabling this on main. [09:24:10] pfischer: nice! I've seen a couple alerts so there might need some tools that needs some adaptations (goblin?), but it's definitely worth a task [09:27:53] re space usage in main vs jumbo, jumbo being populated generally by mirror-maker it does benefit from better batching so compression generally works better and kafka partitions are generally smaller [09:33:20] https://grafana-rw.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-kafka_broker=All&var-topic=codfw.mediawiki.page_change.v1&var-topic=codfw.mediawiki.cirrussearch.page_rerender.v1&from=now-7d&to=now & [09:33:22] https://grafana-rw.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=codfw.mediawiki.page_change.v1&var-topic=codfw.mediawiki.cirrussearch.page_rerender.v1&from=now-7d&to=now [09:34:19] page_rerender becomes smaller in jumbo than page_state which does not have compaction enabled, seems like a pretty big win [10:18:57] So batching contradicts compression? I thought it was the other way round: If batching, the client can better compress records, compress batches, to be precise. [10:23:35] batching helps compression, jumbo topics are generally smaller than their kafka-main source topics reason is because jumbo is populated by mirror-maker which does a good job at batching [10:24:10] Oh, then I just got that wrong. That makes sense. [10:25:10] page_change drops from ~44G in main down to ~25G in jumbo (I think only due to better batching and compression) [10:26:01] BTW: I looked at the envoy metrics for the last backfill on Friday, and he’d like to see if allowing more CPU for envoy (1 instead of 500m) [10:26:22] …with Janis… [10:27:56] yes I agree, seems to me that envoy is still struggling a bit, looking at logs I saw a bunch of "UF,UO" (upstream failures, upstream overflow) and thus I wonder if it's not causing unecessary retries [10:29:13] We also hit a threshold for busy php-fpm workers behind mw-api-int-ro (at around 1.5k req/s) [10:29:59] For the sake of comparable results I would still go with the same max. parallel requests (600) [10:30:01] interesting... we should perhaps lower the capacity a bit? [10:30:06] ok [11:51:33] lunch [14:42:49] trying to access wdqs102[2-4] from hadoop but getting connection refused, other hosts do seem accessible and not finding anything obvious in puppet to open hadoop <-> wdqs102[2-4] [14:58:10] Hm, did any of those services move recently? At least Erik encountered DNS resolution issues on Friday. [15:02:02] no clue :/ [15:40:07] ah it's because the tls termination is not up but looks like I can can access it via plain http [15:44:00] How’s that possible? I’d expect a reverse proxy in front of wdqs hosts that takes care of *both* TLS termination + routing to the upstream service. [15:45:50] there is a local nginx doing SSL termination on each of the wdqs host [15:46:49] but not a single reverse proxy doing the routing [15:47:42] actually seems like envoy is now doing the tls termination, net -> envoy -> nginx -> blazegraph [15:48:30] and envoy seems down on these 3 test hosts [15:48:59] quick errand [16:49:58] :eyes on the wdqs hosts...could have something to do with them being associated as "test" hosts as opposed to internal or prod [17:08:12] o/ [17:08:50] inflatador: thanks! but since I can use plain http that's more than enough for me [17:29:06] FYI, we might be missing some prometheus k8s metrics due to an ongoing OOM issue, see https://phabricator.wikimedia.org/T354399 for more details [17:33:05] Q: What if anything does the `useragent` filter do at https://gerrit.wikimedia.org/g/operations/puppet/+/ea0266624be030aa0a874a7c391dc2e5531c7c78/modules/profile/files/apifeatureusage/filters/50-filter-apifeatureusage.conf#11 [17:33:26] It appears, from querying the index, that the `agent` field is an unmodified user-agent string, and I see no other keys in the document [17:35:04] e.g. https://phabricator.wikimedia.org/P54550 [17:35:25] maybe the `prune` action is rendering it a no-op? [17:35:39] Also, MW metrics are migrating to Prometheus, more deatils in https://phabricator.wikimedia.org/T350591 [17:39:43] Krinkle: hmm, i don't remember anything related to that. Based on a quick look at docs your premise seems reasonable, useragent is adding fields and prune is taking them back out. Not sure why [17:41:27] seems to have been there, in varying forms, since brad added it in 2014 [17:45:18] ebernhardson: thx, that gives me some confidence that I'm starting to understand a bit how elastic and logstash work under the hood [17:45:31] would it also be correct to say we don't store the clientip field in elastic right now? [17:45:53] or is there something I can put in the query to show additional hidden fields/columns? [17:47:01] Krinkle: if you aren't using any source filters on the elastic query then it should be returning the complete source doc, in elasticsearch there is a hidden source field which is simply a compressed json blob and that should be what you are getting [17:48:57] if we sent arbitrary unknown fields to elasticsearch it would still retain them in the source doc [18:17:09] low priority MR to fix a puppet error. The PCC failure is a false positive https://gerrit.wikimedia.org/r/c/operations/puppet/+/988682/ [18:36:49] thanks dcausse ! [18:57:43] lunch, back in ~30 [19:09:57] * ebernhardson wonders why /w/rest.php/eventbus/v0/internal/job/execute was created a few years ago but the migration seems to have stalled out, endpoint not available on old or new jobrunner clusters (see T246389) [19:26:19] unclear indeed :/ [19:26:59] it also doesn't work, at least in my dev environment. Although maybe i just have it not fully configured, it tries to pass null to a place that accepts a Job instance [19:28:40] :/ [19:29:19] and the /rpc script is not in MW core? [19:29:38] correct, i basically copied it out of mw-config and into my mediawiki directory to test things. Works there [19:30:08] i wonder about the mediawiki signature though, at least the rest endpoint validates a signature with a secret key before executing [19:30:28] a bit of verification that jobs werent injected i suppose [19:35:48] hm ofcourse... "You have been banned until 2024-01-09T19:15:34.264Z" running sparql queries to a single wdqs node from hadoop... [19:36:28] ohh, it's just me being silly. And perhaps something not careful enough. If i don't set the content-type header it fails with a null job. If i set an application/json content-type header it vails signature validation [19:37:00] i feel a bit dubious having to duplicate the signature over to SUP [19:37:37] it's to hash the job params? [19:38:20] s/hash/sign I mean [19:38:36] dcausse: it uses EventBus::serializeEvents to serialize the event minus the mediawiki_signature, and then does a sha1 hmac with the secret key [19:38:55] ottomata ^^ any insight on why dcausse is getting banned from the hadoop nodes? [19:39:21] inflatador: it's actually the wdqs machine banning me :) [19:39:32] oops, nmind ;( [19:39:50] I need to bypass this throttling mechanism somehow [19:40:10] ebernhardson: using /w/rest.php/eventbus/v0/internal/job/execute would mean we could use any mw cluster? [19:40:20] maybe we can patch our nginx config to cut around it [19:40:50] yes I'll take a look tomorrow, there must be some config to tune this [19:41:05] sounds good [19:41:50] dcausse: well, any cluster that allows that endpoint. I haven't looked closely enough, but it 404's on the public endpoints. It also 404's in the job runners, but that's because the apache config only allows two urls and 404's the rest. That part can be changed pretty easily [19:43:42] oh ok will still need some adaptations to our infra... but at least being in EventBus the code is more easily runnable/testable than copying mw-config to a local dev env [19:44:04] yea, it seems like a better approach. Still a bit hacky, but at least it's not copying random files around [19:44:42] dinner [19:45:04] looks like in addition to whatever is currently blocking the rest endpoint in prod, there is also a config flag checked when actually executing the endpoint that is only set to true on async clusters [19:45:24] (assuming they are using async as a proxy for internal clusters) [19:57:11] appointment, back in ~90 [20:51:03] Having some car trouble I gotta try to take care of, so I'll be out of pocket for a while. [21:03:53] hmm, one annoyance of this rest api is that it uses php output buffering to capture and then throw away anything printed by the job. Commented as 'Clear all errors that might have been displayed if display_errors=On' [21:04:21] we can force it to flush anyways, but might cause other oddities. [21:21:34] car things went so badly that I'm back already, lol [21:26:59] oof [21:29:52] back [21:48:44] * ebernhardson is struggling to find a reasonable way to extract out the difference between mw api and job based document building...maybe drop the mw api bits entirely? [21:49:03] inside SUP [22:10:34] * bd808 feels like ebernhardson was trying to get someone to say "what's SUP?" ;) [22:10:56] bd808: lol [22:12:15] its the Search Update Pipeline, btw [22:13:33] thanks for the acronym expansion, and the unintended dad joke setup [22:37:44] {◕ ◡ ◕}