[07:59:36] in the end I'm tempted to add a way to load the subgraph defs from a file, mainly to add a quick split definition for test.wikidata.org that we use in k8s@staging [09:59:53] lunch [10:06:29] lunch [10:07:53] ryankemper: after discussion in #wikimedia-serviceops (~8:30 UTC), we're good to go without LVS. I'm closing T368972 [10:07:54] T368972: Use Envoy instead of LVS to route internal federation traffic for WDQS - https://phabricator.wikimedia.org/T368972 [12:40:55] o/ [12:49:18] quick CR to test envoy TLS terminator in relforge if anyone has time to look https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052819 [13:00:00] o/ [13:12:39] dcausse we talked about airflow logs showing up in the WebUI last week...can you send me a screen shot of what that looks like? Going to try and track down where those logs are stored [14:09:39] \o [14:11:25] inflatador: the logs should be in /srv/airflow-search/logs/ [14:17:54] inflatador: https://people.wikimedia.org/~dcausse/airflow_logs_screenshot.png this is from where I get debug info, esp the yarn application_id to fetch more info [14:18:58] if they're going away when moving to k8s I guess that might be an issue for usability [14:19:38] ahh, yea that would be a problem. iirc airflow has remote logging suppotr, although i dunno what it involves [14:20:07] according to docs, apparently it can write direct to elasticsearch. maybe another use case for the shared cluster [14:20:19] write and read [14:20:34] oh if it can read from there that's nice [14:21:24] actually on closer read, what it accepts is a url template [14:21:35] so it links to the external logs [14:22:38] re: https://airflow.apache.org/docs/apache-airflow-providers-elasticsearch/stable/logging/index.html# [14:25:07] back [14:26:51] dcausse ah, thanks for that. Hopefully those are written to the PostgresDB but I doubt it [14:28:32] ebernhardson my reading is that Airflow can both write to ES and link to Kibana? [14:29:52] inflatador: i thinks thats what the page is saying, yes [14:30:28] inflatador: but that doesn't really fit our generic logging setup, what we probably want to stay similar to everything else is to write to syslog and link to kibana (assuming k8s syslog auto wires to the standard wmf logging) [14:31:00] i dunno, but i doubt observability wants us writing direct to elastic clusters and skipping the pipelines [14:31:28] or use separate elasticsearch from logging, possible but extra work over time i imagine [14:33:00] ebernhardson ACK, I guess the airflow logging today is passing the standard pipeline? Sorry, I need to look a bit closer at our current airflow [14:33:10] inflatador: airflow today writes to the local disk [14:33:39] i suppose we never worried about it too much since the disks stay around for awhile, and we generally only need the last week or maybe two at most of logs [14:34:07] ah, I see it now, `/srv/airflow-search/logs` [15:30:24] could this go to a ceph volume if we can't have elastic? [15:30:37] yes, I think so [15:47:32] hmm, reading tickets it sounds like i should be looking into why that cache doesn't work like it's expected [15:47:36] parser cache [15:48:49] ebernhardson: you mean the memcached key group CirrusSearchParserOutputPageProperties? [15:49:24] dcausse: the bit ladsgroup is concerned about [15:49:42] dcausse: yea, thats the one [15:50:47] yes makes sense, quickly looked at the code and saw nothing obvious :/ wondering if it's key with the getTouched() or something we overlooked in WANObjectCache [15:53:14] indeed the code looks quite straight forward, there isn't too much for us to get wrong [15:55:08] touched does seem like the most likely culprit, maybe we have logs of key requests somewhere to poke over [16:02:15] workout, back in ~40 [16:18:20] * ebernhardson notes that WANObjectCache has quite a few more functionalities than i would have guessed [16:21:41] * ebernhardson has to pagedown 11 times to cover the docblock of the getWithSetCallback function :P [16:24:01] :) [16:53:24] dinner [16:56:15] sorry, been back [17:26:59] ryankemper fwiw, it does look like we're going to have to do something like https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/elasticsearch/manifests/tlsproxy.pp if we want a separate envoy instance per port (which we do) [17:27:40] lunch, back in time for pairing [18:22:32] back [20:27:21] hmm, so consumer-search and consumer-cloudelastic in eqiad and codfw are 100% backpressured. [20:28:59] and, at least consumer-search-eqiad, is backpressured in the elasticsearch writer [20:29:09] should be post-fetch [20:35:59] more document_missing_exceptions than i would expect, [20:36:40] 20:34:00-20:35:00 shows ~280 document_missing_exception [20:36:57] but i wouldn't expect those to cause backpressure :S [20:39:14] are you finding that in logstash? [20:39:57] inflatador: from kubectl logs, i haven't found a nice way to see all this in logstash [20:40:32] i wrote a little bash script i use: https://phabricator.wikimedia.org/P62389 [20:41:02] so in this case the invocation (to see the taskmanager logs and not the default jobmanager logs): ssh deployment.eqiad.wmnet /usr/bin/env cluster=eqiad release=consumer-search pod=flink-app-consumer-search-taskmanager-1-2 ~/bin/klog -f | ~/.cargo/bin/fblog [20:41:48] it looks like the backpressure is fixing itself, but the document_missing_exceptions are a bit concerning (but probably unrelated) [20:42:27] something caused it to take in far les records from 20:15-20:25, then 20:25-20:40(now) it's been catching up [20:43:09] lunch [20:48:18] maybe a bunch of large rerenders? unsure. There was a spike of rerenders around 20:15, then elastic sink req duration went from ~1s to ~6s for the time it was having problems [21:25:56] * ebernhardson mutters at the analysis chains used in logstash... in `Memcached error for key "foo:bar:baz:bang"` you can't search for foo and find it, beause the analysis doesn't break it up like that :S [21:30:12] (╯°□°)╯︵ ┻━┻ [21:31:16] i suspect thats part of how they manage such impressive ingestion numbers in logstash pipelines, by turning off the bits that are expensive but make it great :P [21:37:21] at least we have all the raw logs on mwlog1002, can grep though them and find at least we aren't having (obvious) problems with memcached rejecting to store values [21:37:32] s/rejecting/refusing/