[07:38:09] <ryankemper>	 main reload is chugging alone fine. scholarly reload failed, haven't had tiem to debug it yet but it failed at timestamp stage
[07:38:11] <ryankemper>	 https://www.irccloud.com/pastebin/oza1PPOV/
[07:52:18] <dcausse>	 ryankemper: thanks, will take a look, regarding the kafka topic I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060049 is required
[07:53:20] <dcausse>	 if on main the updater starts without this patch I'm afraid that the journal will get polluted with bad data
[07:58:29] <dcausse>	 can't access wdqs1023...
[08:02:00] <dcausse>	 looking at some dashboards it appears that it did not even load the dumps
[08:07:06] <dcausse>	 and seems like it's pooled somehow and triggering wikidata maxlag
[08:26:39] <dcausse>	 stopped blazegraph on wdqs1023, I think these are the monitoring queries that makes the system believe that this host is live
[08:43:32] <dcausse>	 stevemunene: o/ I'd like to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060065 (this list of user agents is used to detect queries that are coming from us so that they are not taken into account to detect if a host is live or not). Knowing if a host is live is used to detect if a host serving queries is lagged and will then instruct wikidata to throttle edits
[08:44:23] <dcausse>	 in our case wdqs1023 was brought up and started to receive monitoring queries, but since it's not loaded it's considered lagged and thus caused edits on wikidata to be throttled
[08:45:10] <dcausse>	 I stopped blazegraph there as quick workaround but I think it might be best to have this patch applied so that it does not cause problems again
[08:47:52] <dcausse>	 these "prometheus-$something-sparql-ep-check" ua are set in (for instance) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1046123/16/modules/profile/manifests/query_service/monitor/wikidata_main.pp 
[09:01:41] <stevemunene>	 o/ dcausse having a look
[09:04:19] <dcausse>	 thanks!
[09:32:46] <dcausse>	 stevemunene: thanks for the review! I can't +2 myself on puppet, would you be OK merging this patch?
[09:54:23] <dcausse>	 lunch
[13:13:38] <ebernhardson>	 \o
[13:15:24] <dcausse>	 o/
[13:39:14] * ebernhardson notes that the cirrus doc fetchers don't use the specialized MediaWikiHttpClient...wonder if i did that on purpose
[13:40:28] <dcausse>	 where is this?
[13:42:08] <ebernhardson>	 dcausse: in SUP. the site matrix and max doc id fetchers use the MediaWikiHttpClient class, but the CirrusFetcher RichAsyncFunction uses the CloseableHttpAsyncClient
[13:42:18] <ebernhardson>	 i suppose the answer is in the name, we needed async for the fetcher
[13:42:39] <dcausse>	 ah ok, yes async is completely different
[13:42:58] <dcausse>	 I suppose that means you need to implement auth twice :P
[13:43:17] <ebernhardson>	 i decided to be lazy and provide the header directly, so it shouldn't be too bad
[13:43:23] <dcausse>	 sure
[13:43:37] <ebernhardson>	 there was just too much silliness with credential providers and auth schemes and such to do it the "right" way
[13:44:11] <dcausse>	 yes... the extension points of these clients are always tricky to get right
[15:16:24] <dr0ptp4kt>	 nice flink presentation and q&a dcausse! (thx also ottomata and elsewhere milimetric)
[15:16:57] <dr0ptp4kt>	 Search Platform Team, reminder for today - standup updates for 2024-08-06
[17:33:25] <dcausse>	 dinner
[18:25:40] <ryankemper>	 dcausse: any idea where the `cluster` key of the `blazegraph_lastupdated` prometheus metric gets set at? I see blazegraph_lastupdated coming from `modules/query_service/files/monitor/prometheus-blazegraph-exporter.py` but I don't see where the cluster comes from
[18:26:37] <ryankemper>	 ah i can prob pair with ebernhardson on that during pairing
[18:28:13] <ebernhardson>	 hmm, yea we can poke around
[18:29:09] <ebernhardson>	 ryankemper: at first guess, with it not appearing in the exporter i would assume it's in the definition that tells prometheus where to scrap from
[18:31:59] <ebernhardson>	 yea, there is a related prometheus::resource_config() definition in modules/profile/manifests/prometheus/ops.pp.  resource_config always injects a cluster label
[18:33:03] <ryankemper>	 ebernhardson: will be at pairing in ~3 mins
[18:53:12] <dcausse>	 ryankemper: no clue... it's not set from the exporter itself so probably configured somewhere on the prometheus nodes polling this data
[18:54:28] <ryankemper>	 dcausse: thikn erik and i just figured it out, it's ultimately from the `clusters` hieradata variable in `hieradata/role/common/wdqs/[main,scholarly].yaml`
[18:54:41] <ryankemper>	 ebernhardson: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060158
[19:03:21] <ryankemper>	 ebernhardson: https://puppet-compiler.wmflabs.org/output/1060158/4134/wdqs1021.eqiad.wmnet/index.html only stuff i'm not sure of is the envoy cluster changes. probably fine though
[19:26:17] <ebernhardson>	 ryankemper: https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=All&orgId=1&from=1722963271049&to=1722965786449
[20:16:06] <ebernhardson>	 spark always being silly... df.explain() doesn't return anything, it just prints from somewhere :P
[20:17:04] <ebernhardson>	 but it does at least show that spark optimizer decided to lift the udf evaluation up before the repartition call...but not having luck yet convinving it to not do that
[20:25:31] <ebernhardson>	 found why it does it twice in the explain as well: isnotnull(UDF(query#178))
[20:28:23] <ebernhardson>	 so it's not really that it lifted the udf evaluation to before the repartition, instead it "optimized" by lifting the is not null check to before repartitioning, which causes it to evalutate the UDF twice
[20:43:23] <ebernhardson>	 for laughs plugged a related question into gemini.  It offers terrible suggestions like cache the udf results, or drop the query_info column before the final repartitioning :P