[07:56:01] <hashar>	 ^ I have built that image :)
[07:56:24] <hashar>	 as `docker-registry.wikimedia.org/dev/cirrus-elasticsearch:7.10.2-s2`
[09:23:09] <pfischer>	 hashar: thanks! Did you built it manually (and locally) or do you have some kind of automation for that?
[09:23:49] <hashar>	 pfischer: the git repository has a shell script wrapper to trigger a build
[09:23:55] <hashar>	 something like `./fab deploy_devimages`
[09:24:23] <hashar>	 which really ssh to contint.wikimedia.org, git pull the repo there showing the diff then ask confirmation to run `docker-pkg`
[09:24:32] <hashar>	 which then eventually push the resulting image(s) to the registry
[09:27:20] <pfischer>	 Ah, good to know. Could that be wrapped in a (manual) .gitlab-ci.yaml job?
[09:31:52] <pfischer>	 dcausse: I bumped the docker image name for the ConsumerApplicationIT so the PR is complete now: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/82 Would you have a minute?
[09:32:59] <dcausse>	 pfischer: sure
[09:50:08] <dcausse>	 pfischer: merged
[09:50:38] <pfischer>	 dcausse: thanks!
[09:50:55] <pfischer>	 I’ll deploy an update once CI passed.
[10:43:51] <pfischer>	 dcausse: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/982774
[10:44:16] <dcausse>	 +1
[11:42:09] <pfischer>	 dcausse: I forgot about the split fetcher, I have to distinguish the metrics otherwise it’s unclear which HTTP client they are coming from
[11:43:01] <dcausse>	 pfischer: oh.. I thought that flink would have added the operator name in the metric labels?
[11:47:38] <pfischer>	 Huh, we might be lucky in that case.
[11:49:26] <dcausse>	 quickly looking I only see metrics with operator_name="Map" which I assume would be the synchronous client used by the CirrusNamespaceIndexMap operator
[11:51:15] <pfischer>	 flink_taskmanager_job_task_operator_http_method_authority_path_request_duration_count would be one of the new ones. But you are right, I forgot about the label
[11:52:49] <dcausse>	 hm for some reasons I don't any metrics from the consumer app
[11:53:07] <dcausse>	 oh it failed yesterday around 19 utc
[11:53:09] <pfischer>	 Right k8s cluster? Only see values on codfw
[11:56:07] <dcausse>	 last failure is https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.50?id=4YJvX4wBRtLP5wy6VVuy
[11:56:12] <dcausse>	 lunch
[12:00:28] <pfischer>	 dcausse: That was probably during the update of relforge ES instances
[12:21:22] <pfischer>	 Hm, seems like the consumer is in some limbo state. It stopped last night, but when I restart the application (using restartNonce) the pod remains untouched (uptime 19h).
[12:29:54] <pfischer>	 Okay, consumer is running but flooding the logs with warnings of duplicate metrics. 👀
[13:29:05] <pfischer>	 inflatador: I just launched a docker container from `http://docker-registry.wikimedia.org/dev/cirrus-elasticsearch:7.10.2-s2` and it fails due to conflicting JARs: https://phabricator.wikimedia.org/P54377
[13:30:54] <dcausse>	 pfischer: oh good point regarding relforge upgrade yesterday, probably something we should tune on the flink restart strategy so that it survives such operations
[13:33:12] <pfischer>	 dcausse: definitely, I’ll create a ticket.
[13:35:32] <pfischer>	 I’ll have to re-release all extra-common dependent plugins to work around the class path issue mentioned above
[13:36:38] * pfischer wonders how that did not break for wmf9
[13:40:14] <dcausse>	 the deb is broken
[13:40:42] <dcausse>	 elastic uses different classloaders per plugin
[13:43:09] <dcausse>	 pfischer: no need to release other plugins I think
[13:43:49] <dcausse>	 most probably a stale jar on the repo that was used to build the debian package
[13:43:54] <pfischer>	 Okay, I couldn’t remember that this was an issue with wmf9
[13:44:13] <dcausse>	 the repo used to build the deb was probably cleaned up
[13:44:31] <dcausse>	 the script should perhaps take care of this
[13:44:31] <pfischer>	 inflatador: mentioned something like that
[13:45:02] <dcausse>	 https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/982444/2/debian/sha256sums does not have the faulty jar so it must be stale somewhere
[13:46:17] <pfischer>	 Alrigh. I rolled back the consumer tempora
[13:46:28] <pfischer>	 temporarily
[13:46:54] <pfischer>	 None the less, looks like the new version is capable of higher throughput: 
[13:46:54] <pfischer>	 https://grafana-rw.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater?forceLogin=&forceLogin=&forceLogin=true&from=now-3h&orgId=1&to=now&var-k8sds=eqiad%20prometheus%2Fk8s-staging&var-opsds=eqiad%20prometheus%2Fops&var-service=cirrus-streaming-updater&var-site=eqiad&var-app=All&var-operator_name=enrich_page_change_with_revision&var-operator_name=Source:_cirrussearch_update_pipeline_update_rc0_source&refresh=5m
[13:48:04] <pfischer>	 10 ops/s for enrich_page_change_with_revision vs. < 1 ops/s
[13:48:55] <pfischer>	 Let’s wait until inflatador: comes back on, until then I’ll try to fix the duplicate metrics complaints.
[13:49:24] <dcausse>	 pfischer: I wonder if it's because it's now catching up, if you increase the timerange to 3days it's roughly similar
[13:50:40] <dcausse>	 also don't trust the kafka lag on this dashboard we don't have such metrics on the test kafka cluster
[13:51:18] <dcausse>	 (which is where the update_stream is stored)
[13:53:09] <dcausse>	 hm the ops-plugin repo should run the makefile task clean_blobs when downloading new blobs, not why that did not work here :/
[13:55:35] <dcausse>	 inflatador: if you still have logs of the "prepare_build" task you ran on the host that created the deb this would be helpful, something's not cleanup properly I'm afraid
[14:01:27] <pfischer>	 I’m surprised that relforge produces valid responses after all: I queried relforge for /_cat/plugins and it showed the expected version but while the updated consumer was running, I saw only _FAILED responses. Now that the previous version is running, we can see the usual load of NOOPs and a few UPDATED. It appears as if there’s an old version of the extra plugin processing the requests, that is not capable of 
[14:01:27] <pfischer>	 extracting script.source and hence fails if params.source is missing.
[14:11:08] <dcausse>	 I thought that Brian did fix the relforge machine by removing the jar manually?
[14:11:52] <dcausse>	 I don't think elastic would even start with a broken plugin
[14:16:25] <inflatador>	 <o/
[14:16:28] <dcausse>	 we seem to run 50 http rps, for an avg response time of 0.5s so that's about 25concurrent requests while configured for 100 concurrent request with 15% for rev_based and the rest for rerenders 
[14:16:31] <pfischer>	 o/
[14:16:34] <dcausse>	 o/
[14:18:16] <inflatador>	 Looks like I broke the elastic package? ;(
[14:19:40] <inflatador>	 The package does have multiple jars for extra-common and extra-7.10.2 . it will need to be rebuilt
[14:20:44] <pfischer>	 But the deb is packaging all plugins, right?
[14:21:29] <inflatador>	 Correct, nothing else should have changed though
[14:21:48] <pfischer>	 So it may have multiple versions of that JAR if there are plugins from different releases.
[14:22:04] <pfischer>	 Is it possible to diff two debs?
[14:22:38] <pfischer>	 Would be interesting to see what changed from wmf9 to wmf10
[14:22:59] <inflatador>	 Possibly, but I don't know how. I can definitely rebuild the plugin though
[14:23:21] <dcausse>	 inflatador: do you run prepare_build locally and then upload to the server creating the package?
[14:23:32] <inflatador>	 Y
[14:23:37] <dcausse>	 so that might explain
[14:23:59] <dcausse>	 if the folder on the remote server is not cleaned up
[14:24:25] <inflatador>	 That's what I'm thinking too
[14:24:40] <inflatador>	 Sounds like there are still issues even after I manually deleted the duplicate jar?
[14:25:29] <pfischer>	 dcausse: regarding the 50 req/s: this may be explained by how the connection pool is initialised: with maxTotal > 0 ? maxTotal : 50
[14:25:34] <dcausse>	 inflatador: you mean deleting the jar on the elastic hosts? I think after that elastic was running fine
[14:26:16] <inflatador>	 dcausse correct, just trying to gauge the urgency...whether or not relforge is working ATM
[14:26:25] <pfischer>	 Since we do not specify maxTotal but only defaultMaxPerRoute, 50 should apply (until the new release runs)
[14:26:47] <dcausse>	 oh if that's the case then yes
[14:28:31] <dcausse>	 on my local copy I see that we set both the total and the max per route
[14:29:12] <dcausse>	 inflatador: relforge is fine I think, no need to do anything there, it's just that we should not deploy this plugin further
[14:29:41] <inflatador>	 dcausse ACK, I'll fix it ASAP and report back
[14:30:02] <inflatador>	 in the meantime I'll stop puppet and check plugin versions
[14:32:37] <dcausse>	 going to try bumping the pod resources a bit, gc times aren't great and cpu seems to be throttled a bit
[14:44:53] <inflatador>	 dcausse I'm having issues w/prepare_build, it looks like it's trying to download the new version I'm trying to build? 
[14:44:54] <inflatador>	 ` curl --fail --head https://apt.wikimedia.org/wikimedia/pool/component/elastic710/w/wmf-elasticsearch-search-plugins/wmf-elasticsearch-search-plugins_7.10.2-11.tar.gz`
[14:45:23] <dcausse>	 oh strange
[14:45:48] <dcausse>	 trying on my machine
[14:46:02] <inflatador>	 dcausse I think the problem is that I merged the patch before running prepare_commit?
[14:46:27] <dcausse>	 hm... prepare_commit should only update the SHAs
[14:46:31] <dcausse>	 but they should not change here
[14:48:58] <dcausse>	 inflatador: did run "./debian/rules prepare_build" on top of your latest commit "Fix typo in BUILD_VERSION" and it worked
[14:49:59] <dcausse>	 hm wait that's not the latest version my bad
[14:50:01] <inflatador>	 dcausse we're in https://meet.google.com/rgb-ebzq-ern if you wanna join
[14:58:27] <dcausse>	 pfischer: if you have a couple sec: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/982809
[15:00:18] <pfischer>	 dcausse: sure
[15:00:56] <pfischer>	 dcausse: If you do have a second, too: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/83 we could merge that in
[15:01:02] <dcausse>	 sure
[15:05:16] <pfischer>	 thx
[15:05:17] <dcausse>	 pfischer: lgtm, but not sure I fully undertand how all this work :/
[15:07:20] <pfischer>	 Welcome to bridging land ;-)
[15:09:56] * pfischer is annoyed by lengthy CI pipelines: sonar job re-runs all tests. Couldn’t we reuse the target/ folders as artefacts from previous stages?
[15:20:59] <inflatador>	 pfischer dcausse MR for dev-images https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/57
[15:24:57] <inflatador>	 NM, I self-merged
[15:26:11] <inflatador>	 Pinged in releng, will let you know when the image is updated
[15:59:40] <inflatador>	 will be 5-10m late to Weds mtg
[16:02:12] <gehel>	 I'll skip the Wednesday meeting in favor of the Product+Tech staff meeting
[16:04:26] <dcausse>	 pfischer: something suspicious as well is that the test relying on src/test/resources/wiremock/mappings/elasticsearch.bulk.json should have failed since it still has the source as a param not the script
[16:09:43] <inflatador>	 looks like everyone else is in product/tech mtg...joining too
[16:24:24] <dcausse>	 pfischer: Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.context.update.max_compilations_rate]
[16:24:40] <dcausse>	 elastic wants to cache the script
[16:25:20] <dcausse>	 and has some circuit breaker to avoid compiling it too many times, I should have known that :/
[16:34:21] <dcausse>	 suspending the consumer while we figure out a solution
[16:40:08] <pfischer>	 Makes sense. Thank you for looking that up. Was it in the relforge logs?
[16:42:17] <dcausse>	 pfischer: no got it while crafting a bulk request and sending it against relforge
[16:48:59] <dcausse>	 although that does not make much sense... since we "compile" it on every requests today since the source param is part of the SuperDetectNoopScript instance...
[16:53:13] <pfischer>	 So we could tune the rate? I was about to open a ticket for ES asking about the size estimation.
[16:56:01] <dcausse>	 yes I feel that we might want to ask es what they think about the estimate taking only the source into account indeed
[16:56:47] <dcausse>	 regarding tuning the compilation rate I'm not entirely sure yet, I don't fully understand how that was even working previously
[16:58:20] <pfischer>	 https://github.com/elastic/elasticsearch/issues/103406
[16:58:47] <dcausse>	 thanks!
[16:59:11] <inflatador>	 workout, back in ~40
[17:03:01] <pfischer>	 Hm, that’s somewhat frustrating… the other upstream PR is also stuck: https://github.com/apache/flink-connector-elasticsearch/pull/83 - I could provide a patched elasticsearch client JAR that solves the calculation issue
[17:06:58] <dcausse>	 pfischer: I would not spend trying to upstream a caclulation fix upstream before they agree with the problem
[17:07:24] <dcausse>	 to make this even more frustrating it's very likely that the fix won't be backported to a version we use
[17:07:47] <dcausse>	 and we might also have to patch the opensearch client
[17:11:28] <dcausse>	 or should we fork all this and own our elastic connector?
[17:22:13] <pfischer>	 Hm, you mean, we could use a plain http client and build the bulk requests ourselves?
[17:24:59] <dcausse>	 pfischer: I meant start forking the flink-connector *and* some of the elastic client classes and start adapting them to our needs and don't be blocked trying to convince upstream
[17:27:13] <dcausse>	 looking at this script issue I don't fully understand how it's working today tbh... if the script is cached then it means we always apply the same update if it's not cached then I don't know how elastic can track the number of compilation per minutes...
[17:28:16] <pfischer>	 Well, we only pass in “” or “super_detect_noop” as ‘code/source’, so that can be cached.
[17:28:53] <pfischer>	 It’s not the result of the script that’s cached but the interpreted script code.
[17:29:23] <pfischer>	 If it was an inline painless script, you wouldn’t want to interpret it on every request, but only once
[17:30:19] <pfischer>	 So if ES stores a hash of some kind for every script source/code block, it would increment the meter whenever it encounters a hash it has not seen yet.
[17:30:21] <dcausse>	 pfischer: yes but the Map<String, Object> holding the params and thus the udpated doc is part of the script instance
[17:30:47] <pfischer>	 Yes, but the only thing to be interpreted would be the source.
[17:32:18] <pfischer>	 Source is an arbitrary string that has to be interpreted, for all other properties of the script object, their syntax and semantics are pre-defined
[17:34:33] <pfischer>	 regarding the fork: we still have to patch in two places: the sink/connnector + ES core (since the BulkRequest supplier is not overridable, see org.elasticsearch.action.bulk.BulkProcessor.Builder#build)
[17:35:21] <pfischer>	 So we’d need a patched sink that relies on a patched ES client.
[17:36:42] <dcausse>	 re script cache this still does not make sense to me: https://github.com/elastic/elasticsearch/blob/v7.10.2/server/src/main/java/org/elasticsearch/script/ScriptCache.java#L99
[17:37:17] <dcausse>	 pfischer: yes I meant putting all this in "forked" repo but not sure that's viable
[17:39:01] <pfischer>	 CacheKey cacheKey = new CacheKey(lang, idOrCode, context.name, options); - options does not mean parameters
[17:40:03] <pfischer>	 So the key works as assumed further up: language + script code + interpreter options
[17:40:17] <inflatador>	 back
[17:42:03] <dcausse>	 pfischer: yes but the thing I don't get is how the compiled script is going to get the params (we put them in the compiled script)?
[17:44:19] <inflatador>	 Doing a rolling restart of elastic to enable the new plugins
[17:45:47] <pfischer>	 dcausse: Ah, so the ScriptCache you linked, is that only used for painless scripts? The extra plugin’s SuperDetectNoopScript simply does not use caching in its compile method
[17:47:07] <dcausse>	 pfischer: I think that ScriptCache is wrapping all ScriptEngine but could be wrong
[17:49:16] <dcausse>	 pfischer: sorry I think I understood... 
[17:49:34] <dcausse>	 compilation returns a factory
[17:52:03] <pfischer>	 So if we increase the rate, that script cache is going to be big unless we’re able to cap it at the same time.
[17:52:50] <pfischer>	 script.cache.max_size
[17:52:58] <dcausse>	 yes but not sure that's what we want in the end :/
[17:53:54] <pfischer>	 script.cache.expire could be low too. But yeah, it’s probably not ideal.
[17:55:36] <dcausse>	 yes esp. since we believe that the root cause is elastic not properly estimating the request...
[17:57:11] <dcausse>	 another terrible hack would be have 3 or 4 strings of varying length (S, M, L, XL) we'd put in the source before doing our own estimation
[17:58:13] <dcausse>	 but in retrospect using existing connector had been pretty annoying so far :(
[18:13:18] <dcausse>	 dinner
[18:27:36] <inflatador>	 lunch, back in ~1h
[18:43:18] <pfischer>	 Dinner
[19:15:14] <inflatador>	 back
[19:17:13] <ryankemper>	 balthazar and I met earlier today to work on the search slo graphite queries. Think we got it figured out, it was as simple as doing `divideSeries(hitcount(isNonNull(removeBelowValue(Search.FullTextResults.p95, 4000)), "90day"), hitcount(transformNull(removeAboveValue(Search.FullTextResults.p95, 0), 1), "90day"))` (/s)
[19:17:14] <inflatador>	 does anyone know if these 'selenium-daily-beta-CirrusSearch' alerts are relevant to Search Platform?
[19:17:35] <inflatador>	 simple ;P
[19:21:47] <ryankemper>	 inflatador: I think it must be, but I don't have a lot of context. I found https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/6bfbf18cc3175973d0ec494182f9a748c38bf65c/tests/selenium/README.md and https://www.mediawiki.org/wiki/Selenium/How-to/Run_tests_using_selenium-daily_Jenkins_job
[19:24:04] <inflatador>	 ryankemper understood, I'm wondering if these are actionable though? Sounds like MW/Cirrussearch dev issue. dcausse gehel any opinions on the above?
[19:24:42] <inflatador>	 (that wiki page lists Erik and David as contacts)
[19:26:40] <inflatador>	 ryankemper thanks for the context, looks like alerts are configured at https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/%2B/master/jjb/mediawiki-extensions.yaml#191
[19:27:37] <ryankemper>	 I imagine the action is "if it's a one-off failure, ignore it but if there's persistent build failures look into it"
[19:28:32] <gehel>	 Those seem to be integration tests, so probably related to a code change, and a CI alert. Not a production alert
[19:28:51] <gehel>	 dcausse or pfischer can confirm tomorrow 
[19:31:27] <inflatador>	 ACK, no rush
[20:44:33] <inflatador>	 break, back in ~20
[21:07:53] <inflatador>	 back
[22:31:29] <inflatador>	 running codfw elastic restart in a tmux window on cumin2002
[23:06:00] <pfischer>	 dcausse: I think I found a work around that only requires a patched flink-connector-elastic search: This connector exposes an interface, `ElasticsearchEmitter` that we implement. If that would be passed ES’ `BulkProcessor` to its `open()` method (called by `ElasticsearchWriter`) then we can do the size calculation inside the emitter and call `BulkProcessor.flush()` if adding another action would exceed the bulk size 
[23:06:00] <pfischer>	 limit. The ES implementation would only flush *after* an action has been added and the `BulkRequest`’s estimated size is *greater* than the limit.