[07:56:01] ^ I have built that image :) [07:56:24] as `docker-registry.wikimedia.org/dev/cirrus-elasticsearch:7.10.2-s2` [09:23:09] hashar: thanks! Did you built it manually (and locally) or do you have some kind of automation for that? [09:23:49] pfischer: the git repository has a shell script wrapper to trigger a build [09:23:55] something like `./fab deploy_devimages` [09:24:23] which really ssh to contint.wikimedia.org, git pull the repo there showing the diff then ask confirmation to run `docker-pkg` [09:24:32] which then eventually push the resulting image(s) to the registry [09:27:20] Ah, good to know. Could that be wrapped in a (manual) .gitlab-ci.yaml job? [09:31:52] dcausse: I bumped the docker image name for the ConsumerApplicationIT so the PR is complete now: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/82 Would you have a minute? [09:32:59] pfischer: sure [09:50:08] pfischer: merged [09:50:38] dcausse: thanks! [09:50:55] I’ll deploy an update once CI passed. [10:43:51] dcausse: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/982774 [10:44:16] +1 [11:42:09] dcausse: I forgot about the split fetcher, I have to distinguish the metrics otherwise it’s unclear which HTTP client they are coming from [11:43:01] pfischer: oh.. I thought that flink would have added the operator name in the metric labels? [11:47:38] Huh, we might be lucky in that case. [11:49:26] quickly looking I only see metrics with operator_name="Map" which I assume would be the synchronous client used by the CirrusNamespaceIndexMap operator [11:51:15] flink_taskmanager_job_task_operator_http_method_authority_path_request_duration_count would be one of the new ones. But you are right, I forgot about the label [11:52:49] hm for some reasons I don't any metrics from the consumer app [11:53:07] oh it failed yesterday around 19 utc [11:53:09] Right k8s cluster? Only see values on codfw [11:56:07] last failure is https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.50?id=4YJvX4wBRtLP5wy6VVuy [11:56:12] lunch [12:00:28] dcausse: That was probably during the update of relforge ES instances [12:21:22] Hm, seems like the consumer is in some limbo state. It stopped last night, but when I restart the application (using restartNonce) the pod remains untouched (uptime 19h). [12:29:54] Okay, consumer is running but flooding the logs with warnings of duplicate metrics. 👀 [13:29:05] inflatador: I just launched a docker container from `http://docker-registry.wikimedia.org/dev/cirrus-elasticsearch:7.10.2-s2` and it fails due to conflicting JARs: https://phabricator.wikimedia.org/P54377 [13:30:54] pfischer: oh good point regarding relforge upgrade yesterday, probably something we should tune on the flink restart strategy so that it survives such operations [13:33:12] dcausse: definitely, I’ll create a ticket. [13:35:32] I’ll have to re-release all extra-common dependent plugins to work around the class path issue mentioned above [13:36:38] * pfischer wonders how that did not break for wmf9 [13:40:14] the deb is broken [13:40:42] elastic uses different classloaders per plugin [13:43:09] pfischer: no need to release other plugins I think [13:43:49] most probably a stale jar on the repo that was used to build the debian package [13:43:54] Okay, I couldn’t remember that this was an issue with wmf9 [13:44:13] the repo used to build the deb was probably cleaned up [13:44:31] the script should perhaps take care of this [13:44:31] inflatador: mentioned something like that [13:45:02] https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/982444/2/debian/sha256sums does not have the faulty jar so it must be stale somewhere [13:46:17] Alrigh. I rolled back the consumer tempora [13:46:28] temporarily [13:46:54] None the less, looks like the new version is capable of higher throughput: [13:46:54] https://grafana-rw.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater?forceLogin=&forceLogin=&forceLogin=true&from=now-3h&orgId=1&to=now&var-k8sds=eqiad%20prometheus%2Fk8s-staging&var-opsds=eqiad%20prometheus%2Fops&var-service=cirrus-streaming-updater&var-site=eqiad&var-app=All&var-operator_name=enrich_page_change_with_revision&var-operator_name=Source:_cirrussearch_update_pipeline_update_rc0_source&refresh=5m [13:48:04] 10 ops/s for enrich_page_change_with_revision vs. < 1 ops/s [13:48:55] Let’s wait until inflatador: comes back on, until then I’ll try to fix the duplicate metrics complaints. [13:49:24] pfischer: I wonder if it's because it's now catching up, if you increase the timerange to 3days it's roughly similar [13:50:40] also don't trust the kafka lag on this dashboard we don't have such metrics on the test kafka cluster [13:51:18] (which is where the update_stream is stored) [13:53:09] hm the ops-plugin repo should run the makefile task clean_blobs when downloading new blobs, not why that did not work here :/ [13:55:35] inflatador: if you still have logs of the "prepare_build" task you ran on the host that created the deb this would be helpful, something's not cleanup properly I'm afraid [14:01:27] I’m surprised that relforge produces valid responses after all: I queried relforge for /_cat/plugins and it showed the expected version but while the updated consumer was running, I saw only _FAILED responses. Now that the previous version is running, we can see the usual load of NOOPs and a few UPDATED. It appears as if there’s an old version of the extra plugin processing the requests, that is not capable of [14:01:27] extracting script.source and hence fails if params.source is missing. [14:11:08] I thought that Brian did fix the relforge machine by removing the jar manually? [14:11:52] I don't think elastic would even start with a broken plugin [14:16:25] we seem to run 50 http rps, for an avg response time of 0.5s so that's about 25concurrent requests while configured for 100 concurrent request with 15% for rev_based and the rest for rerenders [14:16:31] o/ [14:16:34] o/ [14:18:16] Looks like I broke the elastic package? ;( [14:19:40] The package does have multiple jars for extra-common and extra-7.10.2 . it will need to be rebuilt [14:20:44] But the deb is packaging all plugins, right? [14:21:29] Correct, nothing else should have changed though [14:21:48] So it may have multiple versions of that JAR if there are plugins from different releases. [14:22:04] Is it possible to diff two debs? [14:22:38] Would be interesting to see what changed from wmf9 to wmf10 [14:22:59] Possibly, but I don't know how. I can definitely rebuild the plugin though [14:23:21] inflatador: do you run prepare_build locally and then upload to the server creating the package? [14:23:32] Y [14:23:37] so that might explain [14:23:59] if the folder on the remote server is not cleaned up [14:24:25] That's what I'm thinking too [14:24:40] Sounds like there are still issues even after I manually deleted the duplicate jar? [14:25:29] dcausse: regarding the 50 req/s: this may be explained by how the connection pool is initialised: with maxTotal > 0 ? maxTotal : 50 [14:25:34] inflatador: you mean deleting the jar on the elastic hosts? I think after that elastic was running fine [14:26:16] dcausse correct, just trying to gauge the urgency...whether or not relforge is working ATM [14:26:25] Since we do not specify maxTotal but only defaultMaxPerRoute, 50 should apply (until the new release runs) [14:26:47] oh if that's the case then yes [14:28:31] on my local copy I see that we set both the total and the max per route [14:29:12] inflatador: relforge is fine I think, no need to do anything there, it's just that we should not deploy this plugin further [14:29:41] dcausse ACK, I'll fix it ASAP and report back [14:30:02] in the meantime I'll stop puppet and check plugin versions [14:32:37] going to try bumping the pod resources a bit, gc times aren't great and cpu seems to be throttled a bit [14:44:53] dcausse I'm having issues w/prepare_build, it looks like it's trying to download the new version I'm trying to build? [14:44:54] ` curl --fail --head https://apt.wikimedia.org/wikimedia/pool/component/elastic710/w/wmf-elasticsearch-search-plugins/wmf-elasticsearch-search-plugins_7.10.2-11.tar.gz` [14:45:23] oh strange [14:45:48] trying on my machine [14:46:02] dcausse I think the problem is that I merged the patch before running prepare_commit? [14:46:27] hm... prepare_commit should only update the SHAs [14:46:31] but they should not change here [14:48:58] inflatador: did run "./debian/rules prepare_build" on top of your latest commit "Fix typo in BUILD_VERSION" and it worked [14:49:59] hm wait that's not the latest version my bad [14:50:01] dcausse we're in https://meet.google.com/rgb-ebzq-ern if you wanna join [14:58:27] pfischer: if you have a couple sec: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/982809 [15:00:18] dcausse: sure [15:00:56] dcausse: If you do have a second, too: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/83 we could merge that in [15:01:02] sure [15:05:16] thx [15:05:17] pfischer: lgtm, but not sure I fully undertand how all this work :/ [15:07:20] Welcome to bridging land ;-) [15:09:56] * pfischer is annoyed by lengthy CI pipelines: sonar job re-runs all tests. Couldn’t we reuse the target/ folders as artefacts from previous stages? [15:20:59] pfischer dcausse MR for dev-images https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/57 [15:24:57] NM, I self-merged [15:26:11] Pinged in releng, will let you know when the image is updated [15:59:40] will be 5-10m late to Weds mtg [16:02:12] I'll skip the Wednesday meeting in favor of the Product+Tech staff meeting [16:04:26] pfischer: something suspicious as well is that the test relying on src/test/resources/wiremock/mappings/elasticsearch.bulk.json should have failed since it still has the source as a param not the script [16:09:43] looks like everyone else is in product/tech mtg...joining too [16:24:24] pfischer: Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.context.update.max_compilations_rate] [16:24:40] elastic wants to cache the script [16:25:20] and has some circuit breaker to avoid compiling it too many times, I should have known that :/ [16:34:21] suspending the consumer while we figure out a solution [16:40:08] Makes sense. Thank you for looking that up. Was it in the relforge logs? [16:42:17] pfischer: no got it while crafting a bulk request and sending it against relforge [16:48:59] although that does not make much sense... since we "compile" it on every requests today since the source param is part of the SuperDetectNoopScript instance... [16:53:13] So we could tune the rate? I was about to open a ticket for ES asking about the size estimation. [16:56:01] yes I feel that we might want to ask es what they think about the estimate taking only the source into account indeed [16:56:47] regarding tuning the compilation rate I'm not entirely sure yet, I don't fully understand how that was even working previously [16:58:20] https://github.com/elastic/elasticsearch/issues/103406 [16:58:47] thanks! [16:59:11] workout, back in ~40 [17:03:01] Hm, that’s somewhat frustrating… the other upstream PR is also stuck: https://github.com/apache/flink-connector-elasticsearch/pull/83 - I could provide a patched elasticsearch client JAR that solves the calculation issue [17:06:58] pfischer: I would not spend trying to upstream a caclulation fix upstream before they agree with the problem [17:07:24] to make this even more frustrating it's very likely that the fix won't be backported to a version we use [17:07:47] and we might also have to patch the opensearch client [17:11:28] or should we fork all this and own our elastic connector? [17:22:13] Hm, you mean, we could use a plain http client and build the bulk requests ourselves? [17:24:59] pfischer: I meant start forking the flink-connector *and* some of the elastic client classes and start adapting them to our needs and don't be blocked trying to convince upstream [17:27:13] looking at this script issue I don't fully understand how it's working today tbh... if the script is cached then it means we always apply the same update if it's not cached then I don't know how elastic can track the number of compilation per minutes... [17:28:16] Well, we only pass in “” or “super_detect_noop” as ‘code/source’, so that can be cached. [17:28:53] It’s not the result of the script that’s cached but the interpreted script code. [17:29:23] If it was an inline painless script, you wouldn’t want to interpret it on every request, but only once [17:30:19] So if ES stores a hash of some kind for every script source/code block, it would increment the meter whenever it encounters a hash it has not seen yet. [17:30:21] pfischer: yes but the Map holding the params and thus the udpated doc is part of the script instance [17:30:47] Yes, but the only thing to be interpreted would be the source. [17:32:18] Source is an arbitrary string that has to be interpreted, for all other properties of the script object, their syntax and semantics are pre-defined [17:34:33] regarding the fork: we still have to patch in two places: the sink/connnector + ES core (since the BulkRequest supplier is not overridable, see org.elasticsearch.action.bulk.BulkProcessor.Builder#build) [17:35:21] So we’d need a patched sink that relies on a patched ES client. [17:36:42] re script cache this still does not make sense to me: https://github.com/elastic/elasticsearch/blob/v7.10.2/server/src/main/java/org/elasticsearch/script/ScriptCache.java#L99 [17:37:17] pfischer: yes I meant putting all this in "forked" repo but not sure that's viable [17:39:01] CacheKey cacheKey = new CacheKey(lang, idOrCode, context.name, options); - options does not mean parameters [17:40:03] So the key works as assumed further up: language + script code + interpreter options [17:40:17] back [17:42:03] pfischer: yes but the thing I don't get is how the compiled script is going to get the params (we put them in the compiled script)? [17:44:19] Doing a rolling restart of elastic to enable the new plugins [17:45:47] dcausse: Ah, so the ScriptCache you linked, is that only used for painless scripts? The extra plugin’s SuperDetectNoopScript simply does not use caching in its compile method [17:47:07] pfischer: I think that ScriptCache is wrapping all ScriptEngine but could be wrong [17:49:16] pfischer: sorry I think I understood... [17:49:34] compilation returns a factory [17:52:03] So if we increase the rate, that script cache is going to be big unless we’re able to cap it at the same time. [17:52:50] script.cache.max_size [17:52:58] yes but not sure that's what we want in the end :/ [17:53:54] script.cache.expire could be low too. But yeah, it’s probably not ideal. [17:55:36] yes esp. since we believe that the root cause is elastic not properly estimating the request... [17:57:11] another terrible hack would be have 3 or 4 strings of varying length (S, M, L, XL) we'd put in the source before doing our own estimation [17:58:13] but in retrospect using existing connector had been pretty annoying so far :( [18:13:18] dinner [18:27:36] lunch, back in ~1h [18:43:18] Dinner [19:15:14] back [19:17:13] balthazar and I met earlier today to work on the search slo graphite queries. Think we got it figured out, it was as simple as doing `divideSeries(hitcount(isNonNull(removeBelowValue(Search.FullTextResults.p95, 4000)), "90day"), hitcount(transformNull(removeAboveValue(Search.FullTextResults.p95, 0), 1), "90day"))` (/s) [19:17:14] does anyone know if these 'selenium-daily-beta-CirrusSearch' alerts are relevant to Search Platform? [19:17:35] simple ;P [19:21:47] inflatador: I think it must be, but I don't have a lot of context. I found https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/6bfbf18cc3175973d0ec494182f9a748c38bf65c/tests/selenium/README.md and https://www.mediawiki.org/wiki/Selenium/How-to/Run_tests_using_selenium-daily_Jenkins_job [19:24:04] ryankemper understood, I'm wondering if these are actionable though? Sounds like MW/Cirrussearch dev issue. dcausse gehel any opinions on the above? [19:24:42] (that wiki page lists Erik and David as contacts) [19:26:40] ryankemper thanks for the context, looks like alerts are configured at https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/%2B/master/jjb/mediawiki-extensions.yaml#191 [19:27:37] I imagine the action is "if it's a one-off failure, ignore it but if there's persistent build failures look into it" [19:28:32] Those seem to be integration tests, so probably related to a code change, and a CI alert. Not a production alert [19:28:51] dcausse or pfischer can confirm tomorrow [19:31:27] ACK, no rush [20:44:33] break, back in ~20 [21:07:53] back [22:31:29] running codfw elastic restart in a tmux window on cumin2002 [23:06:00] dcausse: I think I found a work around that only requires a patched flink-connector-elastic search: This connector exposes an interface, `ElasticsearchEmitter` that we implement. If that would be passed ES’ `BulkProcessor` to its `open()` method (called by `ElasticsearchWriter`) then we can do the size calculation inside the emitter and call `BulkProcessor.flush()` if adding another action would exceed the bulk size [23:06:00] limit. The ES implementation would only flush *after* an action has been added and the `BulkRequest`’s estimated size is *greater* than the limit.