[07:40:08] somehow the backfill job did not end and I'm pretty sure it consumed all its backlog... [07:53:10] can see "Finished reading from splits [eqiad.cirrussearch.update_pipeline.update.rc0-1]" for all 5 partitions but the job is still running [07:53:36] perhaps the way we disable the sanitizer? [07:56:38] yes seems like its source is always connected, which makes sense we don't want to change the jobgraph when disabling it [07:57:24] might need another option to not install it during backfill, or perhaps interpret that if we run with a bounded kafka source [09:28:46] started the wikidata backfill for cloudelastic [10:29:26] lunch [10:56:54] You can make the saneitizer source bounded by passing “saneitize-max-runtime: 5s” and additionally pause it via “saneitize: false”. This way it is still in the graph, does nothing and is considered bounded by the execution environment. [10:58:07] So once the kafka source reaches its end, the application should shut down. That’s how Erik’s SaneitizerApplicationIT terminates. [12:21:41] pfischer: oh thanks, did not realize that was possible [12:51:27] I updated your PR [12:52:36] Ups, overlooked your approval. Thanks! I just added more detailed config docs. [13:12:40] If someone has time for a few code review on migration to our new pom.xml: [13:12:52] https://gerrit.wikimedia.org/r/c/search/glent/+/1030169 (and parent) [13:12:56] https://gerrit.wikimedia.org/r/c/search/highlighter/+/1030147 [13:54:14] o/ [14:08:59] workout, back in ~40 [14:17:58] gehel: looking [14:45:41] dcausse: Regarding the jackson version inside the WDQS POM reactor: Was 2.12.2 arbitrary or could I bump it to the latest (2.17.1) instead (excluding blazegraph and blazegraph-service (war))? [14:45:49] back [14:45:56] o/ [14:47:10] pfischer: I don't really remember :( [14:47:48] dcausse: s/arbitrary/deliberate/ - looking at random blazegraph dependencies they where built against 2.3.x, so I don’t know where the 2.6.x is coming from [14:49:56] the 2.6 is coming from this patch: https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/blazegraph/+/b03e93016eef5a97799235c4cf8808b206b52d12 [14:50:26] why we used 2.12 in non-blazegraph artifacts I can't really remember [14:50:52] dcausse: alright. Thanks! [14:52:15] perhaps we should be careful with what's being used by eventutilities-flink? [14:53:31] * dcausse should stop removing cindy -1 votes and start looking into what's happening... [14:58:28] cindy was right :) [14:59:11] \o [14:59:18] dcausse: lol, i wonder that sometimes too :) [14:59:24] o/ [14:59:30] but every time i try and make it less failure prone..it still fails :P [14:59:36] :) [15:00:34] next idea is perhaps to split it in half, one test of indexing, one test of search. In that way we can prepare an index all up front and perhaps avoid oddities. But then i dunno about testing move/delete/etc... [15:00:48] probably still have the same inconcisistency problems [15:01:15] yes... [15:01:50] iirc there are tests asserting something before and after an index change [15:02:50] yea, perhaps just concentrates the failures into the indexing side where those are tested [15:10:38] ebernhardson: thank you for fixing convert_to_esbulk! [15:13:08] pfischer: np, it was just random minor stuff. And it took longer because for some reason i looked at the pyarrow v1 FileSystem impl instead of v2 when writing it... [15:32:45] Oh, and why is fs.exists no longer sufficient? Is that the workaround for the issue with the HDFS file system you mentioned earlier? [15:33:04] pfischer: fs.exists is the v1 api, it's not in v2 apparently [15:33:57] Oh, so I have to fix my mark_skipped method, too [15:34:00] this one is v2: https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html vs v1: https://arrow.apache.org/docs/1.0/python/generated/pyarrow.HadoopFileSystem.exists.html#pyarrow.HadoopFileSystem.exists [15:34:29] pfischer: ahh i should have looked, yea probably then [15:43:13] ebernhardson: no worries, I’ll fix it [15:44:37] wow, cindy really doesn't like my one-line change to extension.json [15:49:04] yes... [15:50:13] it's getting page indexed so not sure what's going on, if it broke indexing I would expect a lot more failures... [15:52:30] i'm also going to be proposing a fix to core extension registration, in which case this patch is unnecessary. But i haven't decided quite how to test what changes we will actually see in prod if i change that [15:52:46] probably end up haxing mwdebug1002 with the registration fix and dumping all the configs or something [16:03:38] dcausse: the tools module keeps failing to build due to 503 responses from the on-demand-jetty. Is that supposed to work? Looking a the jenkins build logs, it does not seem to run jetty at all: https://integration.wikimedia.org/ci/job/search-highlighter-maven-java11/4/consoleFull [16:33:32] pfischer: looking [16:35:09] pfischer: the jenkins log is about the highlighter but you mentionned tools & jetty so I believe you mean wdqs? [16:36:03] I think we have to some integration tests that talk to blazegraph so I think jetter should run [16:36:13] s/jetter/jetty [16:37:51] I have merged the migration to the new parent pom on wdqs today. I might have broken something [16:38:08] looking [16:45:18] pfischer: I suspect that you're trying to build with java > 8, this project does not build without java8 [16:46:23] We should have another look at adding sdkman to our projects... [16:46:37] builds fine with java8 but fails in blazegraph integration tests with java11 [17:14:09] * ebernhardson also finds no reason why cindy dislikes provide_default so much :S [17:15:57] i spoke too soon. Somehow that results in $wgCirrusSearchWriteClusters === []...which suggests there are either more problems with extension registration, or i really don't understand something :) [17:18:06] :( [17:20:33] the other option mentioned in #mediawiki-core is that we are kinda being non-standard here, mediawiki would probably use false instead of [] to disable. It works fine but i suppose i've always had a (perhaps unnecessary) aversion to union types [17:43:31] linting errors should be reported in line number descending order... [17:46:31] or me starting to read them backward... [18:02:20] i always do them backwards, otherwise the earlier lint failure lines are wrong [18:03:03] * ebernhardson now has to look at LiquidThreads to understand if this config value changing does anything :S [18:03:15] i wonder what liquid threads still does... [18:48:45] dinner [19:08:50] lunch, back in ~1h [20:20:25] back [21:25:49] * ebernhardson tries to understand why discolytics gets versioned as v0.20.0 but packaged as 0.20.0-v0.20.0, and fails :P [23:10:34] deployed the new metrics dags, started both on march 1st and they will catch up to now. [23:11:58] webrequests hasn't finished one successfully yet, but it will probably take about an hour per day of processing (on a small-ish 128 core/384GB mem limit)