[07:05:59] e.bernhardson ack, thanks [11:08:34] Lunch [11:09:34] pfischer: I might need a second pair of eyes on https://gerrit.wikimedia.org/r/c/mediawiki/libs/metrics-platform/+/895708/2 (unrelated to anything you're working on at the moment) [11:10:14] There are a few flaky tests in EventProcessorTest (see https://integration.wikimedia.org/ci/job/metrics-platform-pipeline-test-java/374/console). [11:10:51] They run fine if I start them individually from my IDE, but fail when run from Maven, or if I run all the tests at once from my IDE. [11:12:11] There must be some state that is leaking from test to test, but I've been staring at that code for too long :/ [11:12:39] Could you have a look if you have time? (only if you have time, and don't spend too long!) [11:13:10] gehel: I’ll have a look [11:14:32] pfischer: thanks ! [11:32:52] lunch [13:01:30] gehel: I can’t reproduce the failure locally, tested on the liberica JDK 8 and 11. Which JDK do you use? [13:02:07] tested both on JDK 8 and 11, same issue on my side for both [13:02:27] weird :/ [13:05:01] I'll simplify that test class a bit and see if I can get to a stable state [13:07:05] Can you log the object IDs of mockEventSender and eventProcessor during createEventProcessor and during eventsAreEnrichedBeforeBeingSent to see if they remain the same? [13:10:22] System.out.println("eventsAreEnrichedBeforeBeingSent: " + System.identityHashCode(eventProcessor)); [13:11:02] both are changing for each execution [13:12:22] So the change between @BeforeEach and the @Test every time? [13:13:06] Oh no, I misread what you were asking. They change between each test. Let me check for between @BeforeEach and @Test [13:13:48] Between tests is fine/expected, they should be stable within a single test-lifecycle though. [13:15:27] You are onto something! the mock changes! [13:15:50] If not, you may have to create the mock for EventSender manually inside @BeforeEach [13:15:52] so MockitoExtension probably runs after @BeforeEach [13:18:08] Junit has creates multiple copies of a test class, one for every @Test and I’ve seen cases where those instances did not get initialised properly. [13:19:07] my bad, typo on my side. The mock is stable between @BeforeEach and @Test [13:19:56] I'm still going to remove the @Mock magic, at least until I understand what the issue is [13:25:26] time for a good old println / bissect [13:42:59] errand [13:53:49] I'm tracing that error to the `CurationFilter.apply()` returning `false` in some cases, so the events are filtered out. Not sure why this happens. [13:54:24] But it really seems like this test should not depends on curation filters at all (or minimally only). Time for another refactoring. [13:59:26] I'll be working on the re-rack maintenance starting in 30m [14:02:23] That will probably take all day, so I declined most meetings [14:32:42] https://etherpad.wikimedia.org/p/T322082 working out of etherpad and #wikimedia-dcops [14:51:53] I won't be there for the Wednesday meeting. Conflicting WMF/WMDE steering commitee. [15:02:09] o/ [15:31:40] \o [15:41:48] o/ [18:13:19] dinner [18:30:28] The elastic rerack maintenance is complete! I just unbanned/repooled elastic1060-1066 , so we should have the full 10Gbps speed for all elastic hosts in eqiad now [18:33:00] Heading to workout/lunch, back in ~90m [19:44:46] back [19:49:40] hmm, so apache-airflow-providers-apache-spark==4.0.0 has a hardcoded list of executables it allows, `ALLOWED_SPARK_BINARIES = ["spark-submit", "spark2-submit"]`. I wonder if we just hack it to append spark3-submit to the list [19:52:26] There may be justification for why somewhere, but the patch doesn't explain why only says that it did: https://github.com/apache/airflow/pull/27646/commits/5a061573463731654d5c7abc212c2f197954ee7d [21:16:10] ebernhardson ryankemper guessing this one should fix the deploy issue we saw yesterday: https://phabricator.wikimedia.org/T331568 [21:19:26] inflatador: realized something related to the new airflow instance, on stat* and an-airflow1001 i have sudo rights for `(analytics-search) NOPASSWD: ALL` but on an-airflow1005 that's missing [21:19:44] not sure why that is yet, but can poke at it [21:21:28] i wonder if thats somehow related to the scap deploy problems (not obviously, but its a permissions discrepancy) [21:21:41] could have similar underlying reasons [21:23:05] inflatador: ok, i should have read the related ticket first :P That does seem to have resolved the `scap deploy` issue from yesterdya. I can deploy the repo now [21:23:52] ebernhardson np, I think the sudo stuff is probably related to groups. Maybe compare the output of `id` on new vs old? Just guessing [21:24:58] inflatador: hmm indeed, on 1001 i'm in wikidev,analytics-search-users,airflow-search-admins groups, but 1005 is missing analytics-search-users [21:25:23] which makes sense i suppose, thats the one that defines the sudo right in data.yaml [21:25:46] i suppose there is some hieradata somewhere that needs that string added [21:26:33] agreed, let me see if I can luck into something [21:31:08] guessing a bit, `hieradata/role/common/search/airflow.yaml` has `profile::admin::groups` including the string, thats the old instance. missing in `hieradata/role/common/analytics_cluster/airflow/search.yaml` [21:31:09] inflatador: ^ [21:32:07] ebernhardson excellent, I love it when people do my work for me ;) [21:32:37] I'll get a patch up shortly...guessing that will probably give d-causse perms he needs too [21:32:40] well, i'm pretty sure i wrote the patch that missed including this value as well :P [21:32:57] happens when i just kind of bodge things into place based on examples without understanding what everything does [21:44:18] bodge and ye shall receive! https://gerrit.wikimedia.org/r/c/operations/puppet/+/895865 [23:10:14] Brian and I took a stab at figuring out a new limit for the elastic per-instance shard recovery throughput limit. Check it out here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/895874 [23:11:24] Like all magic numbers it was somewhat arbitrarily chosen. But basically the idea is to allow no more than 20% of the total available network throughput for shard recoveries. That's per-instance so at theoretical max shard recoveries across both elasticsearch instances (since we have 2 clusters per node ofc), that would leave 60% of the 10Gbps available for normal cluster operations. [23:11:28] Of course that's an upper limit; I'd be shocked if in particular the smaller cluster ever maxed out that # for a significant length of time [23:12:23] yeah, we also want to look at `indices.recovery.max_concurrent_file_chunks` at some point, I have a feeling we'll need to raise it to get faster recoveries for those huge wiki shards [23:12:48] ebernhardson just merged that patch to give you admin on an-airflow1005 [23:13:03] (ran puppet and all that) [23:16:48] thanks! [23:17:06] looks to have done the trick [23:33:37] * ebernhardson does a quick test run of the test suite using airflow 2.5.1, which is deployed to an-airflow1005, and everything fails :P should be fun [23:35:10] surely nothing could go wrong running tests against 2.1.4 but running 2.5.1 on the live instance :) [23:58:48] * ebernhardson wonders what airflow datahub is for and if we are supposed to be using it...tasks finish but the logs finish with an Exception message related to it