[07:05:59] <kostajh>	 e.bernhardson ack, thanks
[11:08:34] <gehel>	 Lunch 
[11:09:34] <gehel>	 pfischer: I might need a second pair of eyes on https://gerrit.wikimedia.org/r/c/mediawiki/libs/metrics-platform/+/895708/2 (unrelated to anything you're working on at the moment)
[11:10:14] <gehel>	 There are a few flaky tests in EventProcessorTest (see https://integration.wikimedia.org/ci/job/metrics-platform-pipeline-test-java/374/console).
[11:10:51] <gehel>	 They run fine if I start them individually from my IDE, but fail when run from Maven, or if I run all the tests at once from my IDE.
[11:12:11] <gehel>	 There must be some state that is leaking from test to test, but I've been staring at that code for too long :/
[11:12:39] <gehel>	 Could you have a look if you have time? (only if you have time, and don't spend too long!)
[11:13:10] <pfischer>	 gehel: I’ll have a look
[11:14:32] <gehel>	 pfischer: thanks !
[11:32:52] <dcausse>	 lunch
[13:01:30] <pfischer>	 gehel: I can’t reproduce the failure locally, tested on the liberica JDK 8 and 11. Which JDK do you use?
[13:02:07] <gehel>	 tested both on JDK 8 and 11, same issue on my side for both
[13:02:27] <gehel>	 weird :/
[13:05:01] <gehel>	 I'll simplify that test class a bit and see if I can get to a stable state
[13:07:05] <pfischer>	 Can you log the object IDs of mockEventSender and eventProcessor during createEventProcessor and during eventsAreEnrichedBeforeBeingSent to see if they remain the same?
[13:10:22] <pfischer>	 System.out.println("eventsAreEnrichedBeforeBeingSent: " + System.identityHashCode(eventProcessor));
[13:11:02] <gehel>	 both are changing for each execution
[13:12:22] <pfischer>	 So the change between @BeforeEach and the @Test every time?
[13:13:06] <gehel>	 Oh no, I misread what you were asking. They change between each test. Let me check for between @BeforeEach and @Test
[13:13:48] <pfischer>	 Between tests is fine/expected, they should be stable within a single test-lifecycle though.
[13:15:27] <gehel>	 You are onto something! the mock changes!
[13:15:50] <pfischer>	 If not, you may have to create the mock for EventSender manually inside @BeforeEach
[13:15:52] <gehel>	 so MockitoExtension probably runs after @BeforeEach
[13:18:08] <pfischer>	 Junit has creates multiple copies of a test class, one for every @Test and I’ve seen cases where those instances did not get initialised properly.
[13:19:07] <gehel>	 my bad, typo on my side. The mock is stable between @BeforeEach and @Test
[13:19:56] <gehel>	 I'm still going to remove the @Mock magic, at least until I understand what the issue is
[13:25:26] <gehel>	 time for a good old println / bissect
[13:42:59] <dcausse>	 errand
[13:53:49] <gehel>	 I'm tracing that error to the `CurationFilter.apply()` returning `false` in some cases, so the events are filtered out. Not sure why this happens.
[13:54:24] <gehel>	 But it really seems like this test should not depends on curation filters at all (or minimally only). Time for another refactoring.
[13:59:26] <inflatador>	 <o/
[14:01:37] <inflatador>	 I'll be working on the re-rack maintenance starting in 30m
[14:02:23] <inflatador>	 That will probably take all day, so I declined most meetings
[14:32:42] <inflatador>	 https://etherpad.wikimedia.org/p/T322082 working out of etherpad and #wikimedia-dcops 
[14:51:53] <gehel>	 I won't be there for the Wednesday meeting. Conflicting WMF/WMDE steering commitee.
[15:02:09] <dcausse>	 o/
[15:31:40] <ebernhardson>	 \o
[15:41:48] <inflatador>	 <o/
[15:47:05] <dcausse>	 o/
[18:13:19] <dcausse>	 dinner
[18:30:28] <inflatador>	 The elastic rerack maintenance is complete! I just unbanned/repooled elastic1060-1066 , so we should have the full 10Gbps speed for all elastic hosts in eqiad now
[18:33:00] <inflatador>	 Heading to workout/lunch, back in ~90m
[19:44:46] <inflatador>	 back
[19:49:40] <ebernhardson>	 hmm, so apache-airflow-providers-apache-spark==4.0.0 has a hardcoded list of executables it allows, `ALLOWED_SPARK_BINARIES = ["spark-submit", "spark2-submit"]`.  I wonder if we just hack it to append spark3-submit to the list
[19:52:26] <ebernhardson>	 There may be justification for why somewhere, but the patch doesn't explain why only says that it did: https://github.com/apache/airflow/pull/27646/commits/5a061573463731654d5c7abc212c2f197954ee7d
[21:16:10] <inflatador>	 ebernhardson ryankemper  guessing this one should fix the deploy issue we saw yesterday: https://phabricator.wikimedia.org/T331568
[21:19:26] <ebernhardson>	 inflatador: realized something related to the new airflow instance, on stat* and an-airflow1001 i have sudo rights for `(analytics-search) NOPASSWD: ALL` but on an-airflow1005 that's missing
[21:19:44] <ebernhardson>	 not sure why that is yet, but can poke at it
[21:21:28] <ebernhardson>	 i wonder if thats somehow related to the scap deploy problems (not obviously, but its a permissions discrepancy)
[21:21:41] <ebernhardson>	 could have similar underlying reasons
[21:23:05] <ebernhardson>	 inflatador: ok, i should have read the related ticket first :P  That does seem to have resolved the `scap deploy` issue from yesterdya. I can deploy the repo now
[21:23:52] <inflatador>	 ebernhardson np, I think the sudo stuff is probably related to groups. Maybe compare the output of `id` on new vs old? Just guessing 
[21:24:58] <ebernhardson>	 inflatador: hmm indeed, on 1001 i'm in wikidev,analytics-search-users,airflow-search-admins groups, but 1005 is missing analytics-search-users
[21:25:23] <ebernhardson>	 which makes sense i suppose, thats the one that defines the sudo right in data.yaml
[21:25:46] <ebernhardson>	 i suppose there is some hieradata somewhere that needs that string added
[21:26:33] <inflatador>	 agreed, let me see if I can luck into something
[21:31:08] <ebernhardson>	 guessing a bit, `hieradata/role/common/search/airflow.yaml` has `profile::admin::groups` including the string, thats the old instance.  missing in `hieradata/role/common/analytics_cluster/airflow/search.yaml`
[21:31:09] <ebernhardson>	 inflatador: ^
[21:32:07] <inflatador>	 ebernhardson excellent, I love it when people do my work for me ;)
[21:32:37] <inflatador>	 I'll get a patch up shortly...guessing that will probably give d-causse perms he needs too
[21:32:40] <ebernhardson>	 well, i'm pretty sure i wrote the patch that missed including this value as well :P
[21:32:57] <ebernhardson>	 happens when i just kind of bodge things into place based on examples without understanding what everything does
[21:44:18] <inflatador>	 bodge and ye shall receive! https://gerrit.wikimedia.org/r/c/operations/puppet/+/895865
[23:10:14] <ryankemper>	 Brian and I took a stab at figuring out a new limit for the elastic per-instance shard recovery throughput limit. Check it out here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/895874
[23:11:24] <ryankemper>	 Like all magic numbers it was somewhat arbitrarily chosen. But basically the idea is to allow no more than 20% of the total available network throughput for shard recoveries. That's per-instance so at theoretical max shard recoveries across both elasticsearch instances (since we have 2 clusters per node ofc), that would leave 60% of the 10Gbps available for normal cluster operations.
[23:11:28] <ryankemper>	  Of course that's an upper limit; I'd be shocked if in particular the smaller cluster ever maxed out that # for a significant length of time
[23:12:23] <inflatador>	 yeah, we also want to look at `indices.recovery.max_concurrent_file_chunks` at some point, I have a feeling we'll need to raise it to get faster recoveries for those huge wiki shards
[23:12:48] <inflatador>	 ebernhardson just merged that patch to give you admin on an-airflow1005
[23:13:03] <inflatador>	 (ran puppet and all that)
[23:16:48] <ebernhardson>	 thanks!
[23:17:06] <ebernhardson>	 looks to have done the trick
[23:33:37] * ebernhardson does a quick test run of the test suite using airflow 2.5.1, which is deployed to an-airflow1005, and everything fails :P  should be fun
[23:35:10] <ebernhardson>	 surely nothing could go wrong running tests against 2.1.4 but running 2.5.1 on the live instance :)
[23:58:48] * ebernhardson wonders what airflow datahub is for and if we are supposed to be using it...tasks finish but the logs finish with an Exception message related to it