[07:34:00] dcausse: Welcome back! How did you survive the travel back home? [07:34:46] You should have an email from Erin about the collocated onboarding with Peter, if you could reply... [07:42:23] gehel: hey, travel was ok [07:42:43] I already replied to Erin (forgot to reply-all so that's why you did not see it) [07:42:46] but we're all set [07:45:47] great! [08:00:03] Landlord’s coming to do a walkthrough of our house thurs morning so probably missing retro, should be around for the puppet window though [08:39:15] ebernhardson: Hi, thank you for the help yesterday! I just a list compression to simply take each table from eventgate. I now have another problem: `AssertionError: Large job (mem > 420g) should not be using the default pool`. What configs should I specify in this case? [08:39:51] ^ that for a spark job [08:40:02] I just used* [09:01:23] tanny411: dcausse might also have some knowledge on that topic [09:02:45] dcausse: I'm scheduling our ITC for next Tuesday during our regular 1:1. Can you make sure you answer https://app.betterworks.com/app/#/conversation/2330358 by then? [09:02:55] sure [09:06:22] tanny411: I think pool='sequantial' [09:06:40] oops: pool='sequential' [09:09:09] we don't seem to have any other pool [09:11:16] TY! [09:18:18] dcausse: glad you are back, hope you are feeling good! [09:20:26] tanny411: thanks! [10:01:42] lunch+errand [12:48:09] dcausse: itamarWMDE has pinged me on T307869. Could you have a look? I'm not sure if there is enough context on the task, but otherwise you can probably ping itamarWMDE or Lucas_WMDE [12:48:09] T307869: Request for new search profile for Wikidata that boosts Items for languages - https://phabricator.wikimedia.org/T307869 [12:51:22] gehel: sure, looking [13:00:48] greetings [13:31:29] dcausse: how can I return a search error [13:31:56] ejoseph: you can notify the user with a warning [13:32:11] searchContext should have addWarning or something [14:22:13] quick errand, back in ~20 [14:57:52] back [16:14:22] going to physical therapy, back in ~1 h or so [16:51:28] \o back around now [17:10:06] * ebernhardson reconnects slack and blows up his chat client to ~45 channels. And that doesn't even include the hundreds of threads. Makes my chat client much less usable :( [17:18:43] ebernhardson: I keep slack contained in its own browser tab and only check it once in a while. Maybe it's okay to keep it separate and lower priority (but still priority > 0) [18:00:34] still unclear what to do about cloudelastic gc alerts. I restarted 1003 yesterday at 21:24 which made it's gc happier, but 1001, 1005, and 1006 all transition the old pool graphs from mostly ok to problematic at the same moment [18:06:11] continuous rolling restart? :P [18:07:42] seems highly related to the size of the term disctionaries, maybe there is some way to keep less? Maybe we have to drop some of the analyzed fields? [18:08:59] or maybe we hope that 7.10 is magic: https://www.elastic.co/blog/significantly-decrease-your-elasticsearch-heap-memory-usage [18:17:06] back [18:18:24] nice [19:42:29] gehel I accidentally lied to you :( . The swift creds **are** in puppet ( https://gerrit.wikimedia.org/r/c/labs/private/+/803287/1/hieradata/common/profile/thanos/swift.yaml ). Working on a fix now. [19:42:57] yes, but in another profile. It might make sense to duplicate them in this case [19:48:11] I'm OK w/that...less work for me ;) . I am going to check how we do it for the wqds_flink user though [19:50:31] ryankemper ^^ FYI [20:18:46] and the answer is .... cross cluster configuration for chi(9200) -> omega refers to all decomissioned hosts [20:19:07] * ebernhardson will some day remember that elipsis is three dots, and does not get spaces around it like a word [20:24:03] ebernhardson the answer to the garbage collector alerts? or all those failed messages on the dashboard? [20:24:09] inflatador: the 1k errors/min [20:24:25] ACK, that makes more sense [20:25:01] i gave up on garbage collector, there really isn't a lot that can be done :P Could try manually moving shards away from the instances with the biggest problems but thats a game of whack-a-mole [20:25:25] i'm into the pray 7.10 is magic for now, if not we will have to more agressively shrink the term dictionaries (how? i dunno. I bet david doe s:) [20:26:22] we could also throw another 10G of memory into the heap i guess [20:26:48] but i completly failed to account for things when pondering that, 4 instances only have 128G of memory so its a bit of a robbing peter to pay paul situation [20:27:30] (i mean david and i briefly discussed increasing cloudelastic memory at the offsite, but i had bad information about the current state) [20:30:20] it wouldn't hurt to add some memory to the heap I guess. Although you'd know more than me as far as historical memory usage [20:32:05] maybe we **should** revisit the whole "dedicated masters" thing if ES7 doesn't save our hide [20:37:45] hmm, i dunno how to fix cross cluster though :S We set `persistent.cluster.remote.omega.seeds` correctly, but my best guess is its reading `persistent.search.remote.omega.seeds` which is from elastic 5.x and still supported for BC [20:38:14] and we can't change persistent.search.remote.omega.seeds, because it's no-longer the correct way to configure that :S [20:38:21] (or i could be completely missing something, also happens :) [20:39:22] I wanna say that David ran into something similar, maybe even the same setting... [20:40:22] elastic has a helpful page on how to force change cluster settings, step 1: Turn the whole cluster off :P [20:41:19] i guess you only have to turn off the master nodes...but same effecty [20:51:21] * ebernhardson goes hunting for wherever this BC was implmemented in elastic... [20:51:59] Turn off all masters at the same time...guess we should sort out our restore procedures first ;P [20:53:59] unrelated, but do you know how we get the thanos swift creds into flink itself? I found the creds in puppet but unsure where/how they get into flink [20:54:00] that process also isn't documented for 6.x, might not be supported [20:54:13] inflatador: 42 nested templates [20:54:52] inflatador: more seriously, it's going to be buried in a helmfile, at the end of the day i think all those templates end up creating a docker volume containing creds that is mounted to the container [20:56:18] inflatador: grep through operations/deployment-charts for swift_api_key [20:56:49] that's OK, just looking at how we can plumb thru the swift creds for backup ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/807623 ) . Looks like we have something already, but was curious if there's a better way. If it's in k8s in a separate repo, probably not too useful ;) [21:07:34] looks like wdqs1006 cleared up on its own, unless someone did something I'm not aware of [21:07:40] not I [21:18:03] pids for blazegraph process did change, maybe it was the auto restart timer or something? no one logged in [21:18:17] anyway, not a huge deal either way [21:42:56] well, i have no idea what fixed it but after sending dozens of basically the same request over and over again, it eventually decided to configure cross cluster with the new seeds. But the old ones are still in there [21:43:29] maybe it actually did delete all the cross-cluster configuration when i told it to, even though it didn't remove the BC settings from the cluster settings [21:44:06] (i suspect there remains a ticking timebomb with those settings still there :P) [21:44:59] failures down to 5/min