[05:33:47] > If a unit has a Conflicts= setting on another unit, starting the former will stop the latter and vice versa. [05:34:18] I think that's not quite what we want because puppet will start the es 7 units and then the es 6 units will get stopped [05:35:24] There's probably some other option that means "don't start unit X if unit Y is running" though [08:01:41] disabling the export queries to relforge dag while we upgrade elastic-hadoop to 7.10 [10:00:00] Lunch [10:54:38] lunch [13:05:50] gretings [13:05:57] or greetings [14:00:34] o/ [14:16:51] Retrospective time! https://meet.google.com/eki-rafx-cxi (inflatador) [15:29:11] dcausse: any thoughts on jvm 11 vs 15 on elasticsearch? they shipped 15 with 7.10 but we went with the system jvm (11) for now [15:29:57] i'm not sure how much it matters, i'm only really familiar with 15 getting new GC's [15:29:59] ebernhardson: not really I think jvm 11 is fine (tbh I stopped following java updates) [15:30:29] good enough for me i suppose :) [15:30:49] you mean the ZGC (just discovered about that)? [15:31:10] i'm not entirely sure what i mean, but i've seen numerous references in only things over the last years about advances in java GC [15:31:17] s/only thing/online things/ [15:32:03] i suppose 11 isn't that old though, i have to rework my mental model about java version numbers. 11 is from 2018 [15:32:25] :) [15:33:19] randomly, also surprised to see in the support tables that java 8 has non-commercial free updates from oracle until 2030 [15:33:52] yes I think that java6 is still supported if you pay (a lot I suppose) [15:34:39] ah no [15:35:05] yea, looks like azul.com (totally unfamiliar with them) are offering builds of openjdk 6 until 2027 [15:35:38] and i imagine with enough money they could continue as long as necssary :) [15:35:53] yes :) [15:51:43] ryankemper looking the patches now [15:57:22] * ebernhardson now remembers he didn't use .flatMap(authorizedSessions::getIfPresent) because that throws IOException, and java hates generic handling of functions that throw exceptions :P [15:58:10] yes they added UncheckedIOException for this [15:59:27] ahh, i guess i could transform it, or does lombok have a magic annotation for that? [15:59:44] * ebernhardson can probably check docs [15:59:52] lombok has SneakyThrows [16:00:17] but if you own the method perhaps you can just throw UncheckedIOException [16:01:09] the initial throw comes from the httpClient, but we are already catching it to check for 404, i suppose can throw the unchecked version [16:01:31] yes [16:02:36] can't compile wdqs... it fails on a spark test of the subgraph analysis, not sure what I messed up :/ [16:02:56] percentile(`count`, array(0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0, 8, 0, 9), 1L)' due to data type mismatch: Percentage(s) must be between 0.0 and 1.0, but got cast(array(0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0, 8, 0, 9) as array [16:03:18] hmm, i managed to do a full compile yesterday against the full rdf repository. [16:03:22] huh, that is odd [16:03:47] yes jenkins is happy so it must be on my side... [16:04:32] might locale related perhaps [16:04:37] *be [16:05:32] could locale changes cause some decimal separator to write out with commas instead of periods? i suppose thats plausible but i had never considered it [16:05:55] if it gets written into a string that then gets executed as SQL [16:07:19] yea, i suppose that would be it. in SubgraphUtils.getPercentileExpr [16:08:15] yes must be it [16:08:45] quick looking around suggests f-strings can't have a locale specified. Do f-strings need to be banned by linters in favor of String.formatLocale or some such? [16:09:16] yes seeing this is as well [16:09:35] sounds cumbersome but that's the sole option? [16:10:12] indeed, seems like a big backwards step in ergonomics. Can otherwise try and set the locale properly at all the right places, but seems more error prone [16:10:50] perhaps spark has some builders? [16:11:00] instead of building the sql by hand [16:11:30] hmm, yea in python i always tried to prefer using the builders unless the api totally didn't support what i was doing (like higher-order functions) [16:11:47] yes compilation works with LC_ALL=C [16:11:47] this ought to be doable with functions directly [16:12:44] would have to poke around all the places getPercentileExpr is used though, might be a lot of refactoring [16:13:13] i've certainly noticed when other teams write spark they seem to prefer formatting sql strings over constructing the object-graph [16:15:54] yes if it implies a bigger refactoring I might just add a call to formatLocale or set the env from maven [16:19:13] ryankemper ebernhardson I'm thinking we should have a "Conflicts=" in the ES6 and 7 unit files in addition to the ConditionPathExists in the PR. Should make it a little easier to roll back. I'm going to try this in deployment-prep, but if you have any opinions yes or no LMK [16:19:58] inflatador: ryan's comment from earlier suggested it might not work, but i'm not opposed to trying it in deployment-prep and verifying what happens [16:20:48] inflatador: "If a unit has a Conflicts= setting on another unit, starting the former will stop the latter and vice versa." [16:21:10] inflatador: which would mean when puppet trys to start es7 it will stop es6 first [16:21:20] ebernhardson cool, will give it a try. Also my original comment about "Conflicts=" was made before I saw the PR. I think we need both [16:21:42] not **just** conflicts= but also the ConditionPathExists [16:23:08] I dunno, maybe I'm being too reactive to the situation we had yesterday. Let me kick it around in DP and see what happens [16:24:22] inflatador: certainly what happened yesterday is something we want the automation to never do :) Although on review it sounded like we expected it to do what it did, we simply didn't include the understanding that starting the jvm would allocate the full heap during jvm startup, and in cloudelastic's case we didn't have enough memory for that [16:25:04] inflatador: when you have some time https://gerrit.wikimedia.org/r/c/operations/puppet/+/826589/ should be good to go [16:25:05] 5 and 6 worked as we expected them to because they are 256G machines, while 1-4 failed because they are 128G [16:25:19] Yeah, I'm trying to avoid the situation where we have to rollback and the flag file already exists to allow ES7 to start [16:25:32] dcausse ACK, :eyes [16:25:54] should fix an annoying bug on wcqs [16:27:23] for puppet deploy window i also have https://gerrit.wikimedia.org/r/c/operations/puppet/+/825925 for wcqs, thats half the fix for the 500 errors (the other half is mostly ready but is getting some polishing in java-land and will require a deploy) [16:27:37] the "updater" should be restarted after deploying it (but I can take care of that later tonight) [16:28:14] going offline, might be back later in the night [16:28:14] doh, actually my puppet patch is incomplete ...sec [16:30:47] patch updated (it was missing the part that re-adds Cookie headers to the internal auth requests) [16:34:33] Will be @ the deploy window in a few mins [16:46:33] dcausse patch is merged, working on ebernhardson 's [16:56:11] inflatador: for mine if we can still load https://commons-query.wikimedia.org/ then it didn't break everything, and if we can POST to /sparql and get a failure quickly instead of after 30s then it did what we wanted it to do [16:56:56] ebernhardson it's merged, feel free to give it a try and LMK if I need to rollback [16:57:19] inflatador: is puppet also run on all the servers? It's still stalling :( [16:57:37] i suppose i probably land on the codfw servers [16:58:48] doesn't look like puppet ran yet, i suppose i can do that part [16:59:14] ebernhardson oh good point, I can do that [16:59:40] sed s/can/should have/g [16:59:52] no worries :) [17:00:18] i'm just running it in codfw, since my geo queries should land there. It doesn't actually fix the root problem so no rush running it in eqiad, it can land whenever. It needs the second half of the fix to be complete [17:00:30] It's for both wqds and wcqs? [17:00:41] err.wdqs that is [17:00:56] inflatador: only wcqs, it only effects nginx where auth is enebabled [17:01:00] *enabled [17:02:50] hmm, curiously my response headers sy i'm getting wcqs1001. Really expected my geo-queries to land in dallas [17:03:21] ran puppet for the 3 servers in eqiad too, it now fails quickly with a 307 redirect to oauth. so works as expected :) [17:03:46] ACK, just ran puppet on all wcqs hosts [17:07:10] btw rather than a `Conflicts=` we could probably just take this approach: https://serverfault.com/a/1047958 [17:12:24] ryankemper nice find. I'm just surprised that systemd doesn't seem to handle this natively. Seems like a fairly obvious use case [17:12:43] Also, does anyone have any concerns about me pointing this to a host that actually exists? https://github.com/wikimedia/puppet/blob/production/hieradata/cloud/eqiad1/deployment-prep/common.yaml#L251 [17:12:57] (working on the deployment-prep upgrade too) [17:15:22] No objections [17:21:18] inflatador: the problem with updating deployment-prep to 7.10 is cirrus wont be compatible with querying 7.10 until we merge the es710 branches on monday (after the weekly branch cut) [17:21:45] i'm not sure how broken it would be, probably not super broken but also not everything will work [17:22:00] ebernhardson oh yeah, good point. Would that potentially cause trouble for other teams? [17:22:14] inflatador: well, the other teams should be used to beta cluster being broken :) [17:22:34] LOL, I'll take that as a "refocus efforts on cloudelastic for now" [17:22:58] probably for the best :) [17:27:44] will merge ryankemper 's patches prior to the pairing session and we can go from there [17:40:13] ryankemper I don't see the cookbook patch, your link above is from one that was already merged [17:41:01] typical [17:41:04] inflatador: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/826397 [17:55:32] ryankemper ACK, both patches merged and puppet ran on cumin hosts. Going to lunch but will be back in time for SRE pairing [17:55:46] ack [18:30:36] ~3 mins late to pairing [20:08:52] * ebernhardson has no clue if the way i'm setting up the streaming-updater java project is correct, [20:09:21] but `./mvnw verify` does something, so it's probably not 100% wrong :P [20:13:28] in particular i suspect the testTools module in the rdf repo doesn't quite do what i thought, somehow having a dependency in testTools against junit with the compile scope allows junit to be available to the other modules to run the tests, but having an assertj dependency there doesn't make assertj available to the other modules to use in their tests and has to be re-declared per module [20:13:30] * ebernhardson shrugs [20:43:03] quick break, back in ~15 [20:43:43] (looks like the cookbook made it through without crashing btw) [20:43:52] indeed [20:44:00] 4th time was the charm [21:06:48] back [21:19:14] ebernhardson FYI, cloudelastic is on ES 7 now...not sure if you want to wait a week, but I'm pretty confident we can start Monday. We can talk about it with Ryan further tomorrow if you like [21:21:56] inflatador: cool! if the cirrus side jobs aren't failing (shouldn't) then we are good to go next week :) [21:22:14] jobqueue dashboards look sane, nothing silly like a bunch of writes failing and re-queueing, building up a backlog: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=cirrusSearchElasticaWrite [21:22:37] well, i suppose there is a minor backlog right now, will check again in an hour [21:23:30] hmm, bunch of new deprecation messages in the logs though, 400k per 10 min as of an hour ago :S [21:23:39] those aren't from cloudelastic though [21:25:02] cirrus doesn't look to be complaining about cloudelastic, all looks happy enough on that front [21:26:43] oh, sigh...the warnings in codfw are mjolnir ML jobs via the norm clustering, which clusters based on search results. Not a big deal, can always pause it for a few weeks (or try and fix it this afternoon, will check how hard it is. Probably not bad at all) [21:27:19] anyways, all looks happy for 7.10 merges and rollout to codfw hosts next week, and i've gotta go do a school run now [21:27:25] ACK [22:08:04] I'm out for the day, but if anyone is bored and wants to review a trivial PR: https://gerrit.wikimedia.org/r/c/operations/puppet/+/826630 [22:11:29] back