[07:17:20] Been feeling a bit under the weather the last few days so going to try to get a full night's sleep tn. Means I likely won't be at the `WQS & Blazegraph/Neptune` meeting this morning [13:04:26] greetings [13:04:32] feel better soon! [13:05:17] o/ [13:20:16] \o [14:44:19] o/ [15:33:16] thanks for the meeting, I'll try to get something up [15:33:38] re: Java 11, it looks like someone at least got it to work with BG: https://github.com/blazegraph/database/issues/204 . Far from "supported" but I guess it's something? [15:39:09] if i were to completely randomly guess where to start talking to amazon/neptune team, maybe this guy: https://github.com/beebs-systap (still comments on blazegraph issues, rarely) [15:41:56] yeah, was thinking that too, I'll add to the page [15:52:14] anyone know what this cookbook is about? https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/831908/ [15:52:44] also, quick workout, back in ~30 [16:04:05] meh, creating the doc page for commons api access says: An automated filter has identified this edit as potentially unconstructive, and it has been disallowed. If this edit is constructive, please report this error. [16:04:23] aka, abuse filter doesn't like me :P [16:14:12] reading the filter doesn't make sense how i'm hitting it :S it's an awfully specific filter that targets spambots (and is a private rule, but can look it up in the commons sql db) [16:21:08] oh, it didn't like my example token, deadbeefdeadbeef.deadbeefdeadbeef [16:21:50] plausible wcqs docs for usage as an api: https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service/API_endpoint [16:26:55] back [16:49:05] ebernhardson: that doc is great! Thanks! [17:17:58] inflatador: wrt the cookbook https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/831908/ , not sure what specifically prompted it but looking at the SAL there's definitely a lot of restarts of apache/nginx in general https://sal.toolforge.org/production?p=0&q=moritzm&d= so I'd wager this is just something moritzm has wanted for awhile [17:31:53] ryankemper ACK, sounds like it's for security updates specifically [17:32:14] Also, first draft on the blazegraph stuff, feel free to add/subtract/edit https://office.wikimedia.org/wiki/User:BKing_(WMF)/WQS_and_Blazegraph_Neptune [17:34:05] inflatador: Just checking, but I assume this is on office wiki because we don't want our brainstorming publicly viewable in this case? [17:34:16] ryankemper correct [17:35:29] open to moving it to Google Docs or whatever if that's better [17:36:23] Nope office wiki makes sense, just checking [17:47:24] es7 eqiad upgrade patch is staged: https://gerrit.wikimedia.org/r/c/operations/puppet/+/831938 [17:49:59] lunch, back in ~30-45 [18:06:02] ryankemper, inflatador: indeed, e.g. today I've been rolling out libxslt security updates and to fully effect these nginx needs a restart [18:06:42] and it's convenient to not have to specifically figure out the procedure, but just have a codified procedure [18:07:35] and there's plenty of other rdeps seeing updates (openssl e.g.), so these tend to get restarted at least monthly [18:08:03] so I'd would be appreciated if one you could review the basic parameter (like is the batch size and delay between batches okay) [18:29:14] back [19:18:42] moritzm: ack I threw a +1, I think those params sound fine as a starting point [19:20:00] ryankemper ebernhardson I halted the upgrade, seeing errors on the first batch [19:20:34] seeing "ERROR Null object returned for RollingFile in Appenders." [19:22:07] yeah trying to figure out if those are a red herring or not [19:24:28] Running puppet on `elastic1058` and seeing if that clears things up [19:24:35] ACK, doing same on 1080 [19:32:13] poked the profiling apis a bit for the regex that was now timing out in es7, it does look perhaps a little slower but nothing completely out there. Disabling timeouts i got results in 12s from eqiad running (mostly) es6, and 16s from codfw running es7. But eqiad is also mostly idle which could also allow for faster execution. profiling looks quite similar, times vary but the work done is [19:32:15] all primarily in the `match` section, which is where we run the actual regex [19:33:07] for a random shard we take ~250ms to select the documents that the regex will run against, then 9s to run the regex [19:34:01] ebernhardson: would the cross-cluster latency be enough to move the needle? like are we close enough to the 10s timeout that that makes a difference and it will plausibly go away when we're back to normal operation [19:34:35] inflatador: so there's some stuff about the remote seeds that seems very plausibly the issue. not sure why it would surface now though, maybe there's something that es does only upon upgrade [19:34:37] https://www.irccloud.com/pastebin/UA4XvoUW/ [19:35:21] I am confused why the messages mention 6.8.23 though [19:35:29] Interesting, I'm seeing connection failures [19:36:02] ryankemper: shouldn't, these timeouts are internal to elasticsearch [19:36:11] we tell it to cancel after x seconds [19:36:22] ah, right [19:36:51] hmm, name or service not known? sounds like dns problems. odd [19:37:23] Not sure what's happening yet, could be more connection issues like what we saw with the F rows a couple weeks back [19:37:25] also, isn't 1034 really old and decommed by now? i guess the problem is that host is listed somewhere [19:37:52] Yes 1034 is totally inactive. I was trying to see if it was still listed in the remote cluster seeds, but I don't think it is [19:37:55] oh! i wonder if this is that config key we couldn't reset in es6 ...sec [19:39:11] yes, search.remote.omega.seeds in cluster state has old hosts in it, it's supposed to be overriden by cluster.remote.omega.seeds, but annoyingly with the way elastic did their deprecation we weren't able to remove it from the cluster state. david did some testing and found that es7 will remove it from the cluster state though as a no-longer-supported value [19:39:30] its only for cross-cluster search, and thats only done in response to user search requests which all go to codfw right now. Should be safe to ignore [19:40:13] Okay so presumably we can ignore that for now [19:40:51] 1080 definitely has elasticsearch 6.8 package, not 7.10 [19:41:30] `elasticsearch-oss/bullseye-wikimedia 7.10.2 all [upgradable from: 6.8.23]` [19:41:53] inflatador: so I think we should manually unstick these hosts and then try another batch [19:42:05] It could be as simple as us having ran the cookbook before the second puppet run so stuff wasn't in a good state [19:43:01] ryankemper agreed, I'm manually upgrading on 1080 now [19:44:54] inflatador: cool, doing 1058 [19:45:52] ryankemper services started cleanly after manual apt-get upgrade for es pkgs and a puppet run [19:47:24] inflatador: ack 1058 done, doing the last host now [19:50:10] inflatador: okay, all looks good. we can start the next batch whenever we're ready [19:50:24] sounds like we need to wedge another puppet run into the cookbook around https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/elasticsearch/rolling-operation.py#L256 ? [19:50:32] or do you think it will work now? [19:55:22] ryankemper does look like the apt repos are correct now, will start again [19:55:25] inflatador: it'll work now, we just need puppet to have been run twice since the patch was merged, which it now has [19:55:33] ack [19:58:32] ryankemper looks like it's making progress now [20:05:54] quick errand, back in 20 [20:06:05] we're at 12% on eqiad ES 7 [20:23:22] back [20:46:16] we're up to 18% [21:13:53] inflatador: occurred to me we've still got `indices.recovery.max_bytes_per_sec: 80mb`, I think we can probably bump it now (should do a quick scan of the NICs first though) [21:22:43] my mac is stuck in some kind of xcode update loop, which is breaking git [21:22:47] (╯°□°)╯︵ ┻━┻ THIS IS RIDICULOUS [21:23:12] did you upgrade first the OS or first Xcode? [21:23:20] and the cli tools? [21:25:41] I have a cronjob that runs git every hour...it started complaining that I needed xcode (which I installed long ago), so I followed the prompts and reinstalled it [21:26:01] then software update popped up and said I needed new versions of xcode, so I installed them [21:27:01] but it doesn't take, the same thing keeps happening [21:27:22] weird [21:27:45] I disabled the cronjob for now, we'll see if that helps anything [21:30:05] hmm, looks like rubymine might be part of the problem [21:34:52] Giving up and rebooting, back in a few [21:39:36] inflatador: I had the same problem with Xcode and git earlier today. I got it to behave by opening the Xcode application itself (among all the other (re)installing and updating and stuff). HTH [21:40:26] I hate when that sort of crap happens. I don't think I updated anything, but everything is suddenly broken. That's a quality experience! [21:42:01] Trey314159 yeah, OS 12.6 just came out, and I think WMF blocks the update...but maybe doesn't block Xcode updates [21:42:13] but I'll try opening it [21:43:08] I'm seeing some chatter about this in #macports [22:11:58] I guess it's just my day...my wife's car won't start so I have to take my son to bass practice [22:12:15] ryankemper could you keep an eye on the upgrade? I'll be gone for at least an hour [22:13:04] inflatador: yup I’ll be around [22:54:46] at 50% mark on the upgrade now [23:06:10] back