[07:50:46] dcausse: can we reschedule our meeting for 30 mins today, I am out on an errand [07:51:22] ejoseph: sure, please move the meeting when you want on the calendar [09:13:07] Early lunch break and another quick run to the hospital with Oscar [10:45:42] lunch [15:01:46] dcausse, ryankemper, inflatador, ejoseph : triaging is starting: https://meet.google.com/eki-rafx-cxi [15:02:28] oops [15:48:03] greetings [15:49:31] inflatador: you might want to check -dcops and see the message from papaul. He needs to do work on a wdqs and maps machine. [15:49:52] ACK, will take a look. Thanks RhinosF1 [16:00:34] I just realized I didn't actually schedule the elasticsearch kickoff for q4. Does anybody think we still need one after today's Asana check in? [16:01:14] mpham: fine to skip for me [16:08:34] errand [16:11:54] I'm looking at T304437. I recall hearing that for federation endpoints, we just review patches people send to us. But it sounds like the author here is waiting for us to move something along. I think we don't have a formal process, but what's the informal process here? [16:11:54] T304437: Allow federated queries with cellar endpoint of the Publication Office and European Commission - https://phabricator.wikimedia.org/T304437 [16:13:49] mpham: i think it's just updating some list. Don't really know where that list is, but can probably find it from some old tickets and related patches [16:15:15] ok, cool. just wanted to undrestand the process. I'll let them know it's on our ready for dev and we'll get to it soon [16:19:23] it probably needs a one line change like this and then a deploy, can prepare it easy enough: https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/699746/ [16:20:50] if I need to deploy LMK, happy to help [16:25:19] inflatador: i put a patch up, https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/779069 can probably deploy with other changes if we have things going later in the week [17:18:29] quick errand, back in ~20 [17:23:01] should we restart wdqs instances that are alerting? wdqs100[47] both have active GC death spiral alerts. Perhaps we are letting jvmquake do that now? [17:40:48] hmm, while reviewing the elasticsearch docs on version upgrades i note they suggest upgrading master-eligable nodes last. I don't think we have any support for that yet? [17:49:33] and back [17:52:07] ebernhardson re: wdqs I'm not sure if that alert is properly tuned, let me check the individual nodes (or if you have and confirmed that they are good alerts LMK) [18:14:30] inflatador: i didn't actually check, i suppose i should have. Was just going through a weeks worth of emails and it was at the end. [18:15:49] quick look at the graphs doesn't look concerning, for a GC death spiral i would expect to see rising old GC/hr and here it's fairly typical 0 or 1 per hour [18:16:00] np, I need to follow up with dcausse and see if these alerts are tuned yet. In the meantime, I'm dealing with Grafana's tiny, disappearing vertical scrollbar ;). From what I can tell on the dashboard, these alerts can be ignored [18:23:15] ebernhardson re: master-eligible, let me see if I can get that into the cookbook [18:23:55] inflatador: i don't know we've done it before, but the docs make it sound fairly important. They don't explicitly say but seem to suggest nodes may have trouble joining if the master is a higher version than themselves [18:24:28] we don't expect to restart nodes while an upgrade is ongoing unless they are to be upgraded, but who knows [18:25:36] looks like the docs suggest that master-eligible goes last: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/rolling-upgrades.html [18:25:47] ya [18:26:11] duh, that's what you already said. Sorry [18:53:08] lunch, back in ~30-45 [19:22:56] back [19:33:14] lunch [19:58:39] e-bernhardson for when you get back, do you have any context on why we explicitly set timeouts in the spicerack es script? https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/776999/comment/da4cb0d5_2330e8c2/ [20:15:07] back [20:16:18] inflatador: hmm, in terms if vol's question the call is indeed blocking, that blocking doesn't come from the python library but rather the underlying elasticsearch http query. that query supports the timeout parameter: https://www.elastic.co/guide/en/elasticsearch/reference/6.5/cluster-health.html [20:16:47] i suppose i should look, but i'm assuming the python library passes those query params along to elastic without considering them [20:17:36] yeah, I was going to test that out myself...more curious about whether or not you think we need to pass that timeout value [20:18:40] inflatador: hmm, probably depends on why it was added. I could imagine that as a workaround for an http client that times out early or something, telling elastic to error before the client drops the connection for not receiving anything [20:18:43] looking [20:19:00] np. Also, how can you tell whether or not a particular call is blocking? [20:20:11] inflatador: i don't know about in the general case, but elasticsearch uses the wait_for_* query parameters in a couple places as a convention to say don't return from the call until complete. [20:20:31] i suppose that means it's case-by-case checking the docs for api's on the elastic side [20:22:47] Gotcha. I am reading thru the API calls, but my expectations for a giant blinking "this is blocking" tag were perhaps excessive ;P [20:27:08] inflatador: some comments from ge.hel about that timeout at bottom of this file (patch set 5, if the link doens't work right): https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/456322/5/spicerack/wmf_elasticsearch.py [20:27:24] inflatador: no info really on why we pass the timeout though [20:28:44] without passing the timeout it would default to 30s, so we need the @retry regardless as it might take an hour [20:30:36] as for the exact value, 1s feels short and i wonder why we don't take the default, but in practice it probably doesn't matter that much [22:10:46] ebernhardson got time for a quick chat re: ES upgrade logic to do the master-eligibles last? [22:10:52] inflatador: sure [22:11:41] thanks! https://meet.google.com/nkh-xejx-ofi [22:20:38] ebernhardson https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/776999