[07:50:46] <ejoseph>	 dcausse: can we reschedule our meeting for 30 mins today, I am out on an errand
[07:51:22] <dcausse>	 ejoseph: sure, please move the meeting when you want on the calendar
[09:13:07] <gehel>	 Early lunch break and another quick run to the hospital with Oscar
[10:45:42] <dcausse>	 lunch
[15:01:46] <gehel>	 dcausse, ryankemper, inflatador, ejoseph : triaging is starting: https://meet.google.com/eki-rafx-cxi
[15:02:28] <dcausse>	 oops
[15:48:03] <inflatador>	 greetings
[15:49:31] <RhinosF1>	 inflatador: you might want to check -dcops and see the message from papaul. He needs to do work on a wdqs and maps machine.
[15:49:52] <inflatador>	 ACK, will take a look. Thanks RhinosF1 
[16:00:34] <mpham>	 I just realized I didn't actually schedule the elasticsearch kickoff for q4. Does anybody think we still need one after today's Asana check in?
[16:01:14] <dcausse>	 mpham: fine to skip for me
[16:08:34] <dcausse>	 errand
[16:11:54] <mpham>	 I'm looking at T304437. I recall hearing that for federation endpoints, we just review patches people send to us. But it sounds like the author here is waiting for us to move something along. I think we don't have a formal process, but what's the informal process here?
[16:11:54] <stashbot>	 T304437: Allow federated queries with cellar endpoint of the Publication Office and European Commission - https://phabricator.wikimedia.org/T304437
[16:13:49] <ebernhardson>	 mpham: i think it's just updating some list. Don't really know where that list is, but can probably find it from some old tickets and related patches
[16:15:15] <mpham>	 ok, cool. just wanted to undrestand the process. I'll let them know it's on our ready for dev and we'll get to it soon
[16:19:23] <ebernhardson>	 it probably needs a one line change like this and then a deploy, can prepare it easy enough: https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/699746/
[16:20:50] <inflatador>	 if I need to deploy LMK, happy to help
[16:25:19] <ebernhardson>	 inflatador: i put a patch up, https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/779069 can probably deploy with other changes if we have things going later in the week
[17:18:29] <inflatador>	 quick errand, back in ~20
[17:23:01] <ebernhardson>	 should we restart wdqs instances that are alerting? wdqs100[47] both have active GC death spiral alerts.  Perhaps we are letting jvmquake do that now?
[17:40:48] <ebernhardson>	 hmm, while reviewing the elasticsearch docs on version upgrades i note they suggest upgrading master-eligable nodes last. I don't think we have any support for that yet?
[17:49:33] <inflatador>	 and back
[17:52:07] <inflatador>	 ebernhardson re: wdqs I'm not sure if that alert is properly tuned, let me check the individual nodes (or if you have and confirmed that they are good alerts LMK)
[18:14:30] <ebernhardson>	 inflatador: i didn't actually check, i suppose i should have. Was just going through a weeks worth of emails and it was at the end. 
[18:15:49] <ebernhardson>	 quick look at the graphs doesn't look concerning, for a GC death spiral i would expect to see rising old GC/hr and here it's fairly typical 0 or 1 per hour
[18:16:00] <inflatador>	 np, I need to follow up with dcausse and see if these alerts are tuned yet. In the meantime, I'm dealing with Grafana's tiny, disappearing vertical scrollbar ;). From what I can tell on the dashboard, these alerts can be ignored
[18:23:15] <inflatador>	 ebernhardson re: master-eligible, let me see if I can get that into the cookbook
[18:23:55] <ebernhardson>	 inflatador: i don't know we've done it before, but the docs make it sound fairly important. They don't explicitly say but seem to suggest nodes may have trouble joining if the master is a higher version than themselves
[18:24:28] <ebernhardson>	 we don't expect to restart nodes while an upgrade is ongoing unless they are to be upgraded, but who knows
[18:25:36] <inflatador>	 looks like the docs suggest that master-eligible goes last: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/rolling-upgrades.html
[18:25:47] <ebernhardson>	 ya
[18:26:11] <inflatador>	 duh, that's what you already said. Sorry
[18:53:08] <inflatador>	 lunch, back in ~30-45
[19:22:56] <inflatador>	 back
[19:33:14] <ebernhardson>	 lunch
[19:58:39] <inflatador>	 e-bernhardson for when you get back, do you have any context on why we explicitly set timeouts in the spicerack es script? https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/776999/comment/da4cb0d5_2330e8c2/
[20:15:07] <ebernhardson>	 back
[20:16:18] <ebernhardson>	 inflatador: hmm, in terms if vol's question the call is indeed blocking, that blocking doesn't come from the python library but rather the underlying elasticsearch http query. that query supports the timeout parameter: https://www.elastic.co/guide/en/elasticsearch/reference/6.5/cluster-health.html
[20:16:47] <ebernhardson>	 i suppose i should look, but i'm assuming the python library passes those query params along to elastic without considering them
[20:17:36] <inflatador>	 yeah, I was going to test that out myself...more curious about whether or not you think we need to pass that timeout value
[20:18:40] <ebernhardson>	 inflatador: hmm, probably depends on why it was added. I could imagine that as a workaround for an http client that times out early or something, telling elastic to error before the client drops the connection for not receiving anything
[20:18:43] <ebernhardson>	 looking
[20:19:00] <inflatador>	 np. Also, how can you tell whether or not a particular call is blocking?
[20:20:11] <ebernhardson>	 inflatador: i don't know about in the general case, but elasticsearch uses the wait_for_* query parameters in a couple places as a convention to say don't return from the call until complete.
[20:20:31] <ebernhardson>	 i suppose that means it's case-by-case checking the docs for api's on the elastic side
[20:22:47] <inflatador>	 Gotcha. I am reading thru the API calls, but my expectations for a giant blinking "this is blocking" tag were perhaps excessive ;P
[20:27:08] <ebernhardson>	 inflatador: some comments from ge.hel about that timeout at bottom of this file (patch set 5, if the link doens't work right): https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/456322/5/spicerack/wmf_elasticsearch.py
[20:27:24] <ebernhardson>	 inflatador: no info really on why we pass the timeout though
[20:28:44] <ebernhardson>	 without passing the timeout it would default to 30s, so we need the @retry regardless as it might take an hour
[20:30:36] <ebernhardson>	 as for the exact value, 1s feels short and i wonder why we don't take the default, but in practice it probably doesn't matter that much
[22:10:46] <inflatador>	 ebernhardson got time for a quick chat re: ES upgrade logic to do the master-eligibles last?
[22:10:52] <ebernhardson>	 inflatador: sure
[22:11:41] <inflatador>	 thanks! https://meet.google.com/nkh-xejx-ofi
[22:20:38] <inflatador>	 ebernhardson https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/776999