[06:35:28] hello folks! [06:35:53] there are several CRITs for elastic eqiad nodes in icinga, all for " [06:35:54] ElasticSearch setting check - 9200" [06:36:06] and port 9600 [06:36:27] probably nothing but is it related to some in progress work? [06:52:28] elukey: thanks ! I'll have a look in a minute [07:15:34] elukey: gehel: ah I mostly fixed those last week to bring them into alignment w/ the new master configuration, but mixed up a couple hosts for the omega config. fixing that now [07:16:31] <3 [07:20:02] thanks for the heads up! those should be resolved [07:20:23] ryankemper: thanks! [07:44:41] ejoseph: I had another look at https://gerrit.wikimedia.org/r/c/mediawiki/libs/metrics-platform/+/838232. I think we should stay on Java 8 at the moment. Sorry, I know I'm the one who proposed using Map.of() :( [09:21:16] dcausse: Hey! Would you have some time to chat? I'm trying to do some diagrams of WDQS Streaming Updater... [09:21:30] sure [09:21:52] Oh, I see you have a meeting with Emmanuel in a few minutes. Maybe at 2pm? [09:22:06] gehel: when you want [09:22:24] I've sent an invite for 2pm. Let's see how that goes [09:22:32] ok [09:43:14] g/o jay [09:43:20] nope! sorry for the noise [10:37:22] lunch [12:33:31] ejoseph: time for our 1:1? [13:11:50] o/ [13:50:12] issue using LocalTime.now https://www.irccloud.com/pastebin/sOraNyTF/metrics-client-log.txt [15:09:19] \o [15:12:54] o/ [15:17:13] looks like CI isn't in a happy place :S [15:23:47] yes... [15:35:57] pfischer: I'm en route for the hospital with Oscar. I won't be there for our meeting. [15:36:02] Let's reschedule [15:36:42] ryankemper: I'm also probably not going to be back on time for our meetings later today [15:42:24] gehel: good luck, hope it's nothing serious [15:44:05] Pretty deep cut. Probably requires thread and needles. Way more than I'm comfortable to do on my own [15:44:18] But nothing life threatening! [15:45:19] pfischer: I'll let you see if you prefer rescheduling the meeting with Yunita or do it without me. [16:07:36] hmm, getting intermittent 503's from codfw while running my script that checks if the reindex is complete (GET against index base url for all indices) [16:08:13] i can add the retry options in there easily enough, but i wonder why it's not reliable [16:09:42] this query in logstash starts getting 1-4k hits per 10min starting 6 hours ago: host:elastic2* AND "no master" [16:11:03] looks like 2052 is somehow broken, [16:11:30] indeed: PROBLEM - MD RAID on elastic2052 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 [16:12:47] ryankemper: inflatador: ^ [16:14:24] i've run depool on 2052 [16:15:13] i suppose should also ban it from cluster-allocation, but it already looks to have no shards [16:19:40] not clear that depool is working to stop traffic coming into 2052. still seeing log errors and still getting intermittent 503's [16:20:01] yea, still listed at https://config-master.wikimedia.org/pybal/codfw/search-https [16:22:10] ebernhardson: need superpowers to depool it from conftool? [16:22:42] hmm, depool runs confctl with --quiet and redirect output to /dev/null. Running conftool directly gets `socket.timeout: timed out` [16:22:48] volans: please :) [16:22:53] probably because the host is borked [16:23:28] {done} [16:24:24] if the health check was already not responding though pybal would have already depooled it [16:24:37] at the LVS layer (not conftool) [16:24:41] hmm, guess i should check what we do for health checks [16:27:50] hmm, if i'm reading service.yaml right we do `https://localhost/`, which can't possibly work (we don't listen at :443) [16:28:09] so i'm probably reading wrong :) [16:28:43] perhaps it overrides with the port from other parts of the config. But thats not a great health check, it returns the banner without doing anything [16:31:37] yes IIRC it uses host=server IP and port=server port [16:31:53] at least for the http ones [16:31:56] the https ones not sure [16:32:11] I gues the same, TLS over port XXX [16:32:57] yea that would make sense [16:33:20] it looks like depool is working, the node still complains as it's trying to rejoin the cluster, but it's not rejecting requests anymore [16:33:28] I see that search_9200 uses "http://localhost/" while all the other seearchs the https version [16:33:31] well, it's not getting requests to reject [16:33:45] volans: that makes sense, the 9x00 is http, 9x43 is https [16:33:52] k [16:34:25] will have to ponder what a better health check endpoint is, maybe /_cat/master but i have to check the impl [16:34:45] stiil doesn't mean it's healthy, but for that to respond it would have to be joined into the cluster [16:37:11] is there any query that would always return something, will never break and is cheap enough to do all the time? [16:37:12] ebernhardson ACK, will take a look [16:38:15] volans: not that would be guaranteed to run against the local instance. `_cat/master?local=true` would be super cheap, but it will respond with a 200 and `- - - -` when unhealthy which doesn't match pybal expectations [16:39:55] off the top of my head I don't recall if it allows to check for content [16:42:12] Elastic will route traffic to any node it considers healthy, right? Is depooling from LVS enough? [16:42:42] inflatador: the node itself right now isn't joined into the cluster, so the failures were lvs sending requests to a node that wasn't able to route requests [16:42:55] inflatador: but if the node is in the cluster, then yes it will route it wherever (with some preference for local shards) [16:42:59] for the disk issue you can use T320482 ofc [16:42:59] T320482: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 [16:43:26] annoyingly i'm not seeing good options from elastic api's to report local node health [16:43:37] ebernhardson: and there isn't a query that reponds if the host is part of a cluster or not I guess [16:43:40] would be too easy [16:43:46] or it exists and is always a 200 [16:44:32] volans: it exists but returns 200. We could do `_cat/master?local=false` which wouldn't be the end of the world and will 503, but then we are constantly sending requests to the master instance. probably fine, but seems not-ideal [16:45:17] yeah defintely not ideal, IIRC pybal should support multiple URLs for the proxyfetch, but I don't recall the logic [16:45:56] documentation is... "scarce"... :) [16:47:45] I assigned myself and ryankemper to the RAID ticket [16:48:11] that's most likely for dcops in the end if the disk broke [16:50:40] sorry I have to step out... [16:51:39] volans: thanks for help! [16:53:17] v-olans, ACK, sending it their way shortly [16:53:29] and thanks again! [17:02:31] apparently the change in banner happened between 6.x and 7.x, in 6.x it would 503 if there was no master. They suggest using `/_cluster/health?timeout=0s` which will query the master node, but "Health requests should be pretty cheap, and there are benefits to getting an answer from the master" (maybe benefits in other use cases, but not here :P) [17:02:55] err, timeout=1s [17:27:29] checked pybal/proxyfetch source. it doesn't have any ability to check the output. only http codes. it has an option to run a command instead of doing http, but seems meh...we probably have to accept querying the master node for all health checks. [17:42:09] hmm, alert that just fired seems like its about index_not_found_exception. spot-checking a few they seem to be cross-cluster requests [17:45:17] hmm, actually it's probably poolcounter, the index_not_found_exceptions just happen to be common too right now :S [17:58:59] gehel ~5m late to pairing session [18:01:02] inflatador: still at the hospital, I won't make it for our pairing session [18:11:16] creating index for igwiktionary [18:11:25] randomly guessing from logs...pool counter rejections at 17:43-17:44 and then from 17:51-17:52, at 17:42:57 elastic1060 lost connection to master node, 17:43:40 it came back. at 17:51:11 elastic1060 again lost connection to master node, at 17:51:25 it came back [18:12:03] if a node intermittently falls out, and the other nodes simply sit around waiting for pending shard-requests, those requests could take more time and use more poolcounter space [18:14:51] going to create igwikiquote too [18:15:22] something's not right with how we create wikis... [18:15:36] it's not automated, it's a step on a checklist :S [18:16:19] i forget why, https://wikitech.wikimedia.org/wiki/Add_a_wiki#Search [18:17:14] because addWiki.php isn't run in the context of the wiki, instead a dummy wiki that uses the same sql shard. ref: https://phabricator.wikimedia.org/T254331 [18:17:16] ebernhardson: Is the (hypothetical) alternative that we could just have mediawiki do it since it knows about its own wikis basically? [18:17:47] ryankemper: it suggests to me that a higher level automation is needed above the addWiki.php script that they use [18:21:51] seeing jawikinews missing too but this one should exist :/ [18:24:18] it actually exists so must have been a transient failure? [18:24:47] gehel ACK [18:35:58] ebernhardson: are you still reindexing commons? [18:36:32] seems like yes, seeing it from mwmaint1002 [18:37:19] asked because I was seeing errors in logstash with {"type":"search_phase_execution_exception","reason":"Phase failed","phase":"fetch","grouped":true,"failed_shards":[],"caused_by":{"type":"node_not_connected_exception","reason":"[elastic1062-production-search-eqiad][10.64.48.132:9300] Node not connected"}} [18:39:10] dinner [18:43:13] dcausse: yea i just started up commons reindexing this morning, it had been stalled waiting for last thursdays deploy and i forgot to run it friday [19:54:51] Back home at last! 3 x-ray and 3 stitches... [20:01:26] gehel: ouch! I hope the xrays didn't show any bone damage or the like [20:03:01] X-rays were only to check if there was pieces of glass in the wound. All clean, only superficial damages [20:03:59] glad to hear it [20:23:27] +1 to that [20:44:02] meh, commonswiki reindex failed with: {"type":"search_phase_execution_exception","reason":"Phase failed","phase":"fetch","grouped":true,"failed_shards":[],"caused_by":{"type":"node_not_connected_exception","reason":"[elastic1062-production-search-eqiad][10.64.48.132:9300] Node not connected"}} [20:44:54] * ebernhardson totally didn't realize that's exactly what david posted earlier :P [20:53:30] as for why ... hmm :( the reasons given by the master are a mix of "disconnected" and "follower check retry count exceeded". both suggest networking problems [20:57:05] perhaps suspiciously, everything that left and rejoined the cluster today was row=D, various racks [21:05:46] can see in the `network activity` dashboard that there were a few noticable dips in activity during that time(17:31, 17:44, 17:51), not limited to row D instances. But the dip on non-row D could be from them not getting requests from the row-D instances (maybe) [21:09:40] supporting the idea of a general network issue, eqiad:restbase_dev had high retransmits (>10%) and alerted at 17:31, 17:44, 17:51. Lining up nicely...not sure what we can do about it, if the network fails elastic is going to have trouble [21:27:18] I see some references to row D/eqiad in #wikimedia-sre-foundations , such as https://phabricator.wikimedia.org/T320566