[00:25:54] i'm heading out now too [02:23:20] inflatador: gehel: one thing I've discovered today is that there doesn't seem to be any utility in slowly adding new elastic hosts to the cluster, since the shard reshuffling is heavily ratelimited by [what I assume must be] our shard-recoveries-at-once global limit of 8: [02:23:21] https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=64&orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&refresh=1m&from=now-12h&to=now [02:38:26] also as an aside, we haven't decommed 2035 yet (i'm about to now) and it's alerting for disk space. interestingly it is part of the main production search cluster, even though it *should* be banned. see https://phabricator.wikimedia.org/P20161 for the ban settings and contrast that with the following excerpt from cat/_nodes on the main cluster (port 9200): [02:38:28] `10.192.48.74 26 56 4 0.25 0.77 1.00 di - elastic2035-production-search-codfw` [02:39:01] So our understanding of banning a host based off its name might be wrong? will need to dig more into that later. for now i'm just gonna decom the host [08:04:05] ryankemper: elastic2035 might have been banned correctly (too late for me to check). Banning a node is only preventing that node from having any shard allocated, it is still part of the cluster and can still act as a client node. [08:05:59] Yes about not needing to slowly add new nodes, rate limiting of shard relocation should prevent any cluster overload. We still want to test on a single node first to not crash 15 nodes at the same time if we have issues with reimaging. And we probably want to batch the reimages to not put too much strain on our APT repo. [08:13:37] ryankemper: and thanks for the decommission! [08:21:05] inflatador, ryankemper: thanks for the deb! [09:57:18] errand, back in ~20' [10:53:57] lunch [11:02:12] lunch 2 [13:14:27] Errand, back in 20' [14:02:50] Greetings [14:03:58] dcausse np, sorry it took so long [14:04:10] no worries! :) [15:12:44] inflatador: good suggestion to address our puppet pain points! Do you want to lead that effort? Or send it over to me (see email for some context) [15:21:36] gehel thanks, I will call the mtg but need your help with the agenda. Will reach out to you in a bit [15:35:47] sigh... spark could fail when you forget to call lit("value") on equality: .filter(col("meta.domain").equalTo("commons.wikimedia.org")) does not fail [15:35:56] I wonder what it does tho... [15:38:07] hm wait my bad... it works actually... problem is not there :/ [15:42:00] errand [15:58:17] \o [15:58:34] * ebernhardson should rewrite the OldPoolFlatlined alert...my first attempt fires too often when nothing is going wrong [16:03:33] * ebernhardson also forgot how long mjolnir runs, it's still training models. good sign i spupose :) [16:29:38] meh bluetooth borked. reboot before meeting [16:37:34] Trey314159 Thunderbolt commercial, see if this rings any bells https://www.youtube.com/watch?v=ivyWdd8oduE [16:40:47] Damn, I missed the unmeeting! [17:18:54] inflatador: I watched the commercial without sound during the meeting and didn't recognize it.. but once I turned the sound on.. oh, yeah! "We put the the yeeeee-haw! back in your motor and transmission!" So many memories.... [17:23:13] yeah, during the summer I didn't do much except watch TV and play NES. So this kinda stuff is etched into my brain... [17:24:02] Here's the Chopped and Screwed version of Mattress Mack feat. Paul Wall! https://www.youtube.com/watch?v=P74UnjzKJOU [17:28:19] I had not seen that version.. wow, that was a trip and a half! [18:07:23] Lunch/errands/etc , back in ~1 hr [18:34:50] aww, you can use {icon check} in phabricator to get a checkmark, but you can't use it as the text of a link :( [18:57:08] hmm, since analysis-hebrew is AGPL i think that means we can't just build a custom package, we need to have a publicly visible repo somewhere ? [18:57:35] * ebernhardson might be forgetting how thats supposed to work [19:19:21] came back, realized I forgot to eat lunch...back in ~20 [19:19:57] heh, the query builder link on wcqs goes to wdqs [19:20:06] will see if can turn it off or redirect... [19:48:40] ebernhardson: re: analysis-hebrew. I've added it to the agenda for next Wednesday. [19:53:05] aaand back [19:55:39] aaand lunch :) tag team [20:03:16] し(*・∀・)/♡\(・∀・*) BRO FISTU [20:50:51] back [21:15:26] gehel: oops my message from earlier didn't send, but I did see actual shards scheduled on 2035, which was causing disk space issues (see https://phabricator.wikimedia.org/T300944) [21:16:23] I suppose it's possible that it was banned properly but that elasticsearch felt it couldn't migrate the shards from that host due to not having a viable place to move them to...but seems unlikely because i'd expect that to have plunged the cluster into yellow status when I decom'd it [21:16:37] ryankemper: that's strange [21:17:13] not sure if elastic expects the hostname or the node name [21:17:26] We should have a cookbook to take care of that, including the multi cluster mess [21:18:47] `.transient.cluster.routing.allocation.exclude` has fields for any of `_host`, `_name`, or `_ip`, and I did confirm that the name matched what I saw in `_cat/nodes` [21:19:14] in any case I do think in the past when I've banned hosts by name and checked the shards I did see them gone later [21:19:32] might run a couple small experiments to better understand the behavior [21:20:16] Create a cookbook at the same time, those experiments should be documented in code! [21:21:42] ack [21:21:53] there is also es-tool that used to do bans [21:22:37] * ebernhardson wonders if its time to retire es-tool [21:25:11] I haven't used it in forever. I somewhat doubt that it even works anymore [21:25:32] mpham: Thanks for the invite to the WDQS discussions! I'm soo looking forward to those! [21:29:12] Hmm, looks like Puppet is pushing "Bolt" these days, which appears to be their version of ansible: https://puppet.com/open-source/bolt/ [21:33:42] neat [21:34:03] but be warned, a good config management setup is like a good marriage, kept healthiest by not fantasizing about the other options out there :D [21:37:35] ebernhardson: I'm looking at https://gerrit.wikimedia.org/r/c/operations/alerts/+/759733/3/team-search-platform/cirrussearch.yaml#bFILE, but a bit confused [21:38:01] If I compare the old way of doing the metric to the new one, I don't see much of a difference in the actual number [21:38:06] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22eqiad%20prometheus%2Fops%22,%7B%22exemplar%22:true,%22expr%22:%22max_over_time(elasticsearch_jvm_memory_pool_max_bytes%7Bexported_cluster%3D~%5C%22production-search-.*%5C%22,%20pool%3D%5C%22old%5C%22%7D%5B1d%5D)%20-%20on%20(name)%20min_over_time(elasticsearch_jvm_memory_pool_used_bytes%7Bexported_cluster%3D~%5C%22production-search-.*%5C%22,%20pool%3D%5 [21:38:06] C%22old%5C%22%7D%5B1d%5D)%22,%22requestId%22:%22Q-6aae423a-f863-4ec5-a095-d7f1ca8d4f68-0A%22%7D%5D [21:38:13] bleh, hold on need to send that as a snippet [21:39:16] ryankemper: the main difference should be that the old was was max(used memory over 24h) - min(used memory over 24h). Now it should be max(available memory over 24h) - min(used memory over 24h) [21:39:25] https://www.irccloud.com/pastebin/PSO7m4fG/ [21:40:07] ryankemper: i suppose what i'm trying to avoid is i suspect the extra alerts we are getting, such as when starting a new instance or just intermittently, is that the memory usage isn't changing much but it's also nowhere near the max available memory [21:40:31] hmm [21:40:34] ebernhardson: ack. I should have clarified the approach makes a lot of sense to me at first glance, but when I'm running the actual numbers on, say, elastic1081, I don't see much difference [21:41:30] hmm, possible :( checking [21:41:32] also not sure if it's a red herring but for the new way the grafana ui (so presumably prometheus is actually the one emitting the following) warns of `Metric elasticsearch_jvm_memory_pool_max_bytes is a counter. Try applying a rate() function.` [21:42:09] its not really a counter though [21:42:17] but indeed maybe that means it will do wrong math elsewhere? [21:42:43] Well, i guess i better double check how prometheus defines counter :) [21:42:58] yup, not a counter: A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. [21:45:39] ryankemper: hmm, so this is what i looked at earlier that suggested it would work, shows it should have only alerted twice in 30 days and at least one of those two was a real one: https://grafana.wikimedia.org/goto/uQ-tWV-7k [21:45:55] not sure yet why the other end is getting different numbers :S hmm [21:48:43] I do suspect that the behavior we'll see from alertmanager will match that graph you just sent, rather than the numbers we're seeing with the manual queries I linked [21:48:54] that is bizarre though, it'd be nice to understand what's going on there [21:49:30] ebernhardson: do you know if the `last X hours` ui dropdown in the top right would impact it? I was assuming because the `1d` range is specified manually that it wouldn't, but maybe so? [21:49:59] i.e. for the numbers I was looking at had, it had `last 1 hour` selected in the ui [21:50:32] ryankemper: i know when it was querying graphite that could have a huge difference on things, sadly i'm not as learned in prometheus but i suspect the same is the case. I only went down as far as 24h though, i suppose bringing it down further and checking the timespan of the recent false-alarms might be sufficient? [21:50:36] nevermind, I forgot I already sanity checked that and the numbers I spot checked were the same [21:51:29] ebernhardson: anyway i'm a bit confused by the numbers but I definitely trust https://grafana.wikimedia.org/goto/uQ-tWV-7k as representative of when the alerts will actually fire, so merging the patch now [22:09:45] quick break, back in ~15 [22:34:32] hmm, fired again at 2:11, not fixed :( [22:37:52] aaand back