[05:35:18] Hmm, reimage of the first relforge host went fine but w/ the second one the cluster never recovered properly, lots of stuff in red status. Elasticsearch can't find the data on the other node (weird b/c status was green before proceeding to reimage of the second host) [05:35:32] https://www.irccloud.com/pastebin/POgIatvP/_cluster%2Fallocation%2Fexplain [05:36:09] Presumably we'll need to restore from a dump tomorrow [06:56:48] ebernhardson: thanks [06:56:58] i will look into it [09:56:18] lunch + errand [12:59:42] Greetings [13:01:32] ebernhardson (and dcausse when back): Will is asking for some help on Java code review for his team. It should be low traffic. Expect to be tagged on gerrit (or gitlab) and maybe poked on IRC. [13:19:55] Looks like I still haven't solved the "no calendar reminders" issue ;( [13:21:06] * inflatador twiddles another couple of bits and hopes for the best [13:27:09] ebernhardson (and dcausse): for more context: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/libs/metrics-platform/+/refs/heads/master/java/ [13:33:27] ebernhardson: do you need anything from me on the image-suggestions import? [14:06:46] good morning inflatador.When you want I can make the batch changes to add the AAAA records to the codfw elastic hosts [14:07:12] (see also my reply in the task) [14:09:31] volans thanks for the heads-up, I'm ready anytime [14:09:54] do you see clients connecting via ipv6 on the test hosts you've already done? [14:13:13] no, I haven't checked yet. The processes are definitely listening on IPv6 and passing health checks [14:13:57] But as far as LB health checks, are they IPv6-aware? [14:14:16] no LVS is v4 only in our setup [14:14:28] all svc records are v4 only [14:15:02] so if we hit the search.svc VIP, will LVS point the client to the elastic host's IPv6 address? [14:21:39] FWiW I don't see any ipv6 traffic hitting elastic2027 's public IP, there are ip6tables rules allowing it though [14:21:41] no, that would go via v4, but any client that would connect via other means (FQDN, or else), for example cookbooks [14:22:00] would probably via v6 naturally [14:22:16] yeah, my SSH connection is v6 now [14:24:03] I assume that a v6-only client can still search wikipedia, that the end-to-end connection doesn't need to be v6 [14:24:37] the user connection is ended so many layers before that ;) [14:24:38] so that was my main concern as far as this switchover, but if we need to run some other tests (like cookbooks) let me know, happy to do so [14:24:58] yeah, that's what I thought ;) [14:25:34] what I would test is just some basic API call internally from a random host to an elastic endpoint (and check that it uses v6, it should by default) [14:26:01] so to make sure we're not breaking say maintenance scripts/cron/timers that maybe run directly with the FQDN of the elastic hosts [14:26:48] that makes sense, let me try from a bastion [14:32:40] so it looks like 9243 (HTTPS elastic port) traffic makes it to the host, the host responds, it never comes back to bastion. This happens on v4 too, so let's ignore that for now [14:34:55] try from a cumin host maybe :) [14:36:32] 9200 (HTTP elastic port) works on v4. with v6 traffic makes it to the host, the host responds, but it never comes back to the client. Tried on bastion and cumin [14:36:46] cormacparle: i can make the thing work, can make almost anything work with enough random cli flags, but it's far from ideal [14:37:02] cormacparle: your dataset is 800MB, it should be in something like 3 to 5 files. not 70k [14:37:16] hehe ok [14:37:31] ssh ipv6 traffic looks to be flowing both ways from cumin host [14:37:37] I don't have any idea why it's split up like that, is the trouble - we're pretty new to spark on the team [14:37:44] cormacparle: for concrete things, this gets into how spark works. Most likely you are reading in lots and lots of individual partitions elsewhere and never coalescing them [14:38:19] cormacparle: spark makes it tempting to ignore, but it's actually super important when data is partitioned. I typically have to explicitly partition data at seemingly random points of a spark job [14:38:55] inflatador: I get telnet: Unable to connect to remote host: Connection refused [14:39:18] if I do with the FQDN it does try v6, fail and try v4 that succeeds [14:39:23] ferm maybe? [14:39:26] cormacparle: i'd need to look over the code for anything more concrete, in the past it's very common to run the same script over and over, monitoring it in the Spark UI to see how many partitions/ how big they are/etc [14:40:02] volans I'm dumping v6 from elastic2027 and I do see traffic reaching eno1, where/what port are you hitting from? [14:40:23] telnet 2620:0:860:101:10:192:0:77 9200 from cumin2002 [14:40:29] that's elastic2025 [14:41:06] sounds like ferm, but let me check [14:43:09] volans I'm getting the same results from elastic2025 as I did for elastic2027, but I'm using cumin1001. Let me try 2002 [14:44:57] getting same results from cumin2002, using `curl -6s http://elastic2025.codfw.wmnet:9200/_cluster/health` [14:46:24] nothing grosser than IPv6 tcpdump but here's what I'm seeing https://phabricator.wikimedia.org/P28365#121555 [14:48:03] if you remove the -s you get the same of me [14:48:04] curl: (7) Failed to connect to elastic2025.codfw.wmnet port 9200: Connection refused [14:52:31] OK, then that does sound like ferm, grepping thru ip6tables rules now [14:58:21] hmm, looks like cumin2002 is allowed , so probably not a FW rule. Checking listening ports now [14:59:22] errand [15:01:53] OK, so Elastic is definitely not listening on 9200 on its public interface, and I don't think we should change that [15:02:33] (at least for v6, the v4 address is class A private) [15:03:54] that's "public" because of IPv6, but practically speaking is private [15:03:57] 9200 is cleartext, no one should be using that directly...even internal users **should** be hitting the vip [15:04:35] But if we should notify some stakeholders before making that change, that's fine w/me [15:04:51] * volans doesn't get the last sentence [15:05:29] Sorry, they should be coming in thru search.svc and LVS, specifically through the https-enabled ports [15:05:47] not hitting cleartext ports directly on a single elastic host [15:05:54] sure, but what about maintenance / monitoring? [15:06:01] if there is a check on port 9200 with FQDN [15:06:11] adding the AAAA will go to the v6 instead of v4 for example [15:06:35] that I don't know, although since we've had several v6-enabled hosts in eqiad for the last few months, I kind of doubt it [15:06:40] that was one of my first questions... is the cluster v6 ready? It seems not from what you're telling me right now :) [15:06:54] It all depends on what you mean by "v6 ready" ;) [15:07:47] From my perspective, that means IPv6 only clients can use the search feature on the sites we host [15:09:09] If LVS was giving clients the IPv6 address, it would not work as it is today [15:09:41] it doesn't give the address, just route it, but yes I agree [15:10:55] HOWEVER, curl -6k https://elastic2025.codfw.wmnet:9243/_cluster/health does work, so maybe not [15:13:12] Anyway, we need to agree on a goal here, if LVS never uses v6 I don't see much utility in making it listen on the cleartext ports. Looks like it actually **does** listen on the TLS ports, so my guess is it actually would work [15:13:54] so, the idea of the concerns expressed in https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_can_I_add_the_IPv6_AAAA/PTR_records_to_a_host_that_doesn't_have_it? is that once you have the dual stack it's natural to think that your host works the same on v4 and v6 addresses, exposes the same servers, etc... This also prevents that a random change in the infra in the future might break things [15:14:00] because something will change how it's ... [15:14:02] ... connecting. [15:14:16] that said, if your team is happy with the current setup no problem for me to add the AAAA records. [15:16:53] s/servers/services/ [15:17:37] and you already are in a mixed setup anyway (some hosts have AAAA some not) [15:18:03] I agree, we need to make sure it works on v6, but I'm confident that it does. If anything, we should probably stop exposing the cleartext ports entirely [15:18:38] anyway, yeah feel free to add those AAAA records for codfw and I'll start a new ticket for more IPv6 readiness testing [15:18:59] ack, proceeding [15:22:26] done in netbox, running sre.dns.netbox cookbook [15:28:30] inflatador: all done, replied to the task [15:29:19] Thanks volans ! Just opened https://phabricator.wikimedia.org/T309112 for reviewing the firewall rules and listeners [15:31:16] thanks! let me know when ready for eqiad in the next days [15:34:10] np, will hit you up soon. Thanks again for your help! [15:34:32] anytime, sorry for the additional work [15:35:26] that's OK, they pay me ;) [15:48:04] Workout, back in 30-45 [15:55:10] dinner [15:55:40] https://www.datanami.com/2022/03/15/home-depot-finds-diy-success-with-vector-search/ "seen a 13% increase in nDCG, ... an 8% decrease in query reformulations, ... and 45% decrease in the share of complaints tied to the relevance of search results" [16:32:25] back [16:59:06] Is it possible to have cirrus store search data for multiple wikis in just one index? I'm increasingly coming to the conclusion that we have way too many shards (due to too many indices) and trying to think of ways to reduce this [17:40:11] lunch, back in 30-45 [17:40:54] tarrow: too many shards is a common problem, we had to split our clusters into 3 separate clusters not because of data sizes, but simply to keep number of shards per cluster reasonable [17:41:31] tarrow: unfortunately, storing multiple wikis in the same index isn't something cirrus can do. I started down that road a few years ago, but it was quite painful and splitting our clusters ended up being a simpler way forward [17:42:37] tarrow: there are also other problems we were trying to avoid, particularly that when you put multiple wikis into the same index you end up building a combined language model of all the wikis, and we know in our case that, for example, the language model of en.wiktionary is very different from en.wikipedia [18:09:11] back [18:10:52] ebernhardson: gotcha! Thanks for the input! Right now we have a three node cluster with one primary and two replicas of each shard. Would it sound smart to you to drop to just one replica of each shard to reduce the number each node has to handle? That should still be enough to prevent an outage when we have to update a node or something? [18:18:41] tarrow: one primary and one replica will generally work, i suppose the important factor would be how long/hard is recovery if there really is a problem. Recovery would be the same as the reindexing process, if the wikis have <100k docs its probably less than an hour [18:25:57] tarrow: i'd be curious if that fixes the restart problems though, my intution is you have a plausible number of shards, we run 150-250 shards per instance which seems same ballpark [18:26:45] Couple mins late to pairing [18:26:48] i suppose the mid-sized clusters with more shards use 8G jvm heaps [18:32:39] ryankemper: ack [18:32:55] inflatador: we're in https://meet.google.com/eki-rafx-cxi with ebernhardson already [18:34:51] ebernhardson: there's a strong chance we're still rather underspecced. We have ~400 shards from ~100 wikis. Running with 4GB heap and 8GB of ram available (on a Google kubernetes cluster). Right now recovery takes much longer than an hour. Probably best part of 24hrs. Slowest bit (for us) is recreating indices rather than populating them [18:36:02] Most wiki's have far fewer than 100k docs but we have a few outliers around that order or magnitude [18:40:17] if it sounds like we don't really know what we're doing that is about right: we (wikimedia germany) have inadvertently bounced ourselves into this position because Wikibase 1.36+ requires elasticsearch but the overhead for maintaining it for all these tiny wikis caught us (me) out [19:05:53] tarrow: hmm, with the actual index creation taking forever (typically waiting on master state changes in my experience) that certainly seems a direct mirror of the issues that prompted us to split our clusters. I guess i'm surprised available memory seems to be so closely tied to the performance of the master, but certainly seems to be having an effect [19:17:40] we also have been regularly struggling with what what I now see you describe here: https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell except on all our nodes where the load introduced by a node restarting is sufficient to make another node fall over [19:18:15] Is there an overview of the clusters, the wikis on them and what sort of machine they are? [19:19:42] tarrow: i wrote something many years ago in the top of wikitech:Search, but nothing too accurate any more. I typically look at grafana cluster overview to get an idea of the individual nodes: https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=elasticsearch&var-instance=All&var-datasource=thanos [19:20:03] the short is these should be something like 256G memory, 40 cores, 3.6TB SSD per server, x36 [19:20:21] not everything is that same spec yet, thats the new spec. Some old machines are 128GB ram [19:20:48] Thanks! I really appreciate all the time you're spending explaining all this :) [19:20:49] to see how things are spread between clusters, can query cloudelastic.wikimedia.org:8[246]43 from WMCS [19:22:00] those are basically the same as the main clusters, but instead we somehow shoved the entire thing into 6 instances. We tried to shove it into 4 and couldn't make it work, had to expand to 6 instances and reduce replicas to 1 [19:23:04] tarrow: not a problem, few people are even curious about these things :) I suppose i should write docs on it, but i suppose i expect them to be rarely referenced [19:23:58] but the cloud instances are still these crazy beefy machines? [19:24:22] like 1/4TB RAM and 40 cores? [19:24:31] tarrow: yes, but no. They are different :P they have 1/2 TB of ram and 20 cores [19:24:39] yikes [19:25:43] tarrow: as my predecessor nik said before leaving to go work at elastic, 'ram is life'. search needs ram :) [19:25:44] gotta say it makes our original optimistic deployment of 500m heap and even our 4gb heap now look rather silly [19:26:17] hehe, seems like an apt aphorism [19:27:03] tarrow: do you graph jvm heap over time? in particular for old gc hell problem, depending how bad it is it can usually be fixed by regularly rebooting instances, we have a few instances we reboot once or twice a month when they complain [19:28:02] we don't, but we shall as soon as we find time to get some better observability stuff rigged (which now looks like it should have been yesterday ;) ) [19:28:29] I'm basically just refreshing /_cat/nodes?h=name,heap.max,heap.percent,heap.current,ram.percent,cpu,load_5m,role,flush.total [19:29:51] tarrow: if you are collecting GC logs, have a look at https://gceasy.io/ [19:30:04] tarrow: i dunno if it would be super telling, but here is an example of what a troubled node looks like for old gc hell: [19:30:06] https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&viewPanel=56&var-datasource=eqiad%20prometheus%2Fops&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1049&var-top5_fielddata=All&var-top10_terms=All&var-top5_completion=All&from=now-30d&to=now [19:30:32] can see that when it restarts on may 9th it has about 1gb of free ram, but then over the next few weeks the bottom of the sawtooth gets closer and closer to the top, until there is no room to work with anymore [19:30:42] gotcha! [19:30:54] and the associated config we use on the JVM: https://github.com/wikimedia/puppet/blob/production/modules/elasticsearch/manifests/instance.pp#L177-L186 [19:32:16] we see lot's of these horrible lines: `"[2022-05-24T12:12:14,200][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-master-1] [gc][9569] overhead, spent [10.4m] collecting in the last [5.9m]"` [19:32:26] tarrow: oh wow! thats exceptionally bad [19:32:47] tarrow: really the jvm should kill itself before spending 10 minutes on a GC :S [19:33:11] i guess thats not one collection, thats aggregate over time? but still thats bad :) [19:33:26] yeah, doubling the memory seems to have solved it but... I also thought that last week until it started up again [19:33:51] I mean, the line is just funny: 10 of the last 5 minutes?? what?? [19:34:04] surely at most 5 of the last 5 would be reasonable [19:34:19] lol, i'm also wondering about that :) I suppose that might mean 10.4 minutes of cpu time, which might be multi-threaded [19:34:33] but then it's not really useful to know how much cpu-time it used [19:38:07] right, since it's now well past CEST business hours I'm off for the night. Thanks so much for all the input! [19:38:26] g'night [23:01:40] * ebernhardson wonder how many times he can search for fetchPhrase instead of fetchPhase before realizing it's the wrong thing [23:21:20] New rolling reimage operation cookbook more or less works! In this last run `relforge1003` was reimaged and then failed to come back to green in time so `relforge1004` didn't get reimaged, but the logic itself seemed to work fine