[13:17:52] \o [15:59:56] errand, back in ~30 [16:21:00] back [17:55:18] ebernhardson or anyone else, do you have an opinion on the cloudelastic disk config? Right now we're using 6x 2 TB SSDs in a RAID-10 , it seems like we could use [17:56:02] RAID-0 for the data and rely on elastic to fix data corruption instead of the OS/RAID [17:56:24] handle fault tolerance that is [17:59:32] inflatador: hmm, cloudelastic has lower fault tolerance than the main clusters, in the main clusters we keep 3 copies of all data, but cloudelastic only keeps 2. I'm not 100% sure which is the right approach, but since we don't need the extra disk space (currently ~50% used) it seems easy enough to keep with how it's been [18:01:13] also curious, no Special:ApiSandbox on wikidata.org [18:04:56] ebernhardson thanks, I'm kinda skeptical that it's worth buying 4x more SSDs/host and custom hardware config. Then again, restoring that from backup probably wouldn't be fun ;) [18:05:58] also lunch, back in ~40 or so [18:16:22] ahh, i forgot this is about replacement machines rather than changing the current deployment. We could probably get away with 4T in raid-0, for the most part elastic's recovery seems to work for us. Hard to say how likely raid-10 is to save us much [18:48:59] looks like cloudelastic uses about 2 TB, which briefly hopped up to ~75% late last month https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=cloudelastic&var-instance=All&var-datasource=thanos&from=now-2y&to=now [18:49:20] (I think there was some crazy log spamming happening?) [18:58:02] hmm, i think our only related constraint is growth and the disk low watermark, at >= 85% disk usage elastic will refuse accepting new indices, at 90% it will try and push things away, and at 95% it goes read only [20:01:46] hmm, one hour of sparql logs fails to process because the query triggers a StackOverflow in the sparql parser [21:23:27] My thoughts on cloudelastic disk situation, feel free to respond here or in the ticket https://phabricator.wikimedia.org/T334210#8769525 [21:31:11] oh joy, a bunch of RDF streaming updater alerts! [21:37:14] :( [21:44:14] restarting BG on a couple nodes, let's see if it helps [22:03:45] OK, lag is (mostly) trending down. Will check again after dinner but I think we're good for now