[06:53:26] hm... the import_ttl dag does not look the way I expected, the complete task is connected to the split_graph task... [07:35:34] the cirrus sanitizer is pushing jobs directly to a cluster and might still end up trying to run a cirrusElasticaWrite possibly causing a "Received cirrusSearchElasticaWrite job with page updates for an unwritable cluster" [07:37:32] ah no the maint script should fail early with "$cluster is not in the set of check_sanity clusters" [09:40:07] cloudelastic is struggling a bit, I guess due to the reindex [09:54:52] lunch [10:33:15] gehel: meeting? [11:30:18] break [13:16:58] o/ [14:07:05] dcausse when you said cloudelastic is struggling, was that from seeing the GC alerts/ES memory dashboard or somewhere else? [14:07:43] inflatador: yes mem alerts and I remember one node not responding yesterday [14:08:06] ACK, just wanted to make sure I wasn't missing anything [14:08:58] the unresponsive node was probably from me restarting 1005 to clear the GC alert . It caused the cluster to go red but the problem corrected itself...you can see it in scrollback [14:22:49] ah makes sense [14:26:21] do we still use the forceSearchIndex.php script to kick off backfills, or is there another way to do it with the new SUP? [14:27:05] inflatador: it's using another script [14:27:44] ah OK, I see it now.../cirrus_reindex.py? [14:27:59] yes at https://wikitech.wikimedia.org/wiki/Search [14:28:07] with the backfill command [14:28:07] cool, I'm rewriting the docs [14:44:57] still very crude, but the idea is to move the code snippets out into a repo where we can update them a bit more easily: https://gitlab.wikimedia.org/repos/search-platform/sre/one-liners/-/blob/main/streaming-updater.md?ref_type=heads [14:46:50] thanks! [14:48:29] np, will send out for review once I've added more [14:50:03] will merge https://gerrit.wikimedia.org/r/c/operations/alerts/+/1031522/, feel free to revert or adjust the thresholds, "for" time if it gets noisy [14:52:53] ACK, will keep an eye out [15:04:47] ebernhardson: retro time: https://meet.google.com/eki-rafx-cxi [15:54:50] * ebernhardson ponders "less of solving generic problems with extremely WMF-specific code" while writing 800+ lines of python to orchestrate running two commands the right way :P [15:56:36] :) [15:56:49] careful, you might provoke a rant from yours truly ;P [15:57:06] (not directed at you of course) [15:59:32] does anyone think Categories could write/read directly to/from thanos-swift? or would that be too unreliable? [16:00:25] just pondering if we could move it to k8s and use S3 as storage [16:01:06] inflatador: how big was the dataset? My naive answer would be it depends what % fits in the memory cache, and what % is "hot" data [16:01:58] looks like 34G on disk, hard to say. If it was 1 or 2G it would be easy yes. [16:02:11] data size is 34 GB. As for hot data I'm not sure [16:02:11] inflatador: I have no clue, I suspect that blazegraph won't like that much [16:02:33] np, just thinking out loud. I'll keep looking at the ganeti approach [16:03:09] i'm kinda surprised there is 34G of category information. I mean sure it's indexed multiple ways, various duplication. but still, thats alot of a->b [16:05:27] once we have persistent storage in K8s I'm guessing it will be easier. In the meantime, I created T365735 [16:05:31] T365735: Consider creating a separate WDQS server type for categories - https://phabricator.wikimedia.org/T365735 [16:07:13] it's updated daily, I wonder what's the size after a fresh reload [16:07:25] thanks for the ticket! [16:07:33] heading out back later tonight [16:07:43] enjoy! [16:28:33] oh, we should probably turn on that performance governor stuff for Elastic...can't remember if we've done that yet. Might help cloudelastic during backfills? [16:35:09] maybe? I suspect its going to be more io limited, but hard to say [16:36:40] cpu and io load for cloudelastic report ~20%. At least by those metrics it's not too loaded down [17:20:52] lunch, back in ~40 [18:01:27] not feeling so hot...going to rest up and come back tomorrow. ryankemper I cancelled our pairing [18:01:43] inflatador: ack. feel better! [19:51:20] run, back in 1hr [19:56:18] Oh and wrt the performance governor stuff above, agreed that WDQS is uniquely cpu-bound compared to elasticsearch. I wouldn't expect much benefit from swapping governors, although I'm totally fine trying it out [20:24:27] get weel i.nflatador! [20:50:25] sure, there is no harm in trying the performance governor