[09:24:15] Xabriel is working on wmf_dumps.wikitext_raw (T358366). I think this isn't something that makes sense as an intermediate storage for our Search pipelines (and even less for WDQS), but could someone who knows more have a look? If we don't care about it, please just ignore. Otherwise, have a look and comment! [09:24:17] T358366: Consult with Product and Research team on schema and data retention expectations for wmf_dumps.wikitext_raw - https://phabricator.wikimedia.org/T358366 [09:26:43] I don't think we ever used this, we generally use our own dumps [10:14:38] lunch [13:14:25] o/ [14:59:31] \o [15:05:22] o/ [15:14:57] .o/ [15:35:56] testing the reindex-all script against the closed wikis on cloudelastic. I don't know why this didn't occur to me before, but printing the output of 8 UpdateSearchIndexConfig invocations in parallel is a mess... [15:36:11] i guess i need per-wiki log files or something [15:43:20] not terrible, took ~8 minutes to reindex, but of course closed wikis are all empty. but thats the overhead [15:44:06] ebernhardson: i added a request to https://phabricator.wikimedia.org/T364600#9817447 . look doable? hope is to add this to the regular weekly updates if possible. [15:44:47] dr0ptp4kt: hmm, we do that for rdf lemme check how we did that [15:46:25] hmm, for wikidata rdf imports we didn't change the ownership, just umask to 0022 which makes it group readable. poking a few more things [15:48:12] ahh, yea thats probably what we need. Right now data is stored 0750, we want 0755 [15:48:22] dr0ptp4kt: should be sufficient to be world readable? [15:48:37] hmm, maybe not since this is webrequest derived [15:50:01] one sec, checking about if world readable (for users on the cluster) is really what's wanted. [15:51:54] poking spark i'm not seeing obvious ways to control the group it's writen as, but we could always chown from airflow [15:53:47] hmm, actually i see a suggestion that spark will inherit the parent directory group if you have perms, testing [15:53:56] right. okay, i checked - let's chown/chgrp for the group perm portion, not the world portion [16:04:55] workout, back in ~40 [16:33:07] * ebernhardson realizes while looking at permissions that i forgot to repartition the data before writing [16:33:15] we have 200 70kb files per hour, instead of 1 or 2 [16:33:36] * ebernhardson wishes spark could be a little smarter there...it could just look at the data after writing and then re-write it [16:34:38] hmm, ideally we could fit a repartition declaration into the output spec string...wonder how [17:00:16] dinner [17:11:10] sorry, been back awhile [17:36:16] lunch, back in ~40 [18:05:12] back [18:32:03] inflatador ryankemper okay if i join the sre pairing session for a bit? [18:32:15] dr0ptp4kt: yeah hop in! [18:32:34] omw [19:13:22] kicked off full-cluster cloudelastic reindex with the new script, will see how it goes. It will run some in parallel (up to 32 shards), but it's doing biggest first and commonswiki takes the whole capacity [19:37:57] OK, so the Mac docker workflow for REALLY ACTUALLY DELETING IMAGES appears to be 1) remove all images from docker desktop (could probably do with cli but I'm lazy) then 2) run `docker buildx prune` from the CLI [19:43:16] 🫣 [20:11:20] break, back in ~40 [20:31:25] back [20:56:55] It looks like wdqs1022 and wdqs1024 (test servers of the graph split project) don't have it running, but both wdqs1023 and wdqs2023 (also test servers of the graph split project) have wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service timers. i'm wondering, do we just want to stop and disable wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service on wdqs1023 and wdqs2023 for now? [20:57:02] By the way, after having manually updated aliases.map and bouncing wdqs-categories I ran a reloadCategories.sh on wdqs2023 just to see how long it would normally take. [20:57:48] i imagine you can, it's a separate graph [20:58:04] i'm guessing d.causse may want to run some things for https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1032544 like these, but not sure that the prometheus check needs to be on atm [20:58:06] dr0ptp4kt yeah, looks like the timers are setting off alerts. I'll try masking them and see if that does the job [21:00:10] inflatador: out of curiosity, what's the preferred way to mask them? [21:00:47] `systemctl mask wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.timer` ... I'm running puppet now to see if this change is puppet-proof [21:01:18] i see the 'acknowledge' with a short lived silence checkbox, and then the 'silence this group' in the hamburger menu [21:02:27] Oh good point...I've masked it from the host but that won't stop alerts [21:02:35] (looked at manpage, i see, so suppressing their run altogether) [21:04:08] let me try deleting the units completely. Either puppet will put them back, or (hopefully) we've written code that doesn't install it anymore on new hosts [21:08:54] we shouldn't be running categories at all on these split hosts, right? [21:11:32] I don't think our puppet code is smart enough to remove everything that was already installed, so when we changed these to graph split hosts those unit files were still lying around [21:12:02] I manually deleted them from 1023 and 2023 and it looks like puppet is not adding them back, so hopefully we are good [21:15:04] i don't know that we need categories on them immediately, although d.causse's refactoring stuff may want to utilize these servers for that purpose over the upcoming several weeks (not sure if he was possibly needing it in the order of next several days) [21:15:39] thanks for puppeteering! [21:19:22] np, anytime! [22:25:17] building the docker image takes 313.9s on an 8-core cloud server, 302.9s on my desktop. Bah