[10:09:51] lunch [13:10:26] o/ [14:08:34] I'll be at the triage meeting today, not much going on today at the SRE offsite for virtual attendees [14:20:03] \o [14:20:46] o/ [14:38:35] dcausse: hmm, the docs for setting the consumer position (https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Manual_maintenance#Downstream_consumer_%28wdqs1009%29 ?) suggest using a set_offsets.py script from stat1004, but that script only seems to print the offsets and not set them, is there a more official script somewhere that does perhaps? [14:40:32] ebernhardson: it should be a cookbook IIRC [14:43:14] dcausse: oh! that wasn't clear from the task, so i stopped the updater friday and depooled wcqs2001 and ran the wcqs-data-reload.sh script from the rdf repo. guess that didn't do much then :( [14:43:17] hm... can't find a cookbook for it, I guess a the function in spicerack was added [14:43:29] oh o [14:43:31] k [14:43:39] i see it in the repo, sre.wdqs.data-reload looks like it probably should [14:44:12] should we drop the data reload scripts from the rdf deploy repo? [14:44:12] wcqs-data-reload.sh is reusing the same journal [14:44:17] yes [14:44:30] hm.. yes the journal doubled [14:44:41] ok, so total failure :) [14:44:56] I completely forgot that this script existed [14:46:23] this cookbook is not very practictal... [14:46:39] it forces you to read the data before-hand to know the offset [14:46:52] s/offset/timestamp [14:47:32] in theory could adjust this to re-use the munged files on disk already, would probably be a flag [14:48:33] yea the offset handling there is a bit tedious. gosh i should remember this script...i wrote the wcqs integration there [14:48:41] s/script/cookbook/ [14:48:44] :) [14:49:00] I barely remember something reusing the munged data [14:49:35] it needs to change anyways, your.org is 2 weeks behind on dumps and changed their directory layout so it wouldn't work [14:49:55] i put up a patch to mount nfs dumps to one wcqs and one wdqs host each [14:50:51] :/ [14:51:46] in theory we munge all these dumps every week [14:52:18] i suppose it could have been rsync, the options were nfs or rsync but you need sre agent-forwarding capabilities to rsync so thought it would overly limit the reload script. But that turns out to not be the case, it's already sre limited by being a cookbook :) [14:52:36] :) [14:53:36] where do we munge the dumps weekly? [14:53:39] in yarn? [14:53:40] yes [14:53:50] the data is in hive [14:54:09] hmm, so the question would be how to get it out of yarn....the typical answer would be swift i suppose [14:54:22] what's missing is something like hive -> RDF file -> swift or the like -> import [14:54:37] yes [14:54:53] upload-to-swift should be easy enough, can poke at what it would take to get back out and into the data reload step [14:55:06] that'd be great [14:55:29] i wonder if they ever turned on ttl deleting in swift, the upload-to-swift never cleans up after itself but instead sets ttl's and expects swift to clean itself up [14:55:38] I mean that'd remove the dependency on the external dump mirror and would save 1day of munging [14:56:24] (i suppose more important here becaue it would be uploading 10's of GB per week) [14:56:29] hmm... these files are big indeed, so pruning them is important [15:02:13] poking puppet it looks like we have a profile::swift::storgae::expirer and its included from role::swift::storage, with the hiera set to present, so probably it works now [15:02:29] but only on deployment-ms-be05, no clue if that covers the cluster we use [15:02:43] (seems like no, because thats deployment-prep :P) [15:04:00] (╯°□°)╯︵ ┻━┻ THIS IS RIDICULOUS [15:04:34] inflatador: could be bold, make the patch to turn it on in prod and ask filippo to review [15:04:57] is there some way to find out what all it will delete when first enabled? [15:05:17] ebernhardson: dcausse: triage btwhttps://meet.google.com/eki-rafx-cxi?authuser=1 [15:05:21] I don't think directly, but probably a simple script [15:05:38] oops [15:05:41] https://meet.google.com/eki-rafx-cxi?authuser=1 * [16:00:49] gehel: can we close T304954? [16:00:50] T304954: Import data from hdfs to commonswiki_file - https://phabricator.wikimedia.org/T304954 [16:01:26] cbogen: gehel is out today, but from our side it should be complete [16:01:41] ah, ok I'm going to close, thanks! [16:03:18] inflatador: any hints on how to ask swift to tell me about objects with x-delete-after? Should that be something the listing api's can filter on? [16:06:22] hmm, per docs there is a hidden .expiring_objects account, i guess would need admin access to look there [16:06:53] (i just have the swift creds we use for uploading, the analytics:admin account which i don't think is a real admin) [16:07:16] ebernhardson good question, let me check the API docs...probably something we can do with swiftly or a python script [16:11:49] heh, docs refer to the existing process as legacy. But then in a big note: The new task-queue system has not been completed yet. So an expirer’s with dequeue_from_legacy set to False will currently do nothing. [16:12:32] ebernhardson would we be checking existing objects or setting a new TTL policy for containers? [16:12:42] (or both?). Hit me up with the phab task if you don't mind [16:13:41] inflatador: i was thinking we would want some kind of report of "what will be deleted if we turn this on" to attach to the patch asking filippo to deploy the expirer in prod. I'm not sure what clients other than us have been setting the x-delete-after [16:14:02] basically make it easier for him to say yes [16:14:54] but my reading is we would need access to the hidden account to query that [16:15:42] i suppose before reading these docs more i thought swift would expose that through a query, but instead it seems they manage a special hidden account that contains some sort of files that say what has to be deleted when [16:16:00] ebernhardson so we already are setting x-delete-after on our objects? [16:16:04] inflatador: yup [16:16:31] since 2019, so probably alot of not-deleted things at least in our own account :) I'm mostly curious about others though [16:17:04] this comes about because i'm thinking about adding a new thing that will x-delete-after with 10's of GB per week [16:17:46] Oh yeah, I thought TTL/expiry was native to swift, you're saying it's not active on our installation? [16:18:40] https://docs.openstack.org/swift/latest/overview_expiring_objects.html [16:18:47] inflatador: the expirer isn't enabled in our installation. There is a puppet profile for it, but it's only turned on in deployment-prep [16:18:56] (I know you've already read it, sorry for my slow catchup here) [16:18:58] (for ~2 years) [16:20:06] the patch itself should amount to choosing an instance to run the thing, and setting one hiera variable on that machine. But i was hoping to include in that a list of what would be deleted to make flippo more comfortable shipping it [16:21:09] Yeah, makes sense. Do we set the headers at the container or object level? Hit me up with a link to our code if you don't mind [16:22:58] for our own use case we set it at the object level while uploading, https://github.com/wikimedia/analytics-refinery/blob/master/oozie/util/swift/upload/swift_upload.py#L383-L392 [16:23:09] but i wonder more about other users of the swift cluster [16:24:21] yeah, that's a very likely concern for go-dog and whoever owns swift. On the other hand, if you're creating objects with those headers, don't you want to delete them? ;) [16:24:54] depends. Things have never been deleted before, so maybe people are using code paths where they don't realize they are setting delete-after headers [16:25:10] at least, not auto-deleted [16:25:39] maybe the expirer can be limited to specific containers/users? i should look closer.. [16:26:00] so should I! [16:26:06] i just don't want to find out turning it on deletes masses of thumbnails in mediawiki, and then the thumbnail renderers get overloaded or something silly :P [16:26:23] oh yeah, and you're right to anticipate concerns from the service owner [16:27:54] I missed whatever task that is, but if you want to assign it to me I can try to run it down. Hopefully that object-server.conf config allows enabling the expirer at the tenant or container level [16:27:59] hmm, actually per the docs it shouldn't be dangerous: The expired-but-not-yet-deleted objects will still 404 Not Found if someone tries to GET or HEAD them [16:28:16] so those objects are already unaccessible, deleting them is only a clenaup [16:29:20] seems enough to make me confortable at least [16:31:28] inflatador: (hopefully) last question, any idea if eqiad and codfw swift are separate clusters, thus needing their own expirer each, or if it's one cluster that keeps replicas in multiple datacenters? [16:33:23] ebernhardson I actually don't know, I'm guessing the former but will check it out [16:40:47] https://config-master.wikimedia.org/discovery/ leads me to believe the clusters are synced somehow [16:43:14] asking in sre [16:57:14] based on what i see in puppet, i think they are independant clusters using https://docs.openstack.org/swift/pike/overview_container_sync.html [17:17:54] ah, that makes sense [17:40:30] lunch, back in ~45 [17:49:35] * ebernhardson ponders how the cookbook should know what munged data to download [17:56:01] lol, query `select * from wikibase_rdf where `date`='20220911' and wiki='commons' limit 1;` result: `java.lang.OutOfMemoryError: Java heap space` [18:07:43] weird that hive is not able to grab the first row... [18:07:53] beeline can, something is just up with hive [18:07:57] hive-cli [18:09:56] i suppose an open question i have is how to transform the table into the .ttl.gz files that the import needs, doesn't seem like it should be too complicated, maybe a groupby on subject and some string formatting of the resulting groups, but unsure [18:11:05] and i suppose all the prefix handling, the table looks unprefixed (but i suppose thats not very complicated) [18:42:14] sorry, been back [19:03:44] for the main task of reloading the wcqs instances i think i'm going to have to punt, i prepped the commandline that should work here but an SRE will need to run the cookbook: https://phabricator.wikimedia.org/T316236#8246427 [19:04:09] also found it tedious to figure out that ms timestamp, so put up a patch that lets the cookbook accept more standard time strings: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/833082 [19:04:35] we can still look into moving the yarn based munging over, but i think that's going to take more than a day or two to figure out [19:04:48] OK, I can help with the PR/cookbook [19:05:16] the reason string i used might be a bit meh :P but i figure the task-id is the important bit [19:21:46] screwing up gerrit yet again...I'll merge this shortly if I can get out of my own way ;( [19:37:07] inflatador: no worries :) Also that will run for a couple days, so need to run in a tmux/screen/etc. [19:37:56] one thing i'm not 100% sure of is what the kafka topic for mutations retention is. The initial tasks say 30 days, more than plenty, but i can't verify [19:38:14] ebernhardson yeah i figured. It's bombing out now, looks like it can't find one of the bz2 files. Do we need to merge anything else before running? [19:38:25] hmm, whats the error message? It must be complaining that it's not finding the pre-downloaded dump? [19:38:53] yeah, it's cut off but I think it's https://dumps.wikimedia.your.org/commonswiki/entities/latest-mediainfo.ttl.bz2 [19:38:53] oh! it's probably looking for .bz2 and i downloaded the .gz version :( [19:39:50] will have to re-download from our mirrors the .bz2 and try again. We can't let the cookbook download because it's reading the your.org mirror which is out of date [19:41:11] annoyingly that download is going to be 4 hours :P [19:41:25] or i suppose we could merge the NFS patch? dcaro seemed ok with it [19:41:52] ebernhardson up to you, I'm sure ryankemper and/or myself could start the cookbook again in 4h [19:42:24] or we could patch the cookbook to look for gz, ungz and bzip, whatever's clever [19:43:39] hmm, i suppose could re-compress it. would take less time than downloading i imagine [19:45:25] yeah, probably a pigz pipe to bz2 or something? I'll start testing unless you have a 1-liner handy [19:46:21] oh, no it wont be faster :S we don't seem to have pbzip2 here. zcat runs through the input at 35MB/s, but piping it into bzip2 -1 brings that down to sub 2MB/s, ~6 hours [19:47:25] so I guess that's it's the station wagon full of exabyte tapes, then ;P [19:48:00] i cheated a little and copied pbzip2 into my home dir :P Now it's running at 25MB/s, about 23 minutes to re-compress [19:49:49] ebernhardson much better! also it looks like pbzip2 is available thru the repos [19:49:58] deb repo that is [19:50:55] yea it should be installable, curiously i don't see a single mention of it in the puppet repo [19:51:12] i guess noone else needed it :) [19:53:13] for self-contained deb pkgs like that, I'm fine with one-off installing...debmonitor keeps track and should warn us if there is an exploit [19:53:59] may as well install it w*qs wide if the cookbooks are manipulating bz2 files. What do you think? [19:55:04] they aren't actually doing it directly, rather the decompression happens inside java [19:55:44] it's probably pretty rare we create bz2 files, mostly decompress them [19:59:41] ah, nm then [20:12:54] inflatador: ok it should be ready now. Amusingly the .bz2 is larger than the .gz since i used -1 (which is supposed to mean faster, but larger) [20:21:44] ebernhardson Cool. Just started the cookbook and it's not immediately dying [20:22:59] I doubt it's that exciting, but here's what it looks like: https://phabricator.wikimedia.org/P34883 [20:23:52] yup that looks about right, it'll count up to ~1580 i think [20:56:01] Has anyone used any of the wikidata python libs listed here? https://www.wikidata.org/wiki/Wikidata:Tools/For_programmers [21:20:28] break, back in ~15 [21:31:33] back [22:03:14] see y'all tomorrow [22:38:23] i dunno how i didn't notice this before .... the elastic docs aren't explicit but one of the tickets says: if a field has a value ["one two three", "one two"], today the token_countwould index[3,2] [22:38:43] and not 5 like we would expect [22:41:22] we can still get the value from scripting, it's just slightly stupid :P {"script":{"source":"doc['outgoing_link.token_count'].length}} [22:41:40] getting the length of the doc values, instead of the actual values [22:42:56] or can manually sum the values in painless, but thats silly since we know all the values are 1 [22:51:40] * ebernhardson shrugs and continues the reindex....i guess i must have only tested before with 1 and 0 outgoing_link's