[06:17:41] hi folks, not sure if the tags are correct on this task, but who can take a look at it? T314835 [06:17:42] T314835: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 [07:52:00] tanny411: hello! any chance we could move our 1:1 slightly earlier? For example now? [07:52:38] Sure, Im here [07:52:53] thanks! [08:12:22] godog: how urgent is it (before we run out of space) [08:13:06] David and Erik are both on vacation (I know, bad planning). They are the ones who really understand how all of this works. [08:13:37] I can dig into it, but I might need to give them a call. [08:14:53] gehel: when are they back? [08:15:20] I think a week might be ok, but I'd rather not push it [08:17:25] Next week [08:20:08] It sounds like Flink stopped some kind of cleanup process, but knowing how to fix that is going to require more knowledge of Flink than I have. [08:20:26] Let me check with Platform Engineering to see if they can help [08:22:14] gehel: thank you! [08:23:20] let me know how it goes, if we can't mitigate it I can purge some historical metrics, but I'd rather not have to [08:25:48] Purge metrics? To recover space from a different application? [08:26:43] Nah, worst case, we stop the updater. That's inline with our SLO (at least mostly). [08:48:09] gehel: oh ok! thank you that's good to know we can stop the updater instead [08:49:16] That's a bit of a worst case scenario, it would (obviously) stop WDQS from being updated, which has a significant impact on Wikidata. [08:49:54] But we are under resourced to manage WDQS well, so that's part of the expected risks. [08:50:14] cc: mpham (for when you wake up) [09:03:26] *nod* [09:41:39] Lunch + errand [12:46:32] greetings [12:49:52] I'm working today in lieu, will start the reimage soon [12:50:07] Doubt I can help much with the Flink situation but do let me know if I can learn [13:02:00] inflatador: there is some conversation on Slack: #data-platform-value-stream [13:02:31] Not sure if you can help, or learn something in the process [13:05:34] inflatador: also, if you could check and merge T314853. Marco is checking if the data import is running correctly. It might be interesting for you to pair with him and learn a bit about airflow in the process. [13:06:22] Just a suggestion, and in case the reimages are running as they should [13:08:18] o/ [13:17:09] the problem seems to be codfw only and related to swift [13:18:05] Unable to wrap exception of type class org.apache.flink.fs.openstackhadoop.shaded.org.apache.hadoop.fs.swift.exceptions.SwiftAuthenticationFailedException: it has no (String) constructor" [13:22:13] dcausse: hi! interesting, I can't follow up ATM though [13:22:41] likely in ~1.5h though I'll read what you find! [13:23:56] gehel ACK, looking at puppet patch now [13:25:51] inflatador: thanks ! [13:30:17] dcausse: thanks a lot! This isn't super urgent (we have a few days before we run out of space) and you should be on vacation. So don't drop everything for this ! [13:30:35] But honestly, I have some doubts that we will be able to fix it without you [13:31:08] gehel: I won't have access to a computer for 1 week starting from tomorrow [13:31:31] :/ [13:31:53] gehel OK, merged. Do I need to run puppet on all/any hosts listed here? https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:33:32] Ideally on 'an-airflow1001.eqiad.wmnet', but it runs automatically every 30' and I don't think that Marco is in as much hurty [13:33:46] s/hurty/hurry/ [13:37:15] hey folks [13:37:19] NP, I ran it on an-airflow1001.eqiad.wmnet [13:37:33] o/ [13:37:47] are we discussing the flink/thanos issue in here? [13:37:50] or do we prefer slack? [13:38:03] elukey: o/ here is fine [13:38:17] hey dcausse! [13:38:42] I just added in the task that there seem to be ~18T of checkpoints saved in thanos [13:38:54] for rdf-streaming-updater-codfw+segments [13:38:58] elukey: yes I think codfw is misbehaving [13:39:18] was about to blame the swift client that we still use for flink ha [13:39:55] it caused us problems in the past and we dropped it in favor of the S3 client [13:40:06] but we still use it for flink_ha storage [13:41:23] can we drop some of the old checkpoints by any chance? [13:41:26] will try to stop the job and make some cleanups at least [13:41:28] yes [13:41:35] super, lemme know if you need any help [13:41:40] sure, thanks! [13:52:59] wdqs lag on codfw might start to scream and might cause bots to stop editing [13:53:40] we should route traffic to eqiad only and update wikidata max lag detection to only check eqiad [13:53:55] inflatador: ^ is this something you could take care of? [13:59:35] dcausse I think so, checking now [13:59:52] thanks! [14:09:15] inflatador: https://gerrit.wikimedia.org/r/821753 should stop polling wdqs for the max lag detection [14:15:31] I mean wdqs@codfw [14:17:34] dcausse cool, merged that PR. Still working on the depool command, I've reached out in sre, just wanna make sure I don't accidentally depool everything [14:18:14] inflatador: thanks! lemme know once it's depooled and I'll stop the wdqs job to (stopped the wcqs job already) [14:18:25] s/to/too [14:30:35] dcausse codfw is depooled from wdqs [14:30:45] inflatador: thanks! [14:30:58] stopping the wdqs job [15:01:15] dcausse: huge thanks for taking the time during your vacation! [15:25:22] starting to cleanup the rdf-streaming-updater-codfw swift bucket [15:38:42] dcausse any idea how long we might need to keep codfw depooled? [15:39:22] inflatador: I hope we can repool it in a couple hours if the cleanup is going fast enough [15:39:58] ACK, just wanted to make sure we don't leave it down too long. If there's a dashboard I can watch LMK [15:40:07] inflatador: I'm tempted to move forward with T304914 (at least partially). I mean by using the s3 client on codfw at least [15:40:08] T304914: Remove the presto client for swift from the flink image - https://phabricator.wikimedia.org/T304914 [15:40:33] but I might need help with this (delete some k8s resources and some configmaps) [15:42:55] dcausse I have a mtg in about an hour, do you think we could finish before then? Or maybe after that? I know you're on vacation, just wanna be sensitive to that [15:43:43] the cleanup will take some time, but we could perhaps prepare the k8s namespace in codfw if you have time? [15:45:04] we'd need to fully undeploy the rdf-streaming-updater deployment in k8s@codfw but I'm not sure I can do that [15:48:20] dcausse OK, I'm rescheduling my mtg (lots of ppl cancelled anyway) and will create a Meet shortly [15:49:38] thanks! [15:50:43] OK, up at meet.google.com/ngq-nvrq-mir [15:51:03] joining [16:34:33] purging swift is quite slow and I'm getting errors like "Error Deleting: rdf-streaming-updater-codfw/flink_ha_storage/default/completedCheckpoint26ba64d3a1ec: ('Connection broken: IncompleteRead(6 bytes read)', IncompleteRead(6 bytes read))" [16:36:33] I see some space being reclaimed but the folder I delete with "swift -A https://thanos-swift.svc.eqiad.wmnet/auth/v1.0 -U wdqs:flink -K PASS delete rdf-streaming-updater-codfw --prefix commons/checkpoints/1475a2038f088807f9d695aea3e1c7e3c" still has entries [16:36:55] godog: in case you know if there's a better way to purge some files ^ [16:47:13] Added some notes on the k8s purge/deploy we just did, feel free to add/change: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Flink_On_Kubernetes#Kubernetes_operations [17:02:59] workout, back in ~45 [17:12:14] hm... purging the container like that is not going to work [17:14:11] I could possibly store the savepoints to hdfs and then completely drop the container, this might perhaps be a lot faster [17:24:29] but I don't know how to drop a container so I'm a bit stuck [17:47:34] back [17:50:24] dcausse if you still need help LMK. not sure about dropping containers though [17:51:17] inflatador: thanks but not sure what to do, I'm going to drop the flink_ha_storage folder in priority so that I can resume operation but the mass cleanup will take a while [17:51:30] ACK [17:58:55] updated the ticket, will check back a bit later [18:15:41] is slack down for anybody else? [18:16:04] yeah I can't load threads [18:16:30] mpham I tried sending you a slack msg and it was rejected [18:16:55] ok, good to know. yeah, it's not working for me either [19:13:43] well... not much progress on the "swift" cleanup [19:15:04] one option would be resume the jobs on a new container and hope that there's a command to drop a full container [19:15:47] we used to have to delete swift containers for customers all the time at my old job [19:16:04] swiftly was the preferred tool, but it's been a loooong time ( http://gholt.github.io/swiftly/2.06/ ) [19:16:08] everything I see requires deleting all the objects before [19:16:21] but it's from the swift client [19:16:43] there might be admin commands that might allow to bypass this check [19:17:20] dcausse yeah, that was/is the problem...swift won't let you delete the container until it's empty [19:17:36] swiftly and some other tools will do that for you automatically, trying to remember what the best one is [19:18:42] there are bazillions of files in flink_ha_storage/default/ and not sure how much time it'll take before it's empty ... [19:19:24] I should have added an alert an this... [19:20:17] I think best way forward is to put the data in a new container and resume the jobs so that it's "easy" to cleanup the bad container later [19:20:25] swiftly is oooold, probably only works with python2, but it does allow you to delete all, see https://docs.rackspace.com/support/how-to/install-the-swiftly-client-for-cloud-files/ . Also allows concurrent object deletion, but we should check with data persistence before we start hammering the API [19:20:26] please let me know if you have a better option [19:21:12] I don't have any better ideas [19:21:39] ok, going to configure the system to use the "rdf-streaming-updater-codfw-T314835" then [19:21:40] T314835: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 [19:21:56] I'm guessing that deleting multiple TB of data from swift probably will take a few days unless data persistence knows any backend magic [19:26:04] Lunch, back in ~30 [20:12:00] sigh... no luck can't start the job on this new swift container: Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: txda600f7bcca7429ab42ab-0062f2bef2; S3 Extended Request ID: txda600f7bcca7429ab42ab-0062f2bef2; Proxy: null [20:15:31] bah [20:16:42] not sure what's wrong... [20:17:04] perhaps the S3 compat layer is something that needs to be activated on a per container basis? [20:17:44] the only thing I can think of offhand is the path style vs bucket style we saw with the Elastic stuff https://wikitech.wikimedia.org/wiki/Search/S3_Plugin_Enable#Path-style_and_bucket-style_access [20:23:04] yes it's on here [20:25:45] back to square 0... [20:32:16] maybe there's a way to test it outside of k8s? [20:32:39] I was testing from yarn (the analytics cluster [20:33:31] btw testing swift from codfw (search-load2002) I randomly get Container GET failed: https://thanos-swift.discovery.wmnet/v1/AUTH_wdqs/rdf-streaming-updater-codfw?format=json&prefix=flink_ha_storage/default 401 Unauthorized [first 60 chars of response] b'

Unauthorized

This server could not verify t' [20:33:33] Failed Transaction ID: tx019b7dab38e944ef80bdd-0062f2c481 [20:35:05] Per conversation w gehel , I think we're stable enough if you want to get on with your vacation. Also FWiW, I have seen problems with swift-proxy on the backend manifest as 401/403s on the frontend [20:37:48] dcausse: go enjoy your vacation! You've done s lot already! We'll do our best to survive until you get back for real! [20:38:00] ok, I can leave it down for the rest of my vacations but not sure we have enough retention on the kafka topics [20:38:28] I can also run the jobs from yarn using the same swift container [20:39:15] your call [20:39:55] If there is something quick that you can do, please go ahead. But you should really be on vacation. [20:40:33] Worst case, a bit more work next week to reset everything from scratch, but it should not have user impact. [20:40:43] ok I'll start them from yarn, will update the ticket with paths that should not be cleaned up [20:41:51] And we need to talk about how you get that day back!