[00:48:51] Here's where I am. I am trying to load my dump and I keep running into this error. https://hackerpaste.hns.siasky.net/#AABfcZlhjUFLP-vPt-KvOKyvmY6tTrE9Vk-IY_TQFKVSwAvSsvkYpGheR90boqeFhc [00:48:54] I thought, okay, 20 GB is small for a VM, so I created a 100 GB mounted volume and mounted it onto my controller VM. (Easier than DigitalOcean!) And I moved the gzipped dump to that volume, and tried to run the command again, and still got that error. [01:04:28] What I could do – what I wanted to begin with – was just import the dump remotely from my own machine, but this would require the DB to have a hostname that is accessible from the Internet [01:34:24] the problem is postgres taking all disk space, not the dump file [01:38:23] don't kniw the details (so this might not be helpful at all), but increase size for partition where postgres lives or move postgres to some volume with enough disk space [01:38:35] don't know the details (so this might not be helpful at all), but increase size for partition where postgres lives or move postgres to some volume with enough disk space [02:03:57] postgres server or postgres client? the postgres server is using a new service called trove (re @Edgars: don't know the details (so this might not be helpful at all), but increase size for partition where postgres lives or move postgres to some volume with enough disk space) [02:19:02] postgres server [03:05:13] harej, I can probably attach a temporary public IP to your db instance -- want me to try that? [03:10:43] Sure, let's try that andrewbogott [03:15:41] harej, try -h 185.15.56.102 [03:15:51] for example psql -h 185.15.56.102 -p 5432 -U root -d postgres [03:16:19] (and let me know when you're done so I can close it up again) [03:27:28] andrewbogott I have the dump import running and I will let you know when it finishes or if it crashes again [03:27:42] nice! [03:27:53] It hasn't crashed *yet* which I am taking to be a good sign [03:33:49] andrewbogott: same error as before. What is the size of the disk of the VM running the postgres server? If it's 20 GB like the other VMs it is probably running into that. So for imports we need some way to tell postgres to stage the imports somewhere else. [03:34:10] The dump is 11 GB gzipped. Uncompressed it is somewhere around 100 GB I think. [03:35:47] harej, When you created the instance I think you selected a flavor, looks like g3.cores4.ram8.disk20 [03:36:33] harej, I must've misunderstood, can't you decpress the file on your local system and do the import from there? [03:36:59] That might work [03:37:17] slow but it'll give you more control over some of this [03:40:00] even with blazing fast single core speeds this gzip -d is taking a while [03:40:43] It must be getting late where you are, thank you for helping me navigate this [03:41:24] np [04:10:33] psql -h 185.15.56.102 -U cguser -d citationgraph < citationgraph_20210726 from my local machine returns the same error [04:13:57] The same "no space left on device" error [07:24:07] where do I go to kill an instance on wmcloud? I cannot see linkwatcher and if it is renamed liwa3 it isn't responding [07:24:20] coibot is responding fine [07:24:51] not to worry, just taking foooooorever [07:24:57] horizon.wikimedia.org lets you force reboot instances, if that's what you mean [07:25:25] thx [07:26:41] weird just logging into the instance got the bot operational in irc, nothing else needed [07:26:54] seems it was lost [12:57:11] hi need help from an admin [12:58:15] to reset two factor auth of my user [12:59:20] orenbo: the process for that is here: https://wikitech.wikimedia.org/wiki/Password_and_2FA_reset#For_users, lmk if the docs are unclear [13:03:59] such fancy syntax highlighting on that page ^^ [13:06:23] !log toolsbeta rebuild toolsbeta-sgeexec-1001 as -1003 T287666 [13:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:06:27] T287666: toolsbeta-sgeexec-1001/2: buster sgeexec apt fails to write to /tmp - https://phabricator.wikimedia.org/T287666 [13:06:48] harej, sorry I dropped off last night, 'The Fifth Element' came on TV and I forgot about everything else :) That 'no space left on device' error is accurate -- here's the df line from the database instance: [13:06:52] /dev/sdb 118G 112G 152K 100% /var/lib/postgresql [13:24:32] so I dont seem to be able to connect to bastion.wmflabs.org [13:25:19] perhaps since I changed my ssh key on gerrit recently [13:26:26] I recall a login/logout might be needed for it to get updated ? [13:51:33] orenbo: the gerrit ssh key is unrelated to the key you would use for ssh [14:08:06] !log toolsbeta add mdipietro as projectadmin T287287 [14:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [14:08:10] T287287: Onboard Michael DiPietro to Wikimedia Foundation as SRE in Cloud Services - https://phabricator.wikimedia.org/T287287 [14:09:02] !log paws add mdipietro as projectadmin T287287 [14:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [14:24:53] I really miss high traffic irc. All the Jupyter realtime chat chnnels are quieter [14:58:18] andrewbogott: is it really as simple as me underprovisioning my DB? [15:05:30] harej, looks like [15:26:32] Okay, I am going to delete that DB and make a bigger one, and try again [15:30:24] andrewbogott: what is my DB quota? [15:31:22] harej, 120GB at the moment [15:32:23] Would it be possible to increase that to around 200 GB? [15:33:32] Interesting, the uncompressed db dump is 58,881,933,770 bytes [15:33:50] I guess the actual data structures are bigger than the SQL statements needed to recreate those structures [15:34:09] There's also hella indexing [15:34:29] harej, yes but do you mind opening a quota ticket so we have records? https://phabricator.wikimedia.org/project/view/2880/ [15:49:12] There we go, the current size of my workstation's postgres folder is 172 GB [15:50:01] I filed the ticket [16:23:59] Down the line I may also want to re-do my data model; there's a heavy reliance on varchar(255) columns because user-generated data is a total free-for-all [16:24:11] But that number is pretty arbitrary [16:27:36] @harej: https://stackoverflow.com/a/1067095/8171 -- tuning varchar(255) to varchar(N) probably makes no difference. Preallocated row size for varchar died out in most RDMBS systems 20 years ago. [16:28:04] I was wondering about that, actually [16:28:12] The idea of fixed width columns felt very old school to me [16:42:48] harej, I upped your quota; I think you can resize your existing volume but let me know how that goes [16:47:59] I will try that, thank you [17:41:44] I've resumed the import [18:04:47] !log tools reset sul account mapping on striker for developer account "Derek Zax" T287369 [18:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:04:53] T287369: Unlink SUL account from dev account - https://phabricator.wikimedia.org/T287369 [18:13:17] andrewbogott: can you attach a floating IP to fxf3dbjplex.svc.trove.eqiad1.wikimedia.cloud? [18:13:36] I was running into issues so just in case I deleted the database and will re-import [18:14:55] harej, 185.15.56.96 [18:19:42] cheers! Thank you for continuing to walk me through this. I hope this is giving the team insight into how Trove operates and I am not just being a fussy customer. [18:24:12] One thing that would help me is if I could get more insight into the db server's performance. Unless I can ssh directly into the spawned DB server (is that an option?) I'm not sure how I would have figured out the actual database volume was full (I mean, other than the error messages literally telling me this.) [18:24:44] The confusion was over whether it was the *database* that was full or if it was caching stuff on the *database VM*, i.e. I overthought this [18:24:57] harej, yeah it's quite useful for me to see your process. [18:25:02] 🙂 [18:25:28] Getting a direct login on the database instance is pretty hard to do securely but it's a reasonable thing to want [18:27:08] bd808: I /think/ could be fixed by adding %(project) in the confg.sal.phab message template, is that right? [18:27:23] I just traced it thruogh the code base and it seems like it's based on dict(**bang) which would already contain everything we need [18:27:37] Still a bit unsure as to what it would be for production or where that comes from [18:27:41] A direct login to the db instance might be useful but the thing I really want is insight into how overloaded/overworked the DB is [18:41:50] !log tools.wikibugs Updated channels.yaml to: d7555428ef9264c75b83b097fd78e9918bc8b250 Send SRE-* projects to `#wikimedia-operations` too [18:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [18:46:26] Fingers crossed... [19:34:18] Krinkle: I need to get back into figuring out the issue myself. It must have seemed tricky when I wrote T186845 or I would have just fixed it. :) [19:34:19] T186845: Display project name in SAL messages posted to Phabricator - https://phabricator.wikimedia.org/T186845 [19:36:55] I think I'm remembering that the project name is only data for a toolhub/cloud vps !log and not really for prod logs. I think that's probably what stuck my brain before. That it would likley need a branch in the code and config to render messages into phab differently based on the !log origin. [19:38:18] toolhub !log? i assume you mean toolforge? [19:38:36] yeah... my brain keeps flipping them :/ [19:39:14] i replaced the labs, labs, labs problem with tools, tools, tools in my own brain [19:48:06] This is partially my fault for naming Toolhub Toolhub [19:50:05] The psql import has not crashed yet, which is encouraging. When I tried re-sizing my DB that didn't seem to achieve the desired outcome, so I deleted the DB and started over, and that appears to be working. I also gave it 16 GB of RAM instead of 8 GB, which probably isn't a factor but may have been a good idea anyway. [19:50:55] hm, did you resize the db volume or the db instance? (those both being things you can resize) [19:52:36] Resizing the db instance only gave me the choice to change core count and RAM, since stock VMs are all 20 GB. So I resized the volume [19:53:48] Hard drive sizes are in the terabytes now, but we're still setting up systems with 20 GB of disk storage, like when I first started using computers in the late 1990s. Everything old is new again [19:54:36] (I was not around for the times when hard drives were measured in megabytes.) [22:27:28] @harej: except now that 20GiB is a thinly provisioned, 2x replicated block storage allocation backed by enterprise grade SSD devices ;) [22:31:08] 3x [22:32:06] 3 copies total, correct? primary and 2 replicas? [22:33:41] hmmmm [22:33:53] it isn't really primary and replicas, it's three peer replicas [22:34:11] reads are distributed all over the place for performance, kind of like a multi-server raid [22:34:19] but yes, three copies total :) [22:37:40] Words of this stuff often feel so imprecise. Some part of my brain rejects saying "3 replicas" because replicas of what? Don't you have to have a thing first to replicate it? But yeah the point is really "3 copies of the data" for 1) read speed increase and 2) hardware failure loss protection. It doesn't protect against replicating bad writes, but should recover from loss of up to 2 copies. [22:44:48] andrewbogott: This has come up mid-import, but has not killed the import: "ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp419.0.sharedfileset/0.10": No space left on device" [22:45:09] huh [22:45:38] I don't know what that means, except that maybe making the db takes more space than storing it [22:46:26] So long as the import hasn't crashed and burned I am content with ignoring it [22:46:47] Well, speak of the devil [23:49:52] But I think most of the import has otherwise gone through, I am going to check