[09:04:43] jynus: is it possible for you to provide me with a dump of the ipoid service DB, or a backup from this morning/last night if it exists? We are debugging an application issue and having the DB contents would be helpful. [09:43:25] <_joe_> kostajh: wouldn't it be more convenient to have a dedicated user on the DB you can use to query it? [09:47:46] _joe_: we have one. But we're trying to work out why the app is failing at a certain step of the import process, and it will be much easier to just plug in the DB dump to our local dev instances and step through that way. [09:48:06] <_joe_> ah gotcha [09:48:14] in the meantime, yeah we are making do with manual queries to try to work out what is going on [10:09:34] kostajh: what is that service? [10:09:43] where is it? [10:10:41] where is your local dev instance? [10:12:52] jynus: https://wikitech.wikimedia.org/wiki/Service/IPoid [10:12:57] my local dev instances in on my laptop. [10:13:23] so I imagine I would download the file from a deployment server [10:13:38] generally, that's not allowed- moving data from production to outside of the network [10:14:11] I see it is on m5 [10:14:15] there is no user-generated data in that application [10:14:45] I can generate mostly the same data locally on my laptop -- the thing I need is to be able to debug some data that exists in the `import_status` table [10:15:02] so, even just that table alone would be great, if possible. [10:16:00] jynus: I need to step out for a bit, but Tran is on my team and can follow-up on this as well [10:16:14] :wave [10:17:09] backups for ipoid were not setup [10:17:56] as when the db was requested it was said backups: no [10:18:09] > Backup Policy: Needed? Frequency? no [10:18:57] <_joe_> which makes sense as the data is populated from an external feed [10:19:49] I can set them up now [10:20:13] or generate a new one [10:20:46] but I would prefer to dump it inside the network and give you access to the file or test db [10:21:02] let me know which would you prefer [10:21:04] Could we do it without a backup? We don't need this data to be backed up for any kind of archival reason; we're just trying to replicate a bug locally. [10:21:23] you mean, a dump? [10:21:49] Yes but we'd want to download the dump to our local machine. We don't have any kind of sandbox server for this service. [10:22:11] give me a host you have access to on prod and I will put it there [10:22:22] deployment? [10:22:38] Yes I should have access to that. I'm on deploy2002 right now if that's helpful. [10:23:01] do you mind creating a ticket or adding me to the existing one, just for tracking? [10:23:11] Yes please hold I'll add you and link. [10:23:23] thank you [10:23:35] should have this relatively quickly if the data size is small [10:24:32] https://phabricator.wikimedia.org/T355246 [10:24:59] Is it possible to dump just a portion of the database? It's actually quite large (iirc !25M rows in one of the tables) but we only need a subset of the data. [10:25:06] Is it possible to dump just a portion of the database? It's actually quite large (iirc ~25M rows in one of the tables) but we only need a subset of the data. [10:34:52] Tran: you should have said that before I started dumping :-D [10:35:05] we dump in a file-per table format, that could help [10:35:34] but if you give me a regex I can apply? [10:35:36] hey topranks - I'm seeing lvs2012 in homer diffs (along with a few other IPs on ! lines that aren't related to my change). Seems scary, I'm not gonna apply that :D https://phabricator.wikimedia.org/P54895 [10:36:26] :') The only table we really need is `import_status`. Is that name enough? [10:36:55] yeah [10:37:53] (for context re hnowlan's message, the ! IPs are the ones topranks added manually for me yesterday when I complained about the lvs2012 diff) [10:37:56] this is the backup progress: https://phabricator.wikimedia.org/P54897 [10:38:16] I can send you the import_status table your way now [10:38:33] what's your production account? [10:38:55] I'm unable to see that phab ticket. Could you add me if necessary? I'm STran on phab and stran on prod. [10:40:53] as an aside, please ask for NDA rights of phab [10:41:09] it will help seeing non public data on phab [10:41:12] but not important now [10:41:43] Tran: you should be able to see it now [10:42:12] Tran: (for later) https://phabricator.wikimedia.org/project/view/974/ [10:42:18] Thank you 🙇 [10:42:53] I will copy to a folder in your home the files ipoid.import_status-schema.sql.gz and ipoid.import_status.sql.gz [10:43:26] with deployment rights, so it can be seen by you and others [10:51:05] Tran: while the transfer finishes, you will soon find 2 files on a subdir called ipoid_backup [10:51:20] one is called ipoid.import_status-schema.sql.gz [10:51:39] it is a gzip with the table definition (the CREATE TABLE) [10:51:53] the other is called ipoid.import_status.sql.gz [10:52:01] hnowlan: hey sorry for the delayed response [10:52:17] Tran: it is a series of import statements to recreate the production table [10:52:33] That's perfect, thank you so much [10:52:36] topranks: no bother! [10:52:52] em we have a patch almost ready to merge to fix the problem [10:52:53] you can examine it with regular unix tools (zcat) or by running it on mysql : gunzip file | mysql ipoid [10:53:25] hnowlan: just got the +1, let me merge and then you should be good [10:53:36] or I can run homer if you like - it will help me to test it's ok [10:54:10] Tran: I've made it "chown -R stran:wikidev ." so any other wikidev can see it too [10:54:22] I think that's all, I will comment it on ticket [10:54:41] topranks: go for it, no huge rush on my change [10:55:29] hnowlan: cool, give me 20 mins it's a plugin change so it's a little more involved to update Homer and deploy to the cumin hosts [10:56:06] Thanks again! [10:56:21] topranks: grand, thanks! [11:08:42] hnowlan: I assume the devices you were adding were mw2267, mw2357 and mw2395? [11:10:56] I pushed those changes now, and BGP is established with all 3 [11:11:03] jynus: thanks so much [11:11:16] topranks: deadly, thanks! [11:12:09] wondering if we should rethink our "no backups" policy for ipoid. if something goes catastrophically wrong in a data update or otherwise, it will be faster to reimport from a DB backup than to fully re-generate its data (2-3 hours) [15:32:37] volans what's a reasonable amount of servers to reimage at the same time? I have ~20 that need reimaging [15:38:52] inflatador: define "same time" [15:38:54] :) [15:39:53] If I have a Dell system set up as JBOD, megaraid will tell me about unhappy drives, but the usual MegaRAID check will be OK "no disks configured for RAID" [cf T355330 and icinga status for ms-be2072 ]. The megacli plugin knows about the sad drive, is there an easy way to get it to appear in Nagios? [15:39:53] T355330: Disk (sda) failed in ms-be2072 - https://phabricator.wikimedia.org/T355330 [15:40:55] Emperor: sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli reports it correctly [15:41:16] if what you meant is the predictive failure [15:41:37] volans: yeah (and Swift ends up grumbling too); I sortof expected that would show up in nagios (but possibly my expectation is just wrong :) ) [15:41:44] volans I'd be running the reimage cookbook simultaneously on as many hosts as possible, each in their own tmux window on cumin2002. Hosts are all in the same codfw [15:41:53] err...same DC (codfw) [15:42:53] inflatador: as long as you give 2m between the starts it's all fine. The reason is that after the debian-installer they will run the downtime cookbook with --force-puppet on the alert host and a puppet run there takes more than 1 minute [15:43:32] volans ACK, will do. Thanks for cleaning that up [15:43:32] ka.mila opened T355187 the other day, we'll add together a lock there that will prevent the downtime failure, but will anyway make the cookbook wait [15:43:33] T355187: Reimage cookbook fails to downtime hosts when run concurrently - https://phabricator.wikimedia.org/T355187 [15:43:39] err...clearing that up. /me cannot type today [15:43:49] lol [15:43:52] no worries [15:43:55] thanks for asking [15:44:11] Emperor: I don't recall if we alert for predictive failure as opposed as a real failure [15:44:15] in general (JBOD or not) [15:45:39] ah, that might be then. The filesystem keeps going splat, so I figured it was time to replace, predictive failure or not :) [15:45:50] s/be/& it/ [15:49:55] you got real erros on the FS? [16:12:31] yeah [16:16:28] ack :/ [16:16:46] double check if the current alert is supposed to fire in this case or not [16:22:25] The -n output of get-raid-status-megacli is "raw_disk: 26 OK | enclosure_slot: 1/26 CRIT | adapter: 1 OK" [16:24:16] but the check is probably using check_raid megacli ? [16:25:25] if icinga is to be believed check_nrpe -c check_raid_megaraid [16:25:26] check_raid_megaraid on icinga config that on the nrpe side calls "sudo /usr/local/lib/nagios/plugins/check_raid megacli " [16:25:40] and that gives [16:25:41] OK: no disks configured for RAID [16:25:41] OK [16:25:51] so that's clearl y not checking what you should check [16:26:23] where should I have found that? I was trying the puppet repo [16:26:46] which part? [16:26:51] (from https://wikitech.wikimedia.org/wiki/Icinga#Custom_Checks ) [16:26:59] volans: that command icinga was actually running [16:27:07] on the host [16:27:10] in /etc/nagios/nrpe.d [16:27:58] the .cfg files there define the commands to run [16:30:06] if you want you could switch to use the get-raid-status-megacli script with -n for the check [16:31:31] Mmm, though that would require i) working out where we're currently setting the check (presumably via nrpe::somethingorother) and ii) working out why we aren't using get-raid-status-megacli [16:31:45] that's defined in modules/raid/manifests/megaraid.pp [16:32:13] for context the get-raid-status-megacli is the one used by the raid handler that creates the tasks automatically when a check fails [16:32:17] and acks the check itself [16:32:37] I don't currently recall why the nagios output was added with -n [16:33:03] and potentially it could check less/different things