[09:04:43] <kostajh>	 jynus: is it possible for you to provide me with a dump of the ipoid service DB, or a backup from this morning/last night if it exists? We are debugging an application issue and having the DB contents would be helpful.
[09:43:25] <_joe_>	 kostajh: wouldn't it be more convenient to have a dedicated user on the DB you can use to query it?
[09:47:46] <kostajh>	 _joe_: we have one. But we're trying to work out why the app is failing at a certain step of the import process, and it will be much easier to just plug in the DB dump to our local dev instances and step through that way.
[09:48:06] <_joe_>	 ah gotcha
[09:48:14] <kostajh>	 in the meantime, yeah we are making do with manual queries to try to work out what is going on 
[10:09:34] <jynus>	 kostajh: what is that service?
[10:09:43] <jynus>	 where is it?
[10:10:41] <jynus>	 where is your local dev instance?
[10:12:52] <kostajh>	 jynus: https://wikitech.wikimedia.org/wiki/Service/IPoid
[10:12:57] <kostajh>	 my local dev instances in on my laptop.
[10:13:23] <kostajh>	 so I imagine I would download the file from a deployment server
[10:13:38] <jynus>	 generally, that's not allowed- moving data from production to outside of the network
[10:14:11] <jynus>	 I see it is on m5
[10:14:15] <kostajh>	 there is no user-generated data in that application
[10:14:45] <kostajh>	 I can generate mostly the same data locally on my laptop -- the thing I need is to be able to debug some data that exists in the `import_status` table
[10:15:02] <kostajh>	 so, even just that table alone would be great, if possible.
[10:16:00] <kostajh>	 jynus: I need to step out for a bit, but Tran is on my team and can follow-up on this as well
[10:16:14] <Tran>	 :wave
[10:17:09] <jynus>	 backups for ipoid were not setup
[10:17:56] <jynus>	 as when the db was requested it was said backups: no
[10:18:09] <jynus>	 > Backup Policy: Needed? Frequency? no
[10:18:57] <_joe_>	 which makes sense as the data is populated from an external feed 
[10:19:49] <jynus>	 I can set them up now
[10:20:13] <jynus>	 or generate a new one
[10:20:46] <jynus>	 but I would prefer to dump it inside the network and give you access to the file or test db
[10:21:02] <jynus>	 let me know which would you prefer
[10:21:04] <Tran>	 Could we do it without a backup? We don't need this data to be backed up for any kind of archival reason; we're just trying to replicate a bug locally.
[10:21:23] <jynus>	 you mean, a dump?
[10:21:49] <Tran>	 Yes but we'd want to download the dump to our local machine. We don't have any kind of sandbox server for this service.
[10:22:11] <jynus>	 give me a host you have access to on prod and I will put it there
[10:22:22] <jynus>	 deployment?
[10:22:38] <Tran>	 Yes I should have access to that. I'm on deploy2002 right now if that's helpful.
[10:23:01] <jynus>	 do you mind creating a ticket or adding me to the existing one, just for tracking?
[10:23:11] <Tran>	 Yes please hold I'll add you and link.
[10:23:23] <jynus>	 thank you
[10:23:35] <jynus>	 should have this relatively quickly if the data size is small
[10:24:32] <Tran>	 https://phabricator.wikimedia.org/T355246
[10:24:59] <Tran>	 Is it possible to dump just a portion of the database? It's actually quite large (iirc !25M rows in one of the tables) but we only need a subset of the data.
[10:25:06] <Tran>	 Is it possible to dump just a portion of the database? It's actually quite large (iirc ~25M rows in one of the tables) but we only need a subset of the data.
[10:34:52] <jynus>	 Tran: you should have said that before I started dumping :-D
[10:35:05] <jynus>	 we dump in a file-per table format, that could help
[10:35:34] <jynus>	 but if you give me a regex I can apply?
[10:35:36] <hnowlan>	 hey topranks - I'm seeing lvs2012 in homer diffs (along with a few other IPs on ! lines that aren't related to my change). Seems scary, I'm not gonna apply that :D https://phabricator.wikimedia.org/P54895 
[10:36:26] <Tran>	 :') The only table we really need is `import_status`. Is that name enough?
[10:36:55] <jynus>	 yeah
[10:37:53] <kamila_>	 (for context re hnowlan's message, the ! IPs are the ones topranks added manually for me yesterday when I complained about the lvs2012 diff)
[10:37:56] <jynus>	 this is the backup progress: https://phabricator.wikimedia.org/P54897
[10:38:16] <jynus>	 I can send you the import_status table your way now
[10:38:33] <jynus>	 what's your production account?
[10:38:55] <Tran>	 I'm unable to see that phab ticket. Could you add me if necessary? I'm STran on phab and stran on prod.
[10:40:53] <jynus>	 as an aside, please ask for NDA rights of phab
[10:41:09] <jynus>	 it will help seeing non public data on phab
[10:41:12] <jynus>	 but not important now
[10:41:43] <jynus>	 Tran: you should be able to see it now
[10:42:12] <jynus>	 Tran: (for later) https://phabricator.wikimedia.org/project/view/974/
[10:42:18] <Tran>	 Thank you 🙇
[10:42:53] <jynus>	 I will copy to a folder in your home the files ipoid.import_status-schema.sql.gz and ipoid.import_status.sql.gz
[10:43:26] <jynus>	 with deployment rights, so it can be seen by you and others
[10:51:05] <jynus>	 Tran: while the transfer finishes, you will soon find 2 files on a subdir called ipoid_backup
[10:51:20] <jynus>	 one is called ipoid.import_status-schema.sql.gz
[10:51:39] <jynus>	 it is a gzip with the table definition (the CREATE TABLE)
[10:51:53] <jynus>	 the other is called ipoid.import_status.sql.gz
[10:52:01] <topranks>	 hnowlan: hey sorry for the delayed response
[10:52:17] <jynus>	 Tran: it is a series of import statements to recreate the production table
[10:52:33] <Tran>	 That's perfect, thank you so much
[10:52:36] <hnowlan>	 topranks: no bother!
[10:52:52] <topranks>	 em we have a patch almost ready to merge to fix the problem 
[10:52:53] <jynus>	 you can examine it with regular unix tools (zcat) or by running it on mysql : gunzip file | mysql ipoid
[10:53:25] <topranks>	 hnowlan: just got the +1, let me merge and then you should be good 
[10:53:36] <topranks>	 or I can run homer if you like - it will help me to test it's ok 
[10:54:10] <jynus>	 Tran: I've made it "chown -R stran:wikidev ." so any other wikidev can see it too
[10:54:22] <jynus>	 I think that's all, I will comment it on ticket
[10:54:41] <hnowlan>	 topranks: go for it, no huge rush on my change 
[10:55:29] <topranks>	 hnowlan: cool, give me 20 mins it's a plugin change so it's a little more involved to update Homer and deploy to the cumin hosts 
[10:56:06] <Tran>	 Thanks again!
[10:56:21] <hnowlan>	 topranks: grand, thanks! 
[11:08:42] <topranks>	 hnowlan: I assume the devices you were adding were mw2267, mw2357 and mw2395?
[11:10:56] <topranks>	 I pushed those changes now, and BGP is established with all 3 
[11:11:03] <kostajh>	 jynus: thanks so much
[11:11:16] <hnowlan>	 topranks: deadly, thanks! 
[11:12:09] <kostajh>	 wondering if we should rethink our "no backups" policy for ipoid. if something goes catastrophically wrong in a data update or otherwise, it will be faster to reimport from a DB backup than to fully re-generate its data (2-3 hours)
[15:32:37] <inflatador>	 volans what's a reasonable amount of servers to reimage at the same time? I have ~20 that need reimaging
[15:38:52] <volans>	 inflatador: define "same time"
[15:38:54] <volans>	 :)
[15:39:53] <Emperor>	 If I have a Dell system set up as JBOD, megaraid will tell me about unhappy drives, but the usual MegaRAID check will be OK "no disks configured for RAID" [cf T355330 and icinga status for ms-be2072 ]. The megacli plugin knows about the sad drive, is there an easy way to get it to appear in Nagios? 
[15:39:53] <stashbot>	 T355330: Disk (sda) failed in ms-be2072 - https://phabricator.wikimedia.org/T355330
[15:40:55] <volans>	 Emperor: sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli reports it correctly
[15:41:16] <volans>	 if what you meant is the predictive failure
[15:41:37] <Emperor>	 volans: yeah (and Swift ends up grumbling too); I sortof expected that would show up in nagios (but possibly my expectation is just wrong :) )
[15:41:44] <inflatador>	 volans I'd be running the reimage cookbook simultaneously on as many hosts as possible, each in their own tmux window on cumin2002. Hosts are all in the same codfw
[15:41:53] <inflatador>	 err...same DC (codfw)
[15:42:53] <volans>	 inflatador: as long as you give 2m between the starts it's all fine. The reason is that after the debian-installer they will run the downtime cookbook with --force-puppet on the alert host and a puppet run there takes more than 1 minute
[15:43:32] <inflatador>	 volans ACK, will do. Thanks for cleaning that up
[15:43:32] <volans>	 ka.mila opened T355187 the other day, we'll add together a lock there that will prevent the downtime failure, but will anyway make the cookbook wait 
[15:43:33] <stashbot>	 T355187: Reimage cookbook fails to downtime hosts when run concurrently - https://phabricator.wikimedia.org/T355187
[15:43:39] <inflatador>	 err...clearing that up. /me cannot type today
[15:43:49] <volans>	 lol
[15:43:52] <volans>	 no worries
[15:43:55] <volans>	 thanks for asking
[15:44:11] <volans>	 Emperor: I don't recall if we alert for predictive failure as opposed as a real failure
[15:44:15] <volans>	 in general (JBOD or not)
[15:45:39] <Emperor>	 ah, that might be then. The filesystem keeps going splat, so I figured it was time to replace, predictive failure or not :)
[15:45:50] <Emperor>	 s/be/& it/
[15:49:55] <volans>	 you got real erros on the FS?
[16:12:31] <Emperor>	 yeah
[16:16:28] <volans>	 ack :/
[16:16:46] <volans>	 double check if the current alert is supposed to fire in this case or not
[16:22:25] <Emperor>	 The -n output of get-raid-status-megacli is "raw_disk: 26 OK | enclosure_slot: 1/26 CRIT | adapter: 1 OK"
[16:24:16] <volans>	 but the check is probably using check_raid megacli ?
[16:25:25] <Emperor>	 if icinga is to be believed check_nrpe -c check_raid_megaraid
[16:25:26] <volans>	 check_raid_megaraid on icinga config that on the nrpe side calls "sudo /usr/local/lib/nagios/plugins/check_raid megacli "
[16:25:40] <volans>	 and that gives
[16:25:41] <volans>	 OK: no disks configured for RAID
[16:25:41] <volans>	 OK
[16:25:51] <volans>	 so that's clearl y not checking what you should check
[16:26:23] <Emperor>	 where should I have found that? I was trying the puppet repo
[16:26:46] <volans>	 which part?
[16:26:51] <Emperor>	 (from https://wikitech.wikimedia.org/wiki/Icinga#Custom_Checks )
[16:26:59] <Emperor>	 volans: that command icinga was actually running
[16:27:07] <volans>	 on the host
[16:27:10] <volans>	 in /etc/nagios/nrpe.d
[16:27:58] <volans>	 the .cfg files there define the commands to run
[16:30:06] <volans>	 if you want you could switch to use the get-raid-status-megacli script with -n for the check
[16:31:31] <Emperor>	 Mmm, though that would require i) working out where we're currently setting the check (presumably via nrpe::somethingorother) and ii) working out why we aren't using get-raid-status-megacli
[16:31:45] <volans>	 that's defined in modules/raid/manifests/megaraid.pp
[16:32:13] <volans>	 for context the get-raid-status-megacli is the one used by the raid handler that creates the tasks automatically when a check fails
[16:32:17] <volans>	 and acks the check itself
[16:32:37] <volans>	 I don't currently recall why the nagios output was added with -n
[16:33:03] <volans>	 and potentially it could check less/different things