[09:22:34] <godog>	 jynus: I'm looking at dispatch-be1001 for Backup freshness alert, looking at the runbook this is indeed a new host/backup, anything I should be doing or it'll fix itself ?
[09:23:51] <jynus>	 let's see
[09:24:19] <jynus>	 No backups: 1 (dispatch-be1001) - it is only a warning
[09:25:49] <godog>	 got it, there is data in the db now, am I right in thinking the warning should go away say next week or so?
[09:25:51] <jynus>	 but it should have run tonight
[09:26:11] <jynus>	 let me see when it is scheduled
[09:26:41] <godog>	 ok! iirc should be daily indeed
[09:27:30] <jynus>	 weird, I don't see it scheduled for today or tomorrow
[09:27:40] <jynus>	 when did you add the backups setup?
[09:28:17] <godog>	 earlier this week, like tues IIRC
[09:28:57] <jynus>	 so the first of the month we do the monthly full backups on eqiad
[09:29:17] <jynus>	 maybe the has caused some clogging, so not what I expected
[09:29:21] <jynus>	 *that
[09:29:43] <jynus>	 I will wait for current backups to finish and theny will reload the daemon manually to make sure it is scheduled
[09:29:53] <godog>	 ok! thank you for your help
[09:30:08] <jynus>	 are the dumps being produced?
[09:30:14] <godog>	 checking
[09:30:31] <jynus>	 that way I can also do a manual run to make sure it works
[09:30:51] <godog>	 yeah data is in /srv/postgres-backup
[09:31:11] <jynus>	 so the alert is fair, there is an unexpected anomaly
[09:32:18] <jynus>	 I will force a config reload in a few hours and ping you when I know more, it is weird
[09:32:27] <godog>	 cheers
[09:33:27] <jynus>	 also please be aware of T316655 defect
[09:33:27] <stashbot>	 T316655: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655
[09:33:54] <jynus>	 postgres backups don't have the same support as others, there is a plan to make that better but it is not there
[09:34:20] <godog>	 thank you I'll subscribe
[09:35:34] <jynus>	 we have been in discussions to potentially support gitlab and postgres in the wmfbackups framework: T274463 but that will take time
[09:35:34] <stashbot>	 T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463
[13:00:58] <marostegui>	 jynus: do you think it would be worth opening a bug for mydumper?
[13:01:11] <jynus>	 it's complicated
[13:01:53] <jynus>	 because the issue only happens on the old version, but the new version breaks other things
[13:02:44] <marostegui>	 I see
[13:03:05] <jynus>	 and it has things we don't like or need
[13:03:36] <jynus>	 so not sure if trying to fix things for mydumper is the right way; I need to do more tests
[13:03:48] <marostegui>	 sure, that makes sense
[13:04:09] <jynus>	 for example, maybe the problem is the server version it is linked to
[13:04:32] <jynus>	 but it is not easy to check, because that also requires more code changes, etc
[13:06:07] <jynus>	 the debian maintainer is not keeping the package up to date (my guess is for a reason), there are lots of interrogants about the best way to move forward
[13:07:29] <jynus>	 another possibility could be to backport only the fix to the current version, so we keep current functionality
[13:07:44] <marostegui>	 that can be complicated I would guess
[13:07:54] <marostegui>	 Like, is there a fix in place to start with?
[13:08:21] <jynus>	 there is a fix, somwhere in the 20 version between 0.10 and 0.13
[13:09:53] <marostegui>	 Ah interesting I didn't know that part
[13:10:05] <marostegui>	 So maybe that's an option, but might break in future versions too
[13:10:12] <marostegui>	 But it would give us some more time 
[13:10:13] <jynus>	 alternatively, we could keep the dumper and create our own loader
[13:10:28] <jynus>	 as the issues seem to me mostly about loading back the data
[13:11:02] <jynus>	 I need to do more testing
[13:11:21] <marostegui>	 Yeah, I would prefer if we would for something that we don't have to maintain ourselves, although it has prons and cons
[13:12:07] <jynus>	 the problem is the stability is not great- many features have been added that we don't care about that are not creating but constant bugs
[13:12:27] <jynus>	 so other option would be to use an alternative tool
[13:12:49] <jynus>	 see https://github.com/mydumper/mydumper/releases
[13:13:19] <marostegui>	 But we don't have many alternative tools to import data in parallel like myloader does :(
[13:13:28] <jynus>	 apparently there is a new realease- maybe that fixes our issues?
[13:13:47] <marostegui>	 We can try I guess
[13:17:09] <jynus>	 the maintainer has changed again: https://github.com/mydumper/mydumper/issues/688 and also the phylosophy has changed (fast + stable + not many features)
[13:17:57] <marostegui>	 I think it might be worth opening a bug, maybe it is not something that hard to fix
[13:19:58] <jynus>	 I think there is a higher chance to find ourselves the problem and send it to debian patches
[13:20:18] <marostegui>	 ok
[13:20:44] <jynus>	 reporting "mydumper doesn't work" won't help us, it is not a bug
[13:21:12] <marostegui>	 That's not what I have suggested, but anyways...
[13:21:25] <jynus>	 I know, but that is what I have so far
[13:22:29] <marostegui>	 What I am trying to say is that if we believe we can report something to the maintainer, it might be easier for them to fix&release than for us. But you know the problem better than I do, so up to you
[13:23:04] <jynus>	 there is an old report with the same symthoms, the answer was "reduce the number of threads"
[13:23:16] <jynus>	 (which I tried and doesn't work)
[13:23:55] <jynus>	 we need to find first what is special about our setup that fails
[13:27:14] <jynus>	 if you could try the latest release on a test env, that would help me a lot
[13:28:02] <marostegui>	 you have db1124 and db1125 there 
[13:28:09] <marostegui>	 They are not being used at the moment
[14:25:49] <urandom>	 Amir1: does https://phabricator.wikimedia.org/T320835#8365339 make sense to you?
[14:26:05] <Amir1>	 let me check
[14:26:47] <urandom>	 Amir1: TTBMK, what was proposed was to use memcache for these mp3s, and then on the page save hook, copy the final iteration to swift via a deferred update.  That comment is meant to explain why that will not work. :/
[14:27:26] <Amir1>	 hmm yeah
[14:27:31] <Amir1>	 I see. Thanks
[14:31:03] <Amir1>	 urandom: I wrote something
[14:32:16] <urandom>	 Amir1: thanks; FWIW I'm inclined to think we should just let it go forward too... but here is my concern...
[14:33:21] <urandom>	 This thing is going to ramp up slowly over $some_period, to grow to $some_expected_size, but will almost certainly be something > $some_expected_size
[14:33:37] <urandom>	 So what *should* we expect to see when looking at it in 6m?
[14:34:08] <urandom>	 How would we gauge that it was OK, why do we think 6 months is the right time frame?
[14:35:04] <urandom>	 I'm concerned that we'll only notice it's a problem 2 years from now, when the number of files in the container has grown into the many millions and (whoever is around then) will need to fumble around for answers and a solution
[14:36:13] <Amir1>	 urandom: yeah, My thinking is along the lines of "let's deploy this and while it's being adapted develop a clean up strategy"
[14:36:29] <urandom>	 we can farm request logs to get an idea of what the hit ratio is, but if we look in 6 months and see (just throwing numbers out there) 300k files, how would we know that wasn't for 50k actual terms?
[14:36:45] <Amir1>	 based on actual data. Not something like "let's look at it in six months and if it's below some number, we don't care"
[14:37:15] <urandom>	 right, but I think that's something we have to get a commitment on now
[14:37:40] <urandom>	 otherwise I'd expect them to move on to other priorities, and not have the time
[14:37:55] <urandom>	 (and they'd be right to)
[14:37:58] <Amir1>	 yup yup
[14:39:53] <Amir1>	 I said before, I think this is quite parallel to T211661 and probably will get the same solution. Maybe solving that gives more freedom to phonos
[14:39:53] <stashbot>	 T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661
[14:40:28] <urandom>	 Oh, right LRU
[14:40:44] <urandom>	 yes, that's the right semantics for this
[14:41:31] <urandom>	 allocate capacity that fits the cost:benefit, and cull the least recently accessed
[14:44:18] <Amir1>	 yeah
[14:44:32] <Amir1>	 I'm planning to get some numbers on this
[15:03:18] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m2 on db1117 is CRITICAL: 3802 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13322
[15:03:33] <jynus>	 oh
[15:03:40] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m2 on db2133 is CRITICAL: 226 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2133&var-port=9104
[15:04:23] <jynus>	 mwaddlink
[15:05:40] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m2 on db2133 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2133&var-port=9104
[15:08:18] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m2 on db2160 is CRITICAL: 332 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13322
[15:09:16] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m2 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13322
[15:12:18] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m2 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13322
[16:15:00] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m2 on db1117 is CRITICAL: 101 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13322
[16:17:00] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m2 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13322
[18:02:42] <dhinus>	 o/ I'm having some issues with the dbproxy haproxy in front of wikireplicas, anyone here might be able to have a quick look?
[18:03:14] <dhinus>	 HAProxy is complaining: 'Health check for server mariadb-s8/clouddb1016.eqiad.wmnet failed, reason: Layer4 timeout, check duration: 3000ms'
[18:03:25] <dhinus>	 but 'nc' to clouddb1016.eqiad.wmnet works fine
[18:03:36] <dhinus>	 full story at T313445
[18:03:36] <stashbot>	 T313445: hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445
[18:05:51] <marostegui>	 dhinus: did the IP change?
[18:05:58] <dhinus>	 it did
[18:06:08] <dhinus>	 new rack, new subnet
[18:06:15] <dhinus>	 new vlan
[18:07:03] <marostegui>	 then that's the issue, the grants for haproxy user
[18:07:05] <marostegui>	 what's the new ip
[18:07:05] <dhinus>	 dcaro just spotted an Access denied in the haproxy logs
[18:07:10] <marostegui>	 (and the old one)
[18:07:22] <dhinus>	 new one: 10.64.16.14/22
[18:07:42] <dhinus>	 old one: give me a sec :)
[18:09:18] <marostegui>	 dhinus: check again if it works now 
[18:11:40] <marostegui>	 Nov 03 18:10:34 dbproxy1019 haproxy[21020]: [WARNING] 306/181034 (21020) : Health check for server mariadb-s8/clouddb1016.eqiad.wmnet succeeded, reason: Layer7 check passed, code: 0, info: "5.5.5-10.4.22-MariaDB", check duration: 0ms, st>
[18:11:50] <marostegui>	 Let me apply the fix for all the other ones then
[18:16:09] <marostegui>	 dhinus: they are all now up
[18:16:23] <marostegui>	 https://phabricator.wikimedia.org/P38085
[18:17:27] <dhinus>	 yes I think it's now working correctly!
[18:17:38] <dhinus>	 thanks marostegui!
[18:17:48] <marostegui>	 I commented on the task, can you please let me know the old ip on the task
[18:17:57] <marostegui>	 https://phabricator.wikimedia.org/T313445#8367942
[18:18:16] <marostegui>	 I will clean the old one tomorrow, see you o/
[18:20:40] <dhinus>	 thanks, yes I'll add the old IP there
[18:20:46] <dhinus>	 haven't found it yet :D
[18:25:13] <marostegui>	 I can probably find it tomorrow in our mysql grants, no worries