[09:22:34] jynus: I'm looking at dispatch-be1001 for Backup freshness alert, looking at the runbook this is indeed a new host/backup, anything I should be doing or it'll fix itself ? [09:23:51] let's see [09:24:19] No backups: 1 (dispatch-be1001) - it is only a warning [09:25:49] got it, there is data in the db now, am I right in thinking the warning should go away say next week or so? [09:25:51] but it should have run tonight [09:26:11] let me see when it is scheduled [09:26:41] ok! iirc should be daily indeed [09:27:30] weird, I don't see it scheduled for today or tomorrow [09:27:40] when did you add the backups setup? [09:28:17] earlier this week, like tues IIRC [09:28:57] so the first of the month we do the monthly full backups on eqiad [09:29:17] maybe the has caused some clogging, so not what I expected [09:29:21] *that [09:29:43] I will wait for current backups to finish and theny will reload the daemon manually to make sure it is scheduled [09:29:53] ok! thank you for your help [09:30:08] are the dumps being produced? [09:30:14] checking [09:30:31] that way I can also do a manual run to make sure it works [09:30:51] yeah data is in /srv/postgres-backup [09:31:11] so the alert is fair, there is an unexpected anomaly [09:32:18] I will force a config reload in a few hours and ping you when I know more, it is weird [09:32:27] cheers [09:33:27] also please be aware of T316655 defect [09:33:27] T316655: Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals - https://phabricator.wikimedia.org/T316655 [09:33:54] postgres backups don't have the same support as others, there is a plan to make that better but it is not there [09:34:20] thank you I'll subscribe [09:35:34] we have been in discussions to potentially support gitlab and postgres in the wmfbackups framework: T274463 but that will take time [09:35:34] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [13:00:58] jynus: do you think it would be worth opening a bug for mydumper? [13:01:11] it's complicated [13:01:53] because the issue only happens on the old version, but the new version breaks other things [13:02:44] I see [13:03:05] and it has things we don't like or need [13:03:36] so not sure if trying to fix things for mydumper is the right way; I need to do more tests [13:03:48] sure, that makes sense [13:04:09] for example, maybe the problem is the server version it is linked to [13:04:32] but it is not easy to check, because that also requires more code changes, etc [13:06:07] the debian maintainer is not keeping the package up to date (my guess is for a reason), there are lots of interrogants about the best way to move forward [13:07:29] another possibility could be to backport only the fix to the current version, so we keep current functionality [13:07:44] that can be complicated I would guess [13:07:54] Like, is there a fix in place to start with? [13:08:21] there is a fix, somwhere in the 20 version between 0.10 and 0.13 [13:09:53] Ah interesting I didn't know that part [13:10:05] So maybe that's an option, but might break in future versions too [13:10:12] But it would give us some more time [13:10:13] alternatively, we could keep the dumper and create our own loader [13:10:28] as the issues seem to me mostly about loading back the data [13:11:02] I need to do more testing [13:11:21] Yeah, I would prefer if we would for something that we don't have to maintain ourselves, although it has prons and cons [13:12:07] the problem is the stability is not great- many features have been added that we don't care about that are not creating but constant bugs [13:12:27] so other option would be to use an alternative tool [13:12:49] see https://github.com/mydumper/mydumper/releases [13:13:19] But we don't have many alternative tools to import data in parallel like myloader does :( [13:13:28] apparently there is a new realease- maybe that fixes our issues? [13:13:47] We can try I guess [13:17:09] the maintainer has changed again: https://github.com/mydumper/mydumper/issues/688 and also the phylosophy has changed (fast + stable + not many features) [13:17:57] I think it might be worth opening a bug, maybe it is not something that hard to fix [13:19:58] I think there is a higher chance to find ourselves the problem and send it to debian patches [13:20:18] ok [13:20:44] reporting "mydumper doesn't work" won't help us, it is not a bug [13:21:12] That's not what I have suggested, but anyways... [13:21:25] I know, but that is what I have so far [13:22:29] What I am trying to say is that if we believe we can report something to the maintainer, it might be easier for them to fix&release than for us. But you know the problem better than I do, so up to you [13:23:04] there is an old report with the same symthoms, the answer was "reduce the number of threads" [13:23:16] (which I tried and doesn't work) [13:23:55] we need to find first what is special about our setup that fails [13:27:14] if you could try the latest release on a test env, that would help me a lot [13:28:02] you have db1124 and db1125 there [13:28:09] They are not being used at the moment [14:25:49] Amir1: does https://phabricator.wikimedia.org/T320835#8365339 make sense to you? [14:26:05] let me check [14:26:47] Amir1: TTBMK, what was proposed was to use memcache for these mp3s, and then on the page save hook, copy the final iteration to swift via a deferred update. That comment is meant to explain why that will not work. :/ [14:27:26] hmm yeah [14:27:31] I see. Thanks [14:31:03] urandom: I wrote something [14:32:16] Amir1: thanks; FWIW I'm inclined to think we should just let it go forward too... but here is my concern... [14:33:21] This thing is going to ramp up slowly over $some_period, to grow to $some_expected_size, but will almost certainly be something > $some_expected_size [14:33:37] So what *should* we expect to see when looking at it in 6m? [14:34:08] How would we gauge that it was OK, why do we think 6 months is the right time frame? [14:35:04] I'm concerned that we'll only notice it's a problem 2 years from now, when the number of files in the container has grown into the many millions and (whoever is around then) will need to fumble around for answers and a solution [14:36:13] urandom: yeah, My thinking is along the lines of "let's deploy this and while it's being adapted develop a clean up strategy" [14:36:29] we can farm request logs to get an idea of what the hit ratio is, but if we look in 6 months and see (just throwing numbers out there) 300k files, how would we know that wasn't for 50k actual terms? [14:36:45] based on actual data. Not something like "let's look at it in six months and if it's below some number, we don't care" [14:37:15] right, but I think that's something we have to get a commitment on now [14:37:40] otherwise I'd expect them to move on to other priorities, and not have the time [14:37:55] (and they'd be right to) [14:37:58] yup yup [14:39:53] I said before, I think this is quite parallel to T211661 and probably will get the same solution. Maybe solving that gives more freedom to phonos [14:39:53] T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 [14:40:28] Oh, right LRU [14:40:44] yes, that's the right semantics for this [14:41:31] allocate capacity that fits the cost:benefit, and cull the least recently accessed [14:44:18] yeah [14:44:32] I'm planning to get some numbers on this [15:03:18] PROBLEM - MariaDB sustained replica lag on m2 on db1117 is CRITICAL: 3802 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13322 [15:03:33] oh [15:03:40] PROBLEM - MariaDB sustained replica lag on m2 on db2133 is CRITICAL: 226 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2133&var-port=9104 [15:04:23] mwaddlink [15:05:40] RECOVERY - MariaDB sustained replica lag on m2 on db2133 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2133&var-port=9104 [15:08:18] PROBLEM - MariaDB sustained replica lag on m2 on db2160 is CRITICAL: 332 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13322 [15:09:16] RECOVERY - MariaDB sustained replica lag on m2 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13322 [15:12:18] RECOVERY - MariaDB sustained replica lag on m2 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13322 [16:15:00] PROBLEM - MariaDB sustained replica lag on m2 on db1117 is CRITICAL: 101 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13322 [16:17:00] RECOVERY - MariaDB sustained replica lag on m2 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13322 [18:02:42] o/ I'm having some issues with the dbproxy haproxy in front of wikireplicas, anyone here might be able to have a quick look? [18:03:14] HAProxy is complaining: 'Health check for server mariadb-s8/clouddb1016.eqiad.wmnet failed, reason: Layer4 timeout, check duration: 3000ms' [18:03:25] but 'nc' to clouddb1016.eqiad.wmnet works fine [18:03:36] full story at T313445 [18:03:36] T313445: hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 [18:05:51] dhinus: did the IP change? [18:05:58] it did [18:06:08] new rack, new subnet [18:06:15] new vlan [18:07:03] then that's the issue, the grants for haproxy user [18:07:05] what's the new ip [18:07:05] dcaro just spotted an Access denied in the haproxy logs [18:07:10] (and the old one) [18:07:22] new one: 10.64.16.14/22 [18:07:42] old one: give me a sec :) [18:09:18] dhinus: check again if it works now [18:11:40] Nov 03 18:10:34 dbproxy1019 haproxy[21020]: [WARNING] 306/181034 (21020) : Health check for server mariadb-s8/clouddb1016.eqiad.wmnet succeeded, reason: Layer7 check passed, code: 0, info: "5.5.5-10.4.22-MariaDB", check duration: 0ms, st> [18:11:50] Let me apply the fix for all the other ones then [18:16:09] dhinus: they are all now up [18:16:23] https://phabricator.wikimedia.org/P38085 [18:17:27] yes I think it's now working correctly! [18:17:38] thanks marostegui! [18:17:48] I commented on the task, can you please let me know the old ip on the task [18:17:57] https://phabricator.wikimedia.org/T313445#8367942 [18:18:16] I will clean the old one tomorrow, see you o/ [18:20:40] thanks, yes I'll add the old IP there [18:20:46] haven't found it yet :D [18:25:13] I can probably find it tomorrow in our mysql grants, no worries