[06:25:54] taavi: Is cloudlb1001,2 supposed to add clouddb hosts via haproxy user? [06:26:06] We just got an alert about it [06:30:08] s/add/access [06:44:05] Starting to failover m2 [08:56:33] marostegui: yeah, those are intentional, added via https://gerrit.wikimedia.org/r/c/operations/puppet/+/973777. sorry about the alert, did I miss something that would have prevented it? [08:56:54] taavi: no worries, I will whitelist those hosts [08:56:55] thanks [09:56:16] did you see db2096? it may p*ge soon: https://grafana.wikimedia.org/goto/C9z6JgNIk?orgId=1 [09:59:52] That host should've been decommissioend I think [10:00:11] there's so much stuff going on that I lose track [10:00:54] np [10:00:54] it is weird why it is growing in size [10:00:54] I cannot see an obvious reason [10:00:56] it is still in production, though, receiving core traffic [10:01:08] yeah, it is one of the old ones with smaller disks [10:01:39] yeah, scheduled for replacement on q3 [10:02:18] interesting, I didn't know x1 had grown so much [10:02:24] it has loads of binlogs too [10:02:27] I am going to clean that up a bit [10:02:34] 3.5 TB [10:05:13] it is not the DBs, it was the binlogs [10:05:26] However, why are we generating so many? Amir1 are we writing a lot more to x1? [10:06:18] jynus: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=db2096&var-datasource=thanos&var-cluster=mysql&viewPanel=28&from=now-5m&to=now [10:06:19] yeah, it is common to all x1s [10:06:37] since 1 nov, there is a lot of growth there [10:06:41] Amir1: ^ [10:06:52] it makes sense that host complained first [10:06:55] db2097 will be next [10:06:56] ah, no [10:07:07] I was looking at the wrong graph [10:07:13] db2097 looks ok [10:07:20] it is only that host in particular [10:08:22] no, db2131 also has lots of binlogs (2k) [10:08:47] for example, db1225 has s2, s3, s6 and x1 without binlogs and it is using 4TB only [10:08:55] yeah, without binlogs [10:09:04] But db2131 has 2k binlogs, that's insane [10:09:05] so it could be that [10:09:18] and only for 30 days? [10:09:59] yeah [10:10:04] I don't want to have to reduce that for now [10:10:09] I'd rather find out why this has started [10:10:11] once the p*ge has been avoided, I would just create a ticket and look at it later [10:10:17] because we are using TB of binlogs... [10:10:22] yep [10:11:26] let me know if I can help, I didn't see anything immediately obvious [10:12:13] Just created https://phabricator.wikimedia.org/T351871 for Amir1 XD [10:12:59] x1 I belive uses ROW [10:13:14] so it could make sense it is growing for some new pattern [10:13:26] yeah, that's why I included transaction on the ticket too [10:13:37] imagine new deletes or updates with loooots of rows [10:13:42] that'd make the binlogs grow a lot [10:39:44] I'd switch dborch1001 to Puppet 7, any concerns/objections? [10:39:58] no [10:43:58] ok, I'll start in ~ 5m [11:12:12] dborch1001 has been switched over to Puppet 7 and seems all fine, could I also quickly reboot it for extra validation? [11:12:23] sure [11:12:35] k, doing that now [11:17:20] done, all good [11:17:41] yep, looks good from my side too [11:37:47] marostegui: Thanks. I take a look :D [11:37:57] ta [13:06:52] Does this make sense? https://gerrit.wikimedia.org/r/c/operations/puppet/+/977078/ [13:07:11] It is preventing me from reimagining dbproxy2004 :) [13:10:31] Could I get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/977077 please? make envoy the default for both ms swift clusters (and thus tidy up all the host-specific hiera stuff) [13:10:44] thank you jaime! [13:11:50] thanks :) [13:11:55] np [13:12:25] marostegui: LMK when you're done with puppet-merge? [13:12:30] done! [13:12:50] aactually not done, it looks stuck [13:12:52] at the last step [13:14:44] done now [13:14:48] it takes ages [13:14:49] ta