[07:49:29] Can I get a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/740714 ? [08:42:10] thanks jaime! <3 [08:55:42] marostegui: I want to start dropping useless grants (covered by other grants, there is no db being covered in that host, etc.). Objections? Notes? [08:56:25] yeah, if those dbs don't exist, that should be fine, but do not do it with replication enabled [09:10:58] I wonder how the bullseye of db is going- do you have any host already reimaged, and how did it go? (informal enquire only, I am just curious on the amount of things broken) [09:11:38] it is going well, some stuff with the packages but sorted for now [09:11:42] db1125 is now reimaged [09:11:49] and I am doing db1124 (master role) now [09:11:57] to have both and do proper testing [09:12:26] and the server is technically running etc, generating metrics? [09:13:02] oh, no metrics [09:13:19] no, no metrics yet [09:14:02] there are host ones, though: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1125&var-datasource=thanos&var-cluster=mysql&from=1637648038937&to=1637658838937 [09:14:20] yes, it is the mysqld exporter [09:14:59] when things are more mature I may have space for 1 test host on backup sources, but not yet now [09:15:36] sure, no worries [09:15:50] I am expecting to fix the exporter thing in the morning [09:15:56] but yes, too early to test anywhere else [09:16:10] Today I am freeing up a host in m5, so I am going to use that to place it on s1, and let it replicate [09:17:18] no pressure, sorry [09:17:27] I am just excited about what you are doing! [09:17:32] 0:-) [09:17:52] :) [09:18:39] plus also keeping an eye on it to learn hopefully not too long after you, so I (backups) don't become a blocker [09:35:00] I am about to start rebuiling from logical dump db1139/s1 [09:39:55] Amir1: you can proceed with either db1124 or db1125 if needed, anytime, they are tests hosts [09:40:07] if they are down, just ping me, as I am working with them for the bullseye thing [10:11:16] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (db1139:13311) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [10:26:47] that must be me ^ [10:27:11] it started firing now that I have started mysql to import but not setup yet monitoring [10:27:45] I think I can fix that in parallel [10:36:16] (PrometheusMysqldExporterFailed) resolved: Prometheus-mysqld-exporter failed (db1139:13311) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org [10:41:17] FYI the extra 120K row writes/s is me with the reimport: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=7&orgId=1&var-site=eqiad&var-group=All&var-shard=All&var-role=All&from=1637662501159&to=1637664026672 [10:43:33] I think we should split mariadb::dbstore_multiinstance into mariadb::backup_source and mariadb::analytics_multiinstance roles (while keeping the same functionality) [10:44:37] I will file a ticket and ask analytics and k*rmat when feedback when back [11:02:09] https://www.irccloud.com/pastebin/jYO9axst/ [11:02:22] marostegui: now testwiki is on write both for rev_actor [11:04:13] sweeeet [11:12:52] running the script now, the lag on s3 seems fine [11:14:49] finally sent my dumb swift-recon -d patch upstream so they can tell me Not To Do It Like That :) [11:15:58] progress :P [11:18:44] it's taken ~4 weeks to get to this point? [I think I wrote the patch on 20 October] [11:22:50] I was thinking oooh this script is fast, now I realized I missed a digit [11:25:10] sigh ms-be2058 looks like it went down and nothing on console, I'll powercycle it [11:31:32] I will start the pre work for m5 switchover in around 30 minutes or so, the switchover is at 14:00 UTC [11:36:43] I go eat lunch before epooling db1121 [11:36:50] *repooling [12:28:00] the mediabackups database is a toy compared with mediawiki, but I am starting to have large performance issues due to the mariadb query planner: https://phabricator.wikimedia.org/P17801 [12:28:24] "instead of using the index explicitly created for this query, I am going to do a full table scan" [12:57:26] maybe a coincidence, but when recovery started going over revision table, throughput dropped considerably: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-job=All&var-server=db1139&var-port=13311&from=1637661428086&to=1637672228087 [13:01:11] jynus: sometimes an analyze fixes it by refreshing table stats [13:01:31] but not always cause the optimizer could be simply silly :( [13:01:32] I guess it could be that, I am inserting a lot of records [13:01:49] back [13:04:05] Re: import, I think it is just a coincidence, that was the point when the buffer pool got full and started scratching disk heavily [13:12:04] oh the db backup of image has got halved (even less) [13:12:04] https://www.irccloud.com/pastebin/HLISBFR3/ [13:12:14] jynus: fyi ^ [13:12:40] thank you! [13:12:45] I saw it this morning with: [13:12:48] Last dump for s4 at codfw (db2139.codfw.wmnet:3314) taken on 2021-11-23 00:00:02 is 150 GB, but previous one was 164 GB, a change of 8.7% [13:14:00] I hope next year with normalization of links tables it'll reduce to below 100GB (and possibly less) [13:19:29] going for lunch now that backups are running and before switchover, will return for that, but soon after that I have onfire meeting [13:59:49] back [14:00:02] right on time for the fun! [14:02:29] o/ [14:02:50] bd808: o/ we are on -operations [14:09:29] le sigh re: ms-be2058 T296300 [14:09:29] T296300: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300 [14:12:27] hardware, who'd have it? [14:13:24] heheh [14:14:03] * godog shakes fist [14:18:20] I mean I like hw issues #notmyproblem :P [14:18:34] I am not planning to push this until tomorrow, but if I can get a sanity check on the IPs/hostnames, that'd be nice https://gerrit.wikimedia.org/r/c/operations/puppet/+/740839/ [14:19:30] The reason not to push it today is to make sure db1132 is stable as a master [14:20:17] marostegui: hostnames/IPs match OK [14:20:25] thanks! [15:30:14] godog: sorry, tried to rolling-restart Thanos swift frontends with restart-thanos-swift (which I understood to be the correct depool-restart-repool script in absence of a proper cookbook) and it barfed on thanos-fe1001. https://phabricator.wikimedia.org/P17802 What did I do wrong? [15:32:38] Emperor: looks like the script defaults to the (wmf) service name being equal to the systemd unit that needs restarting [15:33:27] so how should I be doing this? [15:34:24] I'm looking into how to override the service name, that should be the right fix [15:34:55] TY [15:35:30] this would also work (which is what the script does but less brutally): depool && sleep 3 && systemctl restart swift-proxy && sleep 3 && pool [15:36:02] Ah, is that your usual approach? :) [15:37:17] yeah [15:40:05] <_joe_> Emperor: i guess something wrong was written in profile::lvs::realserver::pools [15:40:16] <_joe_> I can take a look in 10 minutes [15:40:35] That'd be kind :) [15:40:48] _joe_ knows :) I'll leave it up to him [15:41:25] I'm due a typing break now anyhow [15:49:46] <_joe_> Emperor: I have a couple questions for you [15:50:21] <_joe_> so I see two pools defined for thanos frontends, thanos-swift and thanos-query [15:50:41] <_joe_> I guess the service that responds to the thanos-swift pool is swift-proxy, correct? [15:51:17] <_joe_> while thanos-query is served by all the thanos-* services I see there? [15:56:03] <_joe_> Emperor: please check https://gerrit.wikimedia.org/r/c/operations/puppet/+/740859 [15:58:31] 👀 [15:58:44] <_joe_> err sorry I inverted the order [15:58:50] <_joe_> amending [16:00:09] <_joe_> so, if you want to restart swift-proxy, you can just run "restart-swift-proxy" once that patch is merged [16:00:42] <_joe_> https://puppet-compiler.wmflabs.org/compiler1003/32586/ looks correct to me [16:01:27] _joe_: thanks, that looks good to me, but I'd like godog to approve since they are still the Swift expert :) [16:02:04] I'll check too [16:02:41] LGTM, thanks _joe_ [16:10:16] <_joe_> Emperor: the script should be everywhere now [16:10:26] <_joe_> err sorry spoke too early :D [16:10:30] <_joe_> puppet still running [16:10:48] <_joe_> the script is /usr/local/sbin/restart-swift-proxy [16:11:29] <_joe_> and btw, you should be able to run it in parallel on all hosts, as it will anyways be limited to concurrency of 1 [16:11:33] <_joe_> by poolcounter [16:12:10] <_joe_> the script is now everywhere, but I would suggest removing /usr/local/sbin/restart-thanos-proxy anyways [16:17:16] _joe_: parallel - YM "sudo cumin O:thanos::frontend /usr/local/sbin/restart-swift-proxy" should be safe/sane? [16:17:37] <_joe_> yeah but also -b 1 will have the same effect [16:17:50] <_joe_> it depends on how fast the service is to come back up [16:18:02] few seconds, I think [16:18:38] <_joe_> but yeah anyways as usual I suggest -b1 -s5 to be super safe first [16:23:29] Done, thanks.