[07:49:29] <marostegui>	 Can I get a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/740714 ?
[08:42:10] <marostegui>	 thanks jaime! <3
[08:55:42] <Amir1>	 marostegui: I want to start dropping useless grants (covered by other grants, there is no db being covered in that host, etc.). Objections? Notes?
[08:56:25] <marostegui>	 yeah, if those dbs don't exist, that should be fine, but do not do it with replication enabled
[09:10:58] <jynus>	 I wonder how the bullseye of db is going- do you have any host already reimaged, and how did it go? (informal enquire only, I am just curious on the amount of things broken)
[09:11:38] <marostegui>	 it is going well, some stuff with the packages but sorted for now
[09:11:42] <marostegui>	 db1125 is now reimaged
[09:11:49] <marostegui>	 and I am doing db1124 (master role) now
[09:11:57] <marostegui>	 to have both and do proper testing
[09:12:26] <jynus>	 and the server is technically running etc, generating metrics?
[09:13:02] <jynus>	 oh, no metrics
[09:13:19] <marostegui>	 no, no metrics yet
[09:14:02] <jynus>	 there are host ones, though: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1125&var-datasource=thanos&var-cluster=mysql&from=1637648038937&to=1637658838937
[09:14:20] <marostegui>	 yes, it is the mysqld exporter
[09:14:59] <jynus>	 when things are more mature I may have space for 1 test host on backup sources, but not yet now
[09:15:36] <marostegui>	 sure, no worries
[09:15:50] <marostegui>	 I am expecting to fix the exporter thing in the morning
[09:15:56] <marostegui>	 but yes, too early to test anywhere else
[09:16:10] <marostegui>	 Today I am freeing up a host in m5, so I am going to use that to place it on s1, and let it replicate
[09:17:18] <jynus>	 no pressure, sorry
[09:17:27] <jynus>	 I am just excited about what you are doing!
[09:17:32] <jynus>	 0:-)
[09:17:52] <marostegui>	 :)
[09:18:39] <jynus>	 plus also keeping an eye on it to learn hopefully not too long after you, so I (backups) don't become a blocker
[09:35:00] <jynus>	 I am about to start rebuiling from logical dump db1139/s1
[09:39:55] <marostegui>	 Amir1: you can proceed with either db1124 or db1125 if needed, anytime, they are tests hosts
[09:40:07] <marostegui>	 if they are down, just ping me, as I am working with them for the bullseye thing
[10:11:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (db1139:13311)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org
[10:26:47] <jynus>	 that must be me ^
[10:27:11] <jynus>	 it started firing now that I have started mysql to import but not setup yet monitoring
[10:27:45] <jynus>	 I think I can fix that in parallel
[10:36:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) resolved: Prometheus-mysqld-exporter failed (db1139:13311)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org
[10:41:17] <jynus>	 FYI the extra 120K row writes/s is me with the reimport: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=7&orgId=1&var-site=eqiad&var-group=All&var-shard=All&var-role=All&from=1637662501159&to=1637664026672
[10:43:33] <jynus>	 I think we should split mariadb::dbstore_multiinstance into mariadb::backup_source and mariadb::analytics_multiinstance roles (while keeping the same functionality)
[10:44:37] <jynus>	 I will file a ticket and ask analytics and k*rmat when feedback when back
[11:02:09] <Amir1>	 https://www.irccloud.com/pastebin/jYO9axst/
[11:02:22] <Amir1>	 marostegui: now testwiki is on write both for rev_actor
[11:04:13] <marostegui>	 sweeeet
[11:12:52] <Amir1>	 running the script now, the lag on s3 seems fine
[11:14:49] <Emperor>	 finally sent my dumb swift-recon -d patch upstream so they can tell me Not To Do It Like That :)
[11:15:58] <Amir1>	 progress :P
[11:18:44] <Emperor>	 it's taken ~4 weeks to get to this point? [I think I wrote the patch on 20 October]
[11:22:50] <Amir1>	 I was thinking oooh this script is fast, now I realized I missed a digit
[11:25:10] <godog>	 sigh ms-be2058 looks like it went down and nothing on console, I'll powercycle it
[11:31:32] <marostegui>	 I will start the pre work for m5 switchover in around 30 minutes or so, the switchover is at 14:00 UTC
[11:36:43] <Amir1>	 I go eat lunch before epooling db1121
[11:36:50] <Amir1>	 *repooling
[12:28:00] <jynus>	 the mediabackups database is a toy compared with mediawiki, but I am starting to have large performance issues due to the mariadb query planner: https://phabricator.wikimedia.org/P17801
[12:28:24] <jynus>	 "instead of using the index explicitly created for this query, I am going to do a full table scan"
[12:57:26] <jynus>	 maybe a coincidence, but when recovery started going over revision table, throughput dropped considerably: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-job=All&var-server=db1139&var-port=13311&from=1637661428086&to=1637672228087
[13:01:11] <marostegui>	 jynus: sometimes an analyze fixes it by refreshing table stats
[13:01:31] <marostegui>	 but not always cause the optimizer could be simply silly :(
[13:01:32] <jynus>	 I guess it could be that, I am inserting a lot of records
[13:01:49] <Amir1>	 back
[13:04:05] <jynus>	 Re: import, I think it is just a coincidence, that was the point when the buffer pool got full and started scratching disk heavily
[13:12:04] <Amir1>	 oh the db backup of image has got halved (even less)
[13:12:04] <Amir1>	  https://www.irccloud.com/pastebin/HLISBFR3/
[13:12:14] <Amir1>	 jynus: fyi ^
[13:12:40] <jynus>	 thank you!
[13:12:45] <jynus>	 I saw it this morning with:
[13:12:48] <jynus>	 Last dump for s4 at codfw (db2139.codfw.wmnet:3314) taken on 2021-11-23 00:00:02 is 150 GB, but previous one was 164 GB, a change of 8.7%
[13:14:00] <Amir1>	 I hope next year with normalization of links tables it'll reduce to below 100GB (and possibly less)
[13:19:29] <jynus>	 going for lunch now that backups are running and before switchover, will return for that, but soon after that I have onfire meeting
[13:59:49] <jynus>	 back
[14:00:02] <marostegui>	 right on time for the fun!
[14:02:29] <bd808>	 o/
[14:02:50] <marostegui>	 bd808: o/ we are on -operations
[14:09:29] <godog>	 le sigh re: ms-be2058 T296300
[14:09:29] <stashbot>	 T296300: ms-be2058 memory error - https://phabricator.wikimedia.org/T296300
[14:12:27] <Emperor>	 hardware, who'd have it?
[14:13:24] <godog>	 heheh
[14:14:03] * godog shakes fist
[14:18:20] <Amir1>	 I mean I like hw issues #notmyproblem :P
[14:18:34] <marostegui>	 I am not planning to push this until tomorrow, but if I can get a sanity check on the IPs/hostnames, that'd be nice https://gerrit.wikimedia.org/r/c/operations/puppet/+/740839/
[14:19:30] <marostegui>	 The reason not to push it today is to make sure db1132 is stable as a master
[14:20:17] <Emperor>	 marostegui: hostnames/IPs match OK
[14:20:25] <marostegui>	 thanks!
[15:30:14] <Emperor>	 godog: sorry, tried to rolling-restart Thanos swift frontends with restart-thanos-swift (which I understood to be the correct depool-restart-repool script in absence of a proper cookbook) and it barfed on thanos-fe1001. https://phabricator.wikimedia.org/P17802 What did I do wrong?
[15:32:38] <godog>	 Emperor: looks like the script defaults to the (wmf) service name being equal to the systemd unit that needs restarting
[15:33:27] <Emperor>	 so how should I be doing this? 
[15:34:24] <godog>	 I'm looking into how to override the service name, that should be the right fix
[15:34:55] <Emperor>	 TY
[15:35:30] <godog>	 this would also work (which is what the script does but less brutally): depool && sleep 3 && systemctl restart swift-proxy && sleep 3 && pool
[15:36:02] <Emperor>	 Ah, is that your usual approach? :)
[15:37:17] <godog>	 yeah
[15:40:05] <_joe_>	 Emperor: i guess something wrong was written in profile::lvs::realserver::pools
[15:40:16] <_joe_>	 I can take a look in 10 minutes
[15:40:35] <Emperor>	 That'd be kind :)
[15:40:48] <godog>	 _joe_ knows :) I'll leave it up to him
[15:41:25] <Emperor>	 I'm due a typing break now anyhow
[15:49:46] <_joe_>	 Emperor: I have a couple questions for you
[15:50:21] <_joe_>	 so I see two pools defined for thanos frontends, thanos-swift and thanos-query
[15:50:41] <_joe_>	 I guess the service that responds to the thanos-swift pool is swift-proxy, correct?
[15:51:17] <_joe_>	 while thanos-query is served by all the thanos-* services I see there?
[15:56:03] <_joe_>	 Emperor: please check https://gerrit.wikimedia.org/r/c/operations/puppet/+/740859 
[15:58:31] <Emperor>	 👀
[15:58:44] <_joe_>	 err sorry I inverted the order
[15:58:50] <_joe_>	 amending
[16:00:09] <_joe_>	 so, if you want to restart swift-proxy, you can just run  "restart-swift-proxy" once that patch is merged
[16:00:42] <_joe_>	 https://puppet-compiler.wmflabs.org/compiler1003/32586/ looks correct to me
[16:01:27] <Emperor>	 _joe_: thanks, that looks good to me, but I'd like godog to approve since they are still the Swift expert :)
[16:02:04] <godog>	 I'll check too
[16:02:41] <godog>	 LGTM, thanks _joe_ 
[16:10:16] <_joe_>	 Emperor: the script should be everywhere now
[16:10:26] <_joe_>	 err sorry spoke too early :D
[16:10:30] <_joe_>	 puppet still running
[16:10:48] <_joe_>	 the script is /usr/local/sbin/restart-swift-proxy
[16:11:29] <_joe_>	 and btw, you should be able to run it in parallel on all hosts, as it will anyways be limited to concurrency of 1
[16:11:33] <_joe_>	 by poolcounter
[16:12:10] <_joe_>	 the script is now everywhere, but I would suggest removing /usr/local/sbin/restart-thanos-proxy anyways
[16:17:16] <Emperor>	 _joe_: parallel - YM "sudo cumin O:thanos::frontend /usr/local/sbin/restart-swift-proxy" should be safe/sane?
[16:17:37] <_joe_>	 yeah but also -b 1 will have the same effect
[16:17:50] <_joe_>	 it depends on how fast the service is to come back up
[16:18:02] <Emperor>	 few seconds, I think
[16:18:38] <_joe_>	 but yeah anyways as usual I suggest -b1 -s5 to be super safe first
[16:23:29] <Emperor>	 Done, thanks.