[05:20:22] Amir1: I don't remeber if you said I can do s2 codfw switchover because your finished your stuff there? [05:26:40] will start https://phabricator.wikimedia.org/T366259 [05:52:05] I'm a bit ahead of schedule, waiting for 6 UTC to disabnle puppet and move to the next steps [05:52:14] ok :) [05:52:52] arnaudb: wait [05:53:01] arnaudb: the topogy isn't moved entirely [05:53:03] did you see that? [05:53:11] oh yes I'm on orchestrator [05:53:18] my message was ahead of schedule as well x) [05:53:24] ok [05:53:30] waiting for the cookbook to properly finish up [05:53:46] ah ok [05:54:06] I was not clear, sorry about that haha [05:54:22] (now its over) [05:54:38] yep [05:54:57] bbiab [06:30:34] arnaudb: once you are done with the old master, leave it depooled (but start replication please) [06:30:44] I have a bunch of stuff to run there - I think you do too? [06:32:36] I do indeed [06:32:46] ok, I will run them once you've done yours [06:32:47] not a bunch but a quick schema change [06:33:11] cool, let me know when you are done with it [06:33:17] so I can run mine [06:33:38] will do [06:35:03] thanks [07:09:00] marostegui: its all yours [07:09:20] arnaudb: great thank you! [07:22:35] Morning folks. I'm still looking for a +1 for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1038391 please :) [07:54:15] thanks arnaud.b [08:14:31] Last snapshot for s4 at eqiad (db1150) taken on 2024-06-03 07:46:21 is 1592 GiB, but the previous one was 1693 GiB, a change of -6.0 % [08:31:08] Probably because of https://phabricator.wikimedia.org/T364069 [09:03:29] marostegui: I am done with s2 codfw [09:04:28] Yeah. That rebuild dropped 100GB from templatelinks. I will explain why [09:20:02] To explain more: The community cleaned up a lot of rows from templatelinks in commons via refactoring templates: https://phabricator.wikimedia.org/T343131#9467054 (and onwards) [09:41:27] marostegui: when you're done please let me have some fun with the old s1 masters (codfw is not switched over yet I know but once you're done [09:43:07] Amir1: will do, probably it won't be finished today [09:43:12] s1 codfw I am going to switch it now [09:43:43] thanks [11:17:37] Amir1: marostegui: I just spotted the lag on clouddb1021 s4/s8 which I see you marked with downtime - It seems to be dropping very slowly. Is there anything I should know about it, or have you got it in-hand? [11:18:34] I was about to start the bookworm upgrade and then move onto the transfer from clouddb1021 to an-redacteddb1001, but I'll hold off for now. [11:21:25] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on thanos-fe1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:10] le sigh [11:31:25] FIRING: [2x] SystemdUnitFailed: swift_dispersion_stats.service on thanos-fe1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:32:00] I literally just fixed that [11:34:30] btullis: meeting but that's expected. Give it a bit [11:35:00] Amir1: Ack, thanks. Happy meeting. [11:36:25] RESOLVED: [2x] SystemdUnitFailed: swift_dispersion_stats.service on thanos-fe1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:44:57] btullis: I was in a meeting but I see amir answered [12:00:14] Yup, thanks. I'll just monitor it and not do anything on it right now. Thanks. [13:28:39] marostegui: okay if I bump the weight to 200 for db1194 (s7 host we discussed)? Maybe in batches of 50 [13:28:50] yeah, do it in batches [13:52:55] I'm fixing the big bugs again... https://github.com/ceph/ceph/pull/57868 [13:55:42] making the world a better place, one misplaced character at a time [13:56:08] 🦾 [14:38:08] do we have a clone cookbook documentation? [14:39:04] (like a wikitech page) [14:41:53] Amir1: https://wikitech.wikimedia.org/wiki/MariaDB/Clone_a_host in the cloning [14:42:01] Clone via the sre.mysql.clone cookbook in a tmux: [14:42:02] sudo cookbook sre.mysql.clone --source $source_server --target $destination_server --primary $cluster_replication_source [14:42:49] awesome. Thanks. [14:43:42] I think it needs its own page since cloning can be done for multiple reasons, not just provisioning. It can get really confusing [14:44:43] Yeah, that mixes a bit the provisioning part (like zarcillo database addtition) [14:44:55] But that is the only part that maybe requires a note [14:45:00] I will add it [14:45:04] Thanks! [14:47:12] something that bothers me about the clone cookbook (and it's sorta my fault) is that you have put fqdn otherwise it gives a cryptic error message [14:47:28] I'd add an example in the clone section so people would know [14:47:28] yeah [14:47:38] ok I will do it now too [14:48:27] thanks <3 [14:53:58] Amir1: db1184 old s1 master is ready for you. Can you take care of repooling it once you are done? [14:54:05] marostegui: sorry if I'm misunderstanding this: https://phabricator.wikimedia.org/T366552#9860018 but if you're done with the old s1 codfw master, I have some stuff to discuss with that host [14:54:15] oh sure thing [14:54:23] Amir1: can you take db1184 first? [14:54:33] definitely [14:54:36] let's do that [14:54:58] db2203 is being repooled, once you are done with db1184, repool that one, and you can proceed with db2203 if that works? [14:55:08] sounds good and fun [14:55:17] enjoy db1184 then [14:55:22] I think each one will take a day or more [14:55:42] I have pagelinks and externallinks schema changes [14:56:23] the pagelinks thing in s4 codfw took 2 weeks XD [14:57:08] for s8 it'd take longer, that's why I'm doing codfw and eqiad in parallel [14:58:07] db1171:3318 is on 13th hour and counting 😭 [14:58:17] sorry, 14 [14:58:19] I will switch s2 codfw tomorrow [14:58:29] awesome [15:48:55] heads up I'm depooling, stopping and rebooting clouddb1013 (T366555) [15:58:01] clouddb1021.s8 lag is still going up, not down. Still expected? Is it running an alter table or something? [16:12:47] clouddb1013 rebooted and repooled [17:01:36] btullis: yes. Alter table