[01:08:53] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 17.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:10:31] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [08:56:50] I am going to shutdown orchestrator's database for a few minutes [09:03:48] did yesterday's meeting have any interesting thoughts about T327253 BTW? [09:03:49] T327253: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 [09:41:09] orchestrator is available again [10:44:05] "Last dump for matomo at eqiad (db1108) taken on 2023-01-24 03:26:49 is 270 MiB, but the previous one was 216 MiB, a change of +24.8 %" FYI btullis [10:44:24] I will ack it [10:45:05] jynus: Thank you. I'll check it out, [10:52:05] marostegui: this is just a crazy random idea of mine, but do you think we could fight to have es hosts in a 10G network somewhat in the future? [10:52:30] jynus: I am up for that, but I believe the problem has always been to have enough 10G on the switches [10:52:40] I know [10:53:23] but in theory network is going to be redone soon [10:54:29] but I am looking at the graphs and I am seeing that it takes more time to export a db from a local host than to send it to a different dc! https://grafana.wikimedia.org/goto/wvFX2BT4z?orgId=1 [10:54:55] haha [10:55:07] Yeah, if they can give us 10G ports, I am fine [10:55:20] We've been buying them with 10G ports for years, so we are fully ready [10:55:33] jynus: we can definitely start with backup related hosts [10:55:44] my suggestion is to try to start with es hosts in general [10:55:51] sure [10:55:54] as they have 12T of content [10:55:58] yeah [11:02:17] some day I will get used to how slow our network is ;p [11:03:09] jynus: I think all backups in eqiad mX have finished (per dbbackups.backups table), but can you confirm? I would need to reboot db1117 (backups source for mX) [11:03:26] only es backups are running at the moment [11:03:38] cool, doing it now then [16:23:11] jynus: is thursday around 07:00 AM UTC a good time for m1 master switchover? (dbbackups lives there) [16:23:17] expected RO time is around 30 seconds [16:23:57] @ meeting will have to check to try to minimize running backups [16:24:01] no worries [16:24:04] take your time [16:24:10] dbbackups don't worry as much as bacula [16:24:18] good point :) [16:24:23] just take your time and let me know tomorrow [17:30:33] marostegui for tomorrow, I think I can make it work disabling some long running jobs and running them early or after that (e.g. es bacula), but I may need some time after 7 am just to be sure [17:31:11] jynus: I was suggesting Thursday :) [17:31:21] if that helps in anyway [17:31:22] 8 or later to make sure things are idle for both dbbackups and bacula [17:31:36] I can do it anytime Thursday really [17:31:40] whatever is more convenient [17:31:41] yes, thursday but later in the day [17:31:45] sure [17:31:49] that works [17:32:11] let's chat on Thursday and I can do it ad-hoc pretty much [17:32:12] in theory 7 would be ok, but I would like a larger buffer just in case [17:32:23] no problem at all [17:35:21] es bacula backups take 17 hours and run on thursdays, so I will disable them tomorrow: https://grafana.wikimedia.org/goto/blaQnYoVk?orgId=1 and start them after the switchover [17:38:18] bacula finishes at 6:53 and dbbackups at 6:42, so want to give me some extra time just in case [17:43:23] sure, we can do it at 9 or 10 utc [18:17:40] also for tomorrow, I wonder if db_inventory topology could be added to orchestrator? [18:55:52] jynus: sure, we can do it tomorrow