[16:40:37] note there is actually redundancy on sanitariums, as I designed it years ago. eqiad and codfw are rarelly delayed at the same time, and cloud dbs were supposed to switch from replicationg from one datacenter or the other in order to maintain replication up to date, as much as reasonable possible. Not sure why that is not used more often (maybe just lacks the automation for the schema change + master switch?). [16:41:52] but I agree generally that is production-like SAL is needed, services should be migrated to production [16:42:01] *SLA [16:44:18] e.g. I advocated for a long time to generate reports on production once to avoid duplicate runs on cloud. Eg.: T59617 [16:44:19] T59617: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617 [17:14:53] it caught up now [19:43:15] Thanks for the note Jaime, I keep it in mind and will investigate how to use that. The biggest complexity is the fact that cross-dc replication is by nature slow but not impossible as we replicate to another dc. [20:37:54] wasn't the issue the ALTER on cloud dbs? [20:38:24] if the replication is halted at labs side, I don't see how a different replicating master would help [20:40:58] it can reduce the time that it lags behind, currently every run makes the cloud lag behind three times the alter table time, first for it to go through sanitarium master, then sanitarium host and then the third one on the db itself, this would eliminate the first two [20:41:20] sure, commons would lag for seventeen hours but that's better than fifty? [20:53:25] so there is a sanitarium master in eqiad and another in codfw? [20:53:53] switching masters would help, indeed [21:49:22] yup