[08:01:15] I am going to set temporarilly db2185 (db_inventory) in read write to test p*aging, ok? don't worry [08:46:41] ah thanks! [09:00:27] Amir1: can I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188769 ? [09:01:01] sorry, just going through my emails. Yeah let's go for it [09:04:23] thanks, deployed [10:11:09] Amir1: looks good? https://wikitech.wikimedia.org/wiki/MariaDB/Provisioning/es_hosts [11:04:58] still trying to figure out why s1 in codfw has two less replicas than in eqiad [11:05:21] plus according to dbctl there are two candidate master of s1 in eqiad which is not fun [11:07:37] speaking of which, a good while ago Manuel was saying that by moving towards statement based replication we would have all replicas being suitable as candidate master [11:08:08] (and with standardizing the weights it's also helpful) [11:10:01] that can be helpful when master is down and not coming back online in which you can pick the ones that have the most recent commit since the designated candidate might be missing some transactions. But for normal times, I need a designated candidate specially since automations like switchmaster rely on it [11:13:05] Amir1: can I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1189131 ? [11:16:04] go for it [13:02:55] FIRING: SystemdUnitFailed: mariadb.service on es2050:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:32:47] Amir1: can I depool es2027 now? [15:56:42] hi DP, just wanted to confirm that it's okay to proceed with the swtichover live-test? (apologies I could have been more vocal about the re-test!) [16:00:13] just to provide some context on what this means: the live test _will_ make production changes, but only ones that are either noops or monitoring only. examples include: [16:00:13] * adding and removing downtimes for read-only state [16:00:13] * setting db primaries to read-only in the current read-only DC (codfw) (a noop) [16:00:13] * running puppet on db primaries [16:48:23] hi again, folks - so, it looks like there are some operations ongoing involving es2027, which is the codfw es3 primary (T402859), and thus would be touched by the live test as described above. [16:48:23] we're having a hard time telling if any of that might conflict with or otherwise complicate the ongoing work. guidance would be appreciated :) [16:48:23] T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859 [17:07:25] after a bit more searching, it looks like es3 is *not* among the sections that would be operated on by the cookbooks, so this is probably not a problem in practice. [17:07:25] even so, if a DBA could confirm that all of the above operations are good to proceed at any time, that would be greatly appreciated. [17:22:05] regarding T402859: the ongoing work is not going to affect the switchover afaict as it's currently involving only 2 hosts in codfw, both depooled [17:22:06] T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859 [17:23:31] (yet on the Deta Persistence side we planned to stop any activity on tomorrow [17:26:06] ...at 11:00 UTC) [17:33:49] sounds good, thanks f.edrico,! [17:34:07] just to confirm, what time does the scheduled test start at? [17:56:26] on our side we stop DB maint tomorrow at 11:00 UTC to then set up circular replication between 13:00 to 14:00 UTC [18:09:18] ah, that's good to know! that means we should expect this check [0] to succeed during tomorrow's live test. [18:09:18] [0] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/switchdc/mediawiki/03-set-db-readonly.py#30