[04:34:42] 10DBA, 10SRE, 10Datacenter-Switchover: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Marostegui) Treating it like parsercache would also be my first approach. I would be comfortable with doing RO on both DCs, then the switchover and then the RW on both DC... [04:34:47] legoktm: just answered, essentially...I agree with timo [04:41:43] FYI, I am going to start warming buffer pools [04:47:13] thanks [04:48:29] I think I'm going to remove x2 from CORE_SECTIONS in spicerack, we don't list the parsercache hosts there, right? [04:54:54] legoktm: don't know, I haven't checked :) [04:55:17] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/mysql_legacy.py#21 [04:58:11] legoktm: right, yeah, then let's remove it from there [04:58:21] do we set parsercache in read only during the switch? [04:58:22] or not even? [04:59:03] I don't think we touch parsercache at all [05:00:23] Ah ok [05:00:46] I am fine either way, I would prefer to set x2 as read-only and then rw once the switch is done, but 1) it is not in use and has no data 2) don't know how hard it is [05:01:08] legoktm: It is a matter of how important it is to have consistent data on x2, which is not something I can really tell, more a question for Timo [05:03:35] actually I don't think it would be too hard [05:04:05] Up to you and timo I would say [05:04:42] we already set it to read only and then set it back to read-write in the primary DC, so we just need a line or two to set it to read-write in the other DC [05:06:39] there's a secondary problem that when running in live test mode that we set the codfw x2 to read-only which triggered a page, so we'd need to exempt it. but that's also doable too [05:12:49] I thought we'd downtime things before the test [05:12:56] And even before the dc switch itself [05:13:04] I don't recall correctly if we did that last time [05:15:31] in live test mode, we basically run the switchover as if it were going from codfw -> eqiad, and it tries to avoid doing problematic things that would affect eqiad. So we mostly need to teach it that if --live-test is running, to skip marking x2 in codfw as read only [05:16:18] because with the exception of x2, everything in codfw is already read-only, so running the read-only step tests the procedure while being a no-op [05:16:23] right yeah [05:17:12] anyways, I'm nearly done writing the other patch, so whichever direction is picked the patch will be ready [05:17:22] legoktm: excellent, thanks [06:29:00] 10DBA, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) Summary from `#wikimedia-databases`: `lang=irc 22:00:46 I am fine either way, I would prefer to set x2 as read-only and then... [07:55:29] marostegui: oh, one more thing. it was mentioned that some things might have changed around the heartbeat setup, is this query still up to date / correct? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/mysql_legacy.py#89 [07:56:14] legoktm: The only thing that changed as far as I know was the way we start/stop heartbeat, but nothing related to how we measure lag (the query itself) [07:56:22] kormat: can you confirm? ^ [07:56:30] 10DBA, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10jcrespo) Legoktm asked me to copy comments I had given on the patches here- I think Manuel had already spoken my mind already- it is for the applica... [07:57:13] that's the thing I would ask her to double check (what I mentioned about looping in performance/other people) [07:57:34] I don't know what you are referring to [07:57:55] remember on a meeting I said to loop in other people on heartbeat changes? [07:58:07] even if theoretically no change was made [07:58:30] We've not changed anything regarding the inner working of pt-heartbeat, just how we stop/start it [08:00:08] jynus: hey. your input is not required here. [08:00:25] ok [08:02:16] nothing has changed re: how heartbeat works or how we measure lag. the only thing i am wondering is how the existing code handles circular replication (c.f. x2). i'll take a look in a few. [08:02:51] thanks [08:03:03] and x2 is being discussed at https://phabricator.wikimedia.org/T285519 [08:03:08] 👍 [08:04:45] legoktm: To sum up my thoughts which are already there: I would treat x2 as pcX, if data is important we should go RO - switchover - RW, if data isn't important or we don't care like we do with pc, I guess it is fine to leave it RW entirely (like we do with pcX), but this is for Timo to decide on the data part. (Note: there is no data at the moment, but we should treat it as if it had data) [08:05:33] If we go RO - switch - RW, there'll be a page (like last night's) about read_only on both hosts, so that needs to be handled too [09:44:45] 10DBA, 10Orchestrator: orchestrator: Upgrade to v3.2.5 - https://phabricator.wikimedia.org/T275784 (10Kormat) [09:46:44] 10DBA, 10Patch-For-Review: Investigate pt-heartbeat-wikimedia failure modes - https://phabricator.wikimedia.org/T285079 (10Kormat) We now have decent monitoring for pt-heartbeat - if it's not in the expected status (running on 'masters', stopped everywhere else) then we'll get an alert after 2 minutes. The on... [10:00:40] 10DBA, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Kormat) >>! In T285519#7176764, @Legoktm wrote: > * [[https://gerrit.wikimedia.org/r/701471|spicerack: Revert "mysql_legacy.py: Add x2"]]: this basi... [10:02:08] legoktm: marostegui: i've weighed in on the x2 stuff on the task. [10:02:58] i still need to figure out exactly what mysql_legacy is doing about lag detection. it's less straight-forward than i had hoped [10:03:06] if there's anything else that's pending input from me, please ping me. [13:48:23] PROBLEM - MariaDB sustained replica lag on db2133 is CRITICAL: 217.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2133&var-port=9104 [13:49:58] marostegui: we're having m2 lag again [13:50:19] RECOVERY - MariaDB sustained replica lag on db2133 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2133&var-port=9104 [13:51:22] kormat: what's running there? [13:51:56] https://wikitech.wikimedia.org/wiki/MariaDB/misc#m2 [13:52:09] no, I mean what process was running [13:52:15] no idea [13:52:16] if it is over I can check binlogs later [13:53:23] it's mostly over now. db2078 still needs to catch up [13:53:53] tendril didn't show any slow queries in the right time period [13:55:05] Yeah, it is probably the master with writes [13:55:09] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1107&var-port=9104 [13:55:12] Let me check binlogs [13:56:07] Might be the mwlinkadd table [13:56:11] But I need to confirm [13:56:16] s/table/database [14:05:30] Might be otrs, around the time of the lag I see lots of inserts for communication_log_object_entry table [14:07:48] root@db1107:/srv/sqldata# mysqlbinlog --start-datetime="2021-06-25 13:40:00" --stop-datetime="2021-06-25 13:53:00" db1107-bin.000477 -vvv | grep communication_log_object_entry | wc -l [14:07:48] 1582 [14:07:48] root@db1107:/srv/sqldata# mysqlbinlog --start-datetime="2021-06-25 13:20:00" --stop-datetime="2021-06-25 13:33:00" db1107-bin.000477 -vvv | grep communication_log_object_entry | wc -l [14:07:48] 0 [14:07:54] So yeah, looks like a storm of those [16:50:56] I +2'd the "Revert "mysql_legacy.py: Add x2"" patch, but I'll leave the task open in case Krinkle wants to add anything [18:38:00] 10Data-Persistence-Backup, 10SRE, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10Jclark-ctr) [21:59:56] 10DBA, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) Ack, thanks for all the input. For next week we'll just ignore x2, it'll stay RW in both DCs throughout. @krinkle does that also work as th...