[08:36:06] marostegui: i'm going to switch pc2011 to be pc1 primary. choosing pc1 because if anything somehow explodes, we have more spares to handle it [08:36:15] \o/ [08:38:42] I rerun doc1001 and phabricator1001 full backups, they had failed [08:39:00] doc1001? [08:39:17] where docs.wikimedia.org live [08:40:22] ah ok ok [08:46:23] marostegui: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/712115 - first step [08:46:32] checking [08:59:15] B5 et al are rack locations? [09:00:41] yeah [09:00:54] Emperor, https://netbox.wikimedia.org/dcim/racks/?q=B2 [09:08:31] thought so, but seemed worth checking :) [09:11:15] OOI, why does gerrit say failed for one of the tests? If I click-through to jenkins, it looks to have succeeded... [09:11:37] Emperor: if it's Diff one ignore it [09:11:56] Yeah it is [09:16:32] marostegui: 2 more for you: https://gerrit.wikimedia.org/r/712119 and https://gerrit.wikimedia.org/r/712120 [09:16:40] checking! [09:17:22] RhinosF1: that makes me a bit sad, but OK :) [09:17:39] 😭 🐼 [09:18:06] Emperor: I know the job works for changing dblists. There's tasks somewhere for the stuff. [09:18:23] But that patch was only a comment change so I doubt it would ever see a change [09:18:46] kormat: I am here for the spare -> spare(s) CR :) [09:19:06] :D [09:20:25] kormat: is the plan to decommission pc2007 at some point, then? [09:20:40] Emperor: there is, i'm told [09:20:57] Emperor: https://phabricator.wikimedia.org/T223602 [09:29:23] marostegui: /o\ i've just realised how painful it is to change primary when we have circular replication :( [09:29:55] yeah, it is a bit painful and dangerous [09:30:01] marostegui: i'm going to downtime all of pc1 for 1h while i try to get this right. having you looking over my shoulder would be appreciated [09:30:01] At least we don't care about parsercache hosts [09:30:12] kormat: sure, just let me know how I can help [09:31:10] let me write up on the task what i'm going to do, then you can tell me how it's going to go wrong [09:31:18] haha [09:31:22] sure [09:36:33] marostegui: i'm trying to play it somewhat safe: https://phabricator.wikimedia.org/T284825#7278477 [09:36:54] in theory it might be possible to use db-move-replica for everything except pc2007, but i scared of doing bad things [09:37:31] let me read [09:40:22] kormat: why not reseting pc2010 too? [09:40:44] pc2010 is pc1 too, no? [09:40:45] marostegui: because i figure using db-move-replica to move it is safe, and easier than manually resetting [09:41:00] it is, yes. https://orchestrator.wikimedia.org/web/cluster/alias/pc1 [09:42:14] so that procedure would work, but it would lose data if pc1 (unless you correlate the stop slave positions with the new ones in both binlogs - or set up everything on read-only for the moves. But as it is parsercache we don't really care) [09:42:35] oh yeah, definitely not something i'd do for a section we care about [09:43:27] shouldn't pc2011 replicate from pc1011? [09:43:41] no? [09:43:49] pc1007 is the current pc1/eqiad primary. i haven't touched it [09:44:22] ah ok, you are only changing codfw one [09:44:37] maybe it is easier to change pc1 entirely in codfw and eqiad to make things simpler? [09:44:38] you reviewed the CR! :P [09:44:51] but I forgot as soon as I did +1 [09:44:54] :D [09:45:13] why are you complicating things and then calling it 'easier'? [09:45:32] * kormat feels she is getting pranked [09:45:50] cause you'd have pc1011 <> pc2011 and the topology would be the same (hostname-wise) between eqiad and codfw [09:46:13] But the procedure you posted above is correct, +1 to go with it [09:46:15] it's a shame we don't have any tools that we can use to observe the topology [09:46:26] :p [09:47:05] and definitely not 3 different ones. that'd be mind-blowing [09:48:44] kormat: you could use orchestrator also to move hosts [09:48:54] marostegui: oh yeah! i'll try that, see what h appens [09:49:14] it will be interesting to see how it handles it [09:49:32] oh. i made an oops. [09:49:43] i should not have nuked the replication on db2011 [09:49:46] now i can't use any tools [09:49:48] sigh. [09:49:54] It might not let you do it with the smart mode (gtid) so you might need to go for the classic one (file/position) [09:51:26] ok, set up pc2007 to replicate from pc2011, now i can use tools for pc2010/pc2014 [09:52:02] kormat: in this particular case you can also break circular replication, do the move, and then enable it again too [09:54:01] pc2010 worked fine, pc2014 broke due to gtid. i reset it, and just did it manually. [09:54:24] due to gtid? impossible position I assume? [09:54:26] i should have used the console, also, not the web u/i [09:54:28] yeah [09:54:35] I wonder why it let you do it [09:55:04] That's why I was saying that maybe it would fail (without breaking) and tell you to use the "classic" move [09:55:10] yeah [09:58:30] ok. the tree _looks_ right now [09:58:46] need to wait for replication to catch up, and then reenable gtid [09:58:50] 🤞 [09:58:59] do we need to do changes on the heartbeat table? [09:59:20] already done [09:59:34] aw [10:05:37] huh. i'm surprised the perf-team's PC dashboard doesn't display latency [10:07:41] ok, gtid enabled for all of pc1 [11:33:56] if I am on the right channel: [11:34:23] heads up for some high write and read activity on db2151 (misc/mediabackupstemp) [11:35:13] nothing to worry about then? [11:35:27] no, should not affect mw or anything [11:35:32] good! [11:35:35] thanks [11:35:37] but in case you see some spikes of db traffic [11:38:32] I realize now that my first message was ambiguous: I meant as "this is me, don't worry if you see: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=7&orgId=1&var-site=codfw&var-group=All&var-shard=All&var-role=All&from=1628767126346&to=1628768891510" [11:39:10] nice increase [11:39:35] oh, this is just the begining (1 thread) [11:40:11] but kinda justifies my fear to not put it on m1 from the begining [13:15:32] Anyone know offhand where the primary DB query on GET logs end up in logstash? [13:30:06] sorry, I don't quite get "where the primary DB query on GET logs" [13:31:09] MW logs if a GET request does a query to a primary DB server [13:31:20] As it's generally something that's undesireable [14:30:50] sorry, don't know much about that, my guess is performance set it up [14:31:45] I think, but not sure, that thay are leading cross-dc work (but not sure about that) [17:13:20] backup1001 backup freshness alert will soon be fixed