[06:49:54] Going to switchover phabricator master - phabricator will be on read only for around 1 minute [08:19:36] topranks: the row A maintenance task isn't closed yet,but just to be clear, there's nothing else from your side pending, right? [08:20:39] I am working row A hosts assuming there's nothing else pending and it is business as usual [09:20:16] marostegui: correct, everything is complete I’ll close the task off today [09:20:32] topranks: excellent thank you [12:06:55] vgutierrez: When do you think you'd have some time to discuss or give me some pointers for https://phabricator.wikimedia.org/T331318 ? [12:07:39] claime: hmm first ATS layer? [12:07:53] what's first and second ATS layer for you? :) [12:08:20] vgutierrez: As I understood it (and I may be completely wrong), traffic goes ATS -> varnish -> ATS ? [12:08:28] haproxy -> varnish -> ATS [12:08:37] Ah, then haproxy layer, editing [12:09:23] Basically the idea would be to catch certain domains at the first layer, add a header that we'd catch in trafficserver::backend to redirect to mw-on-k8s [12:09:32] So we can avoid this https://gerrit.wikimedia.org/r/c/operations/puppet/+/894529 [12:13:48] claime: I'm assuming a regex_map wouldn't fix the issue? [12:18:48] vgutierrez: it'll quickly become unreadable I think, and error-prone [12:19:06] regex? unreadable? surely not! [12:19:08] :) [12:19:43] * claime blabla two problems [12:24:42] vgutierrez: Basically, I'm ok with doing it via regex for those two domains, but a more flexible and legible approach for the longer term would be nice [12:25:03] claime: hmmm we could put something on top rather than having a human writing the regex [12:25:39] just provide a list of FQDNs that should hit k8s rather than the legacy mw instances [12:26:32] Can you do that directly in ATS? [12:26:43] I don't know that much about it [12:26:51] <_joe_> it's lua scripting [12:27:06] <_joe_> so it means you can do almost anything, but you'll dearly regret doing it in the future [12:27:17] :') [12:27:19] hmm just some puppetization? O:) [12:28:20] I would *really* like it to be an as simple as possible solution [12:28:33] So we don't end up with a bunch of headaches [12:28:51] you are just moving the problem of URL matching from one place to another [12:32:09] we can maybe reduce the question to where is it easier to specify a list of URLs [12:33:22] As it stands, we need the three blocks of conf for rest, api, and standard [12:33:41] Which means triplicating the URL matching regex [12:33:56] err [12:34:10] That's why I was thinking of doing the URL matching on another layer, and just matching a header in ATS [12:34:54] Maybe we only need two, api/rest and standard [12:36:04] you need the three rules, one regex to match the FQDN part of the URL [12:38:56] yes, but that regex will be declared three times, one target in each block right? [12:48:39] It will be used those three times yes [12:54:38] when is eqiad being repooled? [12:59:38] marostegui: repooled in which way? [13:00:14] If it's traffic layer, I did it at 1121UTC [13:00:38] claime: yeah, I am talking about MW reading from eqiad dbs specifically :) [13:01:36] Current switchback is scheduled for April 25th and 26th [13:01:55] I did repool restbase-async in eqiad this morning, but idk if it access dbs or not [13:02:06] accesses* [13:05:19] I am confused, at the moment we are only using codfw databases for reads, meaning we are not in multi-dc on that layer anymore. Are we leaving eqiad out until April 26th? [13:07:31] marostegui: That is the plan as I understand it [13:07:46] akosiaris can maybe clarify when he gets back if I'm wrong [13:07:55] Then we'll need to warm up eqiad before the switch back [13:07:58] yes [13:08:26] Ok, as long as we are all on the same page, that's ok. I thought we were going to repool eqiad at some point and only depool it for the switches maintenance [13:08:48] We'll clarify that this afternoon, I gotta go to lunch rn [13:09:02] No problem, enjoy! [13:09:06] I can't concentrate with the noise my tummy is making :D [14:09:31] <_joe_> not how I understood the plan either fwiw :) [14:09:56] <_joe_> I thought we'd switch back read-only traffic after we were done with network mainetances [14:11:41] Then we really do need to clarify that :D [14:19:53] <_joe_> I'm pretty sure I read something that to me meant we were going to do that for mw RO and services alike [14:20:13] Yeah, same [14:20:51] I'd scheduled the switchback the day before mw-rw switchback from the start [14:21:16] But we can discuss doing it another way, the network maintenances end on March 21st [14:22:31] Oh I can see the confusion https://phabricator.wikimedia.org/T328903 [14:22:58] It's my bad, I had taken the repool as multi-dc RO as just traffic, but yeah you're right [14:23:14] I'll change the task and schedule services and RO switchback after the last row maintenance [14:25:04] Changed [14:26:20] We will still probably need to warm up since it'll be 3 weeks of being completely depooled [14:50:59] o/ [14:51:07] my understanding was the same. Eqiad R/O [14:51:28] full flip of roles between the 2 DCs that is [14:52:08] otherwise we 'd need to warm up eqiad, which multiDC R/O would keep warm [14:52:53] akosiaris: Yeah, I got my wires crossed at some point. [14:53:00] I'd like to depool sessionstore in eqiad and vigorously reboot some nodes to reproduce T327954, is there any reason today would be a bad day to do that? [14:53:01] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [14:53:23] urandom: it's depooled right now [14:53:35] akosiaris: oh, why? [14:53:44] datacenter switchover [14:53:46] akosiaris: I edited the task so it reflects that, with a repool date the day after eqiad row B update, sounds good ? [14:54:12] urandom: in fact you are right on the day the original plan was to repool eqiad R/O [14:54:17] claime: looking [14:55:02] akosiaris: out of curiosity, why was it depooled? [14:55:21] urandom: cause of datacenter switchover. Full depool of all of eqiad [14:55:32] right now nothing goes to eqiad. Absolutely nothing [14:55:43] well, that's not true. Nothing mediawiki related [14:55:47] that's more correct [14:56:07] I see, I guess I assumed (sessionstore-wise) they'd simply switch roles [14:56:30] that's the plan but for 1 week we wanted to have a proof that codfw can survive on its own [14:56:43] auh, ok, makes sense [14:56:56] when is it slated to be made RO? [14:57:00] claime: so, basically extend the situation for another 2 weeks, right ? [14:57:10] urandom: that's what we are working out right now [14:57:20] akosiaris: If we don't want to have to do another depool dance for the network maintenance [14:57:35] and 4 weeks of actual multiDC [14:57:36] But I can also repool RO tomorrow, and we'll depool on the 20th [14:58:35] point being to avoid the eqiad row B issues... that's tempting for sure [14:58:52] I don't really see an issue with pushing the current situation for another 2 weeks, it's holding up well [14:59:17] And 4 weeks of multidc will be plenty of time to warm it back up [14:59:58] but should the row B maintenance really be on the 21st, though? that's within the sprint week and it feels counterproductive to the idea of sprint week to distract half of SRE for the course of the maintenance [15:00:39] (personally I only realised the date overlap right now, not sure if that was discussed before) [15:01:33] I hadn't realized that [15:01:49] so, the other issue is that we are penalizing readers closer to eqiad with extra latency [15:01:57] for another 2 weeks that is [15:01:58] maybe we can do row B one week earlier next Tue [15:02:16] topranks, XioNoX ^ thoughts ? [15:02:25] Arzhel is off this week [15:02:34] I think we probably don't want to penalize our readers, do we ? [15:02:36] the latency diff isn't huge in the grand global scheme of things [15:02:44] <_joe_> yeah the latency diff is very small [15:02:57] it's like.. something less than 40ms, right ? [15:03:07] <_joe_> I would rather do more traffic/pooling gymnastics though [15:03:10] i don't think I can have my stuff ready before 21st, cause I am basically alone for 3 weeks [15:03:11] <_joe_> it keeps us trained :) [15:03:12] depends on the users' location really, but either way it's reasonable [15:03:16] <_joe_> ah right [15:03:18] <_joe_> see marostegui [15:03:24] Around 30ms ? [15:03:27] <_joe_> (also I'm on clinic duty during sprint week) [15:03:33] (eqiad -> codfw) [15:04:16] I am talking about the network maintenance. To repool codfw, I am ready by Monday next week [15:04:31] marostegui: we aren't talking codfw :P [15:04:41] just catching up here [15:04:43] we are talking eqiad MultiDC R/O [15:05:01] em in general I see no reason why we can't move the eqiad row B upgrade to next Tues, March 14th [15:05:03] akosiaris: Sorry, yeah [15:05:04] and there is a second conversation, which is https://phabricator.wikimedia.org/T330165 right on sprint week [15:05:12] My brain defaults to codfw as sby :) [15:05:30] akosiaris: Yes, to be clear. I am ready to repool eqiad by monday and NOT ready to do the switch maintenance before the 21st [15:05:39] akosiaris: yes it makes sense to avoid that week for that reason alone [15:05:49] I am alone till the end of the month [15:06:03] marostegui: ok np [15:06:11] topranks: if you are flexible and amenable to shifting forward 1 week, super! [15:06:32] marostegui: ehm, I need some help here. So, the original plan was to put eqiad in R/O multiDC today [15:06:38] we can also push back the row B to the 28th, or push them all out 2 weeks even [15:06:48] do I interpret your comment correctly that we anyway wouldn't be able to do it ? [15:07:02] cause we may have failed you somewhere then [15:07:14] akosiaris: sorry if I am unclear. I think we'd be happy to move 1 week early or 1 week later [15:07:34] given earlier is not suitable then let's do it 1 week later, March 28th instead of 21st. [15:08:18] akosiaris: Today? let me check but I think we can (I just had planned stuff for tomorrow but that is ok) [15:09:31] akosiaris: I am fine if we want to go multi-dc today yes [15:09:31] We're not glued to a particular day btw [15:09:41] marostegui: we can push it forward, that's fine, I just wana make sure we are all on the same page [15:09:55] akosiaris: if we can push it till Monday/Tuesday, that'd be great. If not, that's also ok [15:10:11] claime: I think we can be amenable to shifting to next Tuesday? [15:10:41] actually, if topranks wants to move https://phabricator.wikimedia.org/T330165 to the 14th [15:10:54] we can do Wednesday instead [15:11:01] akosiaris: I'm good with any of the two [15:11:12] and not have to do extra pool/depool gymnastics [15:11:32] marostegui: would moving https://phabricator.wikimedia.org/T330165 a week earlier work for you ? [15:11:39] akosiaris: I'm happy to move the network maint. to the 14th, but I don't think that works for Manuel? [15:12:05] akosiaris: a week earlier wouldn't work for me [15:12:24] I cannot do it, no [15:14:40] ok then, let's untangle them ? Move eqiad row B upgrade to the 28th so it doesn't interrupt Sprint Week and move MultiDC R/O to the 14th ? [15:15:10] that works [15:15:24] akosiaris: And to be clear, we depool eqiad again from multi-dc for the 28th row B upgrade, right? [15:15:29] yes [15:15:37] that works [15:15:43] yes [15:16:21] ok, I guess we got some announcements to make? [15:16:50] ok yep, that works for me, I'll update the dates for row maintenance and send some announcements [15:16:51] ok, correcting schedule in https://phabricator.wikimedia.org/T328903 [15:17:13] thanks everyone! [15:27:15] urandom: you got 1 extra week of sessionstore in eqiad depooled. Knock yourself out! [15:27:46] akosiaris: woohoo! [15:28:04] * urandom is easy to please [15:31:35] akosiaris: marostegui: RO repool Scheduled for 14/03 at 11:30UTC [15:32:37] claime: great! [15:33:17] Sorry I goofed my timezones [15:33:34] 1030UTC [15:34:35] works [15:35:12] I'm taking the MediaWiki infra window for it https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230314T1000 [16:21:34] I'm pooling thumbor-k8s in codfw, let me know if anything weird pops up. [16:21:50] ack [16:29:01] and depooled [16:57:10] how did it go? [17:08:29] Not well :) elevated 500s in swift again, not much new information. trying to trawl through the logs to figure out what's causing it [19:31:22] bblack: if you are around could you possibly review a dns patch for me? [19:31:24] https://gerrit.wikimedia.org/r/c/operations/dns/+/895848 [19:40:08] apologies, v.olans has done the honours, thanks both!