[08:12:07] hello folks, I am upgrading ml-serve-eqiad to k8s 1.23 [08:12:23] some bgp/pybal alerts will complain, in case blame me :) [08:51:16] Happy mailman day to those who celebrate ;-) [09:10:12] So, did anything break during the night? :P [09:14:09] ugh, someone is having fun with a chainsaw this morning :( [09:16:40] <_joe_> claime: there were a couple WTF moments during a scap deployment, but it was mostly panic [09:16:59] _joe_: Yeah, I backlogged that [09:17:01] Sneaky bug [09:27:41] <_joe_> marostegui, XioNoX, topranks, Amir1 I would need to repool eqiad for read-only traffic for mediawiki briefly; any concerns? [09:27:59] all good network-wise [09:28:58] good [09:31:26] <_joe_> thanks [09:39:14] <_joe_> ok, test done, re-depooling [10:08:05] I forgot re: the apt.w.o switch to codfw, and uploaded a package to apt1001's reprepro, what's the next action in this case? other than uploading to apt2001 too that is [10:09:15] or is the apt2001 motd correct? it says 1001 will rsync to 2001 and thus I should only wait [10:09:40] If that motd is set via puppet, there was nothing changed for it in puppet [10:10:07] I think what we did wrt switching over is just what server apt.w.o hist [10:10:10] hits* [10:10:37] godog: i suspect for now the motd is correct and apt1001 is still where one is expected to make the changes. ill check in puppet and with moritx to see what we shuld do [10:10:43] (not like deploy where we switched over everything in puppet) [10:10:52] ack, thank you! ok so in theory I have to wait and that's it [10:11:10] godog: yes but one sec i can at least kick of the rsync job [10:11:23] cheers jbond [10:11:28] kick it bop it twist it ship it [10:11:59] (please excuse the shitty jokes, they're stress response :P) [10:12:16] godog: i think it shold be done (was it pint?) [10:12:21] 100% understandable claime [10:12:21] :0 [10:12:25] jbond: that's right yeah, thank you [10:12:40] that gave me the idea that uploading to reprepro should kick off the rsync job [10:12:46] happy to open a task for it [10:13:16] godog: please do [10:13:33] * godog does [10:13:53] Netbox has uncommitted DNS changes < Should we maybe clear that up before the switchover? [10:15:55] T330843 [10:15:56] T330843: reprepro uploads should trigger rsync apt job - https://phabricator.wikimedia.org/T330843 [10:19:33] claime: is there some task you would like me to associate 893409 (apt: swap active and failover apt servers) with? [10:20:19] jbond: yes please, T328907 [10:20:20] T328907: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 [10:21:39] When the dust has settled I'll wrangle all the small things that were needed to do so they can be added to the doc [10:21:51] ack will thanks [10:22:03] for future note, netbox too has some hiera related to the active server and will need some work to make it a truly discovery-managed service [10:22:57] claime: as for the netbox uncommitted changes :( [10:23:14] dcaro: do you have a prompt with the dns diff asking you to proceed or not? [10:23:27] I see from SAL sre.hosts.decommission for hosts cloudcephosd1010.eqiad.wmnet started a while ago [10:24:15] I did, hit go right before your ping :) [10:24:17] there is also a change from jgreen uncommitted :/ [10:24:26] thanks dcaro! [10:24:38] claime: that should clear shortly [10:26:13] volans: <3 [10:26:29] thx dcaro too [10:43:58] q: why are the dns aliases for the active deployment server in eqiad/codfw.wmnet? given both point to the same host I was wondering if it should be something like deployment.discovery.wmnet instead [10:48:22] taavi: I'm guessing because there's still quite a bit of manual steps involved with switching it over (cf https://wikitech.wikimedia.org/wiki/Switch_Datacenter/DeploymentServer), and tech debt [10:48:39] I don´t know if it's worth automating it more or not [10:56:05] I found /srv/deployment-charts/helmfile.d/admin on deploy2002 that I didn't recall was on 1002 (we have only admin_ng no?) [10:57:40] seems to contain old calico stuff, and git status looks clean [10:59:38] elukey: probably a leftover, it's gitignored [10:59:53] it matches helmfile.d/admin/*/*/private [11:00:08] super makes sense, didn't check git ignore [11:00:19] if everybody is ok I'll clean it [11:00:46] jayme: ^ ? [11:01:55] thanks claime. Yeah, that's the old stuff (pre admin_ng) - go ahead and delete it elukey! [11:03:08] ack! [11:25:39] jbond: https://wikitech.wikimedia.org/wiki/Switch_Datacenter#apt [11:27:04] claime: great thanks [12:33:45] how terrible of me would to deploy some changes now? [12:33:54] none are urgent tbh [12:36:34] Amir1: https://media1.giphy.com/media/JRF85A7Bcl2YU/giphy.gif [12:36:55] (jk) [12:37:02] the real answer is that I don't know [13:03:29] Amir1: I'd rather not tbh, limiting the number of chances before the switchover seems like a good idea [13:04:31] yeah [13:09:27] Adding scheduled maintenance to statuspage [13:12:06] I will also be putting up a scap lock at 1330 [13:12:18] +1 [13:12:45] do we have a todo list of the additional lock capabilities we'd like to have in the future? [13:12:54] not yet no [13:15:26] tmux session for switchover on cumin1001: sudo tmux -S /tmp/tmux-40392/default attach -r [13:16:14] can we coordinate on -operations this time? that way everyone interested can follow along there [13:16:51] yes [13:36:55] a moment of attention everyone: we're about to perform the mediawiki datacenter switchover. Please don't merge changes during the operation [13:43:08] Emperor: fyi im going to grab some food [13:43:35] ack [14:44:56] <_joe_> (siwtchover is done, you're now free to break production however you'd like. As long as you fix it afterwards) [14:54:55] akosiaris: sorry, I didn't mean to be a party pooper, but I have the experience of breaking production AFTER the switchover, at least twice [14:55:43] <_joe_> jynus: the surrounding conditions were radically different this time though, that's why I was more relaxed than usual [14:56:58] I been super thankful of claime (as the visible head, but I know of many other folks, too)'s work and he knows it [14:58:04] <_joe_> :) sure I was explaining why we felt less on edge as soon as read-only was over [14:58:28] I have lived this from a distance, so I was quite worried (as usual) :-P [14:59:18] even if I didn't doubt this was going to be the best switchover ever [15:00:25] it's nice to see more side-benefits from multi-dc too :) [15:00:40] <_joe_> bblack: definitely [15:01:01] <_joe_> the edits were back at full speed as soon as read-only went away [15:01:13] <_joe_> we didn't have the multiple minutes of slowdowns we had in the past [15:01:14] does anyone remember how chaotic was the first time? [15:01:22] <_joe_> how can I forget [15:01:32] <_joe_> it was 9 puppet patches and 5 code deployments, too [15:01:38] I literally setup parsercaches on codfw dbs the night before [15:01:53] <_joe_> and yes, that switchover gave me RSI [15:01:55] like, for the first time [15:02:11] <_joe_> I was at my pc for 16 hours per day for a couple weeks [15:03:02] congrats people :-) It's so awesome to see how far we've come with this process and our infrastructure [15:03:45] <_joe_> question_mark: indeed, it really makes me proud of all of our work :) [15:03:48] next time we should do it during a Tech All Hands demo ;-p [15:03:50] <_joe_> highlight is https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&from=now-3h&to=now [15:03:54] the real deal [15:03:54] <_joe_> ahahah [15:03:57] question_mark: lol [15:03:57] <_joe_> question_mark: that's evil [15:04:17] we don't have enough SREs to sacrifice to keep the demo gods happy [15:05:15] <_joe_> this is a typical case of "if we do our job well, no one notices" btw :) [15:05:22] as usual [15:06:12] or worse, they notice only with a minor negative. Like, "hey did something happen wednesday? there was a small dropout in " [15:13:39] why no next time, shutdown a datacenter without advance notice? My dream is to delete all data in one datacenter one day! [15:14:08] (or at least pretend to) [15:15:23] <_joe_> I think we have to think of a procedure around that [15:16:10] <_joe_> our switchover has two purposes: 1) allow disruptive maintenance in one dc 2) test that our procedure for a switchover in case of some partial outage or connectivity issue works as it should [15:16:19] dc.swtichover is done, time to thing about dc.failover ? [15:16:26] *think [15:16:34] <_joe_> we're not yet at the point "ok dc A has been sold to OVH" [15:17:14] yeah, I mean "start thinking about it" [15:17:30] <_joe_> so baiscally the scenario where someone pulls the plug on $live_dc [15:17:40] <_joe_> there's multiple layers of issues potentially [15:17:46] but with multi-dc many things got simplified [15:17:50] <_joe_> we'd have to first depool traffic and dns [15:17:54] <_joe_> then move gerrit [15:17:59] <_joe_> then ensure puppet is ok [15:18:08] <_joe_> then move the etcd master I guess [15:18:19] <_joe_> well no in an emergency we cna avoid puppet/gerrit [15:18:42] <_joe_> and just move dns traffic and kill etcdmirror and etcd read-only and do the mw switchover [15:18:45] <_joe_> with less checks though [15:19:07] <_joe_> paradoxically in that scenario it would be ok to just change values in etcd with conftool [15:19:09] I'd say it isn't far fetched, at least some basic scenarios [15:19:13] <_joe_> no checks at all, one dc is on fire [15:19:25] <_joe_> but yes, defining the scenarios would be important [15:21:54] wow the switch over happened?! I DID NOT NOTICE [15:21:56] amazing. [15:23:32] <_joe_> YES [15:23:41] <_joe_> props to claime who was our MC [15:24:06] yeah totatlly, so amaizng. [15:27:03] +1 to that, thanks for your hard work claime and everyone else [15:27:31] <3 [15:55:37] Congrats on the switchover! [15:55:46] Nice work claime! [15:55:59] dancy: Thanks <3 [16:17:56] Congrats claime and everyone involved [16:18:06] Thanks RhinosF1 <3 [16:19:35] _joe_ │ we're not yet at the point "ok dc A has been sold to OVH" < lmao [16:19:48] <_joe_> *cough cough* [17:13:37] If the 94s of no edits on stream (see wikitech-l) is accurate, that’s insanely fast [17:14:29] <_joe_> I think https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&from=1677678605302&to=1677680159519 puts indeed the ro time between 1 and 2 minutes pretty firmly [17:15:45] So probably then [17:16:10] <_joe_> I'm not 100% sure the times reported by eventstreams are accurate [17:16:12] RhinosF1: I can clarify, we're just not counting the same thing. We measure the time from the moment we start being read-only, to the time we are back to completely read-write. [17:16:43] technically it's from the first one becoming r/o from the first one becoming r/w, right? or does it wait for r/w to get applied everywhere? [17:16:57] The 94s measures from the moment we are completely read-only to the first read-write. [17:17:14] taavi: Our measurement waits for read-write everywhere [17:17:20] ah, nice [17:17:35] So we have a lower and an upper bound [17:18:19] claime: that makes sense. It might be worth mentioning that on the thread. It’s still an amazing stat and a credit to you all. I kind of asked too because like _joe_ I wasn’t sure how accurate eventstream was [17:18:29] RhinosF1: Amir already did :D [17:18:36] tl;dr it's complicated [17:18:47] <_joe_> So there's some complications [17:18:48] We're now wondering what would give us the most accurate measure [17:19:03] I now see [17:19:08] <_joe_> how do we define "being read only" [17:19:25] <_joe_> is it "a new edit won't work because we've flipped read only on" [17:19:34] <_joe_> or rather "no edit makes it to the database" [17:20:30] <_joe_> the answer is 119 or 94 seconds depending on that [17:20:51] you could also compute the maximum time for any single wiki, which will be somewhere between the two ;) [17:20:59] that's probably the "best" user impact number but, eh, 119s is fine. [17:21:12] Yeah it really is nitpicking at this point [17:21:15] (and very impressive) [17:21:18] Less than 2 minutes is fantastic [17:21:24] I'm really happy with that time [17:24:10] yeah it become complex going nitpicking, what about central auth :D [17:25:02] we could though publish the not-fully-RW time as we do now and the fully-RO time (as in all masters are RO) too in addition in the future [17:25:04] why would CA be affected in different ways than other features? [17:25:45] that might affect all wikis [17:26:12] <_joe_> look, I think honestly it's ok if we measure by excess [17:26:31] so it might be more tricky to calculate the per-wiki impact taavi, just that ;) [17:26:34] yeah [17:26:37] <_joe_> the impact is the first editor being denied an edit to the last editor to which this happens [17:26:54] <_joe_> the best approximation of that we have is what we record now [17:26:58] I agree with _joe_, the impact metric is this [17:27:19] And that's what we care about, the user impact of the switchover [17:27:23] <_joe_> we could look at statsd maybe :) [17:27:31] <_joe_> if we are collecting such events [18:16:19] what's with the esams intermitent network errors- they seem like a monitoring glitch, not real, right? [18:16:21] are we OK to do puppet merges now? [18:17:46] inflatador: post-switchover you mean? yep, business as usual again [18:18:00] rzl ACK thanks [18:21:46] bblack: is there dns3001 maintenance? [18:22:51] jynus: see sal [18:22:53] 16:28 brett: Remove dns3001 DNS request routing via juniper - T321309 [18:22:54] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [18:23:08] actually I was lookint at SAL and didn't see that, thanks [18:23:14] too many lines [18:23:21] ok, thank you [18:23:50] I was wondering if there was a real issue and just dns redundancy worked, or it was just that maintenance, sorry [18:33:29] jynus: No prob! Thanks for keeping your eyes peeled :) [18:33:48] Would be done a lot quicker if idrac decided to work....