[07:23:13] topranks: no need for the sorry please, you helped a lot! I confirm that all nodes worked, but I had to manually clean one switch again when puppet broke.. I'll report if I find issues when reimaging the ml-serve nodes in row E/F :) [07:45:00] elukey, topranks, what's the tl;dr; of the issue? I don't fully get the latest task updates [07:46:22] XioNoX: o/ it is not clear to me what is the status, I got the same issue again when reimaging the last node.. I worked around it by clearing the switch and hit "retry" in the cookbook for the first puppet run [08:01:51] same. The ones I did worked fine, entire process, didn’t need any manual intervention. [08:02:19] So basically I wasn’t able to recreate the problem to work out what was going wrong [08:02:49] There’s definitely something odd happening sometimes but still unsure what triggers it [08:04:00] we should maybe get a test server in there to try a few more and try to catch exactly when it stops working [08:06:47] In terms of the task there were some complications running those commands earlier. But more importantly I wanted to see what was happening [08:07:15] the problem previously didn’t happen till 12 hours after the dhcp exchange (at lease expiry) [08:07:31] I see [08:07:37] so need to work out what can make it happen earlier in the process [08:07:54] seems like here the mac-to-ip table was not getting popupated, so maybe a variation of that issue [08:08:13] hmm ok [08:08:52] and right after dhcp too, so a single clean was good enough if there was no re-image afterwards [08:09:15] The one I looked at yesterday had these symptoms: [08:09:18] https://phabricator.wikimedia.org/P44879 [08:10:15] Which matches the previous. Basically there is an entry in mac-ip table but none in arp table [08:10:31] The presence of the mac-ip entry stops arp table refresh from working [08:10:39] maybe I miss-remember? [08:11:12] previously that circumstance was caused by dhcp binding expiry, which triggered arp deletion but not mac-ip deletion (as it should) [08:11:55] so circumstances look the same, but I’m not sure whats causing it [08:12:34] there were a few trades I ran last time with tac I was trying to do again, but it didn’t happen when I had them running [08:12:48] s/trades/traces [08:13:05] yeah from my comment, the `clear ethernet-switching mac-ip-table XXX` fixed it, so I guess it was not empty [08:13:55] (relocating, back in 10min) [08:15:10] (bbl) [08:50:57] I seem to have come down with something, sore troat, head etc :( [08:51:20] gonna have to take the day sick, but ping me if anything urgent (cc jobo) [08:52:10] GWS :( [08:53:06] take care! [08:56:02] indeed, take care! [09:52:23] Sorry to hear that Cathal, take care. [10:02:16] slyngs: got a small CR for you: https://gerrit.wikimedia.org/r/c/operations/puppet/+/892908 [10:03:08] vgutierrez: Looks good [10:11:28] cheers [10:11:38] merged, NOOP in terms of certificates [10:17:30] Perfect, thank you [10:48:21] Just a quick reminder that we're switching over services and traffic today, starting at 14:00UTC. If we could please freeze any merge and deployment starting around 13:00UTC I'd be very grateful :) [10:51:46] For sre.hosts.reimage, the default when not specifying --os is Bullseye these days, right? [10:52:00] klausman: it's a mandatory parameter [10:52:04] there is no default [10:52:18] ah, makes sense. ty! [10:52:45] --os {stretch,buster,bullseye,bookworm} [10:52:45] the Debian version to install. One of stretch, buster, bullseye, bookworm (default: None) [10:52:53] yw :) [10:53:41] yeah, sometimes, "Default: None" just hides that there is a "fallback default" somewhere in thebstack [10:55:09] let me expand that help message [10:57:49] volans: i think we can drop strech from that list now [10:58:08] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/892923 [10:58:33] jbond: we can definitely kick off some stretch reimages every now and then to upset moritzm :D :D [10:58:35] jbond: that comes from OS_VERSIONS and I'd rather keep it until we really have removed all stretches [10:58:58] LGTM [10:59:17] Removing all stretches sounds bad for physical health [10:59:24] (I'll grab my coat) [10:59:36] volans: ack [10:59:38] Well, if I never have to stretch again, I'm sorta ok with that [10:59:43] lol [11:07:46] * Emperor weeps in Swift [11:11:16] Does it help? [13:41:47] @claime hi! for today's event i notice there is a meet link, are we dialing in or plan to IRC only? [13:42:13] I was planning IRC only, but we can join a meet if you want [13:42:32] lmata: I think the meet's added automagically by Gcal [13:42:58] In the past we've never done it via meeting, so people can follow it publicly on IRC [13:43:12] I think we can join the meet in case of fire [13:43:27] But otherwise IRC should suffice, at least today [13:43:56] thanks, im not advocating for it, just curious mainly because it was there. [13:44:07] :-) [13:44:10] :) [14:11:13] claime: dns revert agent ran on all dns hosts, and did a no-op "authdns-update" to confirm everything's clean/ok. [14:11:16] Ok we're for the moment holding because of an issue with the apt service. We will fix it, then do a dry-run, then go ahead. [14:11:32] so, we're back to normal and ready to proceed further [14:11:41] do we need to merge https://gerrit.wikimedia.org/r/c/operations/dns/+/892963? [14:11:50] ok, let's merge the dns patch to add apt.discovery.wmnet [14:12:15] yeah that seems like a reasonable interim solution to me [14:12:21] ack merging [14:12:24] thx [14:13:01] in the longer term, if we're going to support having these records in other domains and/or multiple places, we probably need deeper updates to metadata and tooling (to record which domains the disc records live in) [14:13:17] or something [14:13:47] at which point we could even have the dns tooling generate those lines, too [14:13:51] There's work to do to add discovery stanzas and records for quite a few services, see T329193 [14:13:52] T329193: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 [14:14:46] yeah, I think the intent of the discovery records and related stanza was that they were going to be internal, we need to figure out that [14:14:55] it's probably not universally true that we want everything in dnsdisc, but I may not be up on current forward thinking in this space! [14:15:53] bblack: If we want to be able to fail over quickly to codfw in an emergency, it'd be good if most (I know all is a pipe dream) services were supported by sre.discovery.datacenter is my reasoning [14:16:26] cgoubert@cumin1001:~$ dig +short apt.discovery.wmnet │ [14:16:27] 208.80.154.30 [14:16:38] this can probably not the time to go into the history. however this came about from speaking with em.a the idea is instead of using cnames for services that just have two hosts. we should use a discovery records but bypass lvs so that w can failover services with conftool instead of needed to prforem a git commit [14:16:42] I'd tend to agree the long-term goal is everything should be multi-dc and ideally a/a, but that doesn't necessarily mean that dnsdisc has to be the tool that delivers that for everything [14:16:57] running another dry-run [14:17:45] jbond: I think that makes sense, it's just the novel domainname that's throwing things for a loop [14:18:05] i thik there are a few servies configuered in this maner, however most are internal, apt is the only (i think) that has nt internal i and therefore no discovery record [14:18:30] jbond: yeah, the use case definitely makes sense. We are already anyway using this, e.g. helm-charts (I added those, it was the first service without an LVS stanza) [14:18:41] i also suspect that the intention was to move this behind the cache and give it internal addresses (but need to check with morit.zm ) [14:18:55] and helm-charts is explicitly excluded in the cookbook btw [14:19:04] but it has up to now always beeing .wmnet hosts and thus discovery.wmnet entries too [14:19:31] ack thanks i was unware of this ^^ bit and the complication cause by it [14:19:44] apt move ok in dry-run [14:19:54] dry-run successful [14:20:07] ready to run for real if nobody objects [14:20:34] green light from me [14:20:47] go go gadget cookbook [14:21:01] lol [14:21:12] jhathaway: topranks volans heads up, switching over services now [14:21:19] * volans following along [14:21:37] \o/ [14:21:46] (ok, go :)) [14:21:53] did you notify the people oncall? [14:22:00] just in case :) [14:22:13] I thought I just did lol [14:22:32] they are different ones :D [14:22:45] wth [14:22:50] not according to topic [14:22:50] My topic is not up to date [14:23:00] maybe some netsplit split the topics? [14:23:11] jbond: cdanis Emperor heads up, switching over services now [14:23:19] :) thanks [14:23:23] sirenbot: !update [14:23:33] Can´t remember the command, to hell with it [14:23:37] sirenbot is not opped here :/ [14:24:02] yeah that's interesting, the two channels disagree [14:24:18] I know I'm not oncall :D [14:24:24] tmux command again: sudo tmux -S /tmp/tmux-40392/default attach -r [14:24:24] -private has the right people because sirenbot didn't update here [14:24:25] you are now! :) [14:24:34] running smoothly now [14:25:23] FYI for less involved people, it will take a while, there are 55 services to swtich and it's doing them with all the checks to ensure things are moving correctly. [14:26:03] yep [14:26:04] FYI there is also a dangerous mode to switch over things quickly in case of emergency (that could get some additional improvement ;)) [14:26:16] * Emperor twitches [14:26:43] It's flagged --fast-insecure [14:26:47] I think it's pretty explicit :D [14:27:21] Still fyi on the process, we're doing active/active first and depooling eqiad [14:27:32] Then moving on to A/P where we selectively switch [14:27:43] claime: <3 [14:28:22] I was wondering, I thought we used to do the traffic-level change before the dnsdisc? [14:28:30] it can work either way [14:28:45] If you're not on the tmux and just want to see progress, sudo cookbook -d sre.discovery.datacenter status --filter all on a cumin host [14:29:11] bblack: That is not what is documented at https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Overall_switchover_flow [14:31:45] I think api-gateway did not switch over, we'll see at the end of the run, if only one service doesn't automagically switch over I'll be pretty happy [14:35:21] Ah no we're just not sorting alphabetically, it's switching now [14:35:22] claime: it's towards the end of the service catalog file [14:35:26] yeah [14:36:12] as I suggested another time I think we should sort them by priority adding some sort of priority tier in the service::catalog data structure based on importance [14:36:22] z-order [14:37:09] cdanis: you mean https://en.wikipedia.org/wiki/Z_%28video_game%29#/media/File:Z_The_Bitmap_Brothers.PNG ? [14:37:14] A/P in progress [14:38:30] Hoy [14:38:35] Careful with my tmux :D [14:39:17] sudo tmux -S /tmp/tmux-40392/default attach -r [14:39:24] Please attach read-only, whoever is sending control chars [14:39:50] <_joe_> did someone press ctrl+q? [14:40:18] Hmm [14:40:21] Something's weird [14:40:25] I don't see progress [14:40:35] and the logs are only for the status [14:40:36] watch [14:40:47] killing watch [14:40:54] <_joe_> yeah please [14:41:09] stuck in swirt-rw ? [14:41:14] root 550850 1.7 0.1 165236 85608 pts/15 Sl+ 14:21 0:20 | \_ /usr/bin/python3 /usr/bin/cookbook sre.discovery.datacenter depool eqiad --all --reason Datacenter Switchover [14:41:29] did someone send C-b C-q perhaps [14:41:30] it might have stop asking for input? [14:41:39] <_joe_> yeah possible [14:41:44] Something screwed up [14:41:45] <_joe_> it's idempotent though [14:41:52] <_joe_> so you can just restart it [14:41:57] I'm killing it and restarting [14:42:05] <_joe_> claime: you can type "skip" maybe [14:43:33] ok it doesn't like it when an A/P service is already moved over [14:43:41] <_joe_> heh ok [14:44:01] Temperamental cookbook [14:44:11] All done. [14:45:07] _joe_: You can use -d for status [14:45:13] That way it won't log to SAL [14:45:17] claime: don't close the tmux I want to try something [14:45:22] ack [14:45:24] Wasn't going to [14:46:29] <_joe_> uh we migrated appserver-ro and co already? [14:46:39] For a first real-world try of a full switch (including A/P which we hadn´t done) with that cookbook I'd say it went well [14:46:42] _joe_: yes [14:46:52] <_joe_> it's ok, but a bit strange to do it one day in advance [14:47:05] <_joe_> claime: definitely [14:47:19] <_joe_> and I think I know where the a/p thing was failing fwiw [14:47:57] _joe_: I'd asked, and was given the go ahead to do so and only list the rw services in the mediawiki switchover, I thought we were all ok with that [14:48:32] <_joe_> claime: again, it's ok [14:48:43] _joe_: Yeah, just laying the context [14:50:23] volans: What did you want to try? I'm gonna open another window in the tmux to do the deploy if you need these [14:50:38] claime: nah my try didn't work, thx [14:50:43] I'm going to go ahead and start switching over the deployment server [14:50:53] do whatever you want, thx :) [14:52:33] Puppet disabled, merging dns patch [14:54:01] Ok patch not good apparently [14:54:03] E003|MISSING_OR_WRONG_PTR_FOR_NAME_AND_IP: Missing PTR '1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.e.f.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa.' for name 'gr-4-3-0-1.cr2-eqiad.wikimedia.org.' and IP '2620:0:861:fe03::1', PTRs are: [14:54:12] that's not related to the switchover [14:55:20] +logmsgbot │ !log dcaro@cumin1001 START - Cookbook sre.dns.netbo [14:55:27] Gotta wait [14:55:56] \o hi, can I help? [14:57:52] How's your sre.dns.netbox cookbook going ? [14:58:07] Generating the DNS records from Netbox data. It will take a couple of minutes. [14:58:34] it was ~5min ago though [14:58:43] "2023-02-28 14:53:28,586" [14:58:56] netbox seems super slow to me, it has been switched to codf [14:58:58] *codfw [14:59:01] Yes. [14:59:01] yes [14:59:02] so maybe is for that [14:59:21] cc: jbond, XioNoX that had worked on the a/a netbox [14:59:28] that wolud make sense yep [15:00:03] Next time maybe hold until I'm done switching over datacenters [15:00:05] claime: JIC prepare yourself to repool netbox in eqiad and then we can debug this later [15:00:10] Yeah sure [15:00:12] volans: netbox is still active/passive the active/active change got rolled back [15:00:31] I switched it over to codfw as an A/P service [15:00:33] jbond: yes but got switched [15:00:51] ack and its slow so erhaps active/active is not a good idea :) [15:01:12] claime: can we revert it to eqiad? [15:01:21] volans: sure [15:01:25] thanks! [15:02:27] sudo cookbook sre.discovery.service-route pool eqiad netbox; sudo cookbook sre.discovery.service-route depool codfw netbox [15:02:32] good for everybody ? [15:02:44] I would have used the datacenter one with the filter :-P [15:02:50] volans: volans: fyi https://phabricator.wikimedia.org/T296452#8653039 [15:02:56] no nitpicking right now please :P [15:02:57] Well I would have to filter everything BUT netbox [15:02:58] fyi. my cookbook passed that step (Generating DNS records now) [15:03:08] <_joe_> ok so let's hold on netbox [15:03:13] akosiaris: it's just that I don't recall if the service-route manages the a/p ones [15:03:17] <_joe_> I guess maybe the service owners can decide what to do? [15:03:32] <_joe_> dcaro: let us know when it's done [15:03:38] ack [15:04:08] should I abort the cookbook given the opportunity? (by prompt "type go to continue") [15:04:18] <_joe_> no, I would say [15:04:27] <_joe_> it the records were generated [15:04:30] no the last part is without netbox interaction [15:04:33] 👍 sorry for the bad timing [15:04:36] so will work at normal speed [15:05:16] volans: got any hints on the error above? [15:05:25] 16:57:36 E003|MISSING_OR_WRONG_PTR_FOR_NAME_AND_IP: Missing PTR '1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.e.f.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa.' for name 'gr-4-3-0-1.cr2-eqiad.wikimedia.org.' and IP '2620:0:861:fe03::1', PTRs are: [15:05:27] repasting [15:05:39] akosiaris: yes, there are no PTRs [15:05:52] I was looking in netbox but was hard due to slowness [15:06:00] to see if in the changelog anything changes related to that IP [15:06:03] well, apparently no forwards either ? [15:06:12] ah wait, that's netbox generated? [15:06:18] yes [15:06:20] ok [15:06:21] wikimedia.org-eqiad:gr-4-3-0-1.cr2-eqiad 1H IN AAAA 2620:0:861:fe03::1 [15:06:34] <_joe_> ok so, dcaro@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:40] 3.0.e.f.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa:1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR gr-4-3-0-1.cr2-eqiad.wikimedia.org. [15:06:41] <_joe_> claime: can you re-try? [15:06:48] <_joe_> is it still failing? [15:06:56] Sending rechec [15:06:58] k [15:07:09] so the records are there AFAICT [15:07:27] volans: https://netbox.wikimedia.org/ipam/ip-addresses/2937/ ? [15:07:31] that seems wrong? [15:07:42] note the gr-4 vs the gr-3 [15:07:49] and that it's 10.66.0.6/30 IP ? [15:07:56] weird [15:08:07] also... Created 2020-08-06 00:00 · Updated 2 years, 6 months ago [15:08:13] what is this old thing? [15:08:20] https://netbox.wikimedia.org/dcim/interfaces/9021/changelog/ [15:08:21] and why did it decide to bite us right now ? [15:08:47] verified is good [15:08:52] moving on [15:09:17] no gate-and-submit on dns right ? [15:09:27] akosiaris: good question, I'll try to dig in what happened there [15:09:32] <_joe_> claime's patch is ok [15:09:37] <_joe_> claime: authdns-update [15:09:44] <_joe_> from authdns1001 for example [15:09:54] Date: Thu Jan 26 13:33:26 2023 +0000 [15:09:54] cmooney@cumin1001: Remove DNS records for removed esams eqiad GRE tunnel link IPs. [15:10:21] this removed both direct and PTR records [15:10:31] commit cacfe24968b8137f783e4e28872290c2bc516bb4 in the generaed repo [15:10:47] <_joe_> volans: can we move debugging of such an issue elsewhere? [15:10:56] Restbase CRIT 503, cf -ops [15:11:04] <_joe_> claime: I'll look into it [15:11:07] ack [15:12:29] authdns done, merging puppet patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/892373 [15:12:46] (once jenkins validates the rebase) [15:21:20] updated https://wikitech.wikimedia.org/w/index.php?title=Deployment_server&diff=2057221&oldid=1917261 [15:22:08] Hmm, puppet didn't remove the deployment timers from deploy1002 [15:22:20] cgoubert@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng$ systemctl list-units | grep -A1 sync_deployment_dir [15:22:22] sync_deployment_dir.timer loaded active waiting Periodic execution of sync_deployment_dir.service [15:22:24] sync_patches_dir.timer loaded active waiting Periodic execution of sync_patches_dir.service [15:22:54] Wait no, I'm wrong [15:23:00] They're supposed to be on the passive [15:24:56] <_joe_> I was about to say [15:26:02] <_joe_> ok let's try a null mediawiki deployment [15:26:13] <_joe_> I fear something will be broken wrt k8s stuff [15:26:20] yes [15:26:23] probably [15:26:25] <_joe_> should I try? [15:26:39] Grab my tmux on deploy2002 [15:26:40] I'll do it [15:28:07] <_joe_> ok so [15:28:17] <_joe_> it's rebuilding the images from scratch [15:28:24] <_joe_> which is expected after the switchover [15:28:36] yes [15:28:44] We maybe should have done a prebuild [15:30:20] <_joe_> nah it's ok [15:30:28] <_joe_> we can re-deploy later and find out it's now fast [15:30:40] Heh [15:31:11] We can maybe try the scap3 test in parallel, hnowlan you had a restbase deployment on deck didn't you? [15:31:40] <_joe_> I would prefer not to [15:31:45] ack [15:31:46] <_joe_> but we can, ofc [15:32:13] I'm just seeing the time tick by, but the traffic switchover should be quick so we should still be in-windoe [15:32:27] claime: I do, but given citoid/zotero issues I would like to pause a sec [15:32:35] ack [15:36:02] claime: happy to proceed whenever you fancy, restbase isn't the problem [15:36:13] hnowlan: Waiting for scap2 test to finish [15:38:08] ack [15:40:00] It's currently pulling the docker images on every k8s node and it's taking a minute (for a very large value of 1 minute) [15:41:09] Since it's all just post-switchover checks, I'm going to go ahead with the traffic switchover bblack vgutierrez jbond Emperor cdanis akosiaris [15:41:16] ack [15:41:32] ack [15:44:52] One canary error on mw1450.eqiad.wmnet during scap test [15:45:18] yeah, FWIW technically traffic switchover is indepndent functionally anyways. There's some bikeshedding you can do about observed latencies or efficiency of the process, but no functional reason to care about synchronizing services-vs-traffic or ordering them in a particular way. [15:45:54] <_joe_> claime: what was the precise error? [15:46:01] <_joe_> that is worrisome tbh [15:47:19] 500 on /wiki/Special:Version [15:47:23] But only on one [15:48:09] Traffic switchover done [15:48:30] hnowlan: we're go for scap3 test if you want [15:48:42] <_joe_> claime: it's all ok [15:48:57] note the usual caveat on traffic depools: it takes ~10 minutes to see most of the effect on graphs, due to 10 minute TTL on the public records that changed. [15:49:03] yep [15:49:23] Not declaring the switchover "done" until I see the graph go down [15:49:34] claime: groovy - want me to do it? [15:49:52] 4xx rising on https://grafana.wikimedia.org/goto/guLEZTx4z?orgId=1 [15:49:55] hnowlan: go ahead [15:49:59] thank you [15:50:21] running [15:50:57] I'm a bit worried about that 4xx rise tbh [15:51:39] claime: hmm those requests should drop to 0 as soon as eqiad gets depooled [15:52:18] Request rate starting to dip in eqiad (as expected) [15:54:23] claime: BTW, that's happening on other DCs as well: https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?orgId=1&var-datasource=esams%20prometheus%2Fops&var-layer=All&var-cluster=text&from=now-30m&to=now&var-site=esams&refresh=30s&viewPanel=7 [15:54:41] restbase deploy is going a little slow compared to what I'd expect but otherwise looking fine [15:54:50] vgutierrez: ack [15:57:18] claime: whatever it is, it isn't new [15:57:20] see https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams&var-instance=cp3050&var-layer=backend&viewPanel=6&from=now-7d&to=now [15:58:35] vgutierrez: Thta's what happens when you look at graphs just for an operation lol [15:58:38] You see ghosts [15:58:51] We're at about 50% request rate in eqiad now [16:03:32] 30% [16:04:30] do we include anything about the multi-dc a/a reads from ats->appservers in switchover now? or is it just handled implicitly by the public traffic switch more or less? [16:04:49] well and the dnsdisc switch too I guess, since it goes through appservers-ro [16:04:49] <_joe_> we do [16:04:53] bblack: all a/a appservers were switched with services [16:05:45] ok makes sense now. I got a little lost threading through the puppetization and got worried :) [16:05:53] claime: those 4xx are pretty innocent BTW.. just naive PURGE requests [16:06:16] vgutierrez: ok, thanks for checking up on it [16:06:20] np [16:06:22] I'm a bit paranoid :P [16:07:09] claime: so in I/F we decided to be on the safe side and revert back netbox to eqiad, do you want us to do it or you want to take care of it as part of the follow ups and being more practical than us on the routing cookbooks? :) [16:07:47] volans: I can do it, please just add a comment to T330651 [16:07:48] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [16:07:54] sure [16:07:56] I'll switch you back rn [16:09:05] sudo cookbook sre.discovery.service-route pool eqiad netbox; sudo cookbook sre.discovery.service-route depool codfw netbox [16:09:10] claime: rb scap done, all okay [16:09:10] still good for everyone ? [16:09:17] hnowlan: Awesome, thank you <3 [16:10:31] with cache wipe (for netbox) [16:10:37] claime: I think so [16:10:46] ok let's go [16:10:47] commented in T330651#8653355 [16:12:24] It's pooled oon both dcs for the duration of the 5 minute timeout [16:12:27] Hope that's ok [16:12:53] is it? being a/p it should not [16:12:58] even if a/a in etcd [16:13:04] confd will not make it a/a [16:13:14] Since I'm using sre.discovery.service-route I think so [16:13:25] and there will be a stale file to delete on the authdns hosts (I guess the cookbook will take care of them?) [16:13:36] It appears pooled in both with sre.discovery.datacenter [16:13:45] I can run the depool in codfw in parallel [16:13:47] you call [16:13:49] nah [16:13:50] your call [16:13:58] db is in eqiad for both [16:14:01] ok [16:14:20] Then it'll be switched back in a few minutes [16:14:50] and I can confirm I get netbox2002.codfw.wmnet. from both eqiad and codfw hosts [16:15:02] until you depool codfw noting changes [16:15:37] ij [16:15:39] ok* [16:15:58] so I don' get what's the timeout it's waiting for :D [16:16:14] volans: forced 300 seconds wait baked in the cookbook [16:16:17] ok switched now [16:16:19] Waiting 296.29 seconds for DNS changes to propagate [16:16:56] * volans still puzzled, it did run the sre.dns.wipe-cache one too... [16:17:01] I guess I'm lacking some context [16:17:02] You should be switched back [16:17:06] yep [16:17:15] confirmed [16:17:24] volans: I think it's forcing the wait even with wipe-cache, and we just didn´t bother to skip it [16:17:27] back to usual speed! [16:17:30] (in code) [16:17:34] thanks a lot [16:17:39] np [16:18:18] as for the failed DNS CI earlier, I didn't find any smoking gun [16:18:58] the only thing I can think of is that the git clone in CI had some weird data, and I'd love to understand if that was the case and why [16:21:15] As far as traffic goes, we're now flat at ~175rps per instance non-purge in eqiad [16:21:28] but ofc the tmp directory where it's cloned is gone [16:21:43] claime: what failed in the cookbook? [16:21:44] Cache writes down to ~5/s [16:21:56] spicerack.dnsdisc.DiscoveryError: Unable to resolve netbox.svc.eqiad.wmnet [16:22:01] dns.resolver.NXDOMAIN: None of DNS query names exist: netbox.svc.eqiad.wmnet., netbox.svc.eqiad.wmnet. [16:22:02] there is no service [16:22:06] no lvs [16:22:30] so yeah the route-service is not really able to handle non-lvs service I'd say ;) [16:22:42] as opposed to the datacenter one [16:22:44] Looks like it [16:23:03] Well it did what we wanted to [16:23:08] :D [16:23:14] It failed successfuly [16:23:36] * claime pats cookbook's head [16:23:39] You did your best [16:23:47] ahahahah [16:24:07] Need to wipe the stale confd files [16:24:44] volans: refresh my memory on where/how? [16:24:47] puppetmaster right ? [16:25:19] claime: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/switchdc/mediawiki/09-restore-ttl.py [16:25:24] line 25 [16:25:37] ack [16:26:06] sudo cumin 'A:dns-auth' 'ls /var/run/confd-template/.discovery-netbox.sate*.err' [16:26:09] and then rm -fv [16:26:46] if there is any [16:27:14] I can't see any claime [16:40:09] As I was saying to the void of the netsplit [16:40:53] I think we can safely say the traffic switchover is done https://grafana.wikimedia.org/goto/grMBVTx4z?orgId=1 [16:41:24] as done as it reasonably can be anyways :) [16:41:45] \o/ [16:41:49] there's probably a handful of public users that mysteriously won't switch until a day later (or never!), but we can ignore it [16:42:25] I'm looking at you Java user :D [16:43:56] Also some ISPs here in France cache results for a looooong time [16:44:07] At some point Orange was caching for 48h, regardless of TTL [16:46:12] great work, everybody! [16:47:26] Thanks all :D [16:48:22] We managed to finish in time too [16:54:26] awesome work folks! [16:55:05] claime: nice, out of curiosity.... why are you checking the ATS dashboard instead of the varnish one? [16:55:25] vgutierrez: That's the one I had on hand, figured it was good enough [16:55:44] https://grafana.wikimedia.org/d/000000093/cdn-frontend-network?orgId=1&viewPanel=8&from=now-12h&to=now for future reference [16:56:23]