[00:02:38] :D [01:36:53] great job rzl :D [11:08:14] btullis I'm deploying your ferm change alongside my k8s config/version patch [11:10:16] Many thanks. [11:17:04] pcc workers aren't happy: Warning: Request to pcc-worker1005.puppet-diffs.eqiad1.wikimedia.cloud on 443 at route https://pcc-worker1005.puppet-diffs.eqiad1.wikimedia.cloud/pdb/query/v4 timed out after 30 seconds. [14:43:06] hmm I'm seeing some unexpected output when using wmflib::service::get_pool_nodes() in a pcc node. It's returning eqiad nodes as expected, but it fails for the other sites/DCs [14:43:38] https://www.irccloud.com/pastebin/GAgn3ihF/ [14:45:35] is this a known limitation of pcc? [14:46:10] I'd expect that all the data from conftool-data/node/ would be available there and not only eqiad.yaml [14:50:24] vgutierrez: I recently noticed that $site is empty, it could be that [14:51:16] for the services switchover today, to view tmux on cumin1003: sudo -u blake tmux attach -rt blake [14:51:22] as in, it used the right location on hiera, but the string was empty [14:52:00] jynus: unless I'm missing something and get_pool_nodes() will only return the nodes for $site [14:53:41] that's it.. :( [14:53:43] ` $site_nodes = loadyaml("${module_path}/../../conftool-data/node/${::site}.yaml")[$::site]` [14:54:14] yeah, that was what didn't work [14:54:24] that's working as expected though [14:54:28] ah [14:54:35] it's loading eqiad data only [14:54:40] mmmh [14:54:45] I was wrong assuming that I could fetch data from other sites [15:16:47] codfw dns depooled [15:16:53] Moving on to service switchover now [15:18:41] godspeed! o> [15:46:56] service switchover is complete :) [15:47:36] \i/ [15:54:27] nice! [15:54:33] nice well done :) [15:54:58] \m/ [15:55:14] nicely done! :) [15:55:20] gg [15:59:58] brouberol: Mar 24 15:59:24 deploy2002 git[3061907]: error: The following untracked working tree files would be overwritten by merge: [15:59:59] Mar 24 15:59:24 deploy2002 git[3061907]: helmfile.d/dse-k8s-services/global.yaml [16:00:15] ah, I did it again, sorry. on it [16:00:37] thanks all :) [16:00:37] done, sorry again [16:01:42] something is stiull broken, the file is untracked [16:02:05] rm the local copy and sudo git pull? [16:02:25] brouberol: ^ [16:02:52] rm-ed [16:03:20] ty [16:09:00] I'm gonna merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1259067 (no-op ATS lua change to prepare for the rerouting of api-gateway routes), no impact planned but I'd rather flag it [16:35:08] ggwp bjensen! [17:50:31] topranks: it's late for you, not that that will stop you from indulging. but I wonder what's up in eqsin. [17:51:05] it seems like at least for 10.3.0.10 and 10.3.0.1 (both anycast) [17:51:08] traffic is going to ulsfo [17:51:54] two ways I saw this: ping for 10.3.0.10 shows RTT of ~200 ms and then additionally, a DNS query NSID returns the ulsfo servers [17:52:42] sukhe: most likely a side effect of https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1259199 ... [17:53:32] ouch [17:54:24] confirming [17:55:04] sukhe@bast5004:~$ dig @10.3.0.1 CHAOS TXT id.server. +short [17:55:04] "dns4003" [17:55:09] yep [17:57:26] ulsfo<-> eqsin is a snowflake as they're dasychained POPs, + the ulsfo routed ganeti + old network design transitional setup is making things messy [17:57:45] I'll have a closer look shortly [17:57:54] yeah the mix of both is not helping. but anyway, I think your suspicion on that CR is spot on [17:58:05] thanks, it's a little late for you both and we should be OK to ride it out but yeah [18:05:52] sukhe: I think the best bandaid is to also exclude eqsin (like we do for ulsfo) in https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1259199 [18:10:12] XioNoX: ok. would we need to update ulsfo -> eqiad as well for eqord? [18:10:36] sukhe: nah because we don't advertise anycast prefixes from POPs to core sites [18:10:58] right right [18:11:02] only from core sites to POP, but because ulsfo is in the path to eqsin it creates this special case.. [18:11:40] not tested yet: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1260068 I will test it manually shortly [18:12:30] I can run some quick checks if desired as well [18:17:14] all this reminds me of T311618 [18:17:14] T311618: Use blackbox exporter for anycast monitoring - https://phabricator.wikimedia.org/T311618 [18:17:37] we are missing this critical piece I think and we should work on this, especially given the nature of services now on anycast [18:19:06] I will add it to Traffic's roadmap to see if we can take it in Q1 or Q2 of the coming FY [18:21:03] sukhe: hahaha, I was thinking of that task while changing the last diaper of the day :) [18:27:48] sukhe: change manually applied to cr2-eqsin, let me know if things looks better for you, but it looks fine for me (ulsfo still prefers ulsfo, eqsin now prefers eqsin) [18:31:15] XioNoX: looks good but 10.3.0.1 still seems acting up, which is weird [18:31:17] checking what's up there [18:31:21] that's the internal rec still reaching ulsfo [18:31:32] hcaptcha-proxy and wikimedia DNS look good though [18:32:00] sukhe@bast5004:~$ dig @10.3.0.1 CHAOS TXT id.server. +short [18:32:00] "dns4003" [18:32:21] sukhe@bast5004:~$ dig +https +nsid wikipedia.org @wikimedia-dns.org | grep -i nsid [18:32:26] ; NSID: 64 6f 68 35 30 30 31 ("doh5001") [18:32:55] so why are we still going to ulsfo for 10.3.0.1, not sure [18:35:09] yeah that's... weird [18:36:21] same for all the dns prefix, including 198.35.27.27, so that's not great [18:36:58] there is one prepend still happening with the locally learned prefixes: AS path: 64605 64605 I, but right now I don't see where it's from [18:38:43] sukhe: nevermind, found the issue... [18:38:44] is there some config that is not in homer? [18:38:47] /public [18:39:09] I've just deleted `policy-options policy-statement anycast_import term anycast4 then as-path-expand last-as count 2` [18:39:31] but that left `policy-options policy-statement anycast_import term anycast4 then as-path-expand last-as` which by default prepends it once.. [18:39:58] the pending CR is going to do the right thing, it was my manual change that was a bit too short [18:40:03] ah ok [18:40:07] sukhe@bast5004:~$ dig @10.3.0.1 CHAOS TXT id.server. +short [18:40:07] "dns5003" [18:40:08] :) [18:40:21] other stuff also looks good now [18:41:08] sukhe@bast5004:~$ dig +nsid wikipedia.org @198.35.27.27| grep -i nsid [18:41:12] ; NSID: 64 6e 73 35 30 30 33 2d 61 75 74 68 ("dns5003-auth") [18:41:13] also good [18:41:55] but yeah if you can prioritize https://phabricator.wikimedia.org/T311618 it would help reduce the risk of sub-optimal routing [18:42:44] XioNoX: looking at it again, I mean, even if we can simply have a probe on the bastions/prometheus host that checks for basic stuff like the incorrect NSID being returned [18:42:47] it is something [18:43:01] it's certainly much better than what we are doing right now, which is finding this manuallay [18:43:04] *manually [18:43:14] anyway, let's talk more about this, I have added it to my notes for Traffic [18:44:20] cool [18:54:16] sukhe: change deployed, noop where it was manually done already, so we should be all good now [18:55:31] thank you <3