[07:56:06] morning [08:15:00] gretings [09:08:33] godog: we're soon going to be able to upgrade the switches for https://phabricator.wikimedia.org/T390813 ? [09:10:29] XioNoX: yes, next wed we can totally do C3 [09:11:14] we haven't tested yet other racks though I'm not expecting many surprises [09:11:37] the only class of servers we haven't tested for failover is cloudvirts, though we have an established procedure to drain those [09:11:47] brb [09:29:09] XioNoX: after wed then I'm happy to schedule the rest of cloudsw upgrades [09:29:21] cool! [09:50:27] could someone double check that what I have in https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/304 matches the task please? [09:51:55] LGTM [09:52:22] decided to not have 32G of ram? [09:52:48] I see the comment ack [09:52:53] see the last comment, I asked and they realized that they'd have no use for that [09:53:14] 👍 [10:28:24] anyone has opinnions on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1260072 ? (adding kubetail to toolforge bastion/workers/control/...) [10:30:12] taavi: aboout the new flavor for CI , I have talked about it with Peter from test platform team this morning. My ask to raise the memory from 24G to 32G was indeed not backed by anything, thanks for having flagged it [10:30:46] we looked at the instance and they have a lot of free memory and cached memory. They don't seem to use much more memory so we think the current 24G is enough [10:30:54] I have replied on the task https://phabricator.wikimedia.org/T421242#11777574 [10:31:15] hashar: heh thanks for confirming, I will get the flavor out when I can get the laptop out in a moment [10:31:27] awesome thank you! [10:31:49] and I am quite happy to have rediscovered the instances do not use that much memory afterall, all because you have asked the right question 🎉 [10:31:58] I am off for lunch [11:03:29] hashar: your flavor is live [11:19:23] quick review when anyone has a minute, improving the deployment MR feedback message https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1266954 [12:22:48] Can I get a quick review of https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/303 ? Adding a dns record in tofu-infra [12:23:47] is there an associated task to add to the comment? [12:24:42] lgtm, it would be nice to have the task in the comment too though [12:24:57] you mean https://phabricator.wikimedia.org/T421025? [12:25:36] yep, kinda like what we do for the projects, though I see it's in the commit/mr already, so just a nit [12:25:49] oh you mean inline in the code [12:25:52] Sure, I can add that [12:26:27] yep sorry, I meant that :) [12:29:51] git hates me today [12:31:45] added, now waiting for CI [12:33:52] thanks! [12:34:13] taavi: awesome thank you very much! [12:50:27] dcaro (or anyone) when you're at a stopping point can you help me understand what's going wrong with my new magnum cluster? You can log in to pawsdev-bastion.pawsdev.codfw1dev.wikimedia.cloud and the config is /root/kubeconfig.yaml [12:51:12] what have you tried+disicovered already? [12:51:23] (what is wrong too) [12:51:47] Magnum says it's in state 'creating' and has been since Tuesday <- that's ultimately what is wrong [12:52:02] 'get pods --all-namespaces' looks basically fine to me [12:52:30] The orchestrator's complaint is 'Cluster Controller has not yet set OwnerRef' [12:52:44] ack [12:52:45] I take that to mean there's something wrong with inter-pod networking? [12:53:14] But I haven't actually found any tests for what's wrong outside of the capi engine being unhappy [12:53:20] seems likely, or for some reason the controller is not working as expected [12:53:40] (network might the reason) [12:55:23] clusters throw so many errors during startup, it's hard for me to tell which things in the logs are actually 'this is broken' and which are 'this was only broken until the other pods came up' [12:56:19] being "eventually consistent" and such makes it "mostly broken all the time" until it's not xd [12:56:27] yeah [12:56:54] do you mind if I download k9s there for debugging? [12:57:00] although the logs in kube-controller-manager-paws-etc seem to have just given up, it's not trying again and again [12:57:04] nope, please do [12:58:34] there's a debian package now :) [12:58:55] I'll have to go to a meeting in 2 min, but might try to take a look (/me is curious about magnum) [12:59:33] thanks -- I'd like to follow along as you dig but you might be all out of workday in your day [13:02:52] xd, I'll let you know if I find >5min in a row [13:03:30] thanks! [13:04:24] hi, for info I am spinning 6 new VM instances in the `integration` project [13:04:50] hashar: ack, it's 300$ [13:04:59] :-P [13:05:17] *per instance [13:05:24] no worries, I 'll hapilly file the form once you have created it in Coupa :-b [13:05:28] that is per year? [13:05:28] :b [13:06:08] have we ever considered billing internally? [13:07:27] at JOB-1 we did quotes internally so the devs/program budget was partly consumed/transfered to the ops infra budget. I think they used that has an incentive to avoid consuming too many resources [13:09:43] Ages ago I ran the openstack-standard billing/telemetry system out of curiosity but it turned out to be extremely resource-intensive so we switched it off again. It would be sort of interesting to see those numbers but not, I think, interesting enough to justify the effort [13:09:56] definitely possible, though, if the org decides that they want to track cloud-usage expenses somehow [13:12:59] I can't imagine how complicated the system could be :-] [13:13:25] having worked with Radius to bill RTC/landlines customers calling in, it was an interesting problem [13:13:45] (were data loss directly affect revenue, fun times) [13:14:07] anyway I am glad we have WMCS, that is very valuable [13:21:04] It's certainly more fun for us to respond to a request with "sure, approved" than it would be to respond with a price quote [13:45:06] dcaro, now I'm playing with k9s; this seems important: [13:45:11] https://www.irccloud.com/pastebin/5rWFKXst/ [13:45:59] yep, that looks troubling :) [13:46:37] surprised that it's using both 4 and 6 [13:49:17] yep, and at the same time, usually one is the fallback of the other (maybe I'm misunderstanding the logs) [13:50:19] I am pretty sure I have seen this magnum version work in the past, so my next step is probably to just try again with various different template network settings. It would be nice to have a better idea of what's specifically happening though... [13:55:43] andrewbogott: is some k8s/docker/magnum internal thing re-using the 172.20.0.0/16 internal net by any chance? [13:59:13] probably! let's see... [13:59:58] https://www.irccloud.com/pastebin/f5GqVyRl/ [14:00:15] that's that last one, cloud-flat-codfw1dev, isn't it? [14:00:48] oops, meeting time [14:26:31] oh, I misunderstood the question. Hm... [14:30:00] taavi, we had that problem earlier didn't we? do you remember any specifics? [14:32:18] not the specifics [14:36:34] so far, in k9s I only see 10.x addresses and the three addresses that are the VM controller or worker [14:37:10] which container has the errors? [14:37:21] maybe we can attach to it and play with dig [14:38:36] looking [14:39:27] coredns [14:40:03] wait, 10.22.183.6:35984->172.20.254.1:53 [14:40:09] what is 172.20.254.1 doing in there? [14:40:43] xd [14:41:52] the template sets the dns server to 8.8.8.8 but it's using internal DNS anyway. not necessarily bad... [14:41:54] but bad if it doesn't route [14:43:30] it says timeout, not route failure, that means someone is dropping it (could be bad routing in the path, but at least it finds a route itself), for ip6 it does not find it [14:43:34] no shell in that container :( [14:44:20] oh, so it might just be our recursor refusing that origination host [14:44:25] but, why using that recursor in the first place? [14:45:07] magnum clusters should ihmo be using that recursor and not some public one [14:45:18] wait, 254? in codfw1dev? [14:46:35] that IP is ns-recursor.openstack.codfw1dev.wikimediacloud.org according to dig -x [14:46:54] I'm hacking the recursor config to allow 10.x [14:47:58] ah, no, I misremembered, 254 is indeed correct [14:48:48] andrewbogott: there should be a layer of NAT involved there, 10.x addresses shouldn't (and can't) appear on the neutron and cloudgw layers [14:49:08] so the recursor should never see that [14:49:17] hm, right. So it should appear as a regular cloud-vps ip [14:49:21] which should already be allowed [14:49:33] I would expect the worker to NAT that to its own instance address, yes [14:49:50] can the worker reach the recursor IP? [14:50:45] I'll have to dig up the ssh key for that, give me a bit... [14:55:58] no idea where (if anywhere) the private 'paws-magnum-vm' key is, I might have to rebuild with a known key to answer that [14:56:06] should do that anyway [14:56:25] so, dcaro, I'm probably going to rebuild that cluster unless you're still in the middle of looking [14:56:54] feel free to nuke it :) [17:40:13] * dcaro off [17:40:20] cya after the break!