[07:01:26] morning [07:08:36] o/ [07:44:29] XioNoX: hey! I'm trying to deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/960027/, but there's an unrelated diff from https://gerrit.wikimedia.org/r/c/operations/homer/public/+/959732 eqiad and codfw core routers - is that safe to deploy too? [07:45:40] taavi: eh, yeah I'm rolling it out, I see your diff too :) [07:46:00] * dcaro be back in a bit [07:46:02] pushing your change to cr1-codfw [07:54:07] morning [09:38:01] I think I found the problem with the build service on lima-kilo [09:38:28] awesome [09:39:03] but I have yet to confirm my theory [09:39:29] at the moment it seems builds-builder fails to deploy via lima-kilo. Will wait until https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/81 is merged [09:40:18] the proposed lima-kilo fix is this [09:40:19] https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/81 [09:41:00] that is working outside lima-kilo on my laptop [09:42:21] * taavi lunch [09:42:35] the thing is that some of the steps in the pipeline are generated like this [09:42:37] https://www.irccloud.com/pastebin/hhydSFCn/ [09:42:52] note the last args array entry [09:43:03] it uses the URI without specifying the protocol [09:43:08] and apparently it defaults to https [09:43:11] locally though I do have the protocol: [09:43:14] https://www.irccloud.com/pastebin/iWakrtpQ/ [09:43:32] that arg is injected by builds-api [09:43:32] so was it lost in the lima-kilo setup? [09:43:41] is no the docker-0 annotation [09:45:12] the docker-0 annotation is for the `-basic-docker=basic-user-pass` argument [09:45:26] yes, but the docker-0 annotation is the one missing the protocol there [09:45:28] no? [09:45:35] that one should have been set [09:45:35] v [09:45:37] dockerconfig: '{"insecure-registries": ["http://{{ exec "./helpers/get_harbor_ip.sh" (list) }}"]}' [09:45:48] (in the new helmfile based setup) [09:46:09] I need to step out for a bit, doctor appointment [09:46:13] I think we are in good track here [09:47:17] hmmm... we should not have to need that secret yaml at all, it should be handled by the deployment, not lima-kilo :/ [09:47:48] that's ok, we can fix that easily [09:48:06] it was required before the helmfile migration [09:48:19] * arturo back later [09:50:36] this is happening again: [09:50:36] 2023/09/25 09:39:32 [error] 23#23: *106220 upstream SSL certificate verify error: (10:certificate has expired) while SSL handshaking to upstream, client: 172.16.3.8, server: , request: "GET /envvars/v1/envvar/TOOL_TOOLSDB_USER HTTP/1.1", upstream: "https://10.111.78.72:8443/v1/envvar/TOOL_TOOLSDB_USER", host: "api.svc.tools.eqiad1.wikimedia.cloud:30003" [09:50:49] I think our certs are not restarting the pods ;/v [09:52:35] oh, I think that the issue is that in lima-kilo harbor is in a non-standard port, so you need to specify both, but the charts don't allow specifying the port [09:52:43] (just don't have the option) [09:54:06] btw. I think that now with the vagrant setup, if it works for all, we can remove a lot of the stuff in there (cleanup/non-root handling/etc.) [09:55:48] dcaro: is that envvars-api specific or happening somewhere else too? [09:56:09] it happened with the builds-api [09:56:13] too [09:58:25] hmmm [09:59:00] what's the difference between these two certs? https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/blob/main/deployment/chart/templates/certificate.yaml.tpl [10:02:07] there's a bunch, anything specific? [10:02:29] https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/blob/main/deployment/chart/templates/deployment.yaml.tpl#L8 only listens to changes for one of them [10:06:20] yep, that's probably it, and copy-pasted from builds-api, that copied it from the jobs-api [10:08:15] I think that one of them is not needed [10:15:05] I think this should do it https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/49/diffs [10:31:23] this might fix the harbor/builds setup on lima-kilo, testing locally: https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/82 [11:26:51] dcaro: oh, cool! will test this now [11:53:09] builds-builder wont run `./deploy.sh local` on a fresh cluster because missing CRDs [11:53:25] https://www.irccloud.com/pastebin/E4v1Pj88/ [11:54:15] that's interesting, the crds are set in a specific folder [11:54:51] I think it may be related to helmfile calling helm via `upgrade` instead of `install` [11:55:03] because per helm docs [11:55:09] CRDs can't be upgraded, only installed [11:55:10] https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations [11:55:27] https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#method-1-let-helm-do-it-for-you [11:57:06] we only do `helm upgrade` right? [11:57:18] well that's helmfile crafting the call to helm [11:57:31] so I guess we need to tell helmfile to call helm in a different way [12:08:16] this is related v [12:08:18] https://github.com/roboll/helmfile/issues/1353 [12:16:04] I think I got something working :/, not sure how nice it is [12:21:33] share it! :-) [12:23:19] https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/11 [12:23:25] untested on lima-kilo yet [12:24:19] I was working on something similar here locally [12:25:13] dcaro: it should work just fine with lima-kilo [12:25:43] testing [12:25:45] ... [12:25:55] I just tested, it works! [12:27:16] nice [12:27:48] starting vagrant from scratch is slower xd [12:29:09] speaking of vagrant, unfortunately looks like it doesn't have a very active freely licensed fork (like terraform now has) :/ [12:32:33] it's the same company right? [12:32:51] hashicorp yep [12:33:46] taavi: did the switch the licence for vagrant too? [12:33:51] did they* [12:34:14] yep [12:34:20] :-( [12:34:36] MPL 2.0 [12:34:39] that may be an indication of the technology not being very actively used [12:35:22] *from MPL to BSL [12:35:29] I guess we could do minikube-within-docker or kind-within-docker for lima-kilo [12:36:07] arturo: when you have a moment could you review this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/960162 [12:36:09] I like not having lima-kilo touch anything system-wise [12:38:22] taavi: +1'd [12:39:04] oh noo, does that mean we can no longer use vagrant? I though the license thing was specific to terraform but it's all of hashicorp? [12:40:22] you can probably keep using it. Is just no longer FLOSS [12:52:39] dcaro: [12:52:39] Error: UPGRADE FAILED: rendered manifests contain a resource that already exists. Unable to continue with update: CustomResourceDefinition "clustertasks.tekton.dev" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/release-name": must be set [12:52:39] to "builds-builder"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "builds-builder" [13:04:02] it works for me, the issue on my side is that helm uninstall fails, and leaves the crds and a couple namespaces hanging around, so install -> uninstall -> install does not work [13:38:54] topranks: do you have a few minutes to go over dns config with me again? It still seems broken in ways that I'm not clear on. [13:41:08] I'm confused by why we seem to be getting notifications on both private /and/ public ips: https://phabricator.wikimedia.org/P52608 [13:41:15] current pool config is https://phabricator.wikimedia.org/P52609 [13:41:24] andrewbogott: sure, let me have a look [13:41:48] I expected with that latest pool for notify to only happen on private ips [13:41:54] arturo seemed to be of the opinion there was no way to get the correct settings in the db from the "pool.yaml" file [13:42:16] the logs in your paste are logs of the updates being ACCEPTED. [13:44:11] if you're talking about the 'master' record in the db, that's correct but I don't think it's related to my immediate question (I hope) [13:44:27] dhinus: Raymond_Ndibe sent this patch to disable pages from icinga, https://gerrit.wikimedia.org/r/c/operations/puppet/+/960622 [13:44:30] Can you tell me more about what you're thinking re: 'accepted'? [13:45:04] When it was rejecting transfers it was logging these: [13:45:05] Sep 14 16:51:27 cloudservices1005 pdns_server[2067494]: Received NOTIFY for 16.172.in-addr.arpa from 10.64.151.4:53684 which is not a master (Refused) [13:45:07] the "temporary" ns0 ip assignment to cloudservices1006 was never cleaned up, that might be messing up some logs or similar [13:45:13] I guess we only need pages from alertmanager, we can do a check though to make sure we have what we need there [13:45:50] topranks: ok -- for the moment I'm not worried about what's accepted or rejected, just literally the IPs that it's getting notifies from [13:45:56] andrewbogott: I just mean the logs saying "queueing check" in your paste are just logs that it got the message, and it's gonna act on it [13:46:00] they're not errors [13:46:05] Yep, agreed. [13:46:16] but WHY is it getting incoming notifications from 185.* at all? [13:46:58] From my previous observation (note: observation, I don't understand the logic) it will get updates from the IPs listed under "eventual setup" in this task: [13:46:59] https://phabricator.wikimedia.org/T346385 [13:47:24] it's using it's public loopback to send updates to the local system, and 172.20.x IP to reach the remote one [13:48:06] ok, so you'd expect each cloudservices node to get updates from one public and one private IP, yes? [13:48:45] yeah, from its own public and the remote private [13:49:01] but you can see from my paste that that's also not what's happening. 1006 is getting updates from both public IPs. [13:49:06] dcaro: thanks, does that mean we will lose some pages we currently have when a physical host goes down? [13:49:38] and a private one [13:50:26] is that from taavi's comment above? Is 1006 using both public IPs somehow? [13:50:33] not sure on the logic why it picks the IP [13:51:00] but the root cause is cloudservices1006 still has 185.15.56.162 configured on it's loopback interface [13:51:22] the behaviour is the same, it's using it's own public loopback to update itself, and alternating between the two configured [13:51:22] ok -- let's fix that and see if everything clears up :) [13:51:39] 185.15.56.162 should be removed though, it belongs to cloudservices1005 [13:52:07] What's the cleanest way to remove that IP [13:52:12] sudo ip addr del 185.15.56.162/32 dev lo [13:52:27] (btw, I should explain that I'm not just hunting log ghosts, there is an actual problem with new records getting updated in a timely manner) [13:52:53] looks like conf files in /etc/network/ are ok, just wasn't removed when they updated [13:53:05] ok... well either way good we caught this! [13:53:33] ok, removed -- now I need to update master records and let's see if things work again [13:54:03] andrewbogott: the only confusing piece is I thought the approach in https://phabricator.wikimedia.org/T346385#9182344 would allow that "wrong" IP to work [13:54:20] (i.e. all 4 IPs were added to the db on both systems I thought) [13:55:17] IIRC we can only add the IPs that are actually getting updates, otherwise pdns starts to regard things as stale. I'm not positive about that but I saw bad behavior around that last week. so I'd like try to make the master records correct... [13:59:09] dhinus: yes, we would lose those pages yes, though I think that as long as any of the critical services are still up, we don't care so much about getting paged right away [13:59:40] we should give it a thought though, to make sure we have the alerts we want [13:59:49] yeah I was also thinking we have pages for ceph, openstack, etc... but I'm worried of forgetting about something :) [13:59:50] *minimal paging alerts [14:00:11] didn't we have some paging alerts for hosts down in the operations/alerts.git repo? [14:02:55] lol this wasn't helping https://gerrit.wikimedia.org/r/c/operations/puppet/+/960624 [14:06:01] topranks: thanks for your help; I'm starting a meeting now but will watch logs and see if things are any better [14:10:17] at least my new DNS monitoring system worked :D https://prometheus.wmcloud.org/graph?g0.expr=probe_success{instance%3D"ns1.openstack.eqiad1.wikimediacloud.org%3A53"}&g0.tab=0 now I just need to move the alerts for that from icinga to alertmanager [15:00:22] topranks: I'm still not loving the pdns logs but my test case (nova-fullstack) is working now so I think we're good, or close to it. Thank you! [15:00:41] arturo: looks like cloudcontrol1007 was moved in the dc. so what do I need to change in netbox to be able to reimage the host into the new setup? https://netbox.wikimedia.org/dcim/devices/4290/interfaces/ [15:00:55] taavi: let me see [15:01:14] I imagine I need to change the cable to use the new switch and port, configure VLANs on the switch port and allocate IP addresses. is that correct, am I missing something? [15:01:41] andrewbogott: what about the logs aren't to your taste? [15:01:48] taavi: let me have a look [15:01:55] taavi: I see the NIC port shows as connected to asw2. That's not desirable. May mean the DC work is still ongoing [15:02:08] we need to know the new switch port in cloudsw before we can move forward [15:02:13] that's in the task [15:02:20] something is odd about this setup [15:02:26] e4 switch port #9 [15:02:39] https://phabricator.wikimedia.org/T346892#9195510 [15:03:26] My guess is we are at the point in the process before we "delete existing interfaces apart from mgmt" [15:03:27] https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Move_existing_server_between_rows/racks,_changing_IPs [15:04:45] Once we're at the "Verify that the device now has 2 interfaces in Netbox, mgmt and ## PRIMARY ##" step we can add the additional vlan interface / change switch side [15:05:30] yeah [15:05:31] I believe yes - should I try to follow those steps? [15:05:48] taavi: yes, just don't click the top right DELETE button [15:05:52] yep fire away, just ping me if you've any q's [15:05:58] uh huh - golden netbox rule :P [15:06:00] because that will delete the whole server from netbox taavi [15:06:11] does the cable id stay same? [15:06:48] The cable ID needs to be re-entered, DC ops may have listed in the task [15:07:10] not impossibly it's the same, but they'd need to provide it either way [15:07:24] ok, let me ask [15:07:35] ok, otherwise we can use a dummy ID for now to proceed [15:11:47] which format does the "switch interface" field use? do I just put "9" in there? [15:11:56] xe-0/0/9 [15:13:37] https://netbox.wikimedia.org/extras/scripts/results/5038853/ this looks good to me [15:14:02] yep looks good! [15:14:11] running the dns cookbook now [15:14:31] now the manual bit is to adjust the switch port setup: [15:14:31] https://netbox.wikimedia.org/dcim/interfaces/31474/ [15:15:11] we need to click 'edit' on that page, then at the bottom change the vlan setup [15:15:13] from this: [15:15:14] adding cloud-private with a tag? [15:15:15] https://usercontent.irccloud-cdn.com/file/fqgYPMzf/image.png [15:15:53] to this: [15:15:55] https://usercontent.irccloud-cdn.com/file/0H5Ualln/image.png [15:17:40] We also need to assign it an IP in the cloud-private-e4-eqiad (1153) vlan [15:17:57] let me do that - I think you may need to put the IP in puppet somewhere? [15:18:23] the puppetization queries that via DNS now [15:20:02] ah ok cool [15:20:17] I've added it, just gonna run dns cookbook to add it so [15:20:52] the instructions now say to run sre.network.configure-switch-interfaces, does that still apply with the cloud-private workflow? [15:21:14] taavi: yes [15:21:42] it will configure the vlans on the switch port [15:22:43] ok, done [15:24:09] from gitlab CI [15:24:11] #17 ERROR: failed to push toolsbeta-harbor.wmcloud.org/toolforge/builds-api:dev-image-0.0.97-mr-47: failed to copy: unexpected status: 500 Internal Server Error [15:26:52] arturo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/960642/ [15:27:08] 👀 [15:27:59] taavi: I think you can assign the new role directly [15:28:07] I get NS_ERROR s [15:28:07] in manifest/site.pp [15:28:08] https://usercontent.irccloud-cdn.com/file/D8TzZo9e/image.png [15:28:37] when getting to toolsbeta-harbor [15:28:41] arturo: done [15:29:13] same for tools, is that related to the cloudservices reimage? [15:29:29] dcaro: should not be related [15:29:31] I'm reimaging a cloudcontrol, not a cloudservices [15:29:34] andrewbogott: ^ for the DNS issue [15:30:18] hm, I'm going to join our checkin meeting and you can explain [15:30:19] so even less? [15:30:22] taavi: mind that the reimage may take a while, including most likely manual steps. Do you want to do today, or wait until tomorrow? [15:30:35] uhh good call, let's do that first thing tomorrow morning [15:30:49] ack [15:45:50] * arturo offline [16:31:31] * dhinus off [16:38:18] dhinus: I updated the schema on the wiki and applied it on 2004-dev, seems ok so far [20:31:04] Any views on https://phabricator.wikimedia.org/T347150 ?