[07:11:51] morning. maintain-dbusers grants have fully been sorted out. patch for taking cloudcontrol1006 out of service for the move is https://gerrit.wikimedia.org/r/c/operations/puppet/+/961442 [07:29:32] taavi: it's going to be added back again as cloudcontrol1006 right? [07:29:53] yes, it'll be cloudcontrol1006.eqiad.wmnet once it's been moved [07:32:20] is the ip changing? (I see 208.80.154.149 mentioned in the wikireplica grants) [07:32:41] can be changed later though [07:33:04] yes. but I won't have the new IP before the server has been moved, so I can't update that yet :P [07:38:15] can you run a pcc on a few hosts though? (ceph, cloudcontrol, ...) [07:38:21] sure [07:43:27] pcc looks good, merging [07:54:28] 👍 [08:01:10] arturo: hey, I was reading something about the OpenStack networking options [08:01:16] is this the kind of thing you had in mind? [08:01:17] https://docs.openstack.org/neutron/latest/admin/deploy-ovs-selfservice.html [08:13:26] topranks: yes. It is called 'tenant networks' [08:13:47] meaning each project (tenant) can create their own networks within openstack [08:14:21] maybe this? https://docs.openstack.org/liberty/networking-guide/scenario-classic-ovs.html [08:15:23] ok yeah that was kind of my follow up question - the aim is to have to make use of an overlay to provide those customizable separate networks managable from openstack? [08:15:51] this kind of thing integrates with the physical network better: [08:15:52] https://docs.openstack.org/neutron/latest/admin/config-routed-networks.html [08:16:38] The virtualized approach gives much more flexibility, at the cost of efficiency (both in terms of traffic paths across physical network and packet flow within hosts) [08:17:13] I think that's the key trade-off to weigh up in terms of selecting one or other [08:17:49] No issue on my behalf with either approach just best make sure to select the one that best fits our needs [08:19:52] you all will need to figure out what you want :-) [08:20:58] heh... well I'm trying to work out why someone would do all these crazy things :P [08:21:00] https://usercontent.irccloud-cdn.com/file/pNuQXDJ6/image.png [08:21:09] :D [08:21:18] I have a feeling I have tons of reading I need to do about all of this [08:22:01] yeah we can work it out. Ultimately I'd stick with the "keep it simple" unless good reason not to. [08:22:43] customizable per-user or per-group virtual networks, like how in AWS you set up your own VPC etc., would be the reason to go down the OpenStack/Tenant route [08:23:07] if we don't anticipate needing such functionality there is a good case for some of the other options [08:23:29] I don't think, in the WMCS deploy, we're at a scale where we're forced to choose one or other we have the choice [08:57:07] I'd appreciate a quick review of this email before I send it to ops@ https://etherpad.wikimedia.org/p/T340241-new-bastion [08:58:06] arturo: it looks like you can edit this page https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/restricted.bastion.wmcloud.org but I cannot, does anyone else in the team have the required permissions? [09:06:06] I guess we need to put you in sone special wikitech group [09:08:30] I think you need to be in either the 'Content administrators' or 'Administrators' group. I can edit it myself but don't have the permissions to give you the rights [09:15:29] maybe we need to review wikitech permissions for everyone in the team currently [09:27:19] arturo: I'll create a task [09:28:28] anyone familiar with the openstack LDAP setup? yesterday a.ndrew and myself tried to figure out what's happening in codfw but we couldn't really understand it. I summarized my understanding in T347555 [09:28:29] T347555: [openstack] LDAP is broken in codfw - https://phabricator.wikimedia.org/T347555 [09:29:01] dhinus: was cloudservices2005-dev reimaged yet? [09:29:05] not yet [09:29:08] dhinus: I could take a look at LDAP later [09:29:11] ok, good [09:29:13] but it looks like it was broken from before the reimage [09:29:15] looking at the logs [09:29:27] (maybe slightly LESS broken LOL) [09:29:44] I think you want to have 2004-dev replicate the current database from 2005-dev [09:30:03] yes, but replication is not working :/ [09:30:04] it could be! I think I was the last person who did something in the codfw1dev ldap, did some modifications, such as mirror mode etc [09:30:15] but I think I left in a working state! [09:30:28] arturo: if you can have a look later that's great, I have plenty of other tasks to work on in the meantime :) [09:37:24] yeah, I fear I have too many open topics at the moment :-( [09:37:32] same :D [10:07:03] taavi: blancadesal you should have received an invite for a bulids-cli project on pypi, for the rest, I did not find your usernames xd, we should have them some place [10:07:56] I requested a pypi organization for toolforge a while ago, but it's still pending approval it seems. that would let us maintain pypi package access in one place [10:08:43] that'd be nice yes [10:24:46] dcaro: invite received! [11:02:15] arturo, taavi: realised I'm at SREcon on Oct 12th, can I move our meeting to Mon 11th? [11:03:17] no issues from my side [11:05:45] topranks: I'm confused. I see a calendar event for Oct 10th (Tue) [11:06:11] Sorry brain fart, yeah Oct 10th, and suggest moving to Oct 9th. [11:06:24] yeah, works for me [11:06:28] Was gonna suggest 11:00 CEST ? [11:06:40] ok [11:54:38] dhinus: can puppet be enabled on cloudservices-dev hosts? [12:10:50] arturo: I think a.ndrew disabled it yesterday to test something, but I don't remember what :) [12:11:11] I think it's fine to re-enable it, it was just a test anyway [12:11:33] dhinus: this patch may solve the TLS problems: https://gerrit.wikimedia.org/r/c/operations/puppet/+/961780/ [12:12:12] let's give it a go! [12:12:24] ok [12:14:06] fyi. taking one osd down on codfw bring the cluster to a halt due to lack of space [12:14:52] :/ [12:15:04] maybe we can free up some space? [12:15:24] might be yes [12:15:48] dhinus: now some ldap schema problems [12:15:50] https://www.irccloud.com/pastebin/eh0X71IB/ [12:16:14] :/ I think a.ndrew found there is some difference in the new version of slapd [12:16:28] mmm wait, they are different OS versions? [12:16:33] yes :/ [12:16:43] 2004 has been reimaged ot bookworm, 2005 not yet [12:16:55] because I think if we reimage both we'll lose the data [12:17:05] maybe we could backup and restore [12:17:12] I think you need something similar to https://gerrit.wikimedia.org/r/c/operations/puppet/+/961066 [12:17:48] nice one taavi -- I wonder if it will work if one server is still on the old version though? [12:18:37] looks like m.oritz is also looking into it (see the last comment in the patch) [12:19:47] that patch seems already in the radar of both moritz and andrew, so I'll step back and let them work on that [12:24:36] sonuds good [12:24:54] arturo: can I get your help with updating https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/restricted.bastion.wmcloud.org? [12:25:01] yeah [12:25:04] I think I'm ready to switch the IP to the new bastion [12:25:11] the new fingerprints are here https://etherpad.wikimedia.org/p/T340241-new-fingerprints [12:25:22] they just need to be copied to the wiki, but wait until I switch the IP [12:26:48] let me know when you're ready to save the new page, and I'll try to associate the floating IP exactly at the same time :) [12:27:09] *the new version of the page [12:28:20] dhinus: done [12:28:33] ops [12:28:39] didn't read your message :-P [12:28:39] that's fine, changing the ipnow [12:28:48] taavi: building debs seems not to work when they need backports/toolforge repos: https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/jobs/146154 [12:28:54] any ideas? [12:29:04] the scripts to build locally using docker uses a different command [12:29:05] IP changed, can you try connecting to a cloudvps vm? [12:30:05] I'll skip it for now and create tasks for it [12:30:57] dhinus: works as expected [12:31:01] nice! [12:31:11] dcaro: I believe the .gitlab-ci file should use 'ENABLE_TOOLFORGE_REPO' instead of 'ENABLE_TOOLS_REPO' [12:31:33] oh, nice, it seems not to install other packages too though (dh-python) [12:33:12] I think the error message is just confusing. `python3-toolforge-weld` (which is in the toolforge repo) is the only one it says it doesn't know how to install [12:34:04] btw did you also notice the CI pipeline to publish packages to pypi automatically? [12:35:31] yes, that's ok, only on tag right? [12:36:09] yeah https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/blob/main/py3.9-bullseye-tox-pypi-debian/gitlab-ci.yaml#L39 [12:36:21] it's nice :) [12:36:22] * taavi afk, be back later [12:36:41] it failed, similar error, but without the 'and it's not found on any repo' message [13:01:44] dhinus: connected also correctly, had to delete the key from my hosts file as expected :) [13:03:35] nice one [13:08:01] the upgrade to Bookworm (and to the latest OpenSSH) did help with the Cumin speed when going through the bastion, but not as much as I hoped, details at T340241 [13:08:03] T340241: cloudcumin is slow when targeting many cloud VMs - https://phabricator.wikimedia.org/T340241 [13:22:24] do you know what's the bottleneck? [13:22:35] still spawning the extra python process for each connection? [13:41:56] okok, so got something while building bulids-cli, it fails because it depends on poetry>=1.4, that comes from the toolforge repo, but, even if you add a pin on the repo (source.list.d/toolforge.list), it seems build-dep does not care, and tries to pull it from backports (that by default has priority 100, while toolforge get 500) [13:58:06] anyone that's why this is in the current local build deb script xd [13:58:08] https://www.irccloud.com/pastebin/ULNtG03F/ [13:58:17] I guess I was not able to sort that out before either [14:03:15] wait no, it's completely ignoring the other repo :/ [14:05:59] dhinus: quick ldap update: there seems to be a backup but until a few minutes ago I didn't know how to restore it (apparently when you reimage a backed up host it because harder to restore). So I'll try to restore again but it won't be until this afternoon that I have a chance. [14:06:08] I can point you to the process if you want to try before then [14:08:18] ooohh, it's the `--target-release bullseye-backports` that prevents the toolforge repo from being used [14:09:41] andrewbogott: doesn't cloudservices2005-dev have a replica of the data that we can use? [14:10:27] taavi: unclear. Syncing had been broken for ages. [14:10:36] ah [14:10:40] But also we couldn't make syncing work, which maybe arturo fixed [14:10:41] Hmm... I think we don't need that, but we might want to pin the backports at 500, same as the rest of sources so they are preferred over regular repos if there's a newer version [14:10:59] */me talking about --target-release, sorry for the double conversation [14:11:10] andrewbogott: I think there are several problems yes. I think I fixed the TLS one. I'm glad I introduced backups in the last round [14:11:22] me too!! [14:12:13] arturo: happen to have an idea about when syncing broke? If we only lost a few weeks then I maybe don't care, it's not like we create a lot of users there. [14:12:46] I would guess in the latest round of renames/reracks the cluster never got to sync for real [14:12:52] that was... months ago? [14:13:23] I doubt any user was created meanwhile, but it could be [14:15:12] taavi: I think the last haproxy puppet cleanup broke puppet for cloudlb @ codfw [14:15:24] let me have a look [14:15:25] https://www.irccloud.com/pastebin/KZGPlAPF/ [14:16:01] that looks like a hiera issue [14:16:14] (empty string somewhere) [14:17:39] arturo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/961815/ [14:17:41] * arturo nods [14:18:48] taavi: +1'd [14:18:58] I wonder why the compiler didn't complain about the missing heira key instead [14:19:25] andrewbogott: it's fine if you try the LDAP restore later today, I'm working on a couple other things now [14:19:55] arturo: fixed [14:29:55] quick review https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/18 (ci packaging fix) [14:33:04] hmm [14:33:18] so that would download stuff from backports every time there's something newer? [14:34:37] yep [14:35:31] approved [14:35:46] in this case it was: [14:35:48] https://www.irccloud.com/pastebin/WJctcmN5/ [14:37:16] I need to rebuild the image xd [14:41:38] hmpf... as we use :latest the image is cached xd [14:54:54] Raymond_Ndibe: there's a couple big changes of yours undeployed on https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/10 [14:55:24] can you verify that they work? [14:55:47] or rather, are you around to help me verify that they work? [15:04:51] hello, I had an instance failing to reattach a volume and ended up recreating it **with the same hostname** [15:05:16] my issue is the host dns entry now has two `A` records: [15:05:19] integration-castor05.integration.eqiad1.wikimedia.cloud. 55 IN A 172.16.1.98 [15:05:19] integration-castor05.integration.eqiad1.wikimedia.cloud. 55 IN A 172.16.0.78 [15:05:45] hashar: are you willing to recreate it yet again? If so you can delete, I can clean up leaked records, and then you can recreate. [15:05:46] the correct one is 172.16.0.78 [15:06:15] ^^ is something I know how to do; selectively editing the record will require me to re-learn how :) [15:06:15] ah [15:06:19] yeah I can recrete it [15:06:27] ok. lmk when it's deleted and I'll run a cleanup job [15:07:02] I have once AGAIN tried to outsmart openstack and it failed :D [15:07:20] PSA: the restricted bastion will be down for a few seconds as I increase the number of CPUs [15:08:12] andrewbogott: I have delete the instance :) [15:08:18] deleteD [15:08:22] ok. this will take a couple of minutes... [15:09:02] I could have use an other hostname, but I don't remember what kind of havoc it can cause :) [15:09:43] you just had bad luck, it /should/ have cleaned up. [15:10:13] the instance was broken cause it could not attach a volume: Invalid input received: Invalid volume: Volume 3f90c3f2-158d-4e45-a919-0f048f47c3b6 status must be available or downloading to reserve, but the current status is attaching. (HTTP 400) (Request-ID: req-ddd07558-b6b7-4ec6-8258-c4e5efb83a07) [15:10:36] and I guess since it was in a poor instance, when I deleted it the dns entry did not get deleted for whatever reason [15:10:45] I no more see the DNS entry thank you andrewbogott ! [15:10:57] yep, should be all set (barring local cache) [15:13:34] on a different topic, when a volume is attached to an instance, do we have a puppet class/define to have it mounted in the instance? [15:15:14] no, but the prepare-cinder-volume script should create an fstab entry [15:16:23] yeah last time I followed the guide at https://wikitech.wikimedia.org/wiki/Help:Adding_Disk_Space_to_Cloud_VPS_instances#Cinder:_Attachable_Block_Storage_for_Cloud_VPS [15:16:33] Raymond_Ndibe: hmm, I think it depends on https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/36, before it would just work, now it errors out with 'no matching operation was found' [15:16:40] then the instance is new and the fstab entry is no more present [15:16:46] I guess i can add it manually [15:17:17] doesn't profile::labs::cindermount::srv do exactly that? [15:17:42] oh `cinder` [15:20:17] I will give it a try [15:24:52] taavi: that worked like a charm https://phabricator.wikimedia.org/T304080#9207447 ! [15:29:54] so this is really funny: "cinder mount" is the name of a pizza place near my house... [15:30:15] hahahah [15:30:25] which in turn is a pun on the name of the street where it's located (monteceneri, which literally translated to "cinder mount") [15:30:29] a bit too crunchy pizzas xd [15:30:43] *translates [15:31:20] https://10619-2.s.cdn12.com/rests/original/801_19661660.jpg [15:32:07] nice [15:43:46] taavi: andrewbogott: I have updated the doc at https://wikitech.wikimedia.org/w/index.php?title=Help:Adding_Disk_Space_to_Cloud_VPS_instances&diff=prev&oldid=2115896 :) [15:44:40] thanks hashar [15:46:38] let me word it a bit differently, since I think we want folks not familiar with puppet to use `wmcs-prepare-cinder-volume` [15:51:09] taavi: I originally wrote prepare-cinder-volume to work with all stages (e.g. it would mount an already formatted volume) but last time I used it it reformatted an already-formatted volume. So I no longer really know what it does :( I guess we need to recheck to make sure the docs are still accurate. [15:51:43] yes I also remember some issues with /etc/fstab (that maybe were not caused by prepare-cinder-volume but I want to double check) [15:51:45] and the Puppet patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/961844/ [15:51:55] so my instance now only has `role::ci::castor::server` :) [15:55:26] completely unrelated, I though I have added disk IO metrics to https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board [15:55:40] or maybe I only dreamed about it :) [15:55:49] * arturo offline [15:58:10] hmm I have added node_filesystem_files_free [16:22:26] and I have added the # of disk IO in progress + the read/write throughput (hopefully with the right unit) [17:06:05] rook, you mentioned spinning up environments for each patch as part of the discussion on Catalyst. I'm curious if you see the cluster control plane as "longer"-lived in that scenario? It would be one potential way to lower the deployment time if each patch was within that same cluster. Similarly, I presume you wouldn't expect to spawn new cloud vps [17:06:05] projects each time, but rather new k8s clusters within some predefined projects yes? [17:06:56] ^ as in one project per team? [17:08:45] In my mind it would be a new k8s cluster within the same project (Maybe a few different projects, one per team or the like). At which point the k8s cluster could stay around so long as is appropriate. There are hundreds of patches out at any one time, but suggesting that every one of them needs a running example at a moments notice seems unnecessary. Testing should be fire and forget, so the time to spin up the cluster is [17:08:46] immaterial in my view. And if you need to go track it down after you can deploy another with a tag that instructs it to reserve it for some amount of time, maybe 24 hours. [17:08:56] re:dcaro. I'm more inclined to say one project per project (that is per code repo / set of code reposs?). But yes, "team" [17:09:44] ack [17:09:47] All of the teardown requirements are just based off of quota and how much hardware we have. If the quota offered is deemed sufficient for all of the patches to have a running instance, then they will. If not there are easy ways to reduce the overall load [17:10:20] I can see each environment only living for 24 hours as a general rule that would likely meet most use cases. The error should be repeatable, so it's simply a click away from spawning again [17:10:24] is there a way to speed up the cluster creation? (pre-built images or something) [17:10:56] I have a feeling that a cluster-per-environment gets too expensive and slow very quickly, and doesn't have that many advantages over namespace-per-environment in this specific case [17:11:20] well, I guess that you will not have to maintain a live cluster [17:11:24] (upgrades and such) [17:11:38] ^^ this is actually a bigger plus than it seems I think [17:11:40] and I have added a dashboard showing failing Puppet agents https://grafana.wmcloud.org/d/SQM7MJZSz/cloud-vps-puppet-agents?orgId=1 [17:12:05] hashar: nice! 53 years xd [17:12:10] (unrelated to the ongoing conversation but that is a follow up to another dashboard tweak I have done an hour+ago) [17:12:21] The faster environment creation is (including spawning a k8s cluster), the less need there is to retain anything long-term [17:12:24] Yes if you're always recreating you don't have to worry about upgrades [17:12:37] yeah I don't know why it is 53 years, I am guessing the last run is 0 seconds in UNIX epoch so.. .53 years ie Puppet never ran? [17:13:13] the tldr is that Puppet runs pretty much everywhere [17:13:28] yes, I'm kinda happily surprised too :) [17:13:29] that is all for today! [17:13:34] It is possible to increase build time somewhat. I get magnum deploys between 10-30 minutes. If we really wanted to maintain our own local repo this would probably speed up, I assume most of the build time is pulling down images from beyond our realm [17:13:42] andrewbogott: taavi: thank you very much for your assistance earlier today, that kind of saved my day :-] [17:14:14] Rook: that might be fairly easy to do [17:14:29] hashar: glad to hear it [17:14:48] we could try a POC see how much it helps [17:15:38] how much resources does magnum need for a cluster at a minimum? [17:15:47] dcaro: yes, but it would incur a maintenance burden of always keeping the images up to date. [17:16:11] taavi: I think the smallest I've tried is 4 cpu, 2 for the control node, and 2 for the worker [17:16:15] might work with 1 and 1 though [17:16:44] I think harbor is able to cache images, so it would be slow the first time, the second it would be already there [17:17:15] as in acting as a proxy of sort [17:17:18] *sorts [17:17:44] I guess if we cached everything in a remote repo, might end up caching a bunch of stuff we don't ever use. And then there is licensing, are we allowed to cache things with a mess of random code. I asked but never got a clear answer [17:18:17] as long as it's for local usage and not exposed to the world that's ok, it should cache only things that have been requested, not the whole repo (not a mirror) [17:18:35] we should not be using that kind of stuff in the first place then [17:19:15] probably yes [17:22:13] and, again, spinning a cluster per patch seems incredibly wasteful if most environments aren't actively receiving traffic and could be spun down when not used [17:23:22] In my mind the env is spun up, automated tests are run, then it is spun back down. If manual attention is needed it can be spun back up with an option passed to keep the cluster around for awhile. [17:23:26] feels like it, maybe we can halt the VM [17:23:53] I agree having them all around feels wasteful, deleting them seems the way forward in my mind [17:25:01] hmm, I think that in any case, the application (catalyst backend or whatever) should not live in that cluster, that allows to easily recreate it without affecting the app itself [17:25:14] I would think not [17:26:43] I guess it might be relatively easy then to recreate the 'team' cluster for upgrades and such [17:26:44] Additionally I am suspect of the idea of one cluster for many patches. My experience with such is that usually one gets unexpected errors from the differences between the test env (which has a bunch of patches, or different projects) to the production env (which does not) [17:27:24] But why bother with recreating any shared clusters? Seems easier to make it easy to create a cluster on the fly [17:27:37] as long as the catalyst app doesn't assume everything is in a single cluster, the cluster-per- or namespace-per- is something relatively simple to change later [17:27:44] agree [17:28:23] they might be able to delay the integration with magnum api with shared clusters, as it's not so much work doing it manually for starters [17:28:34] for on the fly clusters, you are forced to integrate right away [17:28:43] (that might not be a bad thing, but requires more time) [17:29:30] What does integrate mean in this context? [17:29:51] write code to manage/handle calls to a different system [17:29:53] if you wanted to match the wikimedia prod environment as closely as possible, you would put all of the stateless apps in the same cluster and all of the databases outside of k8s. but I don't think being as close to production as possible was the goal here [17:30:19] that's another one yes xd, the prod setup is hard to reproduce [17:31:40] maybe stef would like to be in this conversation xd [17:32:34] yeah, this is maybe a conversation we should have with them, not with ourselves [17:34:32] I mean, we can still discuss stuff, but don't forget to share xd [17:35:56] kindrobot: tag, in case you weren't privy to the conversation [17:36:17] oh, I thought it was badrobot xd, awesome [17:36:37] gtg though, cya tomorrow [17:36:40] dcaro: yeah, but talking about stuff like prod-like requirements doesn't really make sense since we don't know the app and what they actually require [17:38:22] agree [18:25:01] FYI guys I'm renaming the cloud-hosts vlan on cloudsw1-b1-codfw to keep the naming convention (should include rack) consistent [18:25:13] Won't cause an issue but always that 1 in a million glitch/bug so just a heads up [18:26:32] done, looks fine [19:16:52] andrewbogott: I updated the CloudVPS/Admin/Network page on Wikitech to document the current setup for host connectivity [19:16:57] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network [19:17:02] thank you! [19:17:19] I hope that covers it, and it was an appropriate place to do so. It had older info about it that needed to be updated anyway [19:51:26] +1, thank you for those edits! [20:06:30] is someone doing something to cloudweb2002-dev? [20:22:46] not that I'm aware [20:35:02] taavi: I rebooted it [20:35:10] which worked surprisingly well -- horizon timed out before, works great now :) [20:35:49] I'm trying to piece codfw1dev back together after a long ldap outage so that Francesco can break it again tomorrow :) [20:36:25] that explains :P I was typing a command to fix a mediawiki thing and my ssh connection was interrupted just before I hit enter [20:36:45] :( sorry [20:37:09] but I'm surprised and delighted that labtestwikitech was good for something [20:37:12] if however briefly [21:35:26] I'm going to try to save Francesco some headaches tomorrow and reimage cloudservices2005-dev now (which will break codfw1dev, which I just fixed.) Will that mess with anyone? taavi I'm looking at you [21:36:27] nope, thanks for checking [21:39:12] I've spent all week fixing broken DNS, why should tomorrow be any different? [21:51:14] :( [21:52:26] I'm getting the hang of it