[07:28:50] good morning, I have a Puppet patch for a CI instance which could use a merge. It is to enable Cinder / attached volume for a specific instance ( integration-castor05 ) and disabling the management of the instance extended disk storage (there is no extended disk and both mount at /srv ): https://gerrit.wikimedia.org/r/c/operations/puppet/+/961844 [07:30:04] I have merely moved the settings from Horizon to Puppet, cherry picked it on the local Puppet master and as a result in Horizon the instance Puppet config has a single configuration: class role::ci::castor::server ) [07:32:04] so you are sure it does what you want? [07:32:57] yeah I cherry picked it :] [07:33:11] 👍 merging then :) [07:34:05] when I switched to use the Cinder / attached volume some months ago I followed a doc on wikitech [07:34:06] done [07:34:24] but as I rebuild the instance the manual modifications got lost and Andrew gave me the magic profile to apply to the instance :) [07:34:26] thanks dcaro ! [07:35:18] yw [07:57:59] dcaro: are the builds-builder deploy instructions up to date? [07:58:13] https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder [07:59:23] Actually no (it would work, but most of it is automated), you can use the instructions https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy [08:01:36] The builds-builder is not listed among the components in the toolforge-deploy readme (I can see it's in the components folder though) [08:02:31] oh, it should be there, that might have been overlooked [08:04:03] blancadesal: you can try using this script to generate the branch + commit https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/104 [08:05:53] blancadesal: for the readme https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/107 [08:10:13] blancadesal: for the builds-builder readme https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/14 [08:16:40] so after running the script it's deployment via cookbook? [08:18:14] blancadesal: the script should have created a new branch, with the upgrade from one version to the other, with some notes of the tasks and changes included [08:18:41] you should manually verify it has what you want, and then create an MR out of it to the toolforge deploy repo [08:19:42] hm, it looks like it bumped the builds-api, not builds-builder [08:20:09] https://www.irccloud.com/pastebin/MBd3kZcY/ [08:20:44] did it add any commits at all? [08:21:01] that seems just the last commit from origin/main [08:21:22] you can git log -1 [08:21:53] you're right, that's the previous commit [08:22:09] it failed to grep though [08:22:16] that might be a diff between linux grep and macos [08:23:51] there are sometimes differences in the flags, let me try from the debian server [08:27:54] yeah that worked [08:28:36] so then usually I create an MR, then deploy the MR on toolsbeta (before merging), test it, deploy on tools, test it again and if everything went ok, then merge [08:29:16] (using the `--git-branch ` and `--cluster-name toolsbeta/tools` options for the deploy cookbook) [09:46:51] dcaro: which wikitech page is our buildpack tutorials linked from? [09:48:12] There's this https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Tutorials_for_popular_languages [09:48:20] (just updated it with the php one) [09:49:22] hopefully soon they will appear here https://wikitech.wikimedia.org/wiki/Help:Toolforge/Quickstart#Build_and_host_your_first_tool [09:54:08] thanks! [10:04:13] starting to think about writing an in-depth tech blog post about build service... xd [10:08:11] that would be a good thing for the open beta [10:08:43] who should be talk with for blog posts nowadays? [10:10:10] I think the techblog is mostly dead :-( [10:10:40] I also had an idea for a blog post but saw the news and then I lost interest [10:10:47] I may write it for my personal blog though [10:23:28] last post was from 4 months ago :( [10:23:41] tricia might know what the status is? [10:24:15] I don't know if the subject would fit Diff? [10:28:48] * arturo errand, back later [12:49:34] np, warning might have been nice though [12:50:09] yes [12:50:21] or at least a SAL entry [12:53:15] arturo: there is no issue with the cloudgw routing traffic towards the openstack APIs I think [12:53:37] but might be worth considering if it could route directly into the cloud-private vlan from the cloudnet / neutron hosts? [12:54:09] that's definitely a possibility [12:54:26] but some research should be done in the neutron side. I wont be here for that, so up to you! [12:54:34] slight optimization, if it is complex on the neutron side we can leave as is [12:54:36] yep :) [12:55:17] topranks: do you have a ticket for the subnet renumber from /30 to /29 that I can reference in the patch? [12:56:00] let me create one real quick - there are some related but better to make a short new one and reference those [12:56:18] topranks: maybe make it a subtask of T347469 [12:56:18] T347469: cloudgw improvements - https://phabricator.wikimedia.org/T347469 [12:56:21] to group them together [13:08:59] sry for the delay [13:09:07] created T348140 [13:09:08] T348140: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 [13:11:01] thanks topranks. Could you please review these 2 patches? https://gerrit.wikimedia.org/r/c/operations/puppet/+/963298 [13:11:08] https://gerrit.wikimedia.org/r/c/operations/puppet/+/922105 [13:11:21] ok yep, just looked at the first one +1 on that [13:11:26] checking the second one now [13:14:54] one small comment on the second one but looks good - assuming those interface::post-up parts work as expected, I've actually not seen them before [13:15:56] we use them in the cloud-private module and others [13:16:15] is basically the augeas way to add post-up entries to the interfaces stanza [13:18:04] who knows, maybe for the future migration to systemd-networkd (or whatever) is better if we have a templated file instead of the interface:: base module [13:20:43] Yeah that’s a good point. In that future we should be able to drive all the interfaces from netbox directly, but the additional routes and sysctl’s will need some consideration [13:21:11] Probably we could re-use “interface::post-up” [13:21:41] which means we in IF can worry about how they get added to the files, so prob better than the template [13:22:21] that was my idea too, to leave this as generic as possible so any future updates can come from IF instead of WMCS :-P [13:22:48] Nice :) [13:23:11] the PCC diff looks scary however [13:23:11] https://puppet-compiler.wmflabs.org/output/922105/2448/cloudgw1002.eqiad.wmnet/fulldiff.html [13:25:46] maybe we should merge this tomorrow, declare an operation window and such [13:26:40] please test that in codfw1dev first [13:32:36] yeah, might be better to post-pone, looking at the diff nothing looks "wrong", but it's a little hard to grok without seeing the complete resulting file (we know there are ordering issues potentially etc) [13:32:48] so tomorrow, and codfw1dev first makes sense to me [13:33:46] augeas can be tricky [13:37:04] I'll create a calendar entry [13:40:33] please nobody break codfw1dev today, so network tests works in there [13:40:37] I'll need them tomorrow [13:41:56] arturo: I've just started a cookbook that might break it :/ [13:42:21] dhinus: the upgrade? [13:42:27] yes, only on one cloudcontrol for now [13:42:41] maybe I can wait before doing the others? [13:42:42] dhinus: you should continue [13:42:48] let's see how this one goes [13:42:54] ack, let me know [13:46:46] * arturo lunch time [14:20:00] andrewbogott: the openstack upgrade in cloudcontrol2001-dev failed with a glance error (see T341285) [14:20:00] T341285: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 [14:20:17] arturo: any reason for us not to do this, other than it breaking our existing numbering scheme? https://gerrit.wikimedia.org/r/c/operations/puppet/+/963325 [14:20:25] andrewbogott: I'm investigating but let me know if you have seen that before [14:21:03] dhinus: I haven't. I think you can retest with something like 'glance-manage db sync' or similar [14:21:08] glance-manage help will tell you [14:21:33] the names of the revisions are quite different... maybe it's using the wrong db? [14:22:08] nah, that's the name of the specific upgrades [14:24:20] andrewbogott: reading [14:25:10] andrewbogott: LGTM [14:25:12] thx [14:26:38] hmm running "glance-manage db sync" manually worked [14:28:18] maybe a typo in the cookbook? [14:28:33] Or maybe the cookbook just forgot about glance entirely? It doesn't change much so it might have not been an issue for the last few upgrades. [14:28:52] (and by' the cookbook forgot' I mean 'Andrew forgot') [14:30:09] the cookbook failed running that very command :D [14:30:17] but re-running it apparently worked... [14:30:42] I'm also confused by the fact the error mentions "wallaby" which is a much older version [14:34:59] puppet started failing in cloudcontrol1006 and 1005 [14:35:49] dcaro: component no longer exists as an option, so it's just the git url right? https://www.irccloud.com/pastebin/rR4IzpPX/ [14:36:23] blancadesal: you should not need to pass the git url, all use toolforge-deploy [14:36:33] do you have the latest code for cookbooks? [14:36:43] think so [14:37:07] or maybe not xd [14:37:12] I still see it there (and in your output) [14:37:31] no, that's from the docs. I don't have it [14:37:33] oh, that's from the docs rightn? [14:37:34] yep [14:37:49] https://www.irccloud.com/pastebin/xApQRdul/ [14:39:05] puppet fixed itself 🤷 [14:43:00] ok, now my cookbooks are up to date. got distracted by having to fix my cookbook config and forgot to merge after fetching [14:46:11] cloudcumins should have a latest version all the time :) [14:46:25] dcaro: I was about to ask if that cookbook works from cloudcumins, it should :) [14:46:41] and since yesterday blancadesal should be able to SSH to cloudcumins too! [14:47:07] but I need sudo there? don't think I have that [14:47:24] docs at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Cookbooks#Running_a_cookbook_from_a_cloudcumin_host [14:47:34] you should have sudo since I merged a patch yesterday! [14:47:48] \o/ [14:48:15] it might still fail for new and unknown reasons :D [14:49:22] I do have sudo now, but didn't half an hour ago! [14:50:20] anyone can +1 https://phabricator.wikimedia.org/T347905 ? [14:51:17] looking [14:51:50] dcaro: I had puppet disabled on those cloudcontrols overnight, was the alert just about that? [14:51:58] blancadesal: interesting, I merged the patch yesterday... but as long as it's now working I won't debug further [14:52:12] andrewbogott: it was an email yes CRITICAL: Puppet last ran 21 hours ago [14:52:54] huh, so it alerted in the time between me re-enabling it and it finishing a run. Very lucky! [14:53:04] this one also needs +1, doubling to 100 the cronjobs quota https://phabricator.wikimedia.org/T347951 [14:53:51] andrewbogott: right on time :) [14:54:12] dcaro: +1'd both [14:54:37] thanks! [14:54:47] although the project should be created without a - in the name due to radosgw [14:54:48] andrewbogott: for the project names, should I push them not to use dashes? [14:54:53] yep xd [14:55:05] dcaro: oh, yes! good thinking [14:55:13] underscores is ok? or should we snakeCase? [14:55:17] hopefully soon we'll switch to uuids and then it won't matter, but today it matters. [14:55:20] (/me has no preference) [14:55:25] underscores Ok as far as I know [14:55:54] no, underscores are not valid DNS nam,es [14:56:15] re-running the cookbook after manually doing the glance db sync worked fine ¯\_(ツ)_/¯ [14:56:24] my preference is just all lowercase without any separators [14:58:10] sounds good to me [14:59:02] andrewbogott: arturo: one cloudcontrol is upgraded, given arturo needs to run some tests in codfw tomorrow, do you think it is safer to upgrade all cloudcontrols, or to leave only one host on antelope and the other two on zed? :) [14:59:20] I think that the cookbooks is trying to authenticate as the new project when creating it xd [14:59:50] dcaro: which cookbook? I remember hitting a similar issue [14:59:54] dhinus: I think you can do one more. The network tests only run from a single cloudcontrol [15:00:01] vps.create_project [15:00:16] network tests from the upgraded cloudcontrol are failing apparently :/ [15:00:29] dcaro: yes, I might have opened a task to fix that (or maybe not) [15:00:40] oh, it seems it did create it [15:01:02] wait, no, it's trying to create admin [15:08:46] dhinus: I'm suspicious of keystone being down @ codfw1dev, and that would be the main problem with the network tests [15:08:54] (because it sshs into VMs) [15:57:32] dhinus: I discovered this T348157 [15:57:32] T348157: keystone: segfaults in debian bookworm - https://phabricator.wikimedia.org/T348157 [16:02:01] a couple of quick reviews https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/15 https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/52 [16:02:16] dcaro: 👀 [16:05:14] dcaro: OOPS, mistake, I clicked `merge` on https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/52 [16:05:22] instead of `approve` [16:06:06] np [16:06:08] dcaro: is it OK to merge now? [16:06:14] yes, I'll releas [16:06:28] ok [16:06:48] arturo: that looks pretty bad :/ [16:06:56] apparently when there is already an approval, the button changes. That was the thing that got me into clicking the wrong button [16:06:56] (the segfaults) [16:07:38] dhinus: seen before this kind of stuff unfortunately. Chances are this is known to upstream [16:08:07] a version dependency we are not tracking correctly, a missing package in the bpo repo or similar [16:08:35] I'm trying to google for related discussions [16:08:43] thanks [16:08:46] * arturo offline [16:14:43] blancadesal: I see you have a pending deploy of https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/109 [16:15:20] it's blocking me on releasing the 'next' version of builds-api, are you working on it? Can I deploy it for you? [16:15:44] wait no, that's builds-builder xd [16:15:45] nm [16:21:44] :) [16:22:23] (just fyi, latest builds-builder is deployed on toolsbeta but yet untested) [16:23:09] ack, careful though, tomorrow there's a long maintenance on gitlab itself, if we need to revert anything it will have to be by hand (no MRs to toolforge-deploy) [16:23:33] 🚧 4 hours planned downtime: Thursday, October 5th @ 09:00 UTC. 🚧 Support: #wikimedia-gitlab, #GitLab on Phabricator. [16:24:08] ack [16:24:42] do we have any test tools for buildservice on toolsbeta? [16:26:01] yes, I usually use this [16:26:11] https://www.irccloud.com/pastebin/KeanASRj/ [16:26:59] thanks [16:29:41] dhinus: sorry, multitasking w/wiki connect things. Um... it might be easiest for Arturo if you just stop puppet and all openstack services on the one upgraded cloudcontrol [16:30:04] because having a version disagreement between cloudnets and cloudcontrols might cause trouble [16:32:10] andrewbogott: I will try that thanks [16:32:17] btw if you have ideas about T348157 [16:32:18] T348157: keystone: segfaults in debian bookworm - https://phabricator.wikimedia.org/T348157 [16:32:39] I think the segfault could be a red herring (see my comment in the task), but keystone is not starting [16:33:44] nothing in the logs at all? [16:33:50] can you try starting it up manually? [16:33:54] dhinus: I'm looking [16:34:32] dcaro: nothing in that log file, but I wonder if it logs somewhere else [16:34:44] journalctl shows something, but not useful [16:35:10] hmm, it should go to kibana eventually, but I think that's systemd logs only [16:36:38] I think it's the systemd module that's broken, the server itself seems to start up fine... [16:38:03] ah interesting [16:39:32] the systemd service seems to run /etc/init.d/keystone xd [16:40:34] dhinus: here's a mild cleanup https://gerrit.wikimedia.org/r/c/operations/puppet/+/963377 [16:41:09] dcaro: it's expecting that file to run as a daemon, but instead it exits with exit code 0 :) [16:41:33] is the init script overwritten by puppet? If so we should fall back on the packaged one and see if that works [16:41:59] let me try that... [16:43:39] * andrewbogott confused [16:48:07] I don't understand why apt-file tells me that the keystone package installed /etc/init.d/keystone but when I do apt-get install --reinstall keystone the file doesn't get installed... [16:49:56] and yet purging and installing /does/ create it. [16:50:01] What even is --reinstall then :( [16:51:05] maybe it's created by some post-script or such? [16:51:08] google tells me "The config packages are special for apt. To let apt re-install them, first they must be purged." [16:51:17] uuu [16:51:19] but I'm not apt expert [16:51:22] *no [16:51:30] dhinus: that certainly 'explains' what I was seeing [16:51:34] :D [16:52:59] ok now I see some keystone processes running [16:53:02] I'll have a patch for the keystone thing in a few minutes [16:55:25] meanwhile... dhinus can you see if the keystone-admin service was removed in A? [16:56:12] gtg [16:56:13] * dcaro off [16:56:39] andrewbogott: let me check [16:57:59] dhinus: this is largely a blind patch but it should work: https://gerrit.wikimedia.org/r/c/operations/puppet/+/963378 [16:59:07] you may need similar changes for other services, it's hard to predict. [16:59:22] but the init script on bullseye/zed was broken so I have high hopes for this one [16:59:42] I seem to be running into networking restrictions on metal systems in codfw, namely that I can't install with pip. How does one deal with this? [17:01:32] Rook, that doesn't seem weird since those hosts mostly can't talk to the public internet. I don't know what the right fix is though, I think mostly it's considered a security threat to install directly from pip onto prod. [17:01:44] But, sorry, that's a drive-by, I'm going to watch lightning talks now. [17:03:57] dhinus: does that patch make sense? and/or have you punched out for the day? [17:04:01] Rook: this might help, but I think the proxies might have further allowlists https://wikitech.wikimedia.org/wiki/HTTP_proxy [17:04:33] So we're blocked from installing? [17:05:05] I don't know what current thinking is. [17:09:22] Sorry who's brain is in the context of your statement? [17:10:00] security folks [17:10:30] Rook: I would try exporting those proxy variables to see if by any chance that lets you pip install. if that doesn't work, it's a harder problem [17:10:41] * andrewbogott agrees [17:11:32] https://phabricator.wikimedia.org/L3 is the general "you really shouldn't do that" document for anything that has a connection to the production realm [17:12:05] Seemingly not, but it fails faster now. So something? [17:15:11] taavi: thanks for that link, though I'm finding some phab discussions that seem more nuanced? T300977 [17:15:11] T300977: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 [17:21:28] dhinus: yeah. it seems like folks are used to installing stuff for the analytics workflows on stat* boxes. not sure if that's something that was explicitely allowed by the policy, or if people just started doing that and now it's hard to block because everyone is doing it [17:49:33] * andrewbogott relocating, off for a while [18:14:42] I tried (unsuccessfully) to find out if/when the keystone-admin binary was removed from keystone :/ [18:16:31] I'm gonna call it a day, andrewbogott if you have time later can you please stop openstack services on cloudcontrol2001-dev, in the hope that will give a.rturo a working cluster for tomorrow, using the 2 other cloudcontrols? [20:31:54] arturo, dhinus, it's unlikely that codfw1dev will be useful for network testing tomorrow. The openstack databases have been upgraded to A but the services mostly haven't (because of us being mid-upgrade) and right now the neutron-api isn't responding to any api calls. [20:32:10] I will try to work on it a bit more today but probably the experimentation will have to wait for a future more-stable era.