[00:50:58] * bd808 off [09:02:08] I created a new tool a few days ago, sample-ruby-rails-buildpack-app. Two issues: [09:02:08] 1. There is a discrepancy between the passwords in replica.my.conf and the one returned by `toolforge envvars list` [09:02:08] 2. I can't auth with either of these passwords [09:02:08] I checked some of my other tools, but they work fine. [09:27:58] * taavi going to meet some people for lunch, be back later [11:10:15] blancadesal: I think it's worth creating a bug in phabricator [11:10:51] I will try to create a new tool and see what happens [11:12:29] dhinus: yes, will create a ticket after lunch :pizza [11:12:29] will you be at collab today? I'm looking for someone to throw a bunch of puppet questions at [11:15:06] I'm on a train (to ItWikiCon tomorrow!) so I can't join the collab :/ [11:15:26] feel free to throw some questions here or on a private chat though! [11:18:18] oh, exciting! where is it hosted? [11:20:03] Bari, southern Italy [11:20:13] re puppet, it's not so much that I have one specific question but more like the need to have a back and forth to clear up my understanding of how different things work and fit together [11:20:25] the wiki is only in Italian https://meta.wikimedia.org/wiki/ItWikiCon/2023 [11:20:30] blancadesal: I can join collab if that's useful [11:20:46] dhinus: nice! that's a long train ride! [11:20:50] (I'm in a tram back home at the moment still) [11:21:49] taavi: that would be helpful. would around 13 UTC work for you? [11:22:29] if I did the timezone calculations in my head correctly, sure [11:22:54] I have an UTC clock on my status bar for a reason :P [11:24:05] * dhinus considers adding a UTC clock too :) [11:27:23] hehe :)) [11:28:30] I once had a fuzzy clock in my status bar and I really enjoyed it :D [11:28:52] not great for meetings though [11:56:03] blancadesal: I created a new tool from toolsadmin, and the passwords in replica.my.cnf and "toolforge envvars list" match, and I can connect to ToolsDB with that password [11:57:50] hi all i wonder if anyone was able to take a look at the codfw1dev cluster to see if there were any issues post puppet7 migations. And wether i can progress to migrating the eqiad cluster? [12:07:01] jbond: I don't see anything broken but let's wait for a.ndrew to be online before proceeding with eqiad [12:10:24] dhinus: ack thanks [12:26:56] dhinus: thank you for testing! what would be the way to start investigating what's going on? [12:29:56] can anyone else log in to dev.toolforge.org? [12:34:54] blancadesal: not sure, I think both replica.my.cnf and the envvars are set by maintain-dbusers but I'm not familiar with the source code [12:35:33] taavi: it seems to hang after accepting the fingerprint (I never use it so I didn't have it in my known_hosts) [12:35:58] dhinus: thanks for confirming, I'll reboot it [12:36:03] same for me, also never use it [12:37:20] I almost always use it since it tends to have much lighter load than login.toolforge.org :-) [12:59:30] blancadesal: I'm in the co-working space meet [12:59:55] coming [14:24:06] andrewbogott: shall we try reimaging cloudvirtlocal100[1-3], and what would you do (if anything) before running the first reimage script? [14:26:34] jbond: codfw1dev is still pretty broken but I've no idea if it's due to puppet 7. I'm hoping to go through and try to fix things this afternoon. [14:26:47] ack [14:27:28] dhinus: i have meetings this morning but I think you should try it. I haven't used wmcs-cold-migrate lately but in theory it should work to move the etcd nodes around so they don't get clobbered during reimage. [14:27:46] I think that's going to be easier than relying on partman to not delete them. [14:27:58] but if cold-migrate doesn't work, then we can give partman a look. [14:29:57] any docs on wmcs-cold-migrate? [14:30:15] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance#wmcs-cold-migrate [14:30:36] Used it a lot pre-ceph; barely use it at all these days so it might have rotted. [14:30:55] but we have tools-beta etcd nodes to experiment on [14:31:13] toolsbeta is a good idea [14:46:12] I'm trying wmcs-cold-migrate on toolsbeta-test-k8s-etcd-22 [14:46:16] great [14:46:48] it's doing something (copying files), let's see if it completes [14:46:54] btw jbond I reimaged cloudcontrol2001-dev last night with --puppet 7. It prompted me to make the hiera change first, which I didn't, and then it seemed to work fine. [14:47:02] So I have created an edge case for you if you're interested :) [14:48:17] andrewbogott: great glad it worked [14:49:47] I was expecting some alert to fire when I stopped that VM, but I don't see any [14:49:57] which VM? [14:50:08] toolsbeta-test-k8s-etcd-22 [14:50:32] when I say "I stopped", I mean "when wmcs-cold-migrate stopped it" [14:51:00] the alerting would be done by toolschecker, which only exists for tools not toolsbeta [14:51:19] makes sense [14:51:31] wmcs-cold-migrate failed :/ [14:51:46] "Can't connect to server on 'openstack.eqiad1.wikimediacloud.org'" [14:51:48] if the script cleanly shuts it down, then the prometheus discovery job will just ignore VM since it's been told to shut down and it cleanly did so it would be in the SHUTOFF state [14:52:24] dhinus: that's not what I expected! You were running it on a cloudcontrol? [14:52:26] nice, we'll find out when I do it on the "tools" vms [14:52:31] cloudcontrol1005 [14:52:38] I guess something changed in the hostnames or other things [14:52:41] "ERROR: failed to update the instance's db record.Host not moved." [14:53:02] wmcs-cold-migrate did not crash, it simply failed at the step "updating nova db" [14:53:02] the mysql command in update_nova_db might need an explicit port to be set [14:53:23] oh yeah I bet that's it. Moved to 23306 but it's probably trying 3306 [14:54:00] hm, wait, 3306 should be right if it's contacting the haproxy endpoint which it seems to be [14:54:48] yeah, but I've noticed before that if you try to use mysql on a cloudcontrol it will just hang if you connect to an external host and don't specify a port [14:55:18] I can try replicating the mysql command used by the script [14:57:17] could it also just be the password? It used to be set in the env script but I'm not positive that it still is [14:57:24] Anyway, I have to go, sorry for handing you a broken tool dhinus [14:57:54] maybe it's missing NOVA_MYSQL_PASS? [14:58:11] andrewbogott: I think it looks quite good, slightly broken but shouldn't be hard to fix [15:00:01] from the error I think it's the missing port param. https://phabricator.wikimedia.org/P53520 [15:01:00] hmm I'm getting something different [15:01:36] I left a comment in the Paste [15:01:53] oh wait maybe -p stops before connecting [15:02:18] does it give you any response after giving a password? [15:02:48] yep I think that's it [15:03:01] yeah I think it hangs, I ctrl-c'd without paying attention [15:03:13] but where is the script getting the password from? [15:03:41] thanks for your help taavi, that cleared up a bunch of things [15:05:30] glad I could be helpful [15:13:01] mysql --print-defaults shows it gets a bunch of options from somewhere [15:13:18] including a different port, so that's why it needs --port [15:17:43] that'd do it [15:47:13] hmm now wmcs-cold-migrate fails in a different way :/ I'm trying to debug it [15:48:55] ok stupid mistake on my side, fixed. now it's running again, but I'm not convinced it will connect successfully to mysql [15:49:04] because I couldn't find where the password is coming from [15:50:51] I think it did work [15:55:27] these are the changes I made https://gerrit.wikimedia.org/r/c/operations/puppet/+/975012 [15:58:31] now, how do I test if toolsbeta is still ok after I moved one of its etcd nodes? [16:26:57] the etcd logs showed a couple warnings after the vm was moved and restarted, but they seem fine [16:27:35] I've tried using "etcdctl cluster-health" to confirm the cluster is in good health, but I'm struggling to find the right options I need to pass [16:59:36] ok dhinus I am back! catching up... [17:00:35] sounds like pretty much all is well... taavi do you have advice about how to check etcd health? [17:02:50] dhinus: does "etcdctl --endpoints=$ENDPOINTS endpoint health" work? [17:12:45] trying [17:13:42] should $ENDPOINTS be defined in the environment? it's not [17:14:18] where do you run etcdctl from? I'm trying from the etcd vm itself [17:16:40] you'll have to define endpoints to be the ips of the three hosts in the cluster [17:16:43] and yeah, on the etcd node [17:17:07] I'm just going from https://etcd.io/docs/v3.5/tutorials/how-to-check-cluster-status/ [17:19:48] ips or hostnames? I tried a similar command and it was complaining about certs [17:19:59] the etcd docs are not great [17:20:20] hmph no idea [17:20:47] _joe_: what's the easiest way to answer the question "are all the etcd nodes in the cluster responsive?" [17:21:55] <_joe_> andrewbogott: what you have described above should work [17:22:11] I just have to figure out the right value for $ENDPOINTS :) [17:22:12] <_joe_> as for $ENDPOINTS, use what the TLS cert has [17:22:31] <_joe_> https://your-server-name:2379 usually [17:22:37] thanks _joe_ ! [17:22:51] <_joe_> you can have multiple ones separated by commas [17:25:04] so "endpoint status" gives me a weird "No help topic for 'endpoint'" [17:25:15] while "cluster-health" gives me error #0: remote error: tls: bad certificate [17:25:50] I might be using the wrong hostname [17:26:29] jbond: using bookworm for your puppet7 servers? [17:26:42] dhinus: what's the fqdn of the host you're testing on? [17:27:48] toolsbeta-test-k8s-etcd-21.toolsbeta.eqiad1.wikimedia.cloud, but you should probably use -22 which is the one I moved [17:28:01] it won't make a difference probably, the cluster is -20,-21,-22 [17:29:54] wtf 'no help topic' [17:30:06] haha that's so weird, maybe it's a different etcdctl version [17:30:37] I've tried something like etcdctl --debug --endpoints $ENDPOINTS cluster-health [17:30:44] which seems to be slightly better [17:31:26] andrewbogott: pm7.puppet-dev.eqiad1.wikimedia.cloud [17:31:31] I'm gonna be afk for the next 30 mins, back later for a little while [17:31:38] https://www.irccloud.com/pastebin/I8zIqwup/ [17:31:43] the good news is that wmcs-cold-migrate is working just fine, with the patch at https://gerrit.wikimedia.org/r/c/operations/puppet/+/975012 [17:32:05] (I applied it manually, but puppet has already reset it, so you'll need to merge the patch or apply it again if you need it) [17:32:12] thx jbond [17:32:30] so bullseye [17:32:48] no its bookworm i did an in place upgrade [17:32:54] it was from before there where bookworm images [17:34:06] oh dang [17:34:10] * andrewbogott rewinds a bit [17:34:21] andrewbogott: tbh both should work [17:38:42] andrewbogott: fyi there is also agent7.puppet-dev.eqiad1.wikimedia.cloud which is an agent to that pm [17:40:17] andrewbogott: taavi: seems i lied a bit i did end up creating a specific wmcs class so they are a bit different to prod [17:40:21] profile::puppetserver::wmcs [17:41:27] as far as i can tell ^^ class and the following hiera is all you need: [17:41:30] profile::puppet::agent::force_puppet7: true [17:44:29] that hiera setting is for the server or the client? [17:44:51] andrewbogott: that is needed to force puppet7 [17:44:59] definetly need on the server [17:45:09] if yu want the agent to be puppet7 then needed on the agent [17:45:10] ah, that chooses the package [17:45:25] well changes the priorites/installes componet [17:45:42] on the agent you will still need to manually update. [17:45:59] i suspect the puppetserver packages will automatically pull in the correct puppet version [17:46:04] 'k [17:51:15] dhinus: I'm frustrated, it seems like those hosts aren't set up for etcdctl to work. It seems that etcd is otherwise working but I don't know how to check the cluster health. [17:51:24] And now I need to eat lunch but will be around off and on [17:58:55] dhinus: andrewbogott: sorry I was afk, what was the question? [18:22:21] taavi: how can we tell if etcd is broken or not broken on toolsbeta? [18:22:34] None of the status commands work because the endpoints don't seem to be configured locally [18:22:43] (but they don't work on tools either so I don't think this is a new issue) [18:26:24] I'm back but not for long... andrewbogott if you feel like continuing with the reimages, it should be easy to move tools-k8s-etcd-18 in the same way, then cloudvirtlocal1001 can be reimaged [18:26:56] otherwise we can move the toolsbeta one I moved earlier back to its original place [18:27:11] ok -- will see if I get to it [18:27:54] if there's a cert issue that may very well caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/973842 [18:28:43] it might be... but etcd seems to be able to communicate with the other nodes, it's just etcdctl that's complaining [18:28:52] so maybe we're just using it wrong [18:30:02] oh yeah.. there's the etcdctl v2 vs v3 confusion [18:30:18] yeah, that's why endpoint status doesn't work but cluster-status should... [18:31:16] taking out the ETCDCTL_API=3 from the health check command on https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Deploying#etcd_nodes gives https://phabricator.wikimedia.org/P53527 which looks fine? [18:32:22] it does! [18:32:30] ah, specifying the cert [18:32:44] seems silly but it's good enough to keep us moving forward [18:32:51] I suspected that was the trick, but how does etcd know to use those certs? I was hoping there was some kind of config file [18:32:56] that etcdctl could rea [18:32:59] *read [18:33:24] i'm not aware of any [18:36:55] dhinus: ok, so you're convinced that cold-migrate is safe with these etcd clusters, right? So the rest of the reimaging should be straightforward [18:36:58] if tedious [18:38:53] yes [18:39:07] great! [18:39:14] the remaining question is if once you reimage to bookworm it "just works" or not, but I don't see why it shouldn't [18:39:19] it's just another cloudvirt [18:39:30] Yeah, I guess we'll see. I'd expect it to. [18:39:42] i'm not worried about it. as long as you check that the toolschecker etcd check is fine everything should be fine [18:39:54] can we merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/975012 ? [18:40:04] in theory the cloudvirtlocals have the same puppet config as cloudvirt-wdqs, right? and those are already on bookworm [18:40:05] it's a bit of a hacky fix with the hardcoded port [18:40:12] but at least it makes the script work :) [18:40:25] oh already done, sorry [18:40:32] yeah, that whole script is a bit hacky [18:40:46] In theory there's an openstack service to do that but I've never found it to work properly [18:40:53] it's worth checking back to see if it's fixed I guess [18:41:38] yeah I was thinking the same, maybe in zed/antelope it works [18:53:44] andrewbogott: jbond: jhathaway: I filed tasks for what we talked about today, see T351450 and its subtasks [18:53:44] T351450: Migrate Cloud VPS puppet infrastructure to Puppet 7 - https://phabricator.wikimedia.org/T351450 [18:54:38] thanks taavi [19:01:42] a few days ago I have changed some alerts in the mariadb database on metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud, but I noticed only now that prometheus.wmcloud.org/alerts is still showing the old ones [19:03:02] dhinus: do you have an example? [19:03:30] the ones I changed/added are https://phabricator.wikimedia.org/T350943#9323931 [19:03:54] maybe there's a syntax error somewhere? I'm not sure where to check [19:04:49] hm [19:05:06] dhinus: https://phabricator.wikimedia.org/P53528 [19:05:09] that seems unrelated [19:05:30] unrelated to your rules, I mean [19:05:58] yeah, but maybe it hasn't been parsed since those ones were added? [19:08:00] ah, this seems to be expecting go templates to be any useful [19:08:50] fixed I think, thanks for spotting that. not great that there was no alert about that [19:09:14] we need an alert about broken alerts :D [19:09:30] thanks for the fix [19:10:01] i recall prometheus has a metric about the last config reload success/failure, so that should be simple [19:10:11] or a proper interface for managing the alerts so it doesn't happen in the first place :P [19:10:36] YES :D [19:16:11] * bd808 lunch [19:16:30] dhinus: is the alert that just started firing expected/known? :P [19:23:32] yes [19:23:42] that's how I found out my new alerts were not working [19:24:02] in a kinda of reverse way (I spotted by chance that the replication was broken) [19:24:50] it seems that restarting the replication fixed it, but I'm not sure why it failed [19:25:17] I've just updated the runbook at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication#If_the_replication_is_NOT_running [19:33:41] I've also acked the alert, the lag is going down and should be back to normal in a few hours [19:37:08] jbond: if still here can I get a quick +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/975073 ? [19:40:43] andrewbogott: just running pcc strange we never hit this before [19:41:09] maybe bookworm vs bullseye packaging differences? [19:45:02] thx [19:46:52] andrewbogott: you asked for a review and merged before i responded [19:47:05] hm? I thought you marked it +1 [19:47:13] can you send a follow up patch with my comments [19:47:15] oh it's verified, sorry [19:47:20] yes, I'll make a followup [19:47:21] the +1 was on verified from pcc [19:47:24] now worries [19:47:35] if you send the patch ill review and merge tomorrow [19:47:42] or add jesse [19:47:58] random thing I spotted while I was looking for something else: the overall load on toolforge increased quite noticably last week: https://grafana.wmcloud.org/goto/ap6g-oISk?orgId=1 [19:48:14] *noticeably [19:48:34] andrewbogott: how has testing being going so far> [19:48:46] we're now at 3x the baseline load of only 1 month ago [19:48:54] maybe just people migrating from the grid? [19:49:45] * dhinus off for the day [19:51:13] jbond: I had to catch up on etcd things so only just now testing (and finding that the puppet7 server wouldn't build) [19:52:48] ack, if fel free to ping me tomorow when you get online and we can got through some things [19:53:02] il also try to rebuild one from scratch [19:53:48] thx [19:55:43] jbond: https://gerrit.wikimedia.org/r/c/operations/puppet/+/975075 [19:58:46] thanks andrewbogott +1 [23:56:18] * bd808 off