[09:48:42] I'm not going to be around the next couple of weeks, should we shift the toolforge monthly/checkin one week? (next monthly would be the 15th) We wanted to do also a special session dedicated to toolforge UI with anyone that wants to come. [09:49:42] dcaro: I think you should feel free to move things around as you see fit :-) [09:54:35] thanks! I'm trying to gather a bit of feedback to evaluate the fitness of the proposal though xd [09:57:18] yeah the change sounds good to me [10:20:01] we have https://gitlab.wikimedia.org/repos/cloud/paws and https://github.com/toolforge/paws/ [10:20:11] I assume the source of truth is the github one [10:22:55] I'd say so, gitlab is behind a few commits [10:23:36] rook might have been trying to migrate it though, there's some ci jobs that ran and such https://gitlab.wikimedia.org/repos/cloud/paws/-/pipelines [10:31:25] ack [10:31:28] can I get a +1 here? [10:31:29] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/160 [10:33:48] +1d [10:34:21] thanks [10:39:09] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131013 [10:39:18] another quick review here ^^^ [10:42:34] there might be other places too, using partial name like https://codesearch.wmcloud.org/search/?q=cloudinstances2b [10:42:57] not sure if those matter [10:43:37] yeah, we also have the virtual router called `cloudinstances2b`, but that one is not changing [10:46:26] I thing we are good for now, we will see if something breaks soon enough [10:46:34] I'm curious to see how nova-fullstack reacts to the change [10:47:14] there seems to be some cookbooks using it too [10:47:59] dcaro: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1131043 [10:48:35] ops, sorry I gotta go, I have a doctor appointment [10:48:47] +1, good luck! [10:48:49] be back later [11:19:49] I was thinking that we also have the metricsinfra/tofu-provisioning repo, but a.rturo already updated it :) https://gitlab.wikimedia.org/repos/cloud/metricsinfra/tofu-provisioning/-/merge_requests/1 [11:20:52] I'm curious if he found that one from codesearch, because if yes, then this patch I did last month was more useful than I imagined :P https://gerrit.wikimedia.org/r/c/labs/codesearch/+/1114742 [11:23:08] unfortunately tofu seems to want to replace everything in that repository now [11:24:30] dhinus: I thought that was covered by https://gerrit.wikimedia.org/r/c/labs/codesearch/+/1053538 xd [11:26:18] wait, the metricsinfra one is in gerrit still I see [11:27:47] nm, I'm confused [11:27:55] taavi: ouch, we need some "move" blocks maybe? [11:28:00] i'm still looking why [11:28:12] is it expected that some trove databases are still showing as using g3 flavors? [11:28:41] not sure, a.ndrew did some upgrades recently I think [11:29:06] I think we can migrate all to g4 if they have the same specs [11:31:25] dcaro: you're right, I was also confused. my patch added cloud/metricsinfra in gerrit, not in gitlab :P [11:31:45] cloud/metricsinfra in gitlab was already covered by your patch adding repo/clouds [11:32:13] ERR_TOO_MANY_REPOS [11:32:31] (for my brain) [11:35:28] dhinus: manually poking the state file to change the network name fixed it [11:35:38] taavi: awesome, thanks! [11:35:42] not sure if there's a more elegant way, but `tofu state pull`/push worked fine [11:36:27] re the trove problem: the VMs itself are using g4 flavors, it's just that trove thinks they are g3 [11:36:28] there's "tofu state mv", but it just does the same thing [11:36:59] no the problem was that the server state had the wrong network name [11:37:08] you're right, it was the value not the name [11:39:31] filed T390042 for the trove issue [11:39:32] T390042: trove database flavors are out of sync with reality - https://phabricator.wikimedia.org/T390042 [11:39:39] thanks [12:07:50] tf-infra-test is failing with "Unable to find fixed network lan-flat-cloudinstances2b" [12:08:06] I think tf-infra-test is in github? [12:08:49] https://github.com/toolforge/tf-infra-test [12:09:44] yes, found the repos in codesearch [12:09:53] any reason not to move that to gitlab? [12:10:05] taavi: not that I can think of at the moment [12:11:31] https://github.com/toolforge/tf-infra-test/pull/22 [12:11:40] +1 for moving to gitlab [12:13:59] i'll import it to repos/cloud/cloud-vps [12:14:15] thanks [12:15:19] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tf-infra-test [12:16:44] there wasn't any pipeline or anything extra on github for that repo? [12:17:20] there's a formatting pipeline that needs to be replaced [12:17:38] but otherwise no, it's provisioned all manually to the runner host it seems :( [12:18:00] yep, I'm finding some "git pull" in bash_history :) [12:18:12] should be easy to pull with puppet [12:18:42] the formatting pipeline was just running "tofu fmt" which we can do easily in gitlab [12:18:49] (we already do for tofu-infra) [12:19:15] the project has code for running on codfw1dev, does anyone know if that's actually in use? [12:19:33] I don't think it is [12:20:37] taavi: are you updating the checkout in tf-bastion.tofuinfratest to point to gitlab, or shall I update it? [12:20:43] already done [12:20:48] thanks! [12:22:10] I will trigger a manual run of the cron job (systemd-cat -t tf-infra-test /root/tf-infra-test/tofu-test.sh eqiad1) [12:22:32] * taavi resists the temptation to immediately convert that to a systemd timer [12:22:58] * dhinus has resisted too many times :D [12:23:12] we should puppetize it really [12:23:38] T341814 :-) [12:23:38] T341814: [cloudvps] puppetize the OpenTofu tests VM (tf-infra-test) - https://phabricator.wikimedia.org/T341814 [12:23:54] LOL [12:24:26] hi, I have an instance that is waiting for a migration confirmation since March 20th. It is up and I can ssh but since it is in that weird state it does not show up in Prometheus :) [12:24:36] https://horizon.wikimedia.org/project/instances/270b6533-dc99-4e5d-a642-c61138b11891/ [12:24:37] integration-agent-docker-1046.integration.eqiad1.wikimedia.cloud [12:25:02] oh it is asking to confirm resize/migrate. I guess I could press Ok [12:25:11] i was just about to ask if you tried the confirm button [12:25:15] that's what i'd do [12:26:22] I will try, I guess someone wanted to migrate/resize it and missed the confirmation [12:32:41] the tfinfratest alert is gone [12:37:52] taavi: while you're here, do you remember any quirks from the last time you failed over clouddumps100x? [12:38:08] I found your patches from last time and prepared https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131051 [12:40:04] dhinus: what I wrote on the wiki last time should be it [12:40:29] ack, this one right? https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Dumps#Failover [12:40:36] exactly [12:40:44] thanks! [12:41:48] the instance is fixde :) [12:41:55] after I have "confirmed" the resize [12:42:03] I'm gonna merge my patch and check if active connections in 1001 slowly decrease [13:18:55] so Prometheus would not collect from an instance that is pending a confirmation for migrate/resize :) [13:19:10] I guess it has a filter for "state===ACTIVE" or something like that [13:19:13] not a big deal [13:42:07] yep, that sounds accurate yes [13:48:16] since my instance has been pending a confirmation since March 20, it was no more collected and eventually the metrics got garbage collected :b [13:48:26] which is annyoing, but really note the end of the world [14:02:26] arturo: does the network change tomorrow require any config file changes? I'm just making sure things get forwarded to the dalmatian config [14:06:37] update on my attempts of depooling clouddumps1001: Puppet changed the symlinks on all NFS workers, but did not restart the NFS service. maybe that's not required, but the number of connections hasn't changed. [14:07:12] I'm not sure I understand all the subtleties of our NFS setup [14:09:52] dhinus: "Puppet changed the symlinks on all NFS workers, but did not restart the NFS service" can you be more specific about what hosts you're talking about? You mean the symlinks on VMs that mount the dumps? [14:10:00] And do you mean the NFS service on the server or the client VMs? [14:10:06] yep, see the last comment in T383723 [14:10:07] T383723: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723 [14:10:19] for example in tools-k8s-worker-nfs-43 [14:10:39] it looks like NFS workers keep an open NFS connections to BOTH clouddumps, regardless which one is the active one [14:11:01] I'm not sure how they're gonna react if the inactive clouddump is powered off [14:11:04] yeah, that sounds familiar. I wouldn't expect a service restart to be needed, are they in a broken state now? [14:11:14] no, they work fine [14:11:19] ok :) [14:11:29] but I'm scared of telling DCops to power down the inactive clouddump :) [14:11:41] I guess if any jobs are still in progress from before the change then powering down the old host might cause workers to seize up. [14:11:43] (or of powering it down myself :P) [14:12:14] So I would wait an hour or two after the failover, then manually stop nfs on the host, then power down, then hand over to dcops. [14:12:37] ('an hour or two' is a total guess. You could also wait overnight to be extra safe) [14:12:40] makes sense. any way to check if there are "in progress jobs"? [14:13:07] hm probably you can see connections from the server side. I don't know how to do that off the top of my head. [14:14:17] I think I'll wait 24 hours to be as safe as possible, but if anybody knows how to do further checks let me know [14:14:41] there are also a bunch of other hosts with currently active connections: https://phabricator.wikimedia.org/T383723#10637137 [14:14:51] (on top of the NFS workers) [14:24:23] I would like to cause a few-minute outage of metricsinfra-db-1 in order to fix T390042; that will probably cause a gap in metrics. Is that OK? [14:24:23] T390042: trove database flavors are out of sync with reality - https://phabricator.wikimedia.org/T390042 [14:28:21] andrewbogott: sure, it should not cause any gaps though, it might fail to reconfigure prometheus/alertmanager [14:29:09] (the db is used to store the config to generate the prometheus/am configs, but not the stats) [14:29:34] oh! ok, that's much less scary then. [14:31:11] should be fine yep :) [15:00:53] omw to the network meeting [15:01:57] folks be on in a sec google meet giving me some niggles here [15:04:32] ok, I resized metricsinfra-db-1 and I'm done messing with it for now. [16:49:32] chuckonwu: clearing the cache did indeed resolve my issue with lima-kilo! [16:49:41] (I used "--no-cache" instead of clearing it manually) [17:02:07] interesting, I wonder how github is cached in there xd [17:17:15] maybe just a coincidence? but it did fail 3 times in a row before [18:00:28] I created T390095 to keep a record of the issue, I'm pretty sure I had the same problem a few weeks ago [18:00:28] T390095: [lima-kilo] ansible random timeouts downloading - https://phabricator.wikimedia.org/T390095