[09:03:19] morning [09:03:32] looking for reviews on https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1043744 [09:05:41] dhinus: I see an alert for tools-db-3 replication, is that expected / are you looking into it? [09:16:36] taavi: not expected, looking [09:19:17] 'Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug' [09:19:44] I don't remember if I saw this one before [09:20:14] START SLAVE resumed the replication [09:26:12] yep it did happen before: T351457 [09:26:13] T351457: [toolsdb] Replication stopped because of invalid event - https://phabricator.wikimedia.org/T351457 [09:26:31] that's odd [09:26:51] I did a quick google but did not find any clear explanation [09:30:03] what's the right way to revert a revert in toolforge-deploy? e.g. https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/commit/af86124a5a10668c0f3933217dda3df157ac8f46. [09:30:04] or is it just a question of doing it manually? [09:36:33] blancadesal: I think reverting again with the Options->Revert button would work, but I never tried [09:37:40] are you concerned it would not deploy the right version? [09:37:51] taavi: +1'd [09:38:14] dhinus: I just never did a revert through the gitlab ui before [09:39:16] me neither :D [09:39:41] blancadesal: you want to roll back things to an older version again? [09:40:35] taavi: I want to revert jobs-api to the version it was before it was rolled back to 0.0.305 , so 0.0.308 [09:41:00] that's exactly what the commit you linked did [09:41:15] the revert you created is rolling back to 305 [09:41:24] oh, okay, you already reverted the rollback [09:41:40] yes, once I realized I was way too quick in doing that and should have used a valid YAML file to begin with [09:42:01] my coffee was weak this morning it appears [09:42:56] sorry about the confusion xd [09:43:55] T367569 is the new real bug I ended up filing [09:43:56] T367569: toolforge jobs load crashes if given YAML is an object (instead of an array) - https://phabricator.wikimedia.org/T367569 [09:44:23] 👍 [09:49:30] another deployment-related question: I deployed jobs-cli 16.0.11 on toolsbeta, found a bug while testing, reinstalled the previous version via cumin, merged a fix but should probably have bumped it again to 16.0.12 instead of amending 16.0.11 because now this conflicts with the already existing 16.0.11 [09:49:30] is there a way to remove a package, or should I just create a 16.0.12? [09:50:41] if a 16.0.11 was already tagged in the jobs-cli repo, then yes, the next version should be 16.0.12 [09:53:12] ok [10:20:48] Hello. Does anyone know if there is an easy way for us to find out which cloud projects are making the most calls to archiva? We have ~30k calls today from 185.15.56.1 (nat.cloudgw.eqiad1.wikimediacloud.org) in archiva's logs. I'm wondering if we can find out which projects they were, or monitor usage in future. Thanks. [10:27:32] btullis: not easy, but we could inspect the NAT table [10:28:37] arturo: Thanks. I'm happy to have a go, if you give me some pointers. Perhaps setting up a monitor is easier than looking backwards? [10:28:48] I see we still have cloud-cumin-03/04 in the cloudinfra project, are those still used for anything (now that cloudcumin1001/2001) exist or could we just get rid of them entirely? [10:29:03] taavi: I think we can delete them [10:29:14] We're going to be deprecating archiva as soon as it's practical, so it would be useful for me to know which projects will be the most impacted. [10:31:10] dhinus: T367725, will shut them down and delete if no-one notices the shutdown in a month [10:31:12] T367725: Get rid of cloud-cumin VMs in cloudinfra project - https://phabricator.wikimedia.org/T367725 [10:31:18] sgtm, thanks [10:31:30] btullis: try with something like `aborrero@cloudgw1002:~ $ sudo conntrack -L --dst 208.80.154.33` [10:31:52] arturo: Awesome, thanks. [10:31:54] spoiler: nothing at the moment [11:22:07] do we have a dashboard for per-cloudvirt capacity? basically I'm looking for a signal when I've moved enough VMs to the OVS hosts that I should move the next cloudvirt from linuxbridge to OVS [11:32:39] we used to have a bunch of dashboards for cloudvirts, but has been a while since I explored them [11:33:53] the link I had saved long time ago is dead now [11:33:54] https://grafana.wikimedia.org/dashboard/db/wmcs-openstack-eqiad1?orgId=1 [11:37:52] also, looking for a reviwe of https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/4 [11:41:37] taavi: LGTM [11:42:55] taavi: re jobs-cli, I never actually pushed the 16.0.11 tag because I closed the MR after finding the bug: https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/39 [11:43:09] does that change anything? [11:43:54] blancadesal: so where exactly was that installed? just a hand-built deb hand-installed on the bastion? [11:45:34] on the toolsbeta bastion yes, published via tools-services-05 [11:45:54] I reverted the install, but the package is still in the repo [11:47:20] sigh [11:47:56] if it was published to the apt repo, then I'd say the changelog and the matching git tag should be merged to main [11:48:12] (as a side note, I'm not at all a fan of publishing apt patchages *before* merging the relevant things to main) [11:50:03] ok, I'll reopen the closed MR and merge then [11:55:33] taavi: what would be the most straightforward way to test without publishing to the repo? wget from the MR pipeline to the bastion then apt install? [11:56:15] clone the repo to your home directory and then install it in a venv? or that [12:10:21] For old instances in openstack, I can just delete those from Horizon, along with associated objects, like proxies. Is there anything else that needs to be done? [12:11:43] slyngs: if it's a project with a project-specific puppetserver, you might want to revoke the puppet certs for the to-be-deleted instance. but otherwise no, just deleting the instance in horizon is fine [12:12:31] Cool, thanks, and no special puppet, just an instance for testing some deployment stuff last year [12:37:34] taavi: feels? https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/337 [13:05:54] ± kubectl get pods [13:05:54] Unable to connect to the server: net/http: TLS handshake timeout [13:06:09] I think I was just able to reproduce the toolforge outage from the other day in my laptop :-) [13:06:47] * arturo food time [13:30:07] dhinus: the mariadb replag alert for clouddb1017/s1 has been flapping all weekend, should we be worried? (or take a look?) [13:30:59] taavi: I'll have a look [13:36:29] it's using a lot of cpu, and there a few active long-running queries... but both things could be normal [13:37:13] the replication graph is not looking great, it started to struggle late on Friday [13:38:30] https://grafana.wikimedia.org/goto/nZb63Y8Sg?orgId=1 [13:45:33] cpu usage is not actually that high because the host has a lot of cores (32) and most are idle [13:45:57] I'll ask in #-data-persistence [13:50:37] I wonder if this change in Quarry is related, but that seems odd https://github.com/toolforge/quarry/pull/51 [14:42:07] taavi: latest discussion about cloudvirt-wdqs is https://phabricator.wikimedia.org/T324147 [14:48:57] andrewbogott: and when were these up for refresh (or decom?) exactly? [14:49:32] Purchased 2019-10-16, 5 year refresh [14:52:24] taavi: want me to create trove + magnum things so we can do test migrations? [14:53:28] can you ping the wdqs people first? [14:53:39] yep [14:53:56] trove + magnum test things seem good too, but right now they're fairly low in my list of priorirties [14:57:25] ok [14:57:37] If I create a VM today does it get scheduled on ovs or linuxbridge? [14:57:49] andrewbogott: g4 ones on OVS, g3 ones on linuxbridge [15:02:38] ok, as I hoped [15:02:59] so it sounds like you can/should ignore -wdqs servers, I can do the decom stuff after the ticket is created. [15:03:09] So, what else can I do to help? [15:03:55] decom of those sounds good [15:04:16] (those servers were purchased originally because someone, I don't remember who, was super worried about noisy neighbor issues. No idea if they ever actually turned out to be useful for high-precision performance tests.) [15:05:10] hmmm [15:05:23] so we need to get the announcement out, https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1046644 merged, and then we can start migrating projects [15:08:40] Is merging the cookbook change just about satisfying the linter? I can certainly do that [15:08:55] I'm also going to decom those cloudvirts before someone changes their mind [15:11:09] yes [15:23:50] ok, announcement sent [15:27:12] looks good [15:50:26] * arturo offline [16:35:27] topranks: I just decom'd a few cloudvirts. My recollection is that not all the netbox bits are automated yet, do you mind checking to see if there's followup needed? https://phabricator.wikimedia.org/T367773 [16:36:03] andrewbogott: thanks, let me double check [16:36:11] thank you! [16:37:35] andrewbogott: looks ok at a glance, all the IPs have been removed which is the main thing, the servers have some vlan interfaces remaining but I don't think that's a problem [16:38:05] topranks: ok, thanks for checking [17:12:27] taavi: would you like me to drain and reimage some more cloudvirts today so that you have more room to move tomorrow? If the answer is 'yes' you'll need to tell me how to set up a new cloudvirt for OVS (other than via host aggregate) [17:13:15] andrewbogott: yes, are the instructions at the bottom of T364457 good enough? [17:13:16] T364457: Migrate eqiad1 hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T364457 [17:14:32] taavi: yep! I'll give it a try. [18:06:35] :( VMs with g3. flavors are migrating to ovs hosts [18:17:24] andrewbogott: did you run the command that T364457 lists as the first step to run before draining the hosts? [18:17:25] T364457: Migrate eqiad1 hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T364457 [18:18:25] Ah, dammit, I didn't because phab started a new numbered list after the command... [18:18:31] that would explain it :( [18:18:44] Is there a reason to not run that right now for every remaining linuxbridge host? [18:19:17] not really, expect that initially that script seemed a bit scary [18:19:30] ok [18:19:32] can you get the now-broken VMs unbroken again? [18:19:39] yeah, I'm in the process [18:20:07] well, starting with the k8s nodes. If you can think of a clever way of listing them all I'll take it, otherwise I'll hunt them down in my backscroll [18:21:27] https://phabricator.wikimedia.org/P65114 [18:21:45] I had a plumbing emergency in between running those commands so I am not at my best :( [18:22:01] i would leave the k8s nodes for the last, the cluster can handle a few nodes going down but the other nodes might not be as resilient [18:22:08] * andrewbogott nods [18:33:31] ok, mess unmessed [18:52:52] andrewbogott: the metricsinfra trove db didn't seem to recover from your accidental migration [19:00:58] due to a bug that I cannot convince the upstream devs is real :( [19:02:39] I guess that's a thing that we have to remember about rebooting trove instances... gotta restart the agent afterwards.