[08:00:25] FIRING: InterfaceSpeedError: brq7425e328-56 on cloudvirt1053:9100 has the wrong speed: 1.25e+06 [08:00:29] what is this? [08:06:04] it is true the interface shows up as 10Mb/s via ethtool [08:06:06] https://www.irccloud.com/pastebin/uTynHS1J/ [08:28:41] arturo: I can find a few similar errors in Phabricator, and a task pointing to https://wikitech.wikimedia.org/wiki/Monitoring/check_eth#InterfaceSpeedError [08:28:42] at least the canary VM don't have network connectivity, I'm draining the HV [08:29:33] T353323 [08:29:33] T353323: Improve the InterfaceSpeedError alert - https://phabricator.wikimedia.org/T353323 [08:30:33] ack [08:31:20] there was a phab ticket already created automatically: T368105 [08:31:21] T368105: InterfaceSpeedError brq7425e328-56 on cloudvirt1053:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T368105 [08:55:27] I'll reimage the host, I think it may be misconfigured [09:36:02] arturo: that's very strange [09:36:26] it is! also because is the bridge device, which is 100% software defined, no? [09:37:45] arturo: ah ok! [09:37:51] I hadn't seen that [09:38:08] I'm not sure that's an issue, it may just be a quirk in trying to use ethtool on a virtual device [09:38:29] the VMs did not have connectivity, so I figured something was wrong for real [09:38:37] traditionally with RJ45 ports they could support 10/100/1000Mb ethernet standards [09:38:44] anyway, I just reimaged the server [09:38:57] *however* with the SFP-based cloud switches we use copper SFP modules to connect 1G links [09:39:24] which are typically fixed to 1G, i.e. there is no potential for an error to cause the link to default to a different speed, but actually work [09:39:39] arturo: yeah I logged on to mgmt to look and seen that, hopefully that sorts it out [09:40:50] topranks: the server is now back online after the reimage, without the problem. So I assume it was some kind of misconfiguration [09:41:23] yeah some bug somewhere [09:41:32] just looking at my machine here docker0 bridge reports speed of 10G [09:42:08] in terms of that alert it's not so important now given we're moving to SFP-based switches everywhere [09:42:29] it should probably ignore non-physical interfaces regardless, although I guess in this case it fired and there was an actual issue [09:42:33] so maybe it can stay as it is? [09:44:27] yeah in this case it helped to surface an actual problem [10:27:24] anyone knows what might be up with humaniki? https://wikimedia.slack.com/archives/C0153LQ5G82/p1718965568963049?thread_ts=1710868683.988599&cid=C0153LQ5G82 [10:32:43] blancadesal: I believe that error message is from nova-proxy when it cannot reach the backend [10:32:47] maybe their VMs are down [10:34:32] the backend VM was modified yesterday (!) [10:35:06] humaniki-prod.wikidumpparse.eqiad1.wikimedia.cloud seems down [10:37:43] I'm rebooting it [10:39:49] it is failing to boot because network [10:41:32] other VMs in the hypervisor are just fine [10:42:12] oh, the cookbook to migrate the project VMs to ovs failed yesterday [10:42:15] 14:53 taavi@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.migrate_project_to_ovs (exit_code=1) [10:42:15] 14:53 taavi@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_project_to_ovs [10:42:31] from https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidumpparse/SAL [10:44:26] it failed to migrate because VMs have an old g2. flavor [10:45:13] so I think the problem is just the VM instance is in the wrong hypervisor [10:45:58] arturo: thank you for looking into this [10:47:59] blancadesal: should be back online [10:48:20] <3 [13:48:40] arturo: in theory the migration script doesn't move a VM if it doesn't know what flavor to use... so I'm not sure how humaniki-prod got moved there [13:48:44] should I confirm the resize, for now? [14:07:45] andrewbogott: yes [14:07:59] not sure why is waiting confirmation [14:08:09] I did not resize, just moved [14:08:22] oh wait, maybe moving to another HV triggers a resize [14:08:26] ah, move happens via resize, so the ui is the same :) [14:08:33] ok [14:08:42] is 1053 still suspect or fixed after the reimage? [14:08:55] (I'm wondering if I should pool it) [14:08:57] fixed, I believe:-) [14:09:11] I believe is pooled again, but please double check [14:13:12] yep, it is. Just no new vms since then I guess :) [14:13:38] thanks for dealing with that, I don't understand how it got imaged in that weird state but I guess I need to check ssh'ing to canaries as part of this process [14:19:52] np [14:52:10] * arturo off