[09:27:15] 10Machine-Learning-Team, 10Cloud-VPS, 10User-dcaro: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10dcaro) 05Open→03In progress [09:27:17] 10Machine-Learning-Team, 10Cloud-VPS, 10User-dcaro: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10dcaro) a:03dcaro [09:27:28] 10Machine-Learning-Team, 10Cloud-VPS, 10User-dcaro: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10dcaro) [09:41:46] 10Machine-Learning-Team, 10Cloud-VPS, 10User-dcaro: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10dcaro) On the cli it shows no volumes attached to it, though the disk config is manual: ` root@cloudvirt1021:~# openstack --os-project-id mach... [09:44:06] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10elukey) I tried to PXE boot and it didn't work, so I checked netbox and the interface listed looks weird: xe-4/0/010 https://netbox.wikimedia.org/dcim/... [09:56:27] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10elukey) So I have deleted the /10 interface in netbox, and renamed the /010 to /10. Now homer offers me this diff: ` Changes for 1 devices: ['asw2-c-e... [10:00:19] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ayounsi) Looks good to deploy! [10:02:21] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10elukey) ` elukey@asw2-c-eqiad> show interfaces descriptions xe-4/0/10 Interface Admin Link Description xe-4/0/10 up up ml-cache1002... [10:08:19] interesting typo in one interface name --^ :D [10:13:44] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10elukey) @Cmjohnson I tried to reimage the node but I got `spicerack.dhcp.DHCPError: target file ttyS1-115200/ml-cache1002.conf exists`, I think that yo... [10:28:31] Morning all! [10:28:39] Up wild early for this flight [10:29:00] wow :) [10:29:04] have a nice flight chrisalbon ! [10:29:16] or have you already landed in NYC? [10:29:30] Thanks! No I haven’t even gotten to the airport yet [10:29:36] ack :) [10:36:00] * elukey lunch\ [10:57:00] Morning Chris, safe flight! [12:26:57] 10Machine-Learning-Team, 10Cloud-VPS, 10User-dcaro: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10dcaro) p:05Triage→03Medium [12:29:39] 10Machine-Learning-Team, 10Cloud-VPS, 10User-dcaro: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10dcaro) Horizon logging fixed, rechecking this. On the UI the volume seems attached (on the volumes tab): ` Displaying 1 item Instance Device... [12:30:21] 10Machine-Learning-Team, 10Cloud-VPS, 10User-dcaro: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10dcaro) Detaching from the UI gives the error: ` Error: Unable to detach volume: Volume sandbox-srv on instance ml-sandbox ` Looking... [14:33:32] elukey: I found a solution for the pickle issue, yay!! commenting in the review :) [14:39:21] nice!!! [15:32:51] 10Machine-Learning-Team, 10Cloud-VPS, 10User-dcaro: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10dcaro) I ended up updating on the database the status of the volume itself, essentially: ` root@cloudcontrol1005:~# cinder reset-state --state... [16:06:18] 10Machine-Learning-Team, 10Cloud-VPS, 10User-dcaro: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10dcaro) a:05dcaro→03elukey [16:06:28] 10Machine-Learning-Team, 10Cloud-VPS, 10User-dcaro: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10dcaro) Re-assign to me if you still have issues, or just close if everything is ok :) [16:12:23] aiko: your home dir should be on on ml-sandbox! [16:13:00] 10Machine-Learning-Team, 10Cloud-VPS, 10User-dcaro: Volume stuck for ml-sandbox.machine-learning.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T304872 (10elukey) 05In progress→03Resolved All good thanks! [16:31:03] 10Machine-Learning-Team, 10articlequality-modeling: Articlequality model for nlwiki doesn't seem to track images correctly. - https://phabricator.wikimedia.org/T304973 (10Halfak) [16:33:03] 10Machine-Learning-Team, 10articlequality-modeling: Articlequality model for nlwiki doesn't seem to track images correctly. - https://phabricator.wikimedia.org/T304973 (10Halfak) I think the right next step is to implement some tests to see if we detect the following image links: ` [[Bestand:Stevie Wonder 1967... [16:42:05] klausman: https://phabricator.wikimedia.org/T304938 \o/ [16:42:28] I am thinking if we could use the reboot-single cookbook for this use case [16:42:35] (for the ores nodes) [16:43:45] for uwsgi it is sufficient to depool the node (and the cookbook takes care of it), for celery there is not graceful way of stopping it afaik (if it is processing something it will fail and another worker should pick up the job from the queue in theory) [16:44:43] trying [16:53:14] so it worked nicely for ores2001 [16:53:28] now I am wondering if the new kernel was rolled out for stretch nodes though [16:55:24] seems so yes, proceeding with the reboots [17:01:45] Ack [17:02:15] (sorry, was busy devouring salsicca) [17:03:01] I'll do the ml-staging bits in one go, since they don't do anything anyway [17:04:10] ahahaha sorry for the ping during a delicate moment :D [17:04:29] it is also fine tomorrow! [17:04:36] I am going to log off in a bit [17:04:43] roger [17:05:29] for the ORES nodes we can split, I am doing codfw, if you want eqiad it is all yours :) (fine to reboot tomorrow/day-after-tomorrow) [17:06:30] I'll need some guidance on the how/right order, but sure [17:07:11] ah yes I am just rebooting one at the time, using the --depool option [17:07:17] and watching metrics [17:07:20] nothing special :) [17:10:39] I presume we'll do the non-staging ml machines during the re-init for IPs? [17:11:12] yeah I think we can even bump the os to bullseye if we want (for the VMs) [17:11:23] the workers are already good [17:11:32] Even the etcd VMs? [17:12:19] nono those still on buster [17:12:28] I meant the masters [17:12:40] yeah, sounds like a good plan [17:13:21] ack, going afk, enjoy your evening! [17:13:47] \o