[08:01:57] I'm starting a schema change on s5 in eqiad in a bit [10:56:12] Hi folks, could I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1175120 please? Draining an SM node so it can have its disk controller swapped. [11:33:57] looking [11:34:43] I'm also starting a schema change in s1 codfw DC master in a bit for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1175120 [11:35:41] Emperor: can/should I check if the host is depooled? [11:50:48] federico3: no, it's a backend host so is never depooled (marking it to be drained will gradually remove it from the swift rings over the next couple of weeks) [11:51:12] cf https://wikitech.wikimedia.org/wiki/Swift/Ring_Management#Removing_a_host [13:17:25] FIRING: SystemdUnitFailed: swift_ring_manager.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:32:32] sorry, I'm an idiot - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1175517 has the right syntax, if I could get a +1 ? [13:39:49] ouch :( [13:47:24] at least the fix is easy [14:12:25] RESOLVED: SystemdUnitFailed: swift_ring_manager.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:05] that's a false positive, this is still broken, awaiting a review of my CR above [14:16:25] FIRING: SystemdUnitFailed: swift_ring_manager.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:26] Emperor: did you get my +1? [14:21:31] ah, oops, I missed the +1, done now [14:47:20] TY [14:52:18] ugh, hardware [14:55:20] ms-be1091 has paths like pci-0000:98:00.0-sas-exp0x500304801ff9b73f-phy9-lun-0 and pci-0000:98:00.0-sas-exp0x500304801ffa4e3f-phy9-lun-0 ; ms-be2088 has instead paths like pci-0000:98:00.0-sas-exp0x500304801fd4573f-phy9-lun-0 and pci-0000:98:00.0-sas-exp0x500304801fd4903f-phy9-lun-0 [14:56:25] RESOLVED: SystemdUnitFailed: swift_ring_manager.service on thanos-fe1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:02:18] i.e. the JBOD controller seems to embed something like a varying serial number into the by-path entries [15:07:32] I'm not sure how that's going to be possible to massage into something that doesn't involve us having to write a separate hosts.yaml stanza for each and every system (which will presumably have a different pair of serial numbers) [15:51:02] I've attempted to summarise the issue in T401127 - any thoughts on non-hacky ways forward? [15:51:02] T401127: Swift device facts / names for new JBOD controllers - https://phabricator.wikimedia.org/T401127 [15:51:07] urandom: ^-- ? [21:29:03] E.mperor: I replied on ticket (unhelpfully) [21:30:55] also, I'm not sure I understand the rationale for handling this from puppet (I'm probably missing some context)