[11:38:26] I think ganeti in codfw is kind of locked up migrating instance logstash2024.codfw.wmnet which is running since yesterday and has "memory transfer progress: 66179.58 %" [11:39:14] I could try to cancel that job but I've no concrete understanding of the implications [12:05:32] volans: have you seen something like that by chance ^^ [12:06:25] jayme: mmmh not my area of expertise, but I can have a look if akosiaris doesn't have an immediate answer (sorry for the ping but mor.itz is out) [12:07:35] I asked him as well *fingers crossed* [12:07:50] Ah yeah that's a known problem [12:08:08] I ll have a look in like 10m [12:08:22] you can also point me somewhere ofc [12:08:27] glad it's known, if you want to share knowledge so that I can avoid the ping next time, I'm available for pairing ;) [12:08:44] I couldn't seem to find anything about it [12:08:45] OK I'll ping you [12:09:19] We have deployed mitigations for it on a per instance basis, but I'll share more in a few [12:09:31] ack, thx [12:22:36] volans: any idea why sre.hosts.downtime would throw this error when given `-t T303174`? [12:22:37] `phabricator.APIError: ERR-CONDUIT-CORE: Monogram "T303174" does not identify a valid object.` [12:22:50] https://phabricator.wikimedia.org/T303174 looks pretty valid to me [12:22:58] kormat: it's security [12:23:10] 🤦‍♀️ [12:23:18] so I guess the bot doesn't have access [12:23:32] * kormat shoots the bot, securely [12:23:56] ok, so. This is kinda known and we thought we had solved it but it shows up every now and then. The background is that when ganeti is migrating a VM from the primary node to the secondary node, it needs to copy over the entirety of the memory of the VM. This is handled in fact internally by KVM (ganeti just issues the commands with some [12:23:56] parameters). [12:24:40] the command issued by ganeti are in the general form of: kvm -tons of other parameters -incoming tcp:0: on the secondary node [12:24:58] kormat: that's https://phabricator.wikimedia.org/p/ops-monitoring-bot/ fwiw in case you want to make a case on changing its permissions [12:25:19] and on the primary, ganeti connects to the monitor socket of kvm and issues something like migrate -d tcp:: [12:25:26] ok [12:25:43] the monitor socket of kvm is configured by ganeti to be something like /var/run/ganeti/kvm-hypervisor/ctrl/logstash2024.codfw.wmnet.monitor [12:26:15] that's the legacy one btw kvm wise, there is also a newer one called .qmp but let's not dive into that right now [12:26:33] lol [12:26:36] so something like sudo socat STDIO /var/run/ganeti/kvm-hypervisor/ctrl/logstash2024.codfw.wmnet.monitor on the primary node gives you access to it [12:26:46] you can run more commands there [12:26:51] an interesting one is info migrate [12:27:08] this is run periodically by ganeti and you see part of the output in ganeti job logs [12:27:30] you can get the primary node of an instance by e.g. sudo gnt-instance list -o +pnode logstash2024.codfw.wmnet [12:27:36] in this case it's ganeti2020 [12:27:54] now here comes the tricky part [12:28:03] there are 2 important limits here [12:28:15] 1 is the speed at which we can copy over memory from 1 host to another [12:28:37] this is currently set at 537.16 mbps [12:28:52] which is more than 50% of the network capacity of the node and generally is pretty sufficient [12:29:10] the other limit is the amount of time that the VM can be paused for while the migration is ongoing [12:29:15] the default is 30ms [12:29:22] which is rather strict [12:29:39] but on the other hand it makes most migration be almost packetloss free [12:29:59] BUT! [12:30:03] ack so far [12:31:05] if the memory in the VM gets altered a lot, faster than what the transfer rate and the expected downtime (those 30ms) allow [12:31:13] the migration will never finish [12:31:19] * Emperor having flashbacks to live-migration of Xen guests [12:31:43] akosiaris: so if we could depool logstash2024 for a bit it should complete? [12:32:02] typical examples of VMs whose memories get altered fast enough to trigger this are logstash nodes, cassandras getting too much traffic etc [12:32:10] volans: yes [12:32:27] similarly if you just do a kill -STOP in the VM [12:32:38] and then a kill -CONT after the migration is done [12:32:41] it will also complete [12:33:11] but the easier way is to just say that 30ms just isn't enough and allow kvm to pause the VM for a longer amount of time [12:33:16] and this is what I am going to do right now [12:33:20] there is a kvm command for that [12:34:17] ok [12:34:52] ok/ [12:34:56] is that per-VM? [12:36:03] there is a default and a per VM [12:36:07] sudo gnt-cluster info |grep downtime [12:36:07] migration_downtime: 2000 [12:36:19] that's the default. That's in millseconds [12:36:29] as you can tell we already have increased it from the default of 30ms [12:36:49] the 30ms is the software default, 2000ms is our cluster default [12:36:50] there is also a migrate_set_downtime in the socat [12:36:58] that's the command I am gonna run [12:36:59] but says [12:37:00] migrate_set_downtime value -- set maximum tolerated downtime (in seconds) for migrations [12:37:04] (seconds, vs ms) [12:37:05] this is clearly 1 time only [12:37:16] ofc. Cause why would things be easy? [12:37:22] and consistent [12:37:35] lol [12:38:00] [question] the cluster master is ganeti2021, but you're running the command on ganeti2020 socat? [12:38:20] yes, cause I need to talk to the local kvm process [12:38:21] because 2020 is the primary node for logstash2024 [12:38:22] ok [12:39:00] migrate_set_downtime 30 [12:39:00] migrate_set_downtime 30 [12:39:00] (qemu) info migrate [12:39:01] Migration status: completed [12:39:29] quick q from the sideline: Is it expected that this blocks jobs for other instances in the ganeti cluster as well? [12:39:39] sudo gnt-instance info logstash2024.codfw.wmnet |grep down [12:39:39] migration_downtime: default (2000) [12:39:49] now, we clearly need a higher per instance value for this one [12:40:05] or raise the cluster default even more alternatively [12:40:15] jayme: depends on where they are run [12:40:34] if other instances run in a compleltely unrelated pair of nodes, no they won't be blocked [12:40:48] but if they run on the same node (either primary or secondary), yes [12:42:14] ah, okay. ofc. I'm on 2021/2022 as well [12:43:24] thanks for the help/explanation akosiaris! [12:44:03] +1 [12:44:05] sudo gnt-instance list -o +hv/migration_downtime |grep -v 2000$ [12:44:05] Instance Hypervisor OS Primary_node Status Memory hv/migration_downtime [12:44:05] logstash2023.codfw.wmnet kvm debootstrap+default ganeti2024.codfw.wmnet running 8.0G 10000 [12:44:05] puppetdb2002.codfw.wmnet kvm debootstrap+default ganeti2030.codfw.wmnet running 16.0G 10000 [12:44:30] so looks like we have already bumped it for logstash2023 and puppetdb but not other nodes [12:44:52] I am a bit reluctant to increase the default too much, but also this per instance things isn't sustainable [12:45:19] unless is set by the makevm cookbook maybe, but one should know in advance a good value, that might be hard [12:45:56] as you can tell, there is no way to guess a good value [12:46:16] it's all about how fast the memory inside the VM changes [12:46:28] which is a function of traffic received (usually) [12:46:47] but can also happen because some large batch job is running on a VM [12:47:04] I remember matomo was unmigrateable during backups [12:47:28] it was doing a mysqldump IIRC to the local disk and while that was happening, nope, no migration [12:47:49] anyway, I 'll bump the 2s another 50% to 3s and let's re-evaluate if it happens again. [12:48:09] ack +2 [12:50:16] done and logged in SAL. Both eqiad and codfw. I think we had no such problems yet on caching pops? [12:50:25] but we can do it there as well for consistency's sake [12:50:32] not that I'm aware of [12:50:35] also we have much less VMs there [12:51:06] like https://netbox.wikimedia.org/virtualization/virtual-machines/?cluster_id=4 [12:52:43] akosiaris: would a value set per-VM by the makevm cookbook based on the memory size be of any help? [12:52:49] larger the memory, larger the value [13:05:53] maybe. But note there isn't a strict correlation here. It's about the rate of change, not the amount of memory used. [13:07:43] yes I got that, but is the closest thing I could think of that we knoe before hand [13:08:11] I get that if we have a VM with 100MB of ram that continuously does write on memory it would have the problem anyway [13:09:21] exactly. On the other hand, you are right that there is some correlation between a lot of memory and exhibiting that behavior [13:09:42] more data to transfer statistically speaking [13:09:46] yeah [13:09:52] or we have overprovisioned the VM :D [13:11:35] depends. Adding a lot of memory to a VM that is serving a mostly read only datastore would make it respond faster but since it's mostly read only it would not exhibit that behavior. But yeah, there are probably cases where we might have overprovisioned the VM [14:52:37] _joe_: Puppet is failing on a lot of hosts for an hour or so. Output suggests missing resource `User[mwbuilder]` in admin/groupmembers.pp. Maybe related to T303857? [14:52:37] T303857: Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 [15:08:58] <_joe_> cwhite: interview, but I'm fixing it soon(TM) [15:10:33] <_joe_> basically I forgot the "deployment" group is also on the appservers [15:12:22] <_joe_> thankfully the fix is simple, I just need to install a mwdbuilder user on the appservers too with no permissions at all [15:12:36] <_joe_> "simple" [15:12:39] <_joe_> although [15:12:50] <_joe_> why does the deployment group has access to all appservers? [15:13:38] <_joe_> I guess it's a relic of the past, now most deployers just need to access mwdebug* [15:18:54] <_joe_> the issue is that that group is used everywhere :/ [15:26:25] <_joe_> cwhite, godog https://gerrit.wikimedia.org/r/c/operations/puppet/+/778307/ should be the fix [15:26:31] <_joe_> if you want to take a look as well :) [15:26:39] cheers, checking _joe_ [15:28:16] <_joe_> uhm doesn't work [15:28:22] <_joe_> I must have done something asinine [15:28:44] <_joe_> yeah an easy fix [16:29:32] _joe_: afaik deployers would run scap-pull on mwdebug*, mwmaint* not uncommon. Myself I also sometimes run it on random appservers when testing and depooled, however for that to be possible the user also needs their own ssh to be permitted there which afaik we don't grant as part of deployment so that's not actually needed. assuming scap-deploy can work for SRE and perf-roots on app servers without this group, that'd be fine. [16:29:59] I'm assuming you've also determined (I don't know off-hand) that this group isn't needed on the target side by scap in any way, e.g. around key holder or some such. [16:30:15] <_joe_> yeah it's not, but I didn't remove it [16:30:30] <_joe_> not the day before going on vacation for 2.5 weeks :) [16:31:51] * Krinkle imagines _joe_ sitting on a beach sipping a drink from a cut open pineapple. [16:33:12] btw, low prio, how would I go about testing this - and/or what would be your approach to reviewing a patch like this that modifies httpbb tests. https://gerrit.wikimedia.org/r/c/operations/puppet/+/778295 [16:45:03] Krinkle: are the tests expected to pass right now? or not until a code change is deployed? [16:46:14] if they're supposed to pass now, you can dump those files in your homedir on deploy1002, and say `httpbb --hosts mw1414.eqiad.wmnet ~/test_whatever.yaml [16:46:15] ` [16:46:47] (or --hosts the appserver of your choice, including e.g. an mwdebug host while testing a backport) [16:48:50] rzl: should pass today indeed. [16:49:03] thanks, okay, so this works standalone and is pre-instaled there. awesome [16:49:36] yep! there are a couple of special snowflake tests that need http auth credentials where we do some puppet weirdness, but you don't need any of that here, it'll just work [16:51:04] the only other thing I'd look for in a review is, try to make sure you're testing everything you care about and nothing else :) we don't have CI on these yet, so when they break unexpectedly it can be a project to track down what's wrong [16:51:17] ideally every assertion in there should be something that, if it came false one day, it would probably be an issue [16:52:14] yeah, the main thing from my pov there would be that it's hard to have a simple standalone test that asserts static.php works without also making a random pick of a specific static file and hoping it won't change or get renamed, which would be a non-issue. [16:52:26] short of planting a file specifically for this purpose :/ [16:52:30] should be rare enough [16:52:37] nod, makes sense [20:33:37] Hi, curious if anybody has any advice for troubleshooting a host that won't boot; the host in question is furud.codfw.wmnet, a backup hadoop client, so it shouldn't be an issue if it stays offline for a little bit (the host it is backing up is online). I ran `cookbook sre.hosts.reboot-single` and it looks like it will time out [20:34:46] razzi: Are you able to get to the management console at all? [20:35:04] yeah I'm in the management console, nothing's coming out of serial though [20:37:00] You might want to try some power cycling options from here: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Power_cycling [20:37:36] ok yeah btullis I'll give it a powercycle [20:39:25] > Server power operation successful [20:39:25] That's hopeful [20:39:41] 👍 I'm pretty certain that furud is no longer important to us anyway [20:40:29] ok serial console is spitting out text now, great [20:41:34] ok the machine is back up. Thanks for the technical and emotional support btullis [20:43:46] You're welcome. <3