[07:11:27] morning [07:11:37] i'm looking at the puppet errors [07:19:12] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed when searching for node tools-puppetserver-01.tools.eqiad1.wikimedia.cloud: Exception while executing '/usr/local/bin/puppet-enc': Cannot run program "/usr/local/bin/puppet-enc" (in directory "."): error=0, Failed to exec spawn helper: pid: 1811354, exit [07:19:13] value: 1 [07:19:57] seems like a puppetserver restart fixes that, but weirdly it's happening on multiple puppetservers [08:21:41] does it say anything else on dmsg/journal? is it hitting any kind of limit? (open files/processes/...) [08:22:06] morning o/ [08:25:29] dcaro: i did not have time to look at that yet that deeply. tf-puppetserver-1.terraform.eqiad1.wikimedia.cloud is currently broken if you want to have a look [08:27:04] ack [08:50:31] did the servers get a new version of java or similar? it seems it fails to load some java helper files when trying to spawn the process [08:52:37] https://www.irccloud.com/pastebin/0MvgMvl1/ [08:52:40] probably that [08:54:12] oh that could definitely be it [08:57:50] there was a security update for openjdk-17 yesterday, the cloud VPS puppet servers most probably got auto-upgraded via unattended-upgrades and ran into a variant of https://phabricator.wikimedia.org/T357900 [08:58:59] when we ran into this with the January Java sec updates, the effect was that the puppet servers were broken after the JRE update until we restarted puppetserver.servivce [08:59:01] when we ran into this with the January Java sec updates, the effect was that the puppet servers were broken after the JRE update until we restarted puppetserver.service [09:01:13] sounds similar yes, though maybe the way cloud sets the puppetservers now has changed, and it does not restart itself anymore as the task says (or fails to do so) [09:01:49] would the wmf-auto-restart script have automatically fixed this eventually? [09:02:24] there's no puppet wmf-auto-restart though [09:02:36] https://www.irccloud.com/pastebin/7eKse2jD/ [09:03:23] if it were, it should have fixed it I think (it has deleted open file handles) [09:07:57] this should enable the autorestart https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023383 [09:10:43] dcaro: LGTM [09:11:53] i am moving a second codfw1dev cloudvirt to the openvswitch agent [09:12:05] taavi: ack [09:12:12] is your plan to leave codf1dev fully on ovs? [09:13:12] not sure yet - i have a fwe open questions for netops that i was planning to ask in the sync meeting tomorrow [09:13:26] ok [09:15:05] but mostly now I want to experiment with how to migrate a host from one to another, and how well a temporarily mixed setup would work [09:16:17] in the new puppet7 setup, how should I test a patch? (can I just cherry-pick it in the /srv/git/operations/puppet repo?) [09:17:23] taavi: ack [09:25:14] did that, seemed to work [09:33:57] xd, that patch actually that only helps when the config changes, sent another one adding the auto_restart timer (tested it too, it worked) [09:34:21] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023386/1 [09:36:01] dcaro: also LGTM [13:23:41] is there a way to tell openstack to delete and re-create a server on the qemu level while keeping all the data intact? [13:23:57] stop and start? [13:24:06] as in hard reboot [13:24:26] (not sure that's what you want) [13:25:09] i don't think that does what I want, it stops the qemu process but things like `virsh` are still clearly aware of its existence [13:25:38] hmm, they should not, as after that it might be started in a different node [13:26:14] (so it will have to create the libvirt xml definition again) [13:27:03] maybe there's some "just in case, don't delete the definition" kind of cache [13:27:16] just a regular `openstack server stop` doesn't stop it that hard [13:27:18] you can try migrating it, that should for sure create it anew in the other host [13:27:56] but I need to to run on this specific host [13:29:27] hmm [13:29:42] I think openstack should be able to recreate the livbirt domain if you delete it manually [13:31:22] andrewbogott or arturo might now better though [13:32:29] that seems to have done it! [13:33:36] nice, everything is where it should? (attachements, ips, ...) [13:37:15] the disk is still there. no network connectivity, but that's the thing I'm debugging, so not very surprised there :/ [13:39:58] taavi: just curious (now that I'm logged back into irc), did you do 'virsh destroy'? Or something more drastic? [13:41:43] andrewbogott: yeah, that. [13:42:16] Huh, I wonder if that's different from reboot --hard [13:43:44] dhinus: are you available/willing to glance at the clouddb replag alerts? [13:44:05] so for context, I have an instance that I migrated from a linuxbridge agent based host to a ovs agent based host and now it's not able to connect to the flat vlan anymore (while other instances on the same ovs-based host are) [13:44:06] It may be that it's catching up aready, I foolishly didn't note down the lag last night [13:44:28] I can have a look [13:44:35] taavi: isn't that the opposite direction from what we'll actually do when/if we migrate? [13:44:39] thx dhinus [13:44:56] you might be able to find the progress of the lag in the alert logs in logstash [13:46:03] they seem to come from icinga, I guess we should migrate them to prometheus at some point [13:47:35] oops, yep, we should :) [13:47:57] andrewbogott: no? [13:48:22] taavi: oh you're right, I mentally exchanged ovs and linuxbridge for a second there [13:51:21] taavi: so probably there's some bit of linuxbridge config that follows that VM when it migrates... I'm thinking about where to hunt in the db to find that difference. [13:51:46] Can you tell me the ID of a working ovs VM and the non-working migrated one? [13:52:01] I can see the lag is increasing refreshing this page https://replag.toolforge.org/ [13:52:06] yeah. to make it a bit more complicated, for this VM i tried deleting and re-creating the Port object already. there's probably some vm where the original port is still there that can make debugging easier [13:52:47] dhinus: https://orchestrator.wikimedia.org/web/cluster/alias/s2 shows the issue seems to start from db1155, so a question for -data-persistence probably [13:53:15] thanks! [13:53:25] they're already on it :) [13:53:34] "corruption on db1156" [13:53:43] taavi: sometimes the API persists the initial setup command and reuses it when resources are recreated, I'm trying to remember where that is [13:53:50] 'corruption' doesn't sound great [13:54:07] andrewbogott: the instance in question is 1811ebc0-bdc1-411d-8f67-173e8edd05c8 if you want to have a look [13:54:09] dhinus: is there a task? [13:55:02] nope but there's an active alert in data-persistence and Amir said he's on it ("probably just needs an index rebuild") [13:55:30] ok. Thanks for looking [13:55:33] maybe I'm misreading the conversation there, but they're definitely aware [13:56:09] yeah, i was mostly thinking it would be useful to have one to refer people to, since most people asking about the replag will likely come to us and not to the DBAs [13:56:54] ack. I'll ask them to create a task if the thing is not resolved quickly [13:59:26] already fixed, lag is back to 0 [14:02:33] well there's nothing about initial network config in instance_extra which is where I was expecting it [14:04:18] taavi: is that ovs instance in codfw1dev? [14:04:34] yes, cloudvirt2001-dev [15:39:00] * arturo offline [16:42:31] * dcaro off [16:42:34] cya tomorrow [18:18:49] * bd808 lunch [19:51:31] The volume of connectivity related restarts for stewardbots () and bridgebot () recently makes the WMCS<->World network edge look a bit flaky. [21:09:08] we have reviewed it several times, and found no problems. I'm starting to suspect about k8s [21:14:21] My skills in this area are so out of practice that I don't feel like I can be of much help without a major investment in learning about the specific software defined networking things we use. :/ [21:19:26] also, I wonder if the prod edge network may have anything to do [21:22:44] specially the SUL watcher, if it is hitting a wikiland endpoint, those may have strict ratelimits and stuff. We already saw some tools hitting the limits, and even worse, a single tool in a k8s worker node depleting the wikiland network quota, and not letting the other tools running in the same worker node connect to any endpoint [23:51:15] a.rturo: I poked Otto on a Phab task about the SUL Watcher 429 problems to get his help in figuring out if the rate limit is one bucket for all of Cloud VPS. [23:51:18] * bd808 off