[01:06:32] I still have a TODO to make a ticket and announce email for killing the *.labsdb DNS records. Not don. but not forgotten. [01:06:51] * bd808 off until Monday [10:57:28] folks I've to reset a line card in cr1-codfw, this may cause a momentary blip in comms to cloudsw1-b1-codfw [10:57:38] the switch has a link to cr2-codfw so it will fail over very quick [11:02:28] ack, I don't think that should cause any issues [11:52:08] there was a bunch of neutron alerts 1 min ago [11:52:23] some have already cleared [11:52:31] i'm looking [11:53:58] neutron-rpc-server on cloudcontrols is having trouble talking to rabbitmq. I'll restart the rpc server service and then the struggling l3 agents [11:56:03] sorry, linuxbridge agents, not the l3 ones [11:59:20] didn't help, unfortunately [12:09:53] the network tests are fine, so I'm not clear if there's any end-user impact or not [12:10:43] i don't think there is to existing VMs, but any new VMs or security policy changes would not work properly [12:11:17] I'm struggling to find any issues - the agents are saying that any rabbit messages aren't getting responses but the rpc server is running properly etc [12:14:12] maybe we can try a rabbit restart as in https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Rabbitmq [12:16:15] "rabbit_diagnostics:maybe_stuck()" finds 1 suspicious process [12:23:34] hmm I think a.ndrew restarted or reimaged two cloudrabbits yesterday [12:23:55] related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/979127 [12:24:49] uptime is 24 days on cloudrabbit1001, and 16hours on 1002 and 1003 [13:51:38] the neutron alerts seem to have mysteriously been fixed by its own [13:52:07] no, they're back :( [14:10:10] dhinus: the cloudvirt1046 alerts are you, right? [14:10:15] yes, on it [14:11:32] I reimaged it this morning, but I haven't added it back to the pool [14:11:56] yeah, and I guess you can't do that until the neutron thing is resolved? [14:12:44] good question, I haven't tried though [14:13:18] for neutron, I'm tempted to try a reboot of all cloudrabbits, but maybe andrewbogott has other ideas? [14:21:46] dhinus: you want to try rebooting them or should I? [14:22:13] I'll do it, I will try just a systemctl restart first [14:23:08] systemctl restarted on 1001 [14:25:25] restarted on 1002 [14:28:26] restarted on 1003 [14:28:54] this command is still showing a few "suspicious processes": 'sudo rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' [14:29:40] the output of "sudo rabbitmqctl cluster_status" seems ok [14:38:08] I'm afk for another 30 but restarting neutron services is a good start [14:39:41] I think taavi tried that already, but let's try one more time [14:39:45] yes [14:43:05] restarted neutron* on cloudnet1005 [15:20:00] 1005 seems to be quite upset. it's also refusing to start nova-api unlike the other nodes [15:20:28] weird [15:21:18] I'm also confused that the mariadb logs in 1005 have no events at all from yesterday until you stopped it [15:22:28] I will send an update to cloud-announce [15:22:38] thanks [15:27:42] the reamining myster is what happened to cloudcontrol1005 [15:29:06] `sudo systemctl start nova-api.service` does nothing for example [15:34:12] I'm here now, reading the backscroll [15:35:17] do you have a theory about whether this is a db issue or a rabbit issue? Seems unlikely to be both [15:35:47] andrewbogott: current theory is a db issue on cloudcontrol1005. that host seems to be upset more generally [15:36:33] nova-conductor shows a db deadlock in cloudcontrol1005 at 11:50 UTC (when the first alert fired) [15:36:33] the rabbitmq issues were just a symptom, the neutron rpc server was not able to respond because database writes were hanging [15:36:34] ok, if we remove that from the equation entirely does everything start working [15:36:36] ? [15:36:52] e.g. sudo touch /tmp/galera.disabled [15:37:04] yes, things did start working when I created the magic file (/tmp/galera.disabled) to take it out of haproxy config [15:37:52] cool. So maybe we can just burn down the database there and recreate? [15:38:02] Or do you think there are like hw issues or something with that server? [15:38:40] not sure. the system seems to be more generally broken atm, for example trying to re-start nova-api.service on it hangs [15:38:48] * taavi needs to step away for a minute [15:38:54] it would if it can't talk to the db [15:38:59] hang I mean [15:39:43] no, even the process is not starting for some reason [15:40:36] well that's unexpected [15:41:02] But things are generally working as far as users are concerned? [15:41:52] we can just reimage 1005 entirely [15:42:18] I think user feature are mostly ok, though I wonder if API calls to 1005 are failing [15:43:24] surely it's not pooled by haproxy [15:43:43] they're not, haproxy health checks are failing [15:43:49] i'm not aware of anything that's not working [15:43:57] (from the user perspective) [15:44:31] great, then let's just reimage unless one of you wants to do more investigation first [15:46:08] sure [15:46:59] sgtm [15:47:15] I'll do it :) [15:50:01] side question: isn't systemctl forwarding logs from cloud* hosts to logstash? [15:50:45] dhinus: there's a magic json file somewhere that controls which units get forwarded there [15:51:38] /etc/rsyslog.lookup.d/lookup_table_output.json [15:52:31] that includes e.g. nova-conductor but I'm not finding anything in logstash [15:53:44] dhinus: https://logstash.wikimedia.org/app/dashboards#/view/3ef008b0-c871-11eb-ad54-8bb5fcb640c0?_g=h@865c245&_a=h@cd6aa03 [15:54:31] if you click on 'OpenStack Services -> Edit Filter' you can adjust which services it selects. [15:54:42] cool, let's see what I was doing wrong! [15:57:19] I cursed us by mention 0 alerts last night. Now there are more than 0 :( [15:57:26] *mentioning [15:57:29] :D [16:00:56] maybe neutron just missed arturo and wanted some attention. [16:01:42] * bd808 is only here to spread snark today [16:01:46] yep [16:01:58] :P [16:02:21] dhinus: I can't explain why cloudvirt1046 works now, but I'm happy to see it back up! Maybe give it a couple more test reboots just to be sure it's reliable? [16:02:23] I found what I was doing wrong in Logstash: I had not selected "ecs-*" as the "index prefix" [16:03:07] the default "logstash-*" prefix contains only logs from mw and similar [16:03:32] andrewbogott: I'll reboot it a couple times. [16:04:01] I think Cole made that ecs dashboard so I never had to figure out how to search the firehose [16:04:09] I wrote an alert for the galera failure case: https://gerrit.wikimedia.org/r/c/operations/alerts/+/979385/ [16:04:29] the reimage worked straight away this morning, no idea what has changed, but it's possible some fix was pushed by other SREs? I remember we suspected it was a combination of bookworm+hardware issues [16:05:02] the dashboard is handy, if I remember to look for it :D [16:09:37] dhinus: cloudvirt1046 is still alerting. I think you need to run the cookbook to schedule a canary VM [16:09:58] yep, I did run it earlier but it failed because of the outage [16:10:05] I'll reboot it first [16:11:55] taavi andrewbogott do you know why wsrep_local_state_comment was showing "Synced" even when it was not? [16:12:25] I'm now wondering how reliable is that value to check if the cluster is in sync [16:14:38] that's a good question. Is it possible that the server was sufficiently frozen that it was unable to update the wsrep state? [16:18:53] theoretically possible I guess [16:19:59] I suspect that variable is about the initial join sync, and not about ongoing writes [16:21:13] does that mean it's probably good enough to be used in https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Openstack_upgrade#Upgrading_cloudservices_nodes? [16:21:39] and for the ongoing writes we can rely on your new alert? [16:24:53] we could add some more info to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Galera or to the Troubleshooting paged linked ther [16:25:46] in other news, cloudvirt1046 is rebooted and back in the pool [16:26:12] I think the most complete thing is to compare wsrep_cluster_conf_id on the three nodes, but I'm not sure if it's easy to do cross-host checks like that [16:26:34] but yeah, with taavi's new check we should be in decent shape [16:28:19] hmm there is still one alert in cloudvirt1046 [16:28:43] it might just need a few minutes [16:29:21] I ran the unset_maintenance cookbook but I might need the ensure_canary too [16:35:04] hmm the ensure_canary cookbook is not working for some reason (hanging right at the start) [16:35:38] oh yep, there's definitely nothing running on 1046 [16:35:51] probably the cookbook is hardcoded to use 1005 for running commands [16:35:54] 1005 will be up in a few [16:36:01] I hope [16:36:17] oh yes [16:36:19] that's it [16:36:30] I'll patch it to use 1006 in the meantime [16:44:11] 1005 is up now, which seems to be crashing neutron on all the cloudvirts. That's not great [16:46:09] running cookbook wmcs.openstack.restart_openstack --neutron --cluster-name eqiad1 [16:46:31] hmmmm "openstack hypervisor list" is not working on 1006 either [16:46:49] probably related to the neutron error [16:48:29] taavi, dhinus, can you look for whatever galera thing you saw on 1005 before and see if it's happening again? Meanwhile I'll pursue other angles... [16:49:27] andrewbogott: no, galera looks ok on 1005 [16:49:29] 'hypervisor list' doesn't work but 'compute service list' does [16:49:41] and I don't use hypervisor list usually so I'm not really familiar [16:50:39] I don't see neutron actually being down anywhere other than alert manager [16:50:44] e.g. it seems fine on 1055 [16:50:45] andrewbogott: neutron alerts look fine to me, am I missing something? [16:51:09] I see 50+ alerts on the alert manager dash like "Neutron neutron-linuxbridge-agent on cloudvirt1055 is down" [16:52:49] I used [16:52:58] I used "hypervisor list" this morning and it was working [16:53:01] yep. looking at prometheus directly it looks like something flapped at around 16:37 and then recovered after a minute [16:53:36] dhinus: if you specify cloudvirt1046 to the canary cookbook does it skip the hypervisor list step? [16:54:09] nope :( [16:57:45] it uses it to verify the hostname is valid, I'm doing a quick hack to remove that check [16:57:54] ok canary created! [16:58:27] I'm restarting nova services to see if that fixes 'hypervisor list' [17:16:25] fwiw 'openstack hypervisor list' is still hanging [17:18:31] I'm logging off shortly, and won't be around much tomorrow, fingers crossed things will not break again [17:21:55] I'll keep looking. Have a good evening! [17:22:23] thanks! [19:40:19] bd808: I have a striker deployment puzzle that may interest you. In codfw1dev the container immediately exits with "manage.py runserver: error: unrecognized arguments: --nostatic" [19:40:30] but the same image seems to work fine in eqiad [19:40:44] and manage.py is surely /in/ the image... [19:41:06] I'm hoping this will reveal some vast gap in my understanding of how docker works :) [21:05:40] ceph seems to have just blipped and made at least some part of toolforge rather unhappy [21:08:30] andrewbogott: I fear we have some NFS kicking to do [21:13:00] bah just when I thought ceph had stopped breaking [21:14:16] taavi: I don't think I was around for the last couple of these. Do we need to do things on the server, or reboot client VMs? [21:16:15] andrewbogott: client VMs [21:16:21] checker recovered... [21:16:33] I'll start the k8s reboot cookbook, can you look if any grid nodes are struggling? [21:18:15] I'm running "sudo cumin 'O{project:tools}' "ls /mnt/nfs/labstore-secondary-tools-project"" [21:18:23] assuming that failures and sucesses are OK and timeouts are not [21:18:45] of which there seem to be 4 [21:20:36] timeouts are tools-sgeweblight-10-[18,21,32].tools.eqiad1.wikimedia.cloud [21:20:47] do you typically drain+reboot or just reboot? [21:21:22] for web nodes it's just fine to reboot, I doubt draining will do much to help [21:21:48] 'k [21:28:06] all the grid nodes are responding to my ls now. tools-k8s-worker-80.tools.eqiad1.wikimedia.cloud isn't... [21:28:18] want me to reboot that one too, or is it getting caught by your cookbook soon? [21:30:02] and scratch is failing on tools-k8s-etcd-[16-18].tools.eqiad1.wikimedia.cloud [21:38:52] taavi: anything else? I'm not finding any more broken things at the moment outside of the k8s-worker things which I think are already getting rebooted [21:39:34] andrewbogott: I think we're good, thanks. the k8s reboots will take a while but that's fine and the cookbook is very robust [21:39:42] cool [21:40:12] I'm going to wait 5 minutes and then go get groceries (still have my outdoor clothes on from previous attempt) [21:41:50] sorry :/ and I'm going to go back to packing for SF [21:43:38] I'm just glad I wasn't out the door yet :) [22:11:10] andrewbogott: that codfw1dev behavior is very confusing to me as well. Probably worth a phab task. There must be something "interesting" going on, but off the top of my head I'm not sure at all what. [23:32:03] It also only started happening yesterday when nothing changed and the last deployment was months ago. [23:34:43] andrewbogott: wacky!