[07:04:12] greetings [07:57:39] morning! [08:08:29] morning [08:57:44] I just merged the libvirt cfssl patches btw, in case you see sth out of the ordinary [08:57:54] ack [08:57:58] * dcaro paged [08:59:29] nova-compute reports being down [08:59:37] sigh [08:59:49] stopping puppet on cloudvirt [09:00:19] libvirt.libvirtError: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory [09:00:41] libvirtd is failing to start [09:01:03] mmhh I think that's a puppet race, I'll take a look [09:01:19] xd /me getting spammed with pages [09:02:07] should we revert? [09:02:42] VMs are still running though [09:03:03] Aug 25 08:56:46 cloudvirt1059 libvirtd[3918460]: Unable to import server certificate /var/lib/nova/clientcert.pem [09:03:06] I'd like to try rolling forward first and then revert [09:03:07] they are failing with this [09:03:11] okok [09:03:28] I'm acking pages/alerts, let me know if I can help with anything else [09:03:35] thank you dcaro ! will do [09:06:09] ok I have a "fix", run puppet again and stop libvirtd-tls.socket + libvirtd then start libvirtd-tls.socket again [09:06:21] will do a cumin run [09:06:45] dcaro: there might be other pages coming in, apologies in advance [09:06:54] np [09:07:25] unless there's an easy way to silence the icinga alerts ? [09:08:59] I think it's possible to create a silence on icinga UI directly [09:09:04] but it's ok, I'll just manually ack [09:09:09] thank you [09:09:17] no need to figure out right now xd [09:10:44] I refused to learn how to silence things in icinga given it was supposed to be deprecated "soon" :D [09:11:03] good idea [09:11:21] xd [09:11:23] * dhinus 10 years later... [09:11:43] the crystal ball tells me it was in fact not a good idea :D [09:11:44] I'm also tempted to ditch the check_procs paging check and favor higher level signals [09:11:53] will open a task about it [09:16:01] server live migration works, tested on 1f4d88d7-5448-4f89-b22c-0366d8d9424e [09:16:54] I think I have now silenced them xd [09:17:02] (for 1h, not sure if that was helpful though) [09:17:21] yes I think it was, I'm almost done [09:17:42] you have to search on icinga for the alert, then select all the results, and on the upper right 'schedule downtime for checked service' [09:18:22] * godog nods [09:18:26] dcaro: thanks [09:18:52] there might be an easier way to select all the alerts to downtime probably, but that worked ok [09:19:04] do you know/remember what the nova-compute paging check was for ? [09:19:20] if not that's fine too, VMs were actually ok without libvirt or nova-compute [09:20:29] for us to notice that something was wrong, I think in the past we did not notice that nova was not starting, and let it go for too long [09:21:26] it prevents users from managing VMs essentially (and any other service that needs nova information, like magnum/trove/etc. from modifying/managing the current state) [09:22:01] ok thank you that makes sense [09:22:34] I'm happy to discuss if it should be a non-paging/warning instead [09:23:01] in the end it's fooling around actually defining SLOs xd [09:23:23] heheh very true [09:32:45] T402778 the tracking task [09:32:46] T402778: Evaluate higher level signals for nova troubles rather than paging on nova-compute down - https://phabricator.wikimedia.org/T402778 [11:01:57] tools-harbor-1 is still sending puppet emails, can we at least shut it down as discussed in the meeting last week? [11:16:20] yes please, cc dcaro Raymond_Ndibe ^ [11:16:38] I can do [11:23:47] done [13:07:44] taavi and/or topranks, can one of you check the jumbo frames situation on cloudvirt1045? That's the last one in the batch that doesn't pass the ping test T378828 [13:07:45] T378828: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828 [13:08:26] i will in 10 [13:08:57] jumbos are enabled globally so won't be any issue [13:09:20] cloudvirt1045 has no second-link configured in netbox however, so the storage network will be offline [13:17:07] topranks: :( so back to dcops for that? [13:17:16] * andrewbogott should know to check that by now [13:17:24] Yeah I commented on task [13:18:28] thx [14:09:09] topranks: want to follow along while I pool another drive on a 25G host? [14:09:32] If indeed it alerts but nothing is actually wrong we might need to adjust alert thresholds [16:02:07] the link will max - the alerts will fire - that's a given [16:02:21] sorry I only seen this now for some reason [16:13:37] * dhinus off [17:09:34] * dcaro off [17:29:35] sorry topranks I went to lunch :/ are you still around? [17:30:24] yep still around if you want to do it [17:30:55] I do! Here goes :) [17:34:49] hm, ceph is reporting a lot of slow ops so I'm turning off rebalance for the moment [17:37:05] just saw, anything triggered it? [17:37:30] ` (muted: DAEMON_OLD_VERSION)` [17:37:34] in ceph status, that's weird [17:38:41] dcaro: yes, I muted it. [17:38:48] It's because of the one test node running Q [17:39:01] there's a long (maybe one-week?) delay before the alert fired [17:39:06] so I silenced it since it's expected [17:39:51] ack [17:40:26] I think VMs might start to get affected (instancedown tools-k8s-worker-nfs-67) [17:40:37] I don't like that we immediately get slow ops/inactive pgs as soon as pooling one of those 25G osd nodes :/ [17:40:55] yeah, I'm trying to reduce the rate of rebalance by a lot, hoping ceph will settle [17:41:35] ok, yep, now it's down to only 2 inactive pgs [17:41:35] okok, it's calming down [17:41:48] is it saturating some nic or something? [17:42:34] topranks can explain about the qos again and maybe I'll understand it this time. My suspicion would be that it's throttling to a fraction of 25G on the new OSD but that's still too much for the rest of the cluster which is still on 10G? [17:42:50] https://usercontent.irccloud-cdn.com/file/s4Snl5Nf/image.png [17:43:11] there's certainly strain on the network, thee et/0/0/55 seems getting to the limit [17:43:29] I see no errors though [17:43:43] nothing gets "throttled" [17:44:20] if too much bandwidth gets queued up to be sent on an interface - and some of it has to therefore be droppped - the QoS will make sure each defined "category" gets a specific percentage of the available bw [17:44:41] and in our case it should give a higher priority to the mon traffic (so we shouldn't have those probes dropped) [17:44:49] topranks: ok, that makes me think I don't know what throttled means :/ what's the distinction? [17:45:09] and a lower priority to the other ceph traffic (which means the other WMCS "everyday" traffic won't be squeezed as much as if we didn't have QoS) [17:45:20] ceph still says '2 pgs inactive' and I don't understand why it isn't prioritizing fixing those [17:45:39] throttled as I understand the word would be if we set some rule like "give 2Gb of traffic to this flow, throttle it if the speed goes beyond that" [17:46:26] hm, ok [17:46:42] so... this all makes me think that whatever is going wrong with ceph is not a network bandwidth thing [17:49:46] The C8 -> E4 40G link is the only one I see maxed out [17:49:47] https://grafana-rw.wikimedia.org/d/5p97dAASz/network-device-interface-queues-and-error-stats?orgId=1&from=now-1h&to=now&timezone=utc&var-site=000000006&var-device=cloudsw1-c8-eqiad:9804&var-interface=et-0%2F0%2F52&refresh=30s [17:50:12] I mean more bandwidth never hurts :) [17:50:28] oh, and it's /still/ maxed out even though I reduced the rebalance to a single OSD :/ [17:50:40] but I'm not sure what's going wrong with ceph or what it's complaining about... in theory it should just be going what it needs and whatever speed it is getting is dictating how long those tasks take [17:50:52] yeah [17:51:00] it could be bottlenecked on other things like ram or cpu [17:51:58] the current two osds with slow ops are of cloudcephosd1040 (E4) and cloudcephosd1015 (D5) [17:55:10] bah, it was down to 8 slow ops, now back up to 54 [17:55:42] dcaro: if I set it to 'norebalance' will it calm down a bit? Or will that just prevent it from resolving the slowness? [17:56:11] Or should I just not care about the slow ops and just leave it to do its thing? [17:57:08] cloudcephosd1040 seems to be waiting for cloudcephosd1007 to peer with it [17:57:31] (one of the slow ops, `root@cloudcephosd1040:~# ceph daemon osd.308 dump_ops_in_flight | vim -`) [17:57:58] hm, that shouldn't be a problem (1040 <-> 1007) [17:58:58] * andrewbogott sets 'norebalance' [17:59:32] hm that does not seem to actually have stopped it from rebalancing [18:01:52] oh, there we go [18:03:00] let me try to restart one of the stuck osds, that might force it to retry the operations [18:04:49] you restarted 308? [18:04:52] yep [18:05:01] seems to have helped [18:05:09] sees to have helped, want to do 52? [18:05:16] on it [18:05:37] note that if there were many, or the cluster was in a worse state, that could make it even worse though [18:05:52] yeah [18:07:13] there we go [18:07:41] and back on track :) [18:07:50] ok, going to turn off norebalance [18:08:05] there might be more nuanced issues than just the mon traffic :/ [18:08:40] so those two OSDs essentially crashed during the rebalance flood [18:09:46] kinda, not completely crashed, but misbehaving [18:12:04] all the operations (that I can see from the scroll so far) stuck on coludcephosd1040 seemed to be stuck on peering [18:12:05] right, I guess crashing would've been better [18:12:25] right, I guess crashing would've been better [18:12:28] oops [18:12:33] https://phabricator.wikimedia.org/P81738 [18:12:58] ummm is toolsdb actually broken or is that just a stale alert? [18:13:08] at least we did not see the 'heartbeat' storm :), so that's an improvement [18:13:21] it might be broken, some stuff started failing [18:13:29] 3 [N ] 08/25 20:02 root@wmflabs.org [Cloud-admin-feed] [FIRING:1] HarborComponentDown tools (database tools-harbor-2 harbor toolforge,build_serv [18:13:32] for example [18:13:45] all the other alerts cleared though [18:14:32] I don't see that alert on karma [18:14:53] "There should be exactly one writable MariaDB instance instead of 0" [18:14:54] oh now [18:15:12] if mysql restarted it would have come up as read only [18:15:51] yeah, that's likely [18:16:05] * andrewbogott can't tell which is the primary [18:16:39] root@cloudcumin1001:~# cookbook wmcs.toolforge.toolsdb.get_cluster_status --cluster tools [18:16:43] should help, but it's not working :/ [18:18:03] want me to just set -4 to read/write? [18:18:05] https://www.irccloud.com/pastebin/MpFxEW8H/ [18:18:20] cat you check in hiera? [18:18:49] tools-readonly.db.svc.wikimedia.cloud points to 6 [18:18:51] profile::wmcs::services::toolsdb::primary_server: tools-db-4.tools.eqiad1.wikimedia.cloud [18:18:52] which makes 4 the primary [18:18:54] yep [18:19:34] SET GLOBAL read_only=OFF; [18:20:16] now does that cookbook work? [18:20:51] running [18:21:00] I removed `enabled=True` from the wmcs-enc-cli [18:21:16] yep, works [18:21:29] https://www.irccloud.com/pastebin/ayu94PGr/ [18:21:41] and all our alerts cleared [18:21:56] so I guess we're good until/unless I pool another osd [18:22:14] (btw this is definitely specific to the 25G hosts, I've pooled lots of 10G hosts over the last few days without incident) [18:22:44] I think it's quite probably the network saturation [18:23:32] I'm concerned that I don't understand about the QOS thing that cathal said above. Can we just reduce thresholds for those new servers? [18:24:24] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1181756 <- this fixes the cookbook, though I'm not sure what the `enabled=True` entails now (or if there's a replacement) [18:24:41] the QoS should prioritize heartbeats over non-heartbeats traffic [18:25:28] I think that there might be some non-heartbeat traffic that's still critical of some sort, we can try to force it to use less bandwidth for recovery [18:28:52] https://docs.ceph.com/en/reef/rados/configuration/osd-config-ref/#confval-osd_max_backfills that might help, and also https://docs.ceph.com/en/reef/rados/configuration/osd-config-ref/#confval-osd_recovery_max_active [18:28:58] not exactly throughput though :/ [18:31:06] what is the difference between a recovery and a backfill? [18:32:33] the backfill happens when rebalancing, the recovery when it's unplanned I think [18:33:54] So osd_max_backfills is already set to 1? That seems like... the lowest it can be [18:34:08] kinda, so backfill is when the cluster moves data around, recovery is when an osd that crashed comes online needs data to catch up [18:34:55] this might be messing with those config values https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/ [18:35:05] It changed defaults on quincy [18:36:45] https://www.irccloud.com/pastebin/FHcPMNog/ [18:37:20] hm, high_client already seems like the right thing [18:38:43] yep, it's the default in our cluster too [18:39:01] that's good news and bad news [18:43:12] if the issue is the recovery traffic (that seems to be what's going around now), we can try to limit it more by setting https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/#confval-osd_mclock_override_recovery_settings , and then tweaking https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/#confval-osd_mclock_max_capacity_iops_ssd (10 by default) [18:43:40] sorry, osd_recovery_max_active_ssd [18:44:33] there's also an interesting big spread of the iops_ssd one, but probably unrelated (it's set for quincy osds only, new) [18:44:36] https://www.irccloud.com/pastebin/2hBylVQX/ [18:45:17] anyhow, things are kinda stable, I'll go walk the dogs [18:45:22] page me if you need me! [18:45:25] cya! [19:34:43] things still up, nice, I'll go dinner 🥘