[12:47:04] dhinus: I added some more stuff on T384591, feel free to add more stuff/remove/etc. [12:47:05] T384591: [dbaas,toolsdb] Add support for management of toolsdb databases within toolforge - https://phabricator.wikimedia.org/T384591 [12:48:18] ack [12:49:01] Hm, in my sleepiness I misread the invite and thought the dbaas meeting was starting now rather than ending now [12:49:15] * andrewbogott reads the ticket [12:50:02] are y'all thinking this will use trove, or some different k8s-based thing? [12:53:21] half-half, the idea might be that instead of using trove and then support some other thing, we might support trove as is for now (manual), and jump to supporting the other thing first, though we agreed that toolsdb would be a good start no matter what [12:53:57] the big question is storage on toolforge, like persistent volumes and such, specifically hooking toolforge to our ceph setup [12:54:11] yeah, that solves a lot of things [12:54:23] yep :) [12:54:52] I know you don't love this idea but remember if we map tool accounts to keystone accounts then it's trivial to get per-tool trove and s3 from that. [12:55:02] I think there's a demonstration of that sitting around someplace [12:55:16] (of course that assumes that we don't hate trove, which... we might) [12:57:42] I know yes :), that's all in the s3 support task, I'll add a note in the dbaas one when mentioning trove so we don't forget [12:59:21] unrelated: If I run ' wmcs.ceph.osd.bootstrap_and_add' will that start filling all 8 OSDs all at once, or does it only set them up but not bring them up? [12:59:49] * andrewbogott trying to not break NFS this week [13:00:33] let me double check, but I think it batches them in groups of 4, it sets them up first, then brings them in bit by bit [13:00:41] T384586 <- the dbaas generic task [13:00:41] T384586: [dbaas] Add DB as a service capabilities to toolforge - https://phabricator.wikimedia.org/T384586 [13:00:57] (in case you want to add/remove stuff to, extra eyes are always appreciated) [13:01:23] ok, are you comfortable with me running that cookbook now? [13:02:09] the cookbook actually uses batches of 2 (you can increase it if you want using --batch-size) [13:02:28] andrewbogott: ack from me, going for lunch though, but feel free to ping me here if anything goes awry (that it should not) [13:02:59] ok! It's running [13:03:35] you can keep an eye on the network pannels here https://grafana.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance?orgId=1 [13:04:22] specially the switches one, if we hit over the orange, that's when we usually started to have issues, if we do not, well, then we might want to undrain the next node faster [13:05:34] hmm, there seems to be a gap at the end for some reason :/, if it does not show up data in a bit, ping cathal about it, maybe there's something borked on the prometheus/gnmi side [13:19:41] 'discards on switch/router links' is high but maybe that's just the QOS change working? [14:02:01] yep, we should see some yes [14:02:17] 200p/s is not a lot either, it usually spikes to the 1ks [14:02:28] though that should be ok also if QoS is working [14:02:36] any workers stuck? [14:03:22] nothing pops out right now [14:03:24] https://usercontent.irccloud-cdn.com/file/CnSlteLg/image.png [14:03:34] there was a little spike ~2h:15m ago [14:04:19] oh, tools-static might be misbehaving (message in -cloud) [14:05:31] topranks: I don't see any switch traffic stats for the last couple hours https://grafana.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance?orgId=1&viewPanel=137 [14:05:58] is there any knows issue? Stat renaming? [14:06:23] I see there's an alert on "cloudsw1-e4-eqiad.mgmt.eqiad.wmnet" [14:06:51] on the nfs side, something happened at 11:48 UTC [14:06:58] https://www.irccloud.com/pastebin/aTQGioCs/ [14:07:02] andrewbogott: ^ [14:07:43] that's actually **before** you started doing stuff right? [14:07:53] yes, well before I did anything [14:08:01] seems to match the time at https://librenms.wikimedia.org/device/device=242/tab=alerts/section=alert-log/ [14:08:02] xd, what a coincidence [14:08:30] seems like it recovered? [14:09:16] it matches the time the graphs for traffic from switches stopped too [14:09:21] topranks: I think we have a network issue xd [14:10:01] I'll reboot tools-static to recover the NFS there [14:13:20] catching up guys [14:13:37] em there is an issue with the gnmic stats polling right now, I'm debugging on the VM in eqiad [14:14:04] okok, do you think that it would have caused a network blip or similar? we saw NFS issues at that same time [14:14:09] (before we did anything xd) [14:14:13] I am pooling a new cephosd which probably doesn't help, but that's close to finished [14:14:14] no not at all [14:14:23] just the daemon collecting stats isn't working right [14:14:42] murphy's law timing-wise [14:15:02] we added some new collection for bgp - let me revert that and get things back to how they had been [14:15:16] oh wait, it was actually 1h before... damn UTC changing UIs xd [14:22:29] is the network still misbehaving or was it a one-off that broke tools-static? [14:23:45] one-off I think, things did stop getting stuck [14:23:51] I might have just broken puppet on all cloud servers :/ [14:24:09] oops, anything we can help with? [14:24:16] I think I know the fix [14:24:27] that teaches me I should always use PCC :P [14:24:37] Could not find declared class prometheus::node_kernel_panic [14:24:45] I missed one rename [14:25:07] ooohhh, yep, okok [14:25:26] I just got deployment-prep down to a single puppet failure, that was probably the trigger for new puppet breakage [14:26:10] dhinus: it's a race for you to fix it before the alert manager dashboard I'm watching goes all red :) [14:26:16] hahahaha [14:26:21] doing my best :P [14:26:55] ok the network stats should be back working in eqiad now [14:27:25] is there something specific we should look at? [14:29:10] topranks: we were noticing the dashboard because I'm pooling new osds, I don't think anything interesting is happening right this minute [14:29:26] ok yeah bad timing [14:30:23] fwiw I'd tested all the new stuff in magru and it was fine, but I think the new set of (bgp) metrics was too heavy in eqiad given the higher number of devices there. will need to work on resourcing the gnmic stuff better to introduce those. [14:31:04] np [14:31:19] https://usercontent.irccloud-cdn.com/file/J1IvkkNf/image.png [14:31:39] we are not sure we got to saturate the network much (there were some drops, so that's a good hint), but this test was a success :) [14:32:17] dcaro: the cookbook finished but 'ceph osd tree' shows REWEIGHT of 0 for the new osds. [14:32:19] are the new hosts now synced fully? god damn my timing I wanted to see how things performed [14:32:27] I'm not sure I know what 'reweight' is [14:32:34] andrewbogott: looking [14:32:36] topranks: we can drain the next one while you watch if you want [14:32:40] after ^^ is sorted [14:32:48] reweight is a 'temporary' weight, when the osds are restarted it clears upp [14:32:52] don't do anything that doesn't need to be done anyway [14:32:56] it does [14:33:17] cool, well ping me when it's happening will be interesting to see [14:33:30] I guess a "drain" means it moves the data that host was serving elsewhere to keep it available? [14:33:30] dcaro: ok, so it'll reset itself eventually? [14:33:42] it should not have left it at 0 :/ [14:33:51] do you have the logs of the run? [14:34:48] it sets the reweight to 0 right after setting them up, so it can undrain them bit by bit [14:35:12] I have them in the backscroll, also probably they're someplace in /var/log on cloudcumin1001 [14:35:21] want me to copy/paste the scroll? [14:36:24] any will do, just want to check the flow it went through [14:37:21] topranks: yep, empty server getting filled up with data from others (many->one) [14:37:38] https://phabricator.wikimedia.org/P72268 [14:39:00] I'm on the third iteration of my patch to fix puppet, but this time I'm checkin with PCC :D [14:39:10] andrewbogott: hmm, yep, they have no data [14:39:22] https://www.irccloud.com/pastebin/6JAHLmjF/ [14:39:36] dcaro: what was all that waiting for rebalancing about then [14:39:57] it still created space in the cluster for the new osds I think (new shards) [14:40:52] andrewbogott: do you have the /var/log/spicerack logs? [14:41:13] was this in cloudcumin? (if so I can get there and check myself) [14:41:43] cloudcumin1001:/var/log/spicerack/wmcs/ceph/osd [14:41:47] they're there [14:41:55] ack, looking [14:43:12] '2025-01-23 14:14:08,858 andrew 229356 [INFO] Completed command 'sudo -i ceph osd crush reweight osd.93 1.7465785369277 -f json'' [14:43:18] :/, that should have worked [14:44:09] hmm, still at 0 [14:44:13] (manually ran it) [14:44:36] 🤦‍♂️ it should be `ceph reweight` directly **also** [14:45:19] * dcaro looking [14:45:48] ok this should fix puppet: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1113814 [14:45:57] PCC is still broken, because it compiles only with the change, and is broken in prod [14:47:51] do you need all the specifics for ensure=>absent? [14:48:08] PCC failed without them [14:48:14] I was also confused by that [14:48:26] huh [14:49:08] this was the PCC failure without the specifics: https://puppet-compiler.wmflabs.org/output/1113814/2801/cloudcontrol1005.eqiad.wmnet/change.cloudcontrol1005.eqiad.wmnet.err [14:50:21] oh, I see, it's just that we're missing a bunch of optional[]s around the type defs [14:50:42] I do not feel a need to fix that just now [14:52:25] andrewbogott: no idea how I tested that before :/, this should fix it https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1113817 [14:53:44] ok! Shall I manually fix the 1012 osds or are you doing that? [14:54:29] you can use that patch to test, with test-cookbook (or similar in cloudcumin) [14:54:39] use the `undrain_osd` cookbook [14:54:44] uses the same code in the end [14:54:54] ok puppet is fixed! [14:56:19] andrewbogott: hmm, now I'm thinking that it might not detect them as down, I can do the tests if you want, get it sorted out [14:56:52] * andrewbogott trying to remember what -c takes with test-cookbook [14:57:01] gerrit patch numbers [14:57:38] ok [14:57:48] dcaro: as you predicted, it does nothing :) [14:57:51] andrewbogott: yep, I suspect that the cookbook will not be able to get those osds, as it will see them as in, looking [14:58:01] I can depool and then repool but that'll take a while [14:58:02] ok [14:58:12] you can try to force it, but well, let me test it and fix it also [14:59:55] * andrewbogott waves to taavi [15:02:17] * taavi hides [15:05:37] :), taavi is too notable to hide effectively ;) [15:05:40] this looks better: [15:05:43] https://www.irccloud.com/pastebin/l7DAaQpc/ [15:07:20] data should start moving around [15:08:02] topranks: will you still be around in a couple hours? Might be that long before I'm ready to drain the next node [15:08:11] on the other hand, lots of data should be rushing around right this minute [15:09:15] cloudcephosd1012 network is reaching 1Gb/s, though it's barely noticeable on the switches [15:09:40] I could drain another one right now, just to make things interesting :) [15:09:45] as it did already shift the data it needed between the rest of osds [15:10:04] andrewbogott: make sure to use the patch I sent, little by little, should be ok [15:10:22] the reweight patch you mean? [15:11:10] it actually should not matter for undraining xd [15:11:16] but yep, I meant that one [15:11:23] *draining [15:11:51] andrewbogott: yeah I'll likely be online. storm coming in so I won't be outside anyway ! [15:11:55] ok, draining one drive on 1013 [15:12:50] topranks: the cluster is doing a lot of shuffling now if you want to watch the graphs [15:12:55] * dcaro moves the traffic graphs to it's own window in the other monitor [15:37:14] dcaro: quick +1? I rebased https://gerrit.wikimedia.org/r/c/operations/alerts/+/1113508 [15:37:24] I also double checked with cumin that the new file is available everywhere [15:39:46] dhinus: lgtm (it still has my +1 from before I think) [15:40:15] the old +1 was lost because I had to adapt it after the puppet hotfix [15:40:23] wrong patch :D [15:40:31] right one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1113498 [15:40:53] +1d :) [15:40:59] thanks :0 [15:41:01] :) [15:42:16] hmm... my cookbook failed running `sudo -i ceph status -f json` :/ [15:46:27] I made a quick dash showing the queues on the inter-switch cloud links in eqiad [15:46:28] https://grafana.wikimedia.org/goto/KFrEYcdHR?orgId=1 [15:46:58] there is an uptick last few hours but nothing is saturating [15:47:12] occasional drops and they are in the low prio class but only small numbers [15:47:16] we can keep an eye on it [15:47:35] all those links are 40Gb bw [15:49:10] !log admin cumin 'P:base::cloud_production' 'rm /var/lib/prometheus/node.d/kernel-panic.prom' T382961 [15:49:10] dhinus: Not expecting to hear !log here [15:49:10] T382961: Kernel error metrics have overlapping definitions - https://phabricator.wikimedia.org/T382961 [17:25:24] dcaro: you still around? [18:48:37] took me a bit more than expected, andrewbogott here's the task https://phabricator.wikimedia.org/T127367, feel free to rewrite/reword or anything [18:48:49] * dcaro clocking off [18:50:23] cya in some time! [19:36:29] * taavi wonders the reasons for the sudden interest in logging [19:38:57] the interest isn't new, we're just trying to actually do something about it