[07:13:51] morning [07:22:06] ceph is saying "1 pool(s) do not have an application enabled" and I don't know what that means [07:28:36] hello [07:28:56] I have a special morning today. Baby appointment, will be online a bit later than udual [07:28:59] usual* [07:30:05] also puppetdb on toolforge is down for whatever reason :/ [07:32:04] `FATAL: could not load root certificate file "/etc/ssl/certs/wmf-ca-certificates.crt": No such file or directory`, probably caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/961839, cc jbond [07:52:09] T347934 [07:52:10] T347934: Cloud VPS PuppetDB Postgres instances failing with "could not load root certificate file "/etc/ssl/certs/wmf-ca-certificates.crt": No such file or directory" - https://phabricator.wikimedia.org/T347934 [08:15:13] morning [08:16:06] morning :) [08:18:09] taavi: I'm guessing andrewbogot.t started playing with rgw: `application not enabled on pool '.rgw.root'` [08:18:24] (from `ceph health detail`) [08:19:31] in ceph you manage namespaces by "pools", and you can enable/disable which type of APIs/applications will be using it (cephfs/rbd/rgw/...) [08:19:55] I think that when you manually create the pool, it gets no app by default, and you have to manually specify it [08:20:55] that pool though has been there for a while, not sure why it started complaining now [08:21:12] git log for the puppet repo seems to support that theory :-P [08:21:27] here's a patch for the dynamicproxy issue: https://gerrit.wikimedia.org/r/c/operations/puppet/+/962998/ [08:23:03] junh [08:23:14] *just manually set the .rgw.root pool as rgw app [08:24:23] health is ok now [08:50:24] taavi: did you already handle the puppetdb issue? [08:50:26] Info: Applying configuration version '(644c46f682) root - hieradata: cloud: provision wmf-ca-certificates.crt' [08:51:21] dcaro: I did a cherry-pick on tools to test my proposed fix (https://gerrit.wikimedia.org/r/c/operations/puppet/+/962992/) works, still needs a proper review [08:51:30] ack [08:51:34] thanks! [08:58:09] taavi: ack looking [09:28:59] kindrobot: web proxy deletion should be fixed now [09:36:19] * taavi lunch [10:17:20] dcaro: I merged the toolforge_weld MR to add a helper for streaming data, do you want me to cut a new release? [10:17:38] taavi: sure! [10:17:48] thanks :) [10:36:03] Raymond_Ndibe: please never install an apt package by hand to toolforge, that should be done by puppet instead. for example right now tools-sgebastion-10 has toolforge-builds-cli installed but tools-sgebastion-11 is missing it [10:36:31] so something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/963018/ should be used instead [10:37:00] should we set them to latest? [10:37:58] I might have been the one installing the package by hand btw, seems like something that would have been done way earlier in the builds development [10:39:03] * dcaro lunch [10:39:27] dcaro: I'd say `ensure => installed,` is fine, you can upgrade by hand if you need and otherwise unattended-upgrades will do it, but the initial install should be done by puppet or you will end up in an inconsistent state and make instance replacements much harder [10:40:03] and I checked the apt history logs to check why it was only installed on -10, it wasn't you [11:39:28] I would consider moving the package to the tools-* repo as a release itself, so I'd expect that the next step would be upgrade (as it had been tested already in toolsbeta), and in very very limited cases that upgrade would not be needed no? [11:39:54] not a strong opinion though, but as long as the upgrade step is manual, mistakes will happen [12:14:35] btw, I got a slightly modified pywikibot (https://gitlab.wikimedia.org/taavi/pywikibot-script-buildservice/-/commits/stable) to run a built-in script in a buildservice built image which made a real edit (https://wikitech.wikimedia.org/w/index.php?title=Help:Tool_Labs/Database/Replica_drift&curid=441208&diff=2117003&oldid=2116999) [12:25:42] nice! [12:26:17] I had pending trying to setup + document how to use pywikibot on buildservice :) can you write your findings somewhere? (task/wiki/discussion page, any will od) [12:26:19] *do [12:27:22] what is that you needed to change? [12:29:34] I added that as a topic to the toolforge workgroup meeting today, I'll write my findings to T249787 too. essentially I had to add a Procfile to run the pwb.py script, add a user-config.py file to read the authentication data from envvars, and remove a check that blocked using user-config files owned by another user [12:29:34] T249787: Create Docker image for Toolforge that is purpose built to run pywikibot scripts - https://phabricator.wikimedia.org/T249787 [12:30:24] :+1 [12:30:30] 👍 [13:04:20] dcaro: no idea why ceph decided to worry about that pool, thanks for fixing. I have another question though! [13:05:03] There's this old discussion about using erasure-coded pools for swift: https://phabricator.wikimedia.org/T276961#7042719 [13:05:16] Do you have an opinion about if that's a good idea, and what settings to use? [13:05:55] * arturo food time [13:10:31] andrewbogott: I think it's a good idea, I don't have any experience with it though, but we should play with it [13:10:33] * dhinus paged [13:10:51] dhinus: I think the page is taavi rebuilding cloudvirts [13:10:57] sorry about that [13:11:00] what's the page? [13:11:01] something about wdqs [13:11:06] acked on icinga, let me see how to do that on victorops too [13:11:25] I acked it on VO [13:11:36] dcaro: if I'm tuning for "survive loss of n osd nodes" what would you like n to be? [13:11:48] the default is 1 [13:12:02] are we still using rack as the unit for HA? [13:12:20] ideally yes [13:12:26] (that would mean that we can lose a whole rack, a whole node, but not two of them) [13:12:38] I woud say 2 just in case? [13:12:44] ok [13:12:45] hmm, depends [13:12:52] 1 for rack, 2 for node xd [13:13:04] I think this is talking about destruction of a node, not just downtime. [13:13:06] hmmm... gets tricky [13:15:34] hmm... thinking what would mean erasure coding with rack level HA [13:16:39] that'd mean that you'd have to have at least K+M racks no? (otherwise, the HA of some of the chunks would overlap) [13:16:53] we only have 4 racks in eqiad, and 3 in codfw [13:16:56] iirc [13:17:04] there's a setting 'crush-failure-domain=rack' but I definitely don't understand the algo behind that yet [13:17:15] I take it 'rack' is a concept that ceph is aware of? [13:17:38] kinda, it's a default that it can populate, you can modify by hand as you want though, and rename and such [13:17:58] it's a 'crush bucket' [13:18:41] are you reading https://docs.ceph.com/en/latest/rados/operations/erasure-code/ ? [13:19:01] "For example, if the desired architecture must sustain the loss of two racks with a storage overhead of 67%" seems close to what we want [13:19:46] I think that you need to have at least 5 racks for that [13:19:51] (with the same capacity each) [13:21:03] "The crush-failure-domain=rack will create a CRUSH rule that ensures no two chunks are stored in the same rack." yep, you will need M+K racks to support that HA level [13:22:38] so if we set k=2 m=2 that will give us resiliency over one rack loss won't it? I'm having a hard time relating the theory of what k and m are to how that translates to resiliency [13:24:11] so m is the number of chunks, and k is the number of checksums, you can lose as many osds as checksums you have, but checksums don't store anything new, so if you have m=2 and k=2, you are doubling the space you are using to store anyting, as any object will be split in 2, and 2 checksums will be created too [13:24:37] if the domain is the rack, you can lose as many racks as K [13:25:30] or as many osds (you can lose up to k chunks, be that checksums or actual data) [13:27:17] I think you have k and m swapped? K is chunks, m is checksums [13:27:36] oh, okok, yep xd [13:27:52] naming is hard [13:27:55] yep [13:29:02] so we can have k=2 m=1 on codfw, and k=2 m=2 on eqiad [13:29:34] that seems right to me [13:30:47] the other thing I'm not clear on: radosgw auto-created all these different pools, do I need to rebuild all of them or just the .data one [13:30:55] (or is there some way to tell radosgw to do that by default...) [13:31:26] taavi: thanks, I'll try deleting them today [13:33:49] andrewbogott: hmm, I think the data only is the one it's needed, but that means that there's a different HA on the "control" pools than the data one, so we have to be careful there, as losing two osds will halt the control but not the data [13:34:15] yeah, seems best to have them all the same [13:34:23] easier to think of yes xd [13:35:14] taavi: ok if i wipe out any existing object storage in codfw1dev? [13:35:38] andrewbogott: not sure what I have there, but that should be nothing important. so sure [13:36:28] there might be disks for cloudservices vms and such no? [13:36:45] or you mean rgw data? [13:36:48] dhinus: I'm going to be shutting down the last VMs on cloudvirt-wdqs1001 too. I think I managed to downtime that check, but in case I haven't this is your advance warning/apology [13:37:25] dcaro: yes, I'm going to recreate the rgw. pools [13:37:30] ack [13:37:36] nothing that I know of either [13:37:59] (I have no data or know about any non-deletable data in those pools) [13:39:03] (seems like I did manage to downtime that check. hooray!) [13:40:48] ugh I forgot how hard it is to delete a pool [13:41:31] taavi: thanks, I'll let you know if it pages :) [13:43:18] it's a nagios/icinga check? [13:43:22] dcaro: before I do this 5 more times, do you mind checking default.rgw.buckets.data in codfw1dev and tell me if it looks right to you? [13:45:26] looking [13:45:32] in codfw? [13:46:04] yep [13:46:04] (btw. we will need to add multisite multicluster support to puppet... if we have more clusters) [13:47:15] yeah, although I keep thinking that a few big ceph clusters is better than multiple smaller clusters... at least, seems more redundant [13:47:19] maybe not as performant though [13:48:16] Is there some way to get all the settings for a pool rather than having to ask for them one by one with 'ceph osd pool get'? [13:50:24] looks ok [13:50:28] https://www.irccloud.com/pastebin/dq6xPEMN/ [13:50:40] ooh, that's so much better [13:50:56] I think that for erausurecoded pools you still need the second command to get the profile details [13:51:04] yeah [13:51:16] ok, going to recreate the other pools in codfw and then we'll see if swift still works [13:51:32] 🤞 [13:52:08] * taavi afk, will be back later for the toolforge meeting [13:59:34] hm, nope, nothing works now [14:00:48] I don't know enough about how access works in ceph. Were the keyrings bound to particular pools and when I deleted/recreated I broke that association? [14:02:06] maybe? let me look too [14:02:21] thanks [14:02:55] how are the rgw machines called? [14:02:57] cloudswift? [14:03:27] might be this: [14:03:29] https://www.irccloud.com/pastebin/scdoGNqs/ [14:04:15] added the applications (using the command it mentions) [14:04:21] rgw app [14:06:40] ok, trying again... [14:08:24] hm, why is the haproxy health check only hitting cloudcontrol2004-dev? [14:09:07] (container creation is still not working) [14:09:36] which host are you running this? [14:10:48] the test command you mean? [14:10:48] root@cloudcontrol2001-dev:~# OS_PROJECT_ID=testlabs openstack container create testlabscontainer [14:11:20] where is the rgw running? [14:11:32] cloudcontrols [14:11:35] ack [14:11:46] the log says [14:11:48] https://www.irccloud.com/pastebin/w2dZj3UZ/ [14:11:53] not super helpful [14:12:32] there's only one rgw daemon connected no? [14:13:49] I'm not sure I understand the question. In theory it's running on all three cloudcontrols but something may be wrong in the haproxy layer [14:14:16] cloudcontrol2004 is the only one actually running the radosgw daemon [14:14:24] the other are stopped/dead [14:14:59] hm, haproxy says [14:14:59] Server radosgw-api_backend/cloudcontrol2001-dev.private.codfw.wikimedia.cloud is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [14:15:24] because the daemon there is dead xd [14:15:30] as in, the radosgw process is not running [14:15:33] you're right, seems to be crashing silently on e.g. 2001 [14:16:03] same for 2005 [14:16:21] https://www.irccloud.com/pastebin/lnuYC9SP/ [14:16:21] "corrupted double-linked list" ? [14:16:30] that sounds tricky xd [14:17:27] I'm not sure that's the issue [14:17:40] it's not telling me much when I try to restart [14:18:11] I restarted 2001 rgw daemon and seems ok [14:18:28] why does it work for you and not for me? 'systemctl start radosgw.service'? [14:18:37] aaaa [14:18:50] it's ceph-radosgw@radosgw.service [14:19:05] what the heck [14:19:05] ok [14:19:42] hmm, where does that one come from? [14:20:32] hmm, both seem to be installed by the radosgw package [14:20:50] https://www.irccloud.com/pastebin/1qJiBWQS/ [14:21:33] I think that the init one might be deprecated, and systemd is just showing it because the file is there [14:21:51] anyhow, now we have 3 daemons [14:21:52] rgw: 3 daemons active (122130562, 122130817, 122178836) [14:22:43] I don't think that should have been an issue though xd [14:26:38] yeah, still failing the same way. I was hoping that it was just because I'd restarted the wrong thing... [14:27:39] ok, so what's different now from when this was working an hour ago... [14:28:20] the pools? [14:28:36] yeah, they're different :) [14:28:48] where is the auth the service uses? [14:29:46] oh, iirc radosgw you have to create users and such as s3 would right? [14:29:52] user auth happens via keystone, that part seems to be working (according to the log) [14:30:05] https://docs.ceph.com/en/quincy/radosgw/swift/auth/ [14:30:08] it ought to be implicitly creating those [14:30:23] like, there should be an automatic rados user that corresponds to a keystone project [14:30:24] okok, because those were probably all deleted [14:30:31] true... [14:30:35] maybe it does not recreate them unless forced? [14:30:42] (as in, it thinks it's there already) [14:30:45] it might be out of sync and think they're there [14:30:46] yeah [14:31:54] seems to also fail in a project that hasn't used radosgw before [14:33:11] (that setting is rgw_keystone_implicit_tenants = true) [14:36:08] I see only 3 users, trove, testlabs and cinderprojectest [14:36:24] That fits, those are the three I've tried since rebuilding things. [14:37:25] what decides which pools a given ceph user can access? [14:37:46] In this case ceph.client.radosgw.keyring [14:39:20] when creating the auth, you can add pools= as parameter [14:40:36] ok, so likely that's what's broken. Can that be altered after the fact? [14:41:27] I think so yes [14:49:37] I don't see a pools option for that client auth though [14:50:36] what about -- do you know how to query a user for what pools/applications it has access to? [14:50:51] ceph auth ls [14:53:11] (btw there is a working setup in eqiad1, so I'm trying to compare things to client.radosgw there. there are also some *.rgw users there which as far as I know are relics and don't do anything) [15:41:38] I think that the draining of the ceph hosts is making the cluster get out of space somehow, probably some osds are getting full and the rebalancer has to move more stuff around [15:41:40] taavi: maybe a good thing to think about for that PWB image would be if/how we might replace all the toolforge images with ones based on buildpack? I'm not sure yet if that makes sense or not, but I think long term we probably want convergence. [15:41:44] looking though [15:46:18] bd808: hmm. you mean the images we manage via https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/toollabs-images/+/refs/heads/master? I hadn't considered the possibility of using buildpack images to provide the dependencies and start script for code located on NFS [15:48:37] taavi: yes, those are the images I mean. Like I said, I'm not sure this is entirely possible yet, but it seems like it would be nice to have converged systems for image generation if we can. [15:50:27] Converting that repo to use `pack` seems like it might be possible if we made a custom stack to do it. Maybe that doesn't really buy us anything though. [15:52:55] would be great if we are able to use the build service directly (not sure if it's doable though) [16:08:17] about the ceph health, yep, it's a side effect of the rebalancing, health is back to normal [16:15:10] dcaro: here is the magic: "RBD can store image data in EC pools, but the image header and metadata still needs to go in a replicated pool." [16:15:40] I switched all pools except for .data to replication and things seem happy again. [16:15:54] I'm guessing you would like me to not make this change in eqiad right now while you're depooling things? [16:16:00] ooohhhh, tricky [16:16:11] should be ok [16:22:37] https://openstack.eqiad1.wikimediacloud.org:28080/swift/v1/AUTH_testlabs/ecbucket/foo.html [16:23:36] \o/ [16:23:52] something is still a bit broken in codfw1dev I think [16:29:32] ok, as per tradition, restarting the services has codfw1dev happy now [16:30:25] Now lunch and then swift UI things [16:31:06] taavi: I feel like you looked at this... do you recall if the same endpoint also serves s3 APIs or if that's some kind of add-on? The docs sort of imply that the same endpoint can speak swift and s3 both which seems weird. [16:33:24] andrewbogott: do you think I can try again the Antelope upgrade in codfw? if yes, please add a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/963029 and I'll try running the upgrade cookbook [16:34:53] I think so! [16:36:20] andrewbogott: I remember looking at that before. let me find if I documented that anywhere. [16:36:27] ty [16:40:54] I found https://phabricator.wikimedia.org/P49548 which was to document how to integrate it with terraform, but I don't see anything on how to set it up. iirc the openstack cli had a feature to create s3-style credentials and then you could just use that [16:43:32] and I still feel like the swift api should be exposed on port 443 (https://phabricator.wikimedia.org/T341380#8998491) [16:45:29] * dcaro off [17:23:06] To verify before I break something; All the systems in https://phabricator.wikimedia.org/T342456 are fine for experimenting with? and to reimage the command would be [17:23:06] `sudo cookbook sre.hosts.reimage --os bullseye cloudcontrol2006-dev.codfw.wmnet` ? [17:33:25] Rook: yes on both. Sometimes the reimage command needs a --new if the previous reimage went badly. [17:34:07] neat, thanks [20:33:49] Rook: (believe you are on call), can you see https://phabricator.wikimedia.org/T348067 [20:46:03] sigh [20:46:27] sigh indeed [20:46:47] that's the novaobserver password, it's intentionally public [20:48:47] taavi: for the random user in -cloud, they don't know what's supposed to be public and what isn't. It's hard for them to tell and definitely right to be safe. [20:51:21] I agree prefacing with "fakepass" or the like would reduce such confusion.