[08:20:15] headsup, I'll be rebooting the netbox (and netbox DB) hosts in 10 minutes. let me know if you have any ongoing reimages/decoms, then I can also wait some more [08:44:47] Is there a way in netbox to see the history of an IP? I want to check to which host this IP belonged in the past: 10.192.48.16 [08:45:47] I think it used to be cumin2001 per https://gerrit.wikimedia.org/r/c/operations/puppet/+/505407/6/hieradata/common.yaml [08:45:51] But I would like to confirm [08:48:26] marostegui: https://netbox.wikimedia.org/extras/changelog/?q=10.192.48.16 [08:48:40] the search box on the right in the changelog page (to get there from the UI [08:49:07] oh, awesome volans, thank you [08:49:21] it is indeed cumin2001 [08:49:23] thanks! [08:53:14] then you can click on the left links with the datetime to see the diffs marostegui [08:53:29] yeah, I did that :) [08:53:31] thanks [09:00:03] marostegui: for completion the current retention of the changelog is 730 days, we can ofc consider to increase it if necessary [09:01:43] I guess that's probably more than enough! [09:04:01] I was tempted to put 5.5y as that should cover the whole life of a server, but ymmv :D [09:05:45] dbctl seems to be broken on cumin1001 [09:06:14] :-/ [09:06:16] what's all that? [09:06:20] https://phabricator.wikimedia.org/P25584 [09:06:55] But some of the last pushes apparently worked? https://phabricator.wikimedia.org/P25581 [09:07:30] seems like the data is corrupted [09:07:56] last change i made saved the prev config to /var/cache/conftool/dbconfig/20220420-090010-kormat.json [09:09:18] do we have anyone around who understands dbctl enough to figure out what went wrong? [09:14:15] I can have a look [09:14:41] `dbctl config get` works, `dbctl config generate` does not, which indicates the issue is the local dbctl config, not the etcd store [09:15:18] local? there is nothing local :D [09:15:56] volans: my understanding is there are 2 datastores for dbctl [09:16:01] etcd, and something on a filesystem somewhere? [09:16:12] operations typically edit the fs-based one, and then push it to etcd [09:16:14] is that wrong? [09:16:36] the data of instance/section/config are all in etcd, just different objects [09:16:49] ah [09:16:58] ok, then the non-mw part of the etcd data is broken [09:17:05] I'm checking [09:22:49] volans: es1 doesn't look good [09:22:55] `'es1': [{}, {'es1032': 100, 'es1027': 100}]` [09:23:13] The config does look fine: https://noc.wikimedia.org/dbconfig/eqiad.json [09:23:19] Maybe we need to add es1029 back? [09:23:20] the other sections seem to have a non-empty first entry in that list [09:23:52] dbctl instance es1029 edit looks correct config though [09:24:13] so the error given is that is fails validation of the data internally in dbctl [09:24:20] es1029 is the issue, yeah [09:24:31] trying to depool it is what broke things [09:24:49] but is not very helpful on what actual object is failing validation, could be improved [09:25:10] volans: i'm going to repool es1029, and see if that fixes things [09:25:44] (waiting for your ok) [09:25:57] {"es1": {"master": "es1029", "min_replicas": 1, "readonly": false, "ro_reason": "PLACEHOLDER", "flavor": "external"}, "tags": "datacenter=eqiad"} [09:26:03] so es1029 is es1 master? [09:26:34] volans: es1, es2 and es3 do not have the concept of master in topology (as they are standalone), but yes, for dbctl it is supposed to be the master [09:27:13] I kno wI know [09:27:32] so depooling the master might have caused the issue, I'm even wondering why it allowed to depool it [09:27:43] indeed [09:27:56] I almost depooled s4 master a few days ago and it stopped me, which was great [09:30:31] kormat: how did you depool es1029? [09:30:45] I'm checking if we had something special for the RO es sections to allow depooling the master [09:31:59] `sudo dbctl instance es1029 depool` [09:32:29] (using `software/dbtools/depool-and-wait`) [09:34:56] ok, so yes I think that that's the cause, and it might be all 'by-design' (not sure, still refreshing my memory) [09:35:29] basically what I'm checking is if all the validation/safety nets are done in the 'config' command so it allows you to screw u pthings in the [09:35:45] instance/section objects, but then prevent from generating a faulty config for mediawiki [09:35:48] volans: is there any reason not to fix the config state in the meantime? [09:35:56] so +1 for me to repool es1029 [09:36:01] ok, doing. [09:36:09] check the diff and see what happens [09:36:26] done, diff is empty, as expected/hoped. [09:36:42] ack [09:36:52] phew [09:37:32] I think is all by design because you might want to refactor stuff and need multiple dbctl commands to get into a new state that is valid and potentially have some temporary invalid state while executing different dbctl commands [09:37:49] as it's all virtual until you generate the new config [09:37:57] that makes some sense [09:38:11] it would just be really nice if the error actually said what the problem was [09:38:15] yep [09:38:23] instead of "welp, i don't like `{}`. that's your problem now" [09:38:41] volans: thank you for your help! 💜 [09:38:58] can I repool a host then? [09:39:11] marostegui: yep, go for it. we're back in service. [09:39:16] cool thanks [09:39:17] kormat: that errors might come directly from jsonschema [10:40:26] any puppeteers know what might cause this? [10:40:28] `Evaluation Error: Unknown function: 'size'. (file: /etc/puppet/manifests/realm.pp, line: 103, column: 4)` [10:41:41] kormat: context? [10:41:44] PCC/ [10:41:52] volans: puppet failure on a pontoon env in WMCS [10:44:31] it might be that size() was added in a more recent version of puppet's stdlib and you have an older version? [10:44:53] volans: that line in realm.pp is 5 years old [10:44:57] which is.. confusing. [10:44:58] AFAICT is not part of https://puppet.com/docs/puppet/5.5/function.html but is part of the ones in v6 [10:45:17] I don't know the setup for pontoon, so can't help there, soryr [11:54:50] kormat: were you able to fix the error? IIRC has to do with puppet master and vendor_modules not being in its search path upon rebasing [12:12:35] godog: no i had lunch instead [12:13:20] kormat: https://c.tenor.com/WkKfe2zUbwUAAAAC/you-have-chosen-wisely-choose.gif [12:33:15] :D [12:33:20] godog: any idea how to make fixy? [12:34:51] kormat: mmhh can puppet run on the pontoon master ? if not then making sure vendor_modules is in the modules search path in the config and restart apache should be enough [12:35:36] ahh, there we go, it's at least running now [17:46:17] volans: I tried using sre.hosts.reimage on a VM. I got a "spicerack.netbox.NetboxHostNotFoundError". and it's "line 105, in _fetch_virtual_machine" -> raise NetboxHostNotFoundError. I can find it in netbox manually. it's https://netbox.wikimedia.org/virtualization/virtual-machines/440/ is that worthy of a ticket? [17:46:26] or not yet expected to work with existing VMs [17:46:41] mutante: it didn't work in the past, nnot sure if changed [17:47:02] there were rumours that might have changed. but not sure, yea [17:47:22] the part that made me look twice was that it seems to fail when doing "fetch_virtual_machine" [17:47:28] mutante: documentation says is in the same state: https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#Virtual_hosts [17:47:34] jynus: ok, thanks! [17:48:03] but obviously I could be not up to date with latest features [20:04:22] a "systemd::sysuser"'s homedir should never actually be in /home. is that right? [20:05:20] so basically anything using "home_dir => "/home/something" with systemd::sysuser should be changed and can't work [20:05:45] "PANIC: mkdir /home/...: permission denied" [20:07:40] Never saw a panic alert on unix; sounds frightening :) [21:08:37] turns out the class in question has a Hiera setting to run stuff as root or non-root and to create a new instance you have to temp. change to root.. run it once.. then change back to non-priv user [21:29:18] mutante: doc is uptodate, reimage cookbook works only for physicap hosts [21:30:35] there is a task (when I fibd it) with more details on next steps [21:31:39] https://phabricator.wikimedia.org/T305589#7837933 [21:32:46] volans: gotcha! thank you. I have reimaged using the classic method