[07:26:26] <_joe_> something/someone broke the puppet compiler it seems [07:26:54] arturo, dcaro o/ - On cloudcephmon nodes puppet keeps saying "/Stage[main]/Ceph::Auth::Load_all/Notify[No keydata found for key cinder-backups, skipping.]/message) defined 'message' as 'No keydata found for key cinder-backups, skipping.'" etc.. [07:27:00] <_joe_> ah nevermind [08:20:39] > Marostegui moved this task from Backlog to Blocked on the DBA board [08:20:48] phabricator is full of surprises [08:21:27] :P [08:28:47] <_joe_> Amir1: marostegui is still a manager at heart [08:29:14] haha [08:29:51] :-( [09:16:22] is anyone looking at the huge amounts of cron spam from ms-be-01? Emperor? [09:19:06] mmhh thanks kormat, I will (my bad) [09:19:44] that's the 'swift' project in cloud FWIW [09:24:49] * Emperor twitches [09:25:01] (oddly, I don't seem to have a lot of cron-spam (yet?)) [09:25:43] I'm wondering what's the "right" "fix", possibly blackhole root@ if we're in cloud/pontoon on the local exim [09:26:17] the alternative would be for each pontoon to have an "owner" who got the root cron-barf [09:26:37] (which I guess might sometimes be useful) [09:29:49] indeed that's even better [09:30:13] elukey: thanks, will fix soon [09:51:39] filed the owners contact for pontoon as T296373 for now to not forget [09:51:40] T296373: Define owners email address for Pontoon - https://phabricator.wikimedia.org/T296373 [10:36:13] godog: i'm still seeing some cron spam btw. [10:39:08] kormat: *sigh* from ms-fe-01 still ? [10:40:55] godog: ms-be-01 [10:41:03] but yes to 'still' [10:41:17] on that note, there was some cronspam recently from bounces from alerts still going to brooke's (now inactive) email [10:42:55] on a separate note... there is noone lined up for our upcoming SRE Monday meeting session slot -- anyone interested? [10:49:16] volans: shall I merge https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/741100 ? I don't think I've ever merged a spicerack patch before, is there any concrete steps to do? [10:50:00] that's a cookbooks' patch ;) just +2 on gerrit and that's it, see https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Deployment [10:50:17] right [10:50:26] kormat: thank you, "fixed" but yeah the real solution will be the task above [10:50:36] godog: 💜 [10:50:46] volans: thanks, done [10:50:49] thank you [11:56:27] how do we select which OS gets a system when installing them these days? [11:58:07] `sudo cookbook sre.hosts.reimage --os bullseye` that's all? [12:00:30] arturo, yep: https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#Check_available_options [12:00:53] I mean, assuming bullseye is your target :-) [12:01:16] my doubt is basically about how to persist that information [12:01:36] or how is that information persisted today, if we do that [12:02:02] if the question is- how to reimage without an upgrade? [12:02:35] well, let me give a bit more context [12:02:45] we recently got a new system installed with the wrong OS [12:02:52] not a big deal, we can work with that [12:03:09] but I remember back in the day the OS information was stored in the puppet tree... for install servers [12:03:48] the os to install should be on the ticket for dcops (I think it was there normally) [12:03:57] so I wonder if we no longer store the OS information anywhere, as a source of truth kind of... is this a bit more error prone? [12:04:26] well, puppetdb is our source of truth and it has that data [12:04:26] * volans at lunch I can answer later [12:04:56] I think based on secret WIP patches, moritz, has some secrete orchestration of versioning per roles, maybe? [12:04:57] the desired OS for an initial install done by DC ops can simply be added to the Phab task with the racking details [12:05:20] not orchestration, inventory mayhaps? [12:05:30] yes, the info was on the phab ticket for dcops. But if we only have that `--os` switch in the cookbook, and not a persisted source of truth, it could be the root cause of some human mistakes [12:06:50] but once a server is installed, it's in our source of truth. the intended OS choice before a server gets installed needs some mind reading features which Netbox/Puppet don't (yet!) have :-) [12:06:53] arturo, I am genuinly not understanding- thing used to be on puppet [12:06:58] now it is on command line [12:07:09] what we had in puppet was even worse [12:07:23] since it didn't need to reflect what was currently running [12:07:25] on human error, both will behave the same [12:07:53] a server might have been installed with stretch, but upgraded to buster, but the DHCP record was never updated [12:07:58] and to track it, I think puppet facts and prometheus would be the way? [12:08:03] that's an actual example that happened [12:09:14] are you thinking of accidentally downgrade a server? [12:10:18] We used the Netbox "platform" field to define this in my last place, I see here we just use a more generic "Linux" setting for most of the servers. [12:10:22] as you the workflow seems to be today, for a new server, a human has to read the information in a phab task, and type it in the cookbook cmdline. That phab->cmdline travel inside the human head can cause errors. Or at least, I suspect it caused an error [12:10:54] I see no change [12:11:19] before, dcops had to add those fields onto puppet manually too! [12:12:48] yeah but that puppet patch could be reviewed [12:13:02] * arturo sorry, irccloud a bit flaky here [12:13:04] ah, ok, that I can see- there is less accountability, I can see that [12:13:41] anyway, no big deal. I was just curious how the workflow was today [12:14:36] I don't think there is any action items unless we see the same issue happening over and over again [12:15:09] I think dcops operations will benefit from increased automation, I think that is a known gap, but not super-easy and fast to fix [12:15:24] you cannot just automate racking, sadly :-) [12:16:34] on the win side- usually you will save one puppet patch on upgrade, and prevent accidental reimages, which is a big plus [12:18:24] hi all i just added a new function to puppet wmflib::argparse which takes a Hash of arguments and parses them into a string of --long arguments based on there value type [12:18:28] see: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/wmflib/functions/argparse.pp [12:18:45] and the chain starting at : https://gerrit.wikimedia.org/r/c/operations/puppet/+/741110/4 for examples [12:19:58] oh nice [12:20:06] jbond: clever! [12:21:07] * arturo looking at https://gerrit.wikimedia.org/r/c/operations/puppet/+/741114 [12:21:10] should quotes be added for string values? I wonder if I can inject code there :-D [12:21:33] or on the more trivial side, if strings with spaces will work [12:21:53] jynus: yes i think adding quotes would be a welcome improvment :) [12:22:21] jbond: sightly related, we often have long list of input vars in several puppet manifests. Would you say that a better pattern would be to collapse them into hash/struct data types when possible? [12:22:46] I will try with a test first [12:23:01] jynus: ack thanks and bing me if you need a pointers [12:23:20] I prefer duckduckgo, but I will :-) [12:23:48] -_- [12:23:56] lol [12:24:27] arturo: im not sure there is a hard and fast rule. i have definetly used both patternes in the past. i genrally use a combinations, often it depends on how you want to refrence the variables in other context. [12:24:51] ok, that matches my experience as well [12:25:44] but looking at some puppet code I wrote some time ago (last month), it could definitely benefit from more abstracted datatypes [12:26:22] I tend to prefer to define a custome type most places where i would have a hash where the keys are known [12:26:52] simlar of a parameter takes a string with a known set of values i prefer to define an ENUM [12:27:07] but it can become tedious so got to use your judgment [12:27:14] fair [13:25:03] * volans back, reading backlog [13:31:41] arturo: to recap, previously we were not really hardcoding the OS, in the DHCP config there was a global default OS and then hosts that needed a different OS were hardcoding it into the host-specific stanza. At some point that default was moved to the next OS, potentially causing much more unwanted installs with a different OS. Not to mention that the hardcoded value in the DHCP config could go out [13:31:47] of date, for example for those rare ... [13:31:49] ... cases in which we had upgraded in place the host. The OS was communicated in the provisioning Phab task, and that bit has not changed. [13:32:09] What changes now is that all the installs are explicitly stating which OS to install [13:33:39] After that the information is present in multiple places, directly or indirectly (puppetdb facts, debmonitor, SAL, provisioning Phab task, Phab task updated by the cookbook) [13:34:15] (..., cookbook logs, and I'm sure I'm forgetting something else) [13:36:47] For context topranks, it was discussed in the past if we should save that information into Netbox but ended up deciding it would have probably caused more harm with potential data drift than benefits. We can of course revisit that choice and in case we'll go that route the reimage cookbook would read that information from Netbox and not allow to override it. And then we could add a check to the [13:36:53] Puppetdb-Netbox report to ensure reality ... [13:36:56] ... and theory match within each other. [13:37:50] EOF, sorry for the long reply :) [14:16:07] moritzm, may I ask you for a quick look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/740815 ? I only need your ok on the releases yaml, to make sure I haven't broken it 0:-) [14:18:19] looking [14:20:33] followed up :-) [14:24:54] thank you very much [14:25:27] those refactoring patches are not super important, but if left for a long time rebasing becomes more and more dificult with time [14:33:06] volans: ack [16:53:25] fyi all just hit this using the most recent version of git (sid) with git-review https://opendev.org/opendev/git-review/commit/25c2d3fe9678bb467952e8203cfc8c40f4a86a87 [16:54:07] <_joe_> question: who owns wikireplicas? [16:54:34] <_joe_> specifically, I have a patch to deduplicate their definitions in service::catalog and I am seeking reviewers [16:54:50] <_joe_> (patch is https://gerrit.wikimedia.org/r/c/operations/puppet/+/741703) [16:58:29] _joe_: if https://phabricator.wikimedia.org/project/view/2874/ is to be trusted, I'd add bd80-8 [16:59:38] echoing in a Americas-friendly TZ: we have no presentations lined up for our upcoming SRE Monday meeting -- is anyone interested in presenting? [17:12:32] _joe_: best to confirm from andrewbogott, but these days it's data engineering team I think [17:37:58] _joe_: I can confirm we're transitioning at least some parts of wiki replica ownership to data engineering. That being said, that particular patch could probably benefit from a joint review between WMCS/DE/you :-P [17:38:49] <_joe_> arturo: that patch is a noop, I just want someone to tell me "oh no stop we have plans that would make that patch not useful" if that's the case [17:39:11] that's not the case, I think that proxy layer will stay that way for the time being [17:39:37] so the refactor LGTM [17:39:51] that `<<` syntax is new to me though [19:16:16] any reason dbctl and cookbook print normal output to stderr? [19:16:52] my script goes "ALERT THINGS FAILED STDERR" [20:52:18] Amir1: which cookbook? [20:53:17] dbctl depends IIRC [20:55:04] to answer your next question cumin sends output to stdout while progressbars and report of success/failure to stderr, that's also useful when using -o/--output :) [20:56:33] dbctl too should print to stdout the content and messages to stderr IIRC, but I would need to check