[06:08:03] XioNoX mmandere when around, I belive you should update this week's schedule (A still shows as assigned this week), but wanting you to be around before breaking things [06:12:39] jynus: ack [06:13:15] I belive it was because multiple switches happened, not reflected on victorops [06:13:17] mmm is it me or the sync to our puppet repo in github isn't working? the last commit is from 6h ago there: https://github.com/wikimedia/puppet [06:14:40] mmandere: you should I think coordinate with alex to remove yourself from the next week too [06:14:49] marostegui: let me see [06:15:08] jynus: I can create a task and ping releng if you like [06:15:43] marostegui: I see your 2 commits on github [06:15:58] RhinosF1: Oh wow, it just got there indeed [06:16:07] jynus: ^ seems to be fixed [06:16:07] there must be some lag [06:16:11] yeah [06:16:24] I saw it there too [06:16:38] great! [06:22:37] do you often use github (out of curiosity)? [06:24:52] jynus: yeah, I use it to browse the repo [06:50:47] same [07:50:40] Unless it conflicts with the plans of someone else: I'm going to disable Puppet fleet wide in a little bit. The goal is to move Puppet from crontab to systemd timers. [07:52:52] how will failures will be handled, will they generate systemctl alert spam? [07:54:05] Yes, we can disable alerting if it turns out to be an issue [07:54:35] sorry, I didn't understand, do you mean "Yes, it will generate systemctl alert spam?" [07:54:57] Sorry, yes, failure will generate a systemctl alert [07:55:06] It's this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/807118 [07:55:12] but we disabled puppet failure alerting, that will be a regression [07:55:24] we only alert on widspread failres, give a warning otherwise [07:55:55] Fair point, I'll amend the patch to not alert on failures [07:56:12] what's worse, may hide legitimate systemctl alerts [07:56:17] I will comment on ticket [07:56:32] Thank you, I'll abort the mission for now :-) [07:57:49] I'll just check, it's the same script, so maybe it won't generate spammy alerts [07:58:31] jynus: good catch! [08:00:20] So I mentioned because I have pending doing changes to a script in the same way [08:01:35] I don't want systemd to alert if 1 backup fails, for example, out of 10, so I have to tune the exit status to only alert on fatal failures (e.g. config file not found) [08:02:36] (I have backup monitoring separatelly, of course) [08:02:40] so it is a similar case [08:03:24] jynus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/792113 <- We have this one, I think you commented on that one as well, but yes, same issue [08:03:41] yes, that is the one I referred to [08:03:45] I will apply it [08:03:54] just need time to update the script [08:04:21] I'll see what we can do with the run-puppet script. [08:04:45] maybe update the systemctl monitoring to ignore puppet? [08:05:11] as puppet has its own "monitoring stack" [08:05:27] or force a 0 exit after running from systemctl timer [08:05:30] ? [08:05:32] unsure [08:05:52] maybe there is a timer config? [08:07:39] We could just disable monitoring for the Puppet service, or where you concerned that it might hide to much? [08:10:25] but the monitoring is a bit complex- it uses aggregarion to decide if to alert [08:11:02] e.g. alertmanager only alerts if it fails on more than 1 host [08:11:23] per-host puppet failure alerting (not monitoring) was disabled due to spam [08:11:43] oh, or you mean the systemctl one, specifically? [08:11:54] Oh, yes, just the systemctl one [08:12:08] sorry, I missunderstood [08:12:12] The alertmanager alerts are fine, they aren't being influenced [08:12:15] yeah, that was one of my suggestions indeed [08:12:44] In that case I misunderstood :-) [08:12:56] I'll amend the patch and let people review it again [08:12:58] "maybe update the systemctl monitoring to ignore puppet?" <- here [08:13:44] Perfect, I just put way more meaning into that [08:14:26] let's make sure people are aware of it, though on ticket and maybe an email or SRE meeting comment [08:14:36] *trough the patch [08:14:40] *through [08:14:43] arg I cannot write [08:16:45] slyngs: I think you'd be better off by just running bash -c "run-puppet-agent || /bin/true" from the timer if you don't want it to fail on puppet failures [08:17:27] that is the other option I mentioned :-), whatever most people prefer (not sure which one would cause the least surprise?) [08:17:38] I would not modify run-puppet-agent tbh, the way it returns status codes is somewhat related to how we run stuff in cumin [08:18:02] ah, makes sense [08:18:03] I don't think systemd will be to happy about the || [08:19:07] We could to "Yet another wrapper" that will run the bash -c "run-puppet-agent || /bin/true [08:20:45] the last option is to downgrade systemctl to a warning for puppet timer only [08:20:55] *systemctl monitoring [08:21:27] slyngs: why would systemd not be happy about that? if you quote the command correctly it will [08:22:16] Is it just the redirects it doesn't like? [08:23:48] just single quotes should do the work according to this? https://unix.stackexchange.com/a/496370 [08:25:00] I'll just do a few test. The whole thing is wrapped in another script, to handle sending email [08:25:08] he he [08:25:19] maybe that could be changed instead [08:26:04] Yeah, but that's equally dangerous as that wrapper a ton of customer systemd timers and jobs [08:27:14] something like [08:27:19] ExecStart=/bin/bash -c "/bin/false || echo TEST" [08:27:20] works [08:27:26] so I'm not sure what's the worry here [08:28:58] It will end up as ExecStart=/usr/local/bin/systemd-timer-mail-wrapper /bin/bash -c "/bin/false || echo TEST" [08:29:29] why do we need to send email? [08:29:42] Maybe... I might just be overly paraniod [08:29:44] I don't think we do from the current cron [08:30:02] and I don't see a good reason to mail everyone if puppet is failing to run [08:30:07] given we're already alerting on it [08:31:09] No, you're right, just don't send email and ignore errors and let AlertManager handle the alerting should work [08:31:29] cron simply silently failed errors, so with the migration to systemd timers we need to figure out the cases where this actually would make sense [08:31:44] +1 [08:31:49] but I agree that for puppet runs this would be too noisy and the other measures we have in place are effective enough already [08:32:36] I'll just do a little tweaking to the patch and run it through review again :-) [08:33:49] sorry if this caused you more work- I was surprised nobody else brought it up [08:35:56] as an example, I see now up to 7 systemd unit run errors on icinga [08:36:12] (some acked/disabled) [08:36:13] jynus: No this is great, much better to deal with it upfront [08:36:55] So it turns out that some one thought about this. Systemd will ignore errors is the command is prefixed with - [08:37:16] And our puppet module actually knows about that [08:37:41] much cleaner solution :-D [08:38:32] Who ever wrote the puppet module had an amazing amount of foresight [08:39:49] From the systemd manual: If the executable path is prefixed with "-", an exit code of the command normally considered a failure (i.e. non-zero exit status or abnormal exit due to signal) is recorded, but has no further effect and is considered equivalent to success. [08:41:43] you get my +1 if you fix the extra space and basically copy and paste that to the patch (explaining why ignore_errors) :-) [08:53:30] jynus: For you, I'll delete all the spaces you want.... assuming the linter is also happy :-) [08:58:19] you should be happy to get prompt reviews, even if nitpicky! I am still waiting for some that is blocking my main goals for over a month! [08:59:36] I really do appreciate the reviews, and especially the prompt ones. [09:48:46] slyngs: sounds like you found the required knob, but for clarity. if yuo set `ignore_errors => true` then systemd will ignore any errors (https://github.com/wikimedia/puppet/blob/production/modules/systemd/manifests/timer/job.pp#L139) [10:48:07] Now that everyone is happy: Puppet will be disabled in 15 minutes [10:53:51] Oooh, that's a scary amount of servers :-) [10:58:07] Argh, I forgot about cloud, I want to disable Puppet there as well. [10:59:57] people might be running local puppet masters, you won [11:00:06] won't be able to disable it fully [11:00:35] but the majority of cloud hosts are running against a central puppet master running in cloud vps itself [11:00:55] but I currently don't know the procedure to disable it, best to check in #wikimedia-cloud-admin [11:01:48] there is cloud cumin master, if you have global root in cloud or part of that project you should be able to do it for all hosts [11:01:52] including self-hosted puppetmasters [11:04:45] volans is that just another host I can access? [11:05:06] https://wikitech.wikimedia.org/wiki/Cumin#WMCS_Cloud_VPS_infrastructure [11:05:08] yes [11:05:19] Perfect, thank you [11:05:30] if you are in that project or have global cloud root [11:05:36] (not all sre have it) [11:06:33] I don't even think I have ssh setup for wikimedia.cloud [11:12:49] I don't appear to be allowed to login. I'll ask in cloud-admin for help [12:13:48] moritzm: I recall you said the decom script needed some patches on Friday? Was done already? [12:14:50] marostegui: only for VMs, baremetal servers are fine [12:15:54] for VMs I think it will be fixed today [12:16:26] ah cool [12:16:27] thanks [12:29:47] yes I'm working on it, today/tomorrow at ma [12:29:48] *max [12:30:42] Yeah no rush volans, I was just asking to see if we can use the decom script [13:47:14] Do we have anything resembling a runbook/cookbook/automation for rotating a puppet cert? [13:47:45] sre.puppet.renew-cert [13:48:25] awesome! thx, will read [13:48:29] has been a while since I last used it andrewbogott, so not 100% sure it's all still up-to-date, but check it's --help message for details on what it does [13:54:25] I've been using it like a month ago and worked totally fine [13:55:52] feature request: store and display when was the last successful run of a script [14:52:50] XioNoX: as in a cookbook? they all log to /var/log/spicerack at least [14:53:39] rzl: yeah I know, half-joking with the above :) [14:53:58] haha sorry 👍 [14:55:25] something like https://sal.toolforge.org/production?p=0&q=%22sre.puppet.renew-cert%22&d= ? :-P [15:07:46] volans: you should create a cook book to run that command :P [15:07:53] lol [17:01:03] mutante, rzl taking over, right? [17:05:10] jynus: yea [17:05:22] (nothing to report, we had a user reporting errors, but it was quite localized and not widespread [17:05:30] alright [17:06:00] had a look at my latests updates about T303534 (no more comments here) [17:06:21] *have if you can [17:11:25] ACK, TIL thatis another way to get a page by ID instead of name [17:20:18] 👍 [18:10:37] klausman: hi, do you know about ml-cache2 hosts? ongoing work? [18:11:00] elukey: or if you are still around maybe [18:11:13] I think elukey has more state on that than I do [18:11:27] arnoldokoth is running cookbooks for unrelated work but he gets a bunch of ml- hosts in the DNS diff [18:11:36] he needs some help to check what to do [18:11:48] can he pastebin the diff? [18:13:08] arnoldokoth: ml-cache2001-a, ml-cache2002-a, ml-cache2003-a .. right? [18:13:20] https://usercontent.irccloud-cdn.com/file/pW6bH88G/Screenshot%202022-06-27%20at%2021.05.32.png [18:13:30] Here's the diff. [18:13:49] so we need to figure out if it was just forgotten to run the DNS cookbook or if there is an issue with DNS sync [18:13:55] or something else [18:14:17] if he cancels the DNS change then it will mean the entire decom cookbook also fails [18:14:39] I can't really answer what may have been Luca's intent, but IME, adding DNS records (that don't duplicate others) is least likely to break stuff. [18:15:15] do I understand correctly that these changes are in git, but never made it to the servers? [18:16:35] not git, but "they are in netbox but never made it to DNS" [18:16:42] but netbox and DNS are supposed to be in sync [18:17:05] and these are special types because of that "-a" suffix [18:17:13] not regular server names [18:17:29] https://netbox.wikimedia.org/search/?q=ml-cache2001-a&obj_type= [18:17:36] [mwdebug1001:~] $ host ml-cache2001-a.codfw.wmnet [18:17:36] Host ml-cache2001-a.codfw.wmnet not found: 3(NXDOMAIN) [18:18:02] note how https://netbox.wikimedia.org/search/?q=ml-cache2001&obj_type= is not https://netbox.wikimedia.org/search/?q=ml-cache2001-a&obj_type= [18:18:22] Then I'm drawing a complete blank, sorry [18:18:56] I know Luca was working on the ml-cache hosts, but I don't know how far he'd gotten/what stae things were/are in [18:18:56] arnoldokoth: well, we tried. Afraid you have to abort cookbooks a second time [18:19:21] and then report the issue to both Luca and Riccardo ideally [18:19:43] klausman: ACK,thank you [18:20:56] klausman: Thank you. [18:21:01] mutante: Got it. [18:21:03] np at all [18:39:51] hey I'm just reading it now mutante, arnoldokoth [18:40:11] AFAIK luca added the cassandra hostnames to those hosts, I think you can safely go ahead and merge those [18:40:40] BUT [18:40:54] the decom cookbook is currently broken for VMs as I mentioned it in the SRE meeting earlier [18:41:04] it will be fixed tomorrow at this point, sorry for the trouble [18:58:44] mutante: hey sorry yes I added the Cassandra DNS entries today for ml-cache2* [18:59:08] I indeed forgot to run the DNS cookbook, my bad, all is good to be added etc.. [18:59:18] arnoldokoth: ==^ [18:59:26] I'm running the cookbook now [19:00:13] was about to anyway (just run the test run to check if there were multiple diffs) [19:00:25] thanks! [19:00:31] np [19:02:17] volans: Thanks. [19:03:57] {done} [19:26:49] informal poll: do "runbook" and "playbook" mean different things to you personally? if yes, what's the difference? [19:27:46] rzl: personally, no [19:28:46] runbook is a set of instructions that you follow manually, playbooks are automated [19:30:02] taavi: interesting, that one's new to me! [19:53:45] !log cp5012 shutting down and removing power via T311264 [19:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:51] T311264: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 [19:57:07] ok rmeoved power via pdu commands on the power strips.. gonna let that sit that way for a couple minutes to ensure as much discharge and reset as i can, remotely... [19:57:16] even better is someone click the power button on front but not possible ; D [20:01:18] ok cp5012, time to rise from the dead [20:04:08] wooo, idrac now responsive to ssh and https, power removal fixed it [20:04:24] woho! [20:04:37] * sukhe joins the dance [20:04:43] always have that moment of 'well if it decides to be dead there is shit all i can do about it' [20:04:59] glad its coming back to life, os will be back online shortly but may as well leave out of pool until i complete the updates [20:05:08] idrac update shouldnt affect os but why not leave it out heh [20:05:14] take your time please! [20:05:21] and thanks of course [20:05:34] i have an OKR to close out all my fiscal year tasks [20:05:37] so im happy to close this today hehe [20:07:27] :P [20:08:38] plus even shorter SLAs for hw repair like this than 'fiscal year' [20:08:53] i think sla for hw repari is veyr short, we're past it on this one cuz it wasn't critical [20:10:22] cp502 idrac firmware updating now. [20:10:32] cp5012 [20:12:29] bah, upload failed signature check, mehhhh retryiung [20:16:27] I was out earlier but I read all the backlog and thanks for that, Luca and Riccardo. (never sure whether ping is good or bad when just wanting to say thanks or ack later in the day in PST, heh) [20:21:18] who knew it was slow to send a large firmware file via ssh tunnel from SF to TX then over to Singapore then down through our mgmt network... [20:21:23] oh wait, i knew. [20:21:45] same file, no signature error this time so far.... just one little bit is all it takes [20:23:49] and update complete, idrac is resettting. [21:09:20] rzl: for the survey. "playbook is an ansible term" maybe that's why it's associated with automation. but asking wiktioary, interestingly "run book" is really IT-specific, "procedures prepared by an IT admin..." and all that (https://en.wiktionary.org/wiki/run_book#English while playbook can be anything from children's book to American football apparently (https://en.wiktionary.org/wiki/playbook)