[06:54:27] hello folks, if anybody if logstash experience has time for a quick review - https://gerrit.wikimedia.org/r/c/operations/puppet/+/704746 [07:03:03] also, a question about planet1002 [07:03:16] the planet-update-en.service is failing due to [07:03:23] Jul 15 07:01:00 planet1002 rawdog[11680]: An error occurred while reading state from /etc/rawdog/en/state. [07:03:26] Jul 15 07:01:00 planet1002 rawdog[11680]: This usually means the file is corrupt, and removing it will fix the problem. [07:04:06] I'd be inclined to cp the file in my home and delete it, to see if the unit works again, but never done it so if anybody has experience please let me know :D [07:04:31] does it update en.planet.wikimedia.org? [07:14:26] * elukey tries [07:18:00] looks like it is working [07:18:49] (brb) [09:11:59] hello! I have a couple puppet patch pending for review and could use a pair to push them please :] [09:12:32] The first drops a Hiera yaml file that is not doing any effect, it is a mistake I made in an earlier patch and the file should just be deleted ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/673286 ) [09:13:11] the second is to add a motd for all deployment-prep / beta cluster instances to mention the WMCS terms. That would fix a task from 2015! :] https://gerrit.wikimedia.org/r/c/operations/puppet/+/699207 [09:33:35] Amir1: Hey that all looks good to me. Let me know if you want me to merge it. [09:33:38] Thanks! [09:33:51] And sorry for the delayed response had some IT issues here :( [09:34:14] topranks: technology is terrible [09:36:13] I've moving to the wilderness to live in a hut. [09:48:48] 👩‍🌾 [09:50:33] topranks: Hi, it's okay to merge it any time you want, just make sure the crons are properly gone from netmon :D [09:51:13] Ok I will do :) [09:56:09] very minor script to check kernel versions across the fleet: https://phabricator.wikimedia.org/P16823 [09:56:59] kormat: what you don't like of https://debmonitor.wikimedia.org/kernels/ ? :-P [09:57:39] volans: if you can tell me how debmonitor can take a cumin host spec and check if a given host meets a minimum kernel requirement or not, i'm all ears :) [09:58:02] the paste has an example usage and output [09:58:28] * volans was joking [09:58:32] it's a different use case [09:58:38] quite :) [09:59:49] to be fair, it was proposed to add to debmonitor the detection of a reboot needed if the installed kernel didn't match the current running one [10:00:10] but was deemed not too necessary / had some complexity for corner cases [10:00:34] different use-case again [10:00:34] I have a vague recollection though, I might be mis-remembering it all, so don't quote me on this :) [10:00:39] ack [10:03:45] volans: that sort of metric would in general show most DB hosts as being in need of a reboot, fwiw [10:03:58] we're uptime-fanatics [10:04:32] eheheh [10:34:22] jbond: https://wikitech.wikimedia.org/w/index.php?title=Help%3APuppet-compiler&type=revision&diff=1918665&oldid=1917337 [10:36:24] I guess that's where the distinction between service uptime and server uptime should come into play xd [10:45:26] I'm getting spammed (~50 emails so far today) by some cron on thanos-fe2001, is anyone looking into that? [10:46:05] godog: ^^^ [10:47:02] the host was reimaged this morning AFAIK [10:47:57] and thanos store failed to start [10:48:44] can't connect to thanos-swifw.discovery.wmnet [10:49:23] < HTTP/1.1 503 Service Unavailable [10:49:50] it tries to connect to itself [10:53:12] the host is still donwtimed fromt he reimage downtime, but is not healthy [10:53:59] puppet is disabled, I'm commenting the crontab entries [10:56:43] {done} !log-ged it to the related task ( T285835 ) - godog [10:56:44] T285835: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 [10:57:38] thanks [11:02:02] thanks for pointing that out, I had just noticed it myself in my inbox [11:07:07] I've added 3 more downtime hours to the host as the original one from the reimage is about to expire. [12:12:03] gah, sorry about the spam and thank you volans for the fix [12:45:24] godog: it's just a stopgap, I didn't know what was the status so didn't try to fix the service that was not starting [12:46:50] lmk if you need any more details on what I did to revert it once all works fine [12:47:51] Quick question. I'm getting access denied when trying to access netbox. I get past the SSO login, but then access denied. Is there anything else I need to do in order to assign the correct rights please? [12:48:12] volans: thank you, will do! I'll move the crons to another buster host [12:48:53] btullis: do you know already if you are in the LDAP group called "ops"? [12:49:12] mutante: Yes I am already in ops. [12:49:15] docs say "either wmf or ops" [12:49:23] ok, then it's not that.. hmm [12:49:30] the groups were not catched [12:49:59] btullis: could you try to logout from netbox and then re-login again (should show the IDP login page) [12:50:11] we changed something there not too long ago and might have a bug in a corner case [12:50:28] I did try that in a private browser window, but I'll try again now. [12:51:00] if that doesn't work tell me and I'll just delete teh user and you can re-login and we will dig later on the details [12:52:37] Yep, didn't work. Logged out of SSO, then back to netbox, logged in with SSO and then access denied from the netbox interface. Thanks volans. [12:53:09] btullis: user deleted, try again please, sorry about that [12:55:07] Bingo. I am in. I logged out of IDP and then back into netbox. Received one transient error (want to see it?) and now I'm in. [12:57:13] yes pyes please that would be useful [12:59:53] This appeared just once, during my login with a new user https://usercontent.irccloud-cdn.com/file/dG42ab9O/netbox%20error.png [13:00:22] Many thanks again. [13:15:49] * volans in a meeting, will look shortlyy, sorry [13:38:15] volans: It was just an FYI. All is fine with my netbox now, thanks. [13:41:37] btullis: thanks for the patience, I pinged you in query, when you have a second [13:42:44] We're about to test a changeover between maps clusters in codfw - no impact is expected and rollback is a confctl change, but just to give advance notice. [13:43:52] \o/ good luck! [13:50:53] Reverted, it did not go well. [13:50:57] :] [13:51:36] all apergos fault! :-P [13:52:11] you can't jinx it so blatantly :D [13:52:30] dang it [13:52:34] sorry hnowlan :-( [13:52:43] next time I'll use the obligatory "break a leg" phrasing... [13:53:30] haha [16:36:37] trying the maps switchover again, there might be some noise. [19:01:24] Hi all, I get emails that look like [Cloud VPS alert][deployment-prep] Puppet failure on deployment-logstash03.deployment-prep.eqiad.wmflabs (172.16.1.184) regularly (I'm on the root@ alias). Does anybody know what this is about? [19:01:24] Logging in to the node and running `sudo -i puppet agent -t`, I got: `Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Operator '[]' is not applicable to an Undef Value. (file: /etc/puppet/modules/confluent/manifests/kafka/broker.pp, line: 332, column: 11) on node deployment-logstash03.deployment-prep.eqiad.wmflabs` [19:02:38] I feel like the puppet code referenced is outdated and potentially the instance itself is obsolete [19:09:57] razzi: https://phabricator.wikimedia.org/T286567#7214310 [19:17:06] razzi: and those alerts go to all admins of the deployment-prep cloud vps project, production root@ has nothing to do with them [19:53:11] Hi all 💠™️🛡️🏔️📧 My family needs money for treatment [20:18:37] Thanks for the links majavah ! [20:18:50] link / tips