[08:48:07] good morning! docker-pkg has a few patches pending which would be great to get reviewed/merged please ;) [08:48:18] 1st is to fix mypy without it I believe any build fails https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/747104/ [08:49:10] there is another one that would make it fail whenever seed_image is empty, that saves us from generating a `FROM \n` line https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/747060/ [08:50:50] and finally there is https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/692995 to use the user `PATH` when invoking verify commands. Typically useful when docker is under /usr/local/bin or similar [10:05:14] It seems that golang-cfssl 1.6.1 (updated 22 hours ago) needs a newer libc6 than we provide for debian 9 (ms-be2045.codfw.wmnet), what is the process there? [10:27:53] I'm guessing that this is it https://sal.toolforge.org/log/qb7fKX4B1jz_IcWuz4eU, jbond? [11:56:01] dcaro: hmm one sec that shouldn;t be the case let me take a look [13:54:58] dcaro: the cfssl issues should be resolved now. let me know if you see any further issues (however its a national holiday here so may be a not be an instant response) [13:56:29] (*may not respond instantly) [17:00:13] Hi ! Could someone (with the rights) lookup my uid in LDAP on mwmaint1002: /usr/bin/ldapsearch -x "uid=Aqu*" ? Please [17:01:54] aqu1: cn=Aqu has uid 36836 [17:02:15] you can do that from any Cloud VPS instance too if (like me) don't have the access to do that on mwmaint [17:14:00] taavi: Thanks ! Good to know. [17:14:07] You can also use https://ldap.toolforge.org/user/bd808 (replace my shell account name with yours) or search with https://contact.toolforge.org/ [17:14:59] bd808: Thanks ! [20:20:48] What's the best way to do a longer suppression? I have a host that will be down for a few hrs [20:23:45] inflatador: you mean to tell monitoring (icinga) about planned downtime? the best way is using the cookbook on a cumin host [20:23:55] alternatively you can click in icinga.wikimedia.org web UI [20:26:07] inflatador: ssh cumin1001.eqiad.wmnet and then, realworld example: [cumin1001:~] $ sudo cumin alert100* 'icinga-downtime -h mx2001 -d 2 -r "kernel downgrade"' [20:26:32] here alert100* is the icinga server, mx2001 is the host that has maintenance, 2 is the duration in hours and r is the reason [20:26:48] in this case, it is for a failed disk, so probably multiple weeks of downtime. [20:26:58] Thanks mutante ! [20:26:59] just one host? [20:27:04] Indeed [20:27:06] then go to https://icinga.wikimedia.org instead [20:27:09] Is there a way to make sure we don't run into a incigna downtime expired? [20:27:12] find it there, click schedule downtime [20:27:16] and select a date on calendar [20:27:19] submit [20:27:34] select a date one year in the future [20:27:39] but dont forget to delete it again [20:27:54] more common problem is forgotten downtimes that stick around [20:28:08] ideally you know the end date but yea.. [20:29:51] I vaguely remember we had discussed a way to tag them in netbox for longer failures [20:30:27] inflatador: login at https://icinga.wikimedia.org/icinga/ with LDAP credentials, then use the search box to find the host name, then there is a drop-down menu for commands, one of them is schedule downtime [20:30:39] it's also a good way to test if you have privileges to run commands there [20:30:49] should be one thing on onboarding checklist somewhere [20:31:12] it has a calendar thingie there [20:31:31] mutante: in any case, we should still tag that host as failed in netbox, right? [20:31:35] mutante you must be psychic! tried to suppress alerts for elastic2051 but go "not authorized" [20:31:58] phab link at https://phabricator.wikimedia.org/T298674 [20:32:04] inflatador: same thing that I did this morning here, most likely: https://gerrit.wikimedia.org/r/c/operations/puppet/+/751980 [20:32:21] unless you already got added [20:32:31] and it's the capitalization of the user name [20:32:41] gehel: yes, that sounds about right [20:32:49] Do I need to be BKing or Bking? [20:33:00] I'm already Bking, let me logout and try BKing [20:33:03] afaik that does not influence monitoring yet [20:33:08] might be wrong [20:33:28] inflatador: ever saw modules/icinga/files/cgi.cfg in the puppet repo ? [20:33:53] * inflatador checks [20:33:54] that is the global override [20:34:02] we should add you there [20:34:11] even though there is technically a more proper way [20:34:33] https://gerrit.wikimedia.org/r/c/operations/puppet/+/751980/1/modules/icinga/files/cgi.cfg [20:35:13] Gotcha. Gimme a few mins to get a PR going [20:35:15] you need a) icinga contact in private repo b) be in the cgi.cfg above and c) it must all match the LDAP user :p [20:35:47] and then the login screen lets you use bking or Bking [20:36:01] but it also has to match the icinga user exactly :o [20:36:20] which is why these are always a bit tricky [20:39:11] inflatador: so.. you know how to search LDAP? there are different ways too [20:39:19] then you can compare the existing users in that icinga list [20:39:45] in LDAP there is "cn:" and "sn:" and "uid:" and they can be different or the same [20:40:07] and I think it has to match sn: but also have to check again every single time [20:40:40] but dont worry, we will just try if it works and fix it if not [20:42:28] Thanks, I'll check LDAP [20:42:31] [mwmaint1002:~] $ ldapsearch -x uid=dzahn* [20:42:35] for example [20:42:49] or sn= cn= mail= [20:43:35] you can see for me personally it's dzahn in icinga and that list, bot others have full name there , even with spaces [20:44:04] some users added themselves twice, with and without capitalization [20:44:26] to avoid the issue that it allow both versions [20:45:50] what/where is the private repo? [20:45:59] btw, i can also just downtime that one host for you.. if this slows you down [20:46:09] just seemed good to do in general [20:46:28] the private repo is on puppetmaster1001 [20:46:28] yeah, I agree. No hurry here, better for me to get the privs I need [20:46:53] [puppetmaster1001:~] $ cd /srv/private/ [20:47:07] ah, that sounds familiar. You may have helped me with this before ;) [20:47:35] modules/secret/secrets/nagios/contacts.cfg [20:47:41] check if you are in there [20:48:20] I don't think you have this yet [20:49:07] want to add yourself or should I. I would copy lines 347 through 357. that is the one I just added earlier [20:49:31] yeah, I don't have it yet, was thinking of the pwstore stuff [20:49:42] ack, this is separate [20:49:53] so..there are different levels here as well [20:50:04] historically this would either send SMS or only email [20:50:14] and you can configure a timezone for work hours [20:50:35] and options _what_ notifies you, a host going down, a host coming back, a service recovering and so on [20:50:52] but for now, I would suggest just copy a simple one like the one for jgleeson [20:50:58] I see a big warning about not copying off puppetmaster, do I just make the edit in-place? [20:51:05] and the paging part is elsewhere [20:51:28] yes, this is a git repo but you edit it in place [20:51:29] use sudo [20:51:38] sudo git add, sudo git commit [20:51:43] ACK [20:51:44] don't amend or rebase [20:52:02] then you should see an email to root@ [20:52:06] that tells us you did a commit there [20:53:05] then... ssh to the icinga prod machine, currently should be alert1001.wikimedia.org (note this is not in .wmnet but has a public IP) [20:53:12] and run puppet there [20:53:20] and see icinga adding a new contact [20:58:52] cool, still working thru this carefully ;) [20:59:32] yea, take your time [21:00:40] i'm going to merge my own change to icinga config so later you can rebase on that [21:01:03] because herron reviewed it, thanks:) [21:02:30] jhathaway: conflict in puppet-merge with yours, but I saw that one, deployment-prep :) [21:02:35] typing "multiple" [21:02:52] mutante: thanks [21:03:15] merged. np [21:10:30] OK, private repo update is done, moving to alert1001.wikimedia.org [21:10:35] inflatador: one more comment for now. when I just added an icinga contact and ran puppet on the icinga server, I then did also this to really make sure nothing is broken: "[alert1001:~] $ sudo icinga -v /etc/icinga/icinga.cfg" that is not a restart but it's a syntax check of the config and it showed me "warnings/errors: 0". it's possible for that to not be the case for reasons like "there is a [21:10:41] group with that contact but the contact is not found" for example. and if undetected it would mean nothing happens but some surprise day in the future when it gets hard restarted..suddenly icinga is down. so for that reason I run the extra check [21:11:02] AH yes, I was going to ask about a syntax check but forgot [21:11:23] it will output a bunch of stuff, incl. Running pre-flight check on configuration data... [21:11:32] but it's all about the 2 lines at the bottom [21:11:53] 'Things look okay - No serious problems were detected during the pre-flight check [21:13:01] first though, puppet must run there once, and that can take a while on icinga because exported resources are used [21:14:29] how do you run puppet on the icinga server? I'm a puppet n00b (mostly used ansible previously) [21:14:42] btw puppet, I have alias pa="sudo puppet agent -tv" in my .bash_profile so I literally just type "pa" all the time [21:14:47] ^ [21:14:54] sudo puppet agent -tv [21:15:12] OK, and then do the syntax check? [21:15:17] you can puppetize your .bash_profile later and it will be the same on all remote hosts [21:15:22] afaik you should be using `sudo run-puppet-agent` instead of that directly [21:15:28] yes, watch puppet agent run [21:15:38] and see icinga gets refreshed by puppet [21:15:42] after it adds to the config [21:15:50] and once it's finished, you check it [21:16:07] ok, doing 'run-puppet-agent' in a tmux window now [21:16:18] (with sudo) [21:16:37] run-puppet-agent is mostly relevant if you do this via cumin [21:16:54] but yea,it's good [21:22:47] looks like my 'run-puppet-agent' didn't actually pick up my commit, running again... [21:24:49] used 'puppet agent -tv' this time, same results. I can see my commit on top of the git log on the puppetmaster, any suggestions? [21:25:35] I saw the email sent by your commit. seemed all normal. hmm.looking [21:26:46] inflatador: so when you ran puppet it should say which change it is applying [21:27:22] yeah, it says "Info: Applying configuration version '(cca1f8125a) Dzahn - icinga: let Jack Gleeson run commands for any host or service' [21:28:10] hmm, looking at the actually generated icinga config [21:28:24] it's possible there have been changes due to this migrating over to alertmanager [21:28:30] that I don't know about yet [21:29:26] no worries, it actually looks like I am in the contacts.cfg on alert1001 [21:30:03] ah! ok [21:30:08] I am running it too right now [21:30:32] ok, so if you go back to the public repo then [21:30:48] and that cgi.cfg file https://gerrit.wikimedia.org/r/c/operations/puppet/+/751980/1/modules/icinga/files/cgi.cfg [21:30:59] add your new contact there now [21:31:36] either use only exacly bking or Bking or add both :p [21:31:45] or whatever you used in LDAP [21:33:21] and yes, I can confirm you are in the contacts and the config is ok [21:33:26] Probably do both, looks like uid is 'bking' and sn/cn is 'Bking' [21:33:27] I just used lowecase everywhere which worked for me [21:33:45] it happens to me all the time that I am logged in as Dzahn but only dzahn has the privs [21:33:50] and there is no logout button :) [21:34:00] but if you can stick to one.. just do that [21:34:13] Oh. Maybe I'll just stick to lowercase then [21:43:59] you can go to, https://idp.wikimedia.org/login to logout if needed [21:44:52] ah:) thx, the single-sign-on is still modern to me, hehe [22:03:16] OK, my PR for icinga access is up if you are able to look it over! https://gerrit.wikimedia.org/r/c/operations/puppet/+/752005 [22:05:02] inflatador: +1 :) [22:05:27] once again I am in your debt, good sir! [22:07:17] hehe, no problem. btw, if I was suspicious who even is bking. there is "the other LDAP" as well. [ldap-corp1001:~] $ /usr/bin/ldapsearch -x "mail=bking*" [22:09:01] This is the first place I've worked where 'bking' wasn't taken. I've actually worked at multiple places where I wasn't even the only 'Brian King' [22:09:44] heh! sounds like big enterprises [22:10:48] Oh, DEFINITELY... ;P [22:12:50] just checked if you could technically have brian@ but sorry, taken by https://www.wikidata.org/wiki/Q14956410 [22:12:59] hehe [22:16:42] Hey, I'm happy just not to be 'bking1' or whatever [22:18:50] like an auto-generated reddit user names, incredible_bking_sre [22:22:31] somewhere in /usr/share/dict/words, is the adjective for me... [22:22:50] anyway, I guess the next step is puppet-merge from puppetmaster? [22:23:07] heheh, nice [22:23:36] yes, merge in gerrit, then puppet-merge on puppetmaster1001 [22:23:44] then run puppet on alert1001 [22:24:40] run config check, make sure you are logged in on Icinga web ui with the exact same user, try running a random command on a random host [22:25:07] it can be downtime of a few minutes for example [22:31:02] https://wikitech.wikimedia.org/wiki/Icinga#How_to_handle_active_alerts the commands in the Icinga UI are behind a drop-down menu you get when selecting a host or service [22:31:55] OK, puppet stuff done, logging back into icinga web UI now [22:35:56] ok, great. you can find your specific host. optionally break out of the frameset [22:36:43] then you can do this: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=test and just replace "test" in the address bar [22:39:54] I need to run for an errand and go afk for like half an hour or so [22:41:39] No problema. I suppressed 'elastic2051,' no errors this time! [22:41:50] perfect! then it works :) [22:42:05] ok, great, bbiaw [22:43:26] Heck yeah! thanks again