[07:27:56] <hashar>	 hi!  I could use a review/deploy for an AlertManager rule addition. Namely to send our team alerts to both IRC and email
[07:28:44] <hashar>	 it is typically g.odog doing them but he is not around this week
[09:45:43] <btullis>	 hashar: I'll happily look at it for you.
[09:59:09] <hashar>	 btullis: hi!  I forgot to link to the change my bad:  https://gerrit.wikimedia.org/r/c/operations/puppet/+/738381 :]
[10:00:06] <btullis>	 hashar: No worries. I'll have a look now.
[10:16:32] <btullis>	 hashar: Done. Hope that helps.
[10:16:45] <hashar>	 btullis: OHH I completely missed that :]
[10:21:08] <hashar>	  amended :]
[10:34:17] <hashar>	 btullis: are you suggesting I should go with placeholders receivers for later?  Aka  releng-irc + releng-mail + releng-ircmail ?
[10:34:42] <hashar>	 currently there is a single alarm passing through that system. But surely we will later extend and take full advantage of alertmanager
[10:37:39] <btullis>	 hashar: Yes, I was just pointing out the flexibility if you want it. But I'm not aware of what your needs are in terms of alerts (coverage, plus level: warning, critical, page etc.) so it's totally up to you to decide whether you think you might need these multiple receivers in future.
[10:38:51] <hashar>	 they are very limited, I merely moved a Graphite/Icinga alarm to Grafana/AlertManager 
[10:39:12] <hashar>	 it alerted on both irc and email. I first made it to solely use irc to avoid spamming the team list
[10:39:15] <hashar>	 and now adding the email 
[10:39:29] <hashar>	 but I guess I can copy the same pattern other used as you suggested. This way we are set for later
[10:39:35] <btullis>	 No worries. If you currently have one alert and its state is on/off then perhaps one receiver is all you need for now. If you're planning to add more alerts then you might wish for the placeholders to be in place.
[10:39:57] <btullis>	 👍 All good.
[10:40:31] <hashar>	 I really like the Phabricator task creation idea
[10:55:12] <hashar>	 btullis: I think it is good enough as is. We only have one alarm right now and I am not sure whatever others we can use
[10:55:31] <hashar>	 so we can keep it simple for now :]
[10:56:50] <btullis>	 Cool. Would you like me to +2 and merge, or do you just need a +1?
[10:57:10] <arturo>	 ema: it seems we're merging at the same time
[10:57:24] <ema>	 arturo: go ahead!
[10:57:28] <arturo>	 Ema: prometheus:ops: add varnishmtail-internal jobs (8b2ed7a446)
[10:57:28] <arturo>	 Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management (d5c45d4021)
[10:57:29] <arturo>	 ok!
[10:57:53] <hashar>	 btullis: I could use a +2 and deployment. I lack puppet-merge access :]
[10:57:55] <arturo>	 merged
[10:57:58] <ema>	 arturo: ty :)
[10:58:07] <arturo>	 ema: np
[10:58:15] <btullis>	 hashar: Noted, thanks. Will do.
[10:58:20] <hashar>	 \o/
[11:02:21] <btullis>	 hashar: Done.
[11:02:34] <hashar>	 awesome!
[11:05:01] <hashar>	 maybe I can get our Jenkins jobs to emit some stuff to AlertManager
[11:21:12] <btullis>	 hashar: Yes. There is also a way to get Grafana *itself* to send alerts to Alertmanager: https://wikitech.wikimedia.org/wiki/Alertmanager#Grafana_alerts - but I think that this would only be viewed as an interim solution, until you can get the metrics that you need into Prometheus instead of Graphite.
[11:22:58] <XioNoX>	 gehel, btullis, could you have a look at https://phabricator.wikimedia.org/T295118#7509294 ? we need to replace a switch sooner than later and some of your servers will be impacted
[11:23:43] <btullis>	 XioNoX: Looking now.
[11:39:30] <XioNoX>	 thanks!
[12:17:37] <vgutierrez>	 I got a haproxy-mtail@tls.socket  and haproxy-mtail@tls.service units defined and the socket has a implicit Before=haproxy-mtail@tls.service (can be seen with systemctl show ...), but if for some reason (puppet) the service starts before the socket systemd refuses to start the socket unit
[12:17:52] <vgutierrez>	 a explicit After=haproxy-mtail@tls.socket on the service unit would fix that?
[12:20:59] <arturo>	 vgutierrez: sounds reasonable
[12:23:49] <arturo>	 I wonder if Wants= may also help here
[12:24:52] <vgutierrez>	 After= seems to be working
[12:51:52] <hashar>	 btullis: I have followed that documentation to migrate the check_graphite Icinga monitoring ;)
[13:44:16] <gehel>	 XioNoX: I commented on the ticket: there should be no issue on the elasticsearch side.
[13:44:36] <XioNoX>	 gehel: cool, thanks!
[13:44:49] <gehel>	 cc: ryankemper ^^ (T295118)
[13:44:50] <stashbot>	 T295118: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118
[15:54:00] <jbond>	 herron: moritzm: let us know when you are happy for me to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/605568/24 
[15:55:01] <moritzm>	 let's go ahead. we can first keep puppet disabled on mx1001 and have it only apply to mx2001 for tests
[15:55:08] <herron>	 hey jbond, ready
[15:55:39] <jbond>	 ack disabled puppet on mx1001 and merging changes now
[15:55:52] <moritzm>	 sgtm
[15:56:29] <jbond>	 ok applying to mx2001 now
[15:57:57] <jbond>	 this was the diff but currently got an error preventing exim reload
[15:57:59] <jbond>	 https://phabricator.wikimedia.org/P17756
[15:58:07] <jbond>	 Exim configuration error in line 282 of /etc/exim4/exim4 option "data" unknown
[15:59:08] <jbond>	 hmm seems otrs_aliases_file is empty
[16:00:45] <mutante>	 it's supposed to be /etc/exim4/otrs_emails'
[16:01:55] <mutante>	 it sets that in the middle of the class in profile::mail::mx  not a parameter.. ehmm
[16:04:11] <jbond>	 mutante: thanks https://gerrit.wikimedia.org/r/c/operations/puppet/+/739561
[16:04:23] <jbond>	 however still getting an error with the data keyword
[16:06:51] <moritzm>	 oh, independent of the syntax error the template also needs updating! it's no longer mendelevium, but otrs1001.eqiad.wmnet
[16:07:54] <jbond>	 moritzm: ack corrected that
[16:08:35] <moritzm>	 not really sure about the syntax error, maybe that's fallout from the empty "require_files = "
[16:08:49] <moritzm>	 so that is misparses the next entry or so?
[16:09:31] <jbond>	 moritzm: no still got the same error
[16:09:42] <mutante>	 it's require_files = CONFDIR/otrs_emails now .. hrmm
[16:09:47] <mutante>	 weird error
[16:10:10] <jbond>	 i think it may be that `data` is not valid with `driver = manualroute`
[16:11:00] <mutante>	 that would explain why it doesnt already fail on line 230
[16:11:09] <mutante>	 in the "eat" section.. 
[16:11:19] <herron>	 does it work with driver redirect?
[16:11:25] <mutante>	 where it also uses data but with driver = redirect
[16:11:33] <mutante>	 apparently yes, if it parses it from top to bottom
[16:12:09] <jbond>	 herror with that we get an error on   option "route_list" unknown
[16:13:00] <moritzm>	 I'd say let's flip back the Hiera flag for mx2001 and revisit after digging some more in the docs? we can still validate that  605568 works as expected with profile::mail::mx::enable_ldap: true as an interim step
[16:13:21] <jbond>	 ack ill revert the hiera flip now
[16:13:31] <herron>	 hmm, yeah +1 let's test more
[16:16:58] <jbond>	 ok chengaed reverted
[16:19:45] <moritzm>	 sent a test mail via mx2001 to my wikimedia.org address and that arrived fine
[16:20:02] <moritzm>	 doing a quick test via the OTRS test alias
[16:20:15] <jbond>	 ack thanks
[16:21:11] <herron>	 jbond: is there a test host you have been using?  wondering about switching 'data' to 'condition'
[16:21:59] <moritzm>	 sent a mail to otrs-test@w.o via mx2001
[16:22:12] <jbond>	 herron: no i was live hacking on mx2001, however you can at the very least check if exim likes the syntax with exim -C exim.conf
[16:23:07] <herron>	 ok
[16:25:44] <moritzm>	 the otrs-test@ ticket seemed to have arrived fine
[16:34:12] <moritzm>	 jbond, herron: want to poke at this further or shall we re-enable puppet on mx1001?
[16:35:24] <jbond>	 moritzm: oh sorry i had allready re-enabled puppet on both MX's
[16:35:34] <jbond>	 from my side i will likley take another look at it tomorrow
[16:36:42] <moritzm>	 ah ok :-) sounds good, I'll dig into this tomorrow as well
[16:37:16] <jbond>	 ack thanks
[16:38:10] <herron>	 sounds good, I'm thinking a test host would be helpful for this.  too bad mx2002 isn't around anymore
[16:41:51] <moritzm>	 I had the same thought, we probably should have just kept it after the bullseye update...
[16:43:32] <herron>	 I'm spinning up a bullseye deployment-mx03 now, hopefully that's not too horribly painful to get running
[16:47:27] <moritzm>	 ack, it's probably just fine for cloud vps, given that we mostly need to figure out the syntax and not full blown routing
[16:49:21] <majavah>	 herron: note that deployment-mx02 has been failing puppet for a while, so the bullseye one might do that too
[16:50:20] <herron>	 majavah: thanks yeah hopefully can get that sorted and up and running here pretty soon
[16:51:36] <majavah>	 deployment-mx'es aren't really doing anything either, since someone broke the wmcs base config a while ago and now mediawiki mail is routed like any other mail
[17:37:57] <herron>	 majavah: deployement-mx03 is running now, shall I terminate deployment-mx02?
[18:35:11] <majavah>	 herron: I see a few uses in https://codesearch.wmcloud.org/search/?q=deployment-mx02&i=nope&files=&excludeFiles=&repos= that might need to be updated, but I'm not very familiar with it