[07:27:56] hi! I could use a review/deploy for an AlertManager rule addition. Namely to send our team alerts to both IRC and email [07:28:44] it is typically g.odog doing them but he is not around this week [09:45:43] hashar: I'll happily look at it for you. [09:59:09] btullis: hi! I forgot to link to the change my bad: https://gerrit.wikimedia.org/r/c/operations/puppet/+/738381 :] [10:00:06] hashar: No worries. I'll have a look now. [10:16:32] hashar: Done. Hope that helps. [10:16:45] btullis: OHH I completely missed that :] [10:21:08] amended :] [10:34:17] btullis: are you suggesting I should go with placeholders receivers for later? Aka releng-irc + releng-mail + releng-ircmail ? [10:34:42] currently there is a single alarm passing through that system. But surely we will later extend and take full advantage of alertmanager [10:37:39] hashar: Yes, I was just pointing out the flexibility if you want it. But I'm not aware of what your needs are in terms of alerts (coverage, plus level: warning, critical, page etc.) so it's totally up to you to decide whether you think you might need these multiple receivers in future. [10:38:51] they are very limited, I merely moved a Graphite/Icinga alarm to Grafana/AlertManager [10:39:12] it alerted on both irc and email. I first made it to solely use irc to avoid spamming the team list [10:39:15] and now adding the email [10:39:29] but I guess I can copy the same pattern other used as you suggested. This way we are set for later [10:39:35] No worries. If you currently have one alert and its state is on/off then perhaps one receiver is all you need for now. If you're planning to add more alerts then you might wish for the placeholders to be in place. [10:39:57] 👍 All good. [10:40:31] I really like the Phabricator task creation idea [10:55:12] btullis: I think it is good enough as is. We only have one alarm right now and I am not sure whatever others we can use [10:55:31] so we can keep it simple for now :] [10:56:50] Cool. Would you like me to +2 and merge, or do you just need a +1? [10:57:10] ema: it seems we're merging at the same time [10:57:24] arturo: go ahead! [10:57:28] Ema: prometheus:ops: add varnishmtail-internal jobs (8b2ed7a446) [10:57:28] Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management (d5c45d4021) [10:57:29] ok! [10:57:53] btullis: I could use a +2 and deployment. I lack puppet-merge access :] [10:57:55] merged [10:57:58] arturo: ty :) [10:58:07] ema: np [10:58:15] hashar: Noted, thanks. Will do. [10:58:20] \o/ [11:02:21] hashar: Done. [11:02:34] awesome! [11:05:01] maybe I can get our Jenkins jobs to emit some stuff to AlertManager [11:21:12] hashar: Yes. There is also a way to get Grafana *itself* to send alerts to Alertmanager: https://wikitech.wikimedia.org/wiki/Alertmanager#Grafana_alerts - but I think that this would only be viewed as an interim solution, until you can get the metrics that you need into Prometheus instead of Graphite. [11:22:58] gehel, btullis, could you have a look at https://phabricator.wikimedia.org/T295118#7509294 ? we need to replace a switch sooner than later and some of your servers will be impacted [11:23:43] XioNoX: Looking now. [11:39:30] thanks! [12:17:37] I got a haproxy-mtail@tls.socket and haproxy-mtail@tls.service units defined and the socket has a implicit Before=haproxy-mtail@tls.service (can be seen with systemctl show ...), but if for some reason (puppet) the service starts before the socket systemd refuses to start the socket unit [12:17:52] a explicit After=haproxy-mtail@tls.socket on the service unit would fix that? [12:20:59] vgutierrez: sounds reasonable [12:23:49] I wonder if Wants= may also help here [12:24:52] After= seems to be working [12:51:52] btullis: I have followed that documentation to migrate the check_graphite Icinga monitoring ;) [13:44:16] XioNoX: I commented on the ticket: there should be no issue on the elasticsearch side. [13:44:36] gehel: cool, thanks! [13:44:49] cc: ryankemper ^^ (T295118) [13:44:50] T295118: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 [15:54:00] herron: moritzm: let us know when you are happy for me to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/605568/24 [15:55:01] let's go ahead. we can first keep puppet disabled on mx1001 and have it only apply to mx2001 for tests [15:55:08] hey jbond, ready [15:55:39] ack disabled puppet on mx1001 and merging changes now [15:55:52] sgtm [15:56:29] ok applying to mx2001 now [15:57:57] this was the diff but currently got an error preventing exim reload [15:57:59] https://phabricator.wikimedia.org/P17756 [15:58:07] Exim configuration error in line 282 of /etc/exim4/exim4 option "data" unknown [15:59:08] hmm seems otrs_aliases_file is empty [16:00:45] it's supposed to be /etc/exim4/otrs_emails' [16:01:55] it sets that in the middle of the class in profile::mail::mx not a parameter.. ehmm [16:04:11] mutante: thanks https://gerrit.wikimedia.org/r/c/operations/puppet/+/739561 [16:04:23] however still getting an error with the data keyword [16:06:51] oh, independent of the syntax error the template also needs updating! it's no longer mendelevium, but otrs1001.eqiad.wmnet [16:07:54] moritzm: ack corrected that [16:08:35] not really sure about the syntax error, maybe that's fallout from the empty "require_files = " [16:08:49] so that is misparses the next entry or so? [16:09:31] moritzm: no still got the same error [16:09:42] it's require_files = CONFDIR/otrs_emails now .. hrmm [16:09:47] weird error [16:10:10] i think it may be that `data` is not valid with `driver = manualroute` [16:11:00] that would explain why it doesnt already fail on line 230 [16:11:09] in the "eat" section.. [16:11:19] does it work with driver redirect? [16:11:25] where it also uses data but with driver = redirect [16:11:33] apparently yes, if it parses it from top to bottom [16:12:09] herror with that we get an error on option "route_list" unknown [16:13:00] I'd say let's flip back the Hiera flag for mx2001 and revisit after digging some more in the docs? we can still validate that 605568 works as expected with profile::mail::mx::enable_ldap: true as an interim step [16:13:21] ack ill revert the hiera flip now [16:13:31] hmm, yeah +1 let's test more [16:16:58] ok chengaed reverted [16:19:45] sent a test mail via mx2001 to my wikimedia.org address and that arrived fine [16:20:02] doing a quick test via the OTRS test alias [16:20:15] ack thanks [16:21:11] jbond: is there a test host you have been using? wondering about switching 'data' to 'condition' [16:21:59] sent a mail to otrs-test@w.o via mx2001 [16:22:12] herron: no i was live hacking on mx2001, however you can at the very least check if exim likes the syntax with exim -C exim.conf [16:23:07] ok [16:25:44] the otrs-test@ ticket seemed to have arrived fine [16:34:12] jbond, herron: want to poke at this further or shall we re-enable puppet on mx1001? [16:35:24] moritzm: oh sorry i had allready re-enabled puppet on both MX's [16:35:34] from my side i will likley take another look at it tomorrow [16:36:42] ah ok :-) sounds good, I'll dig into this tomorrow as well [16:37:16] ack thanks [16:38:10] sounds good, I'm thinking a test host would be helpful for this. too bad mx2002 isn't around anymore [16:41:51] I had the same thought, we probably should have just kept it after the bullseye update... [16:43:32] I'm spinning up a bullseye deployment-mx03 now, hopefully that's not too horribly painful to get running [16:47:27] ack, it's probably just fine for cloud vps, given that we mostly need to figure out the syntax and not full blown routing [16:49:21] herron: note that deployment-mx02 has been failing puppet for a while, so the bullseye one might do that too [16:50:20] majavah: thanks yeah hopefully can get that sorted and up and running here pretty soon [16:51:36] deployment-mx'es aren't really doing anything either, since someone broke the wmcs base config a while ago and now mediawiki mail is routed like any other mail [17:37:57] majavah: deployement-mx03 is running now, shall I terminate deployment-mx02? [18:35:11] herron: I see a few uses in https://codesearch.wmcloud.org/search/?q=deployment-mx02&i=nope&files=&excludeFiles=&repos= that might need to be updated, but I'm not very familiar with it