[07:19:43] <arnaudb>	 good morning o11y :) you should have received a share notification of an action plan coming from data-persistence, feel free to ping me if there is something weird about it!
[15:27:06] <mutante>	 https://phabricator.wikimedia.org/P61217
[15:27:21] <mutante>	 ^ this compares existing envoy config between a main prometheus and a POP prometheus.
[15:27:50] <mutante>	 one has 'non-SNI support' and the other does not. I think this somehow explains the issue with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023917
[15:28:15] <mutante>	 but not sure about all the details yet. just pointing out POP is different before the cert provider change.
[15:41:29] <herron>	 IMO a good next step is to set pop to match the envoy/cfssl hiera configs from main exactly
[15:41:41] <herron>	 pop has profile::tlsproxy::envoy::sni_support: 'strict' set in hiera
[15:42:11] <lmata>	 Thanks arnaudb we’ll take q look
[15:42:23] <lmata>	 s/q/a/
[15:42:32] <arnaudb>	 thanks!
[16:01:43] <mutante>	 herron: agreed, sounds good to do that first
[16:36:04] <denisse>	 mutante herron: good idea, I'll make the pop configs to match the configs from main and test it on a PoP host. :)
[16:39:10] <mutante>	 :)
[16:49:48] <denisse>	 Hi herron , I have a question regarding this patch. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1019066
[16:49:48] <denisse>	 Do you know why is this line added? - "%{facts.networking.fqdn}"
[16:50:45] <herron>	 denisse: that's to include the system hostname in the cert/config
[16:51:19] <denisse>	 herron: Thanks, sorry if this is a dummy question but why do we need to add it to the cert/config?
[16:56:26] <herron>	 denisse: in case we need it basically, we introduced it when switching eiqad/codfw proms over to cfssl since the dynamic approach is now supported
[16:57:58] <denisse>	 Thanks! :)
[16:58:16] <herron>	 np!
[17:08:06] <mutante>	 maybe let's use this moment to test if it works without them
[17:08:22] <mutante>	 we removed a bunch of names like that on other services to simplify
[17:08:37] <mutante>	 though it's already nice that it's not hardcoded host names that way :)
[18:20:25] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_vopsbot.service on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:21:55] <denisse>	 ^ Taking a look./
[18:23:28] <denisse>	 I think that may be the result of a configuration error as the main alert host is alert1001.
[18:25:44] <herron>	 mutante it is there intentionally.  it's useful for looking at a specific server e.g. https://prometheus1005.eqiad.wmnet/ops/ working vs https://prometheus6002.drmrs.wmnet/ops/ not yet, and we do use these for the main sites
[19:16:00] <denisse>	 Regarding the alert for the 'wmf_auto_restart_vopsbot.service' unit failing on alert2001. I think this may be the result of a configuration error because the main alert host is alert1001. I think that clearing the unit from the list of failed units would resolve the alert, what do you think?
[19:16:57] <denisse>	 I didn't notice any anomalies on the systemd logs regarding that alert, I think it's expected that it's not running on the passive host...
[19:22:41] <moritzm>	 I'll fix the vopsbot config tomorrow, I missed that while the class gets applied to bith alert hosts, it's only running on one of them and the wmf-auto-restart config needs to be adapted for that
[19:30:54] <denisse>	 moritzm: Thank you! 
[22:47:58] <mutante>	 herron: gotcha! ack