[07:19:43] good morning o11y :) you should have received a share notification of an action plan coming from data-persistence, feel free to ping me if there is something weird about it! [15:27:06] https://phabricator.wikimedia.org/P61217 [15:27:21] ^ this compares existing envoy config between a main prometheus and a POP prometheus. [15:27:50] one has 'non-SNI support' and the other does not. I think this somehow explains the issue with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023917 [15:28:15] but not sure about all the details yet. just pointing out POP is different before the cert provider change. [15:41:29] IMO a good next step is to set pop to match the envoy/cfssl hiera configs from main exactly [15:41:41] pop has profile::tlsproxy::envoy::sni_support: 'strict' set in hiera [15:42:11] Thanks arnaudb we’ll take q look [15:42:23] s/q/a/ [15:42:32] thanks! [16:01:43] herron: agreed, sounds good to do that first [16:36:04] mutante herron: good idea, I'll make the pop configs to match the configs from main and test it on a PoP host. :) [16:39:10] :) [16:49:48] Hi herron , I have a question regarding this patch. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1019066 [16:49:48] Do you know why is this line added? - "%{facts.networking.fqdn}" [16:50:45] denisse: that's to include the system hostname in the cert/config [16:51:19] herron: Thanks, sorry if this is a dummy question but why do we need to add it to the cert/config? [16:56:26] denisse: in case we need it basically, we introduced it when switching eiqad/codfw proms over to cfssl since the dynamic approach is now supported [16:57:58] Thanks! :) [16:58:16] np! [17:08:06] maybe let's use this moment to test if it works without them [17:08:22] we removed a bunch of names like that on other services to simplify [17:08:37] though it's already nice that it's not hardcoded host names that way :) [18:20:25] (SystemdUnitFailed) firing: wmf_auto_restart_vopsbot.service on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:21:55] ^ Taking a look./ [18:23:28] I think that may be the result of a configuration error as the main alert host is alert1001. [18:25:44] mutante it is there intentionally. it's useful for looking at a specific server e.g. https://prometheus1005.eqiad.wmnet/ops/ working vs https://prometheus6002.drmrs.wmnet/ops/ not yet, and we do use these for the main sites [19:16:00] Regarding the alert for the 'wmf_auto_restart_vopsbot.service' unit failing on alert2001. I think this may be the result of a configuration error because the main alert host is alert1001. I think that clearing the unit from the list of failed units would resolve the alert, what do you think? [19:16:57] I didn't notice any anomalies on the systemd logs regarding that alert, I think it's expected that it's not running on the passive host... [19:22:41] I'll fix the vopsbot config tomorrow, I missed that while the class gets applied to bith alert hosts, it's only running on one of them and the wmf-auto-restart config needs to be adapted for that [19:30:54] moritzm: Thank you! [22:47:58] herron: gotcha! ack