[09:06:40] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[09:11:40] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[09:18:50] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[09:23:50] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[10:12:25] <jinxer-wm>	 (SystemdUnitFailed) firing: thanos-query.service on titan1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:15:30] <godog>	 sigh, that was thanos-query on titan1001 consuming a lot of cpu, restarted
[10:17:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: thanos-query.service on titan1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:36:48] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on prometheus2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:41:48] <jinxer-wm>	 (PuppetFailure) firing: (4) Puppet has failed on prometheus1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:28:58] <denisse>	 ^ taking a look.
[16:40:15] <denisse>	 I think this change broke Puppet on Prometheus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1018255
[16:41:50] <denisse>	 Relevant output: https://phabricator.wikimedia.org/P60271
[16:45:20] <herron>	 ack
[16:47:46] <denisse>	 I think I spotted the issue with the patch, the label matcher {job=~"cache_.*"} is in the wrong place.
[16:48:57] * denisse working on a patch.
[16:51:22] <herron>	 cc mutante
[16:51:32] <sukhe>	 herron: we are talking about it in -sre
[16:51:47] <sukhe>	 (I am not involved, just passing the message :)
[16:51:49] <herron>	 ha! classic thanks
[17:10:18] <denisse>	 I sent a patch to fix the issue, it's resolved now. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1018745
[17:57:18] <jinxer-wm>	 (PuppetFailure) resolved: (2) Puppet has failed on prometheus2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:58:02] <denisse>	 Hey team, I think I broke Prometheus in PoP during the cfssl migration, my apologies for that.
[17:58:02] <denisse>	 I sent a patch that I'd greatly appreciate it if you could take a look ASAP: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1018749
[17:58:10] <denisse>	 ^ cwhite herron 
[17:59:23] <herron>	 denisse: looking
[17:59:24] <denisse>	 That patch must create the TLS certificates required to get Puppet working again. I'm not sure if we need to specify the port in the cfssl_options.
[17:59:26] <denisse>	 Thanks!
[18:00:12] <herron>	 have you run PCC for this already?
[18:00:56] <denisse>	 I tried, but the job said it failed without any other explanation. :( https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1853/
[18:03:31] <denisse>	 herron: I saw you sent a PCC job for that, thank you. Looking at your job the parameters for mine were incorrect.
[18:03:53] <herron>	 ahh yeah running that now
[18:08:36] <denisse>	 The build failed, but I think it doesn't provide useful information on the issue...
[18:13:12] <herron>	 denisse: should be useful I think, looks like a syntax issue in the patch.  for example compare to profile::tlsproxy::envoy::cfssl_options in hieradata/role/common/prometheus.yaml I'll comment on the patch
[18:15:03] <denisse>	 Thank you.
[18:27:47] <denisse>	 I sent a new PCC job and it failed again. https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1856/console
[18:28:05] <denisse>	 It's very strange that it doesn't say why it failed, I recall PCC used to do that.
[18:28:26] <herron>	 how did you submit that one?
[18:29:03] <denisse>	 After submitting my new patch I triggered a build for that gerrit change and used re:prometheus.*.wmnet
[18:29:35] <herron>	 using utils/pcc?
[18:29:46] <denisse>	 No, I did it from the WebUI.
[18:29:59] <herron>	 maybe it's gerrit change number vs gerrit private change number?
[18:30:37] <herron>	 utils/pcc is nice though, its in the puppet repo just need to set up the jenkins creds once
[18:31:29] <denisse>	 Thanks, sending one with utils/pcc
[18:31:34] <herron>	 https://puppet-compiler.wmflabs.org/output/1018749/1857/ just finished
[18:31:57] <herron>	 as all noops
[18:32:29] <herron>	 I think because we don't need profile::tlsproxy::envoy::services: any longer
[18:32:30] <denisse>	 Thanks Keith, I see a list of Hosts that have failed to compile completely. Do you know if that's expected?
[18:32:53] <herron>	 those are from stale facts afaik
[18:33:32] <herron>	 as in the hosts are not around any more in production
[18:33:52] <denisse>	 I think we need it, moritzm suggested taking a look at the hieradata/role/common/piwik.yaml config as an example for it.
[18:34:12] <denisse>	 (the profile::tlsproxy::envoy::services).
[18:35:20] <denisse>	 I see it's also used in chartmuseum but both have - server_names: ['*'].
[18:37:33] <herron>	 huh interesting, its not the case looking at prometheus.yaml as an example https://gerrit.wikimedia.org/r/c/operations/puppet/+/930187/4/hieradata/role/common/prometheus.yaml
[18:38:44] <denisse>	 That's interesting indeed. QQ regarding that patch, do you know why was the 'profile::tlsproxy::envoy::sni_support:' key removed?
[18:39:57] <herron>	 ha! an excellent advertisement for thorough commit messages
[18:40:27] <herron>	 from memory we didn't need it set as strict, the default was fine.  but its fuzzy
[18:42:31] <denisse>	 Thanks, I've sent a new build: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1858/console
[18:48:44] <denisse>	 It only fails on decommissioned hosts. https://puppet-compiler.wmflabs.org/output/1018749/1858/
[18:50:42] <denisse>	 But I see this on change errors: Warning: Unknown variable: 'service_modules'. (file: /srv/jenkins/puppet-compiler/1858/change/src/modules/prometheus/manifests/blackbox/modules/service_catalog.pp, line: 55, column: 15)
[18:52:03] <denisse>	 Tho I think it's unrelated as the job says Compilation results for prometheus3003.esams.wmnet: No change
[19:01:55] <herron>	 hmm considering noops I'm wondering if we need this at all for the pop?
[19:03:44] <herron>	 since cfssl certs are already present for envoy on the pop hosts
[19:04:15] <denisse>	 I think we do because Puppet is failing on the Pop hosts: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret ssl/prometheus.wikimedia.org.key (file: /srv/puppet_code/environments/production/modules/sslcert/manifests/certificate.pp, line: 108, 
[19:04:15] <denisse>	 column: 26) (file: /srv/puppet_code/environments/production/modules/profile/manifests/tlsproxy/envoy.pp, line: 110) on node prometheus3003.esams.wmnet
[19:06:52] <herron>	 hah right please disregard my lest message I checked prom2006 instead of prom6002
[19:11:49] <denisse>	 No worries, I also confused the prom hosts with the prom on PoP hosts.
[19:11:56] <denisse>	 And that's how I caused this issue. 🙈
[21:29:45] <mutante>	 hey, do you need help to get the cert back?
[21:29:58] <mutante>	 I can recreate it in the private repo as it was before
[21:30:03] <mutante>	 just do NOT use git revert
[21:30:58] <mutante>	 could be more obvious which hosts exist if it wasnt for the regexes in site.pp I guess
[21:35:16] <mutante>	 I am re-creating the private key 
[21:43:14] <mutante>	 re-creating it in the _other location_ :p
[21:43:36] <mutante>	 puppet on prometheus6002 is fixed now
[21:44:34] <mutante>	 which applies the rules_ops.yml file changes... 
[21:45:19] <mutante>	 next we need the cert back too or envoy restart fails. the good part it it doesnt try to restart it with broken config, so it's still up
[21:55:59] <mutante>	 recreated cert, pubkey, csr and cert that was in non-standard location before
[22:03:15] <denisse>	 Thanks a lot mutante!
[22:13:40] <mutante>	 well, yw, but it's still not working yet
[22:14:34] <mutante>	 we can either run cergen to recreate it all with the normal workflow or try to move forward to cfssl
[22:15:30] <denisse>	 Can we move forward to cfssl without the chained crt?
[22:21:27] <mutante>	 we should be able to but there must be some issue with the patch for that
[22:22:00] <denisse>	 Yes, PCC show's there's a NOOP. https://puppet-compiler.wmflabs.org/output/1018749/1863/
[22:22:51] <mutante>	 denisse: so prometheus has this:
[22:23:12] <mutante>	 profile:envoy::ensure: present
[22:23:18] <denisse>	 I think the error may have to do with this: modules/prometheus/manifests/blackbox/modules/service_catalog.pp
[22:23:30] <mutante>	 compare this to:
[22:23:32] <mutante>	 profile::tlsproxy::envoy::ensure: present
[22:23:35] <mutante>	 which others use
[22:23:52] <denisse>	 Interesting, let me see...
[22:24:02] <mutante>	 also prometheus had a cert in a location that not a single other service had a crt file
[22:39:33] <mutante>	 fixed it now!
[22:39:45] <mutante>	 puppet no errors on prometheus6002
[22:39:47] <denisse>	 Thanks a lot for restoring the certs mutante!
[22:40:02] <mutante>	 so.. you can't ues git revert in private repo
[22:40:10] <mutante>	 so what I did was just look at git log
[22:40:20] <mutante>	 and copy/paste certs/keys from that into new file and a new commit
[22:40:29] <mutante>	 but when I did that.. because it's a diff view
[22:40:38] <mutante>	 there was one extra -  at the beginning of every line
[22:40:41] <mutante>	 because it removed them
[22:40:56] <mutante>	 so the second fix was:   0,$s/^-//g  in vi
[22:41:07] <mutante>	 to remove the extra - from every line
[22:41:42] <denisse>	 Thanks a lot!