[09:06:40] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [09:11:40] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [09:18:50] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [09:23:50] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [10:12:25] (SystemdUnitFailed) firing: thanos-query.service on titan1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:15:30] sigh, that was thanos-query on titan1001 consuming a lot of cpu, restarted [10:17:25] (SystemdUnitFailed) resolved: thanos-query.service on titan1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:36:48] (PuppetFailure) firing: Puppet has failed on prometheus2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:41:48] (PuppetFailure) firing: (4) Puppet has failed on prometheus1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:28:58] ^ taking a look. [16:40:15] I think this change broke Puppet on Prometheus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1018255 [16:41:50] Relevant output: https://phabricator.wikimedia.org/P60271 [16:45:20] ack [16:47:46] I think I spotted the issue with the patch, the label matcher {job=~"cache_.*"} is in the wrong place. [16:48:57] * denisse working on a patch. [16:51:22] cc mutante [16:51:32] herron: we are talking about it in -sre [16:51:47] (I am not involved, just passing the message :) [16:51:49] ha! classic thanks [17:10:18] I sent a patch to fix the issue, it's resolved now. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1018745 [17:57:18] (PuppetFailure) resolved: (2) Puppet has failed on prometheus2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:58:02] Hey team, I think I broke Prometheus in PoP during the cfssl migration, my apologies for that. [17:58:02] I sent a patch that I'd greatly appreciate it if you could take a look ASAP: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1018749 [17:58:10] ^ cwhite herron [17:59:23] denisse: looking [17:59:24] That patch must create the TLS certificates required to get Puppet working again. I'm not sure if we need to specify the port in the cfssl_options. [17:59:26] Thanks! [18:00:12] have you run PCC for this already? [18:00:56] I tried, but the job said it failed without any other explanation. :( https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1853/ [18:03:31] herron: I saw you sent a PCC job for that, thank you. Looking at your job the parameters for mine were incorrect. [18:03:53] ahh yeah running that now [18:08:36] The build failed, but I think it doesn't provide useful information on the issue... [18:13:12] denisse: should be useful I think, looks like a syntax issue in the patch. for example compare to profile::tlsproxy::envoy::cfssl_options in hieradata/role/common/prometheus.yaml I'll comment on the patch [18:15:03] Thank you. [18:27:47] I sent a new PCC job and it failed again. https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1856/console [18:28:05] It's very strange that it doesn't say why it failed, I recall PCC used to do that. [18:28:26] how did you submit that one? [18:29:03] After submitting my new patch I triggered a build for that gerrit change and used re:prometheus.*.wmnet [18:29:35] using utils/pcc? [18:29:46] No, I did it from the WebUI. [18:29:59] maybe it's gerrit change number vs gerrit private change number? [18:30:37] utils/pcc is nice though, its in the puppet repo just need to set up the jenkins creds once [18:31:29] Thanks, sending one with utils/pcc [18:31:34] https://puppet-compiler.wmflabs.org/output/1018749/1857/ just finished [18:31:57] as all noops [18:32:29] I think because we don't need profile::tlsproxy::envoy::services: any longer [18:32:30] Thanks Keith, I see a list of Hosts that have failed to compile completely. Do you know if that's expected? [18:32:53] those are from stale facts afaik [18:33:32] as in the hosts are not around any more in production [18:33:52] I think we need it, moritzm suggested taking a look at the hieradata/role/common/piwik.yaml config as an example for it. [18:34:12] (the profile::tlsproxy::envoy::services). [18:35:20] I see it's also used in chartmuseum but both have - server_names: ['*']. [18:37:33] huh interesting, its not the case looking at prometheus.yaml as an example https://gerrit.wikimedia.org/r/c/operations/puppet/+/930187/4/hieradata/role/common/prometheus.yaml [18:38:44] That's interesting indeed. QQ regarding that patch, do you know why was the 'profile::tlsproxy::envoy::sni_support:' key removed? [18:39:57] ha! an excellent advertisement for thorough commit messages [18:40:27] from memory we didn't need it set as strict, the default was fine. but its fuzzy [18:42:31] Thanks, I've sent a new build: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1858/console [18:48:44] It only fails on decommissioned hosts. https://puppet-compiler.wmflabs.org/output/1018749/1858/ [18:50:42] But I see this on change errors: Warning: Unknown variable: 'service_modules'. (file: /srv/jenkins/puppet-compiler/1858/change/src/modules/prometheus/manifests/blackbox/modules/service_catalog.pp, line: 55, column: 15) [18:52:03] Tho I think it's unrelated as the job says Compilation results for prometheus3003.esams.wmnet: No change [19:01:55] hmm considering noops I'm wondering if we need this at all for the pop? [19:03:44] since cfssl certs are already present for envoy on the pop hosts [19:04:15] I think we do because Puppet is failing on the Pop hosts: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret ssl/prometheus.wikimedia.org.key (file: /srv/puppet_code/environments/production/modules/sslcert/manifests/certificate.pp, line: 108, [19:04:15] column: 26) (file: /srv/puppet_code/environments/production/modules/profile/manifests/tlsproxy/envoy.pp, line: 110) on node prometheus3003.esams.wmnet [19:06:52] hah right please disregard my lest message I checked prom2006 instead of prom6002 [19:11:49] No worries, I also confused the prom hosts with the prom on PoP hosts. [19:11:56] And that's how I caused this issue. 🙈 [21:29:45] hey, do you need help to get the cert back? [21:29:58] I can recreate it in the private repo as it was before [21:30:03] just do NOT use git revert [21:30:58] could be more obvious which hosts exist if it wasnt for the regexes in site.pp I guess [21:35:16] I am re-creating the private key [21:43:14] re-creating it in the _other location_ :p [21:43:36] puppet on prometheus6002 is fixed now [21:44:34] which applies the rules_ops.yml file changes... [21:45:19] next we need the cert back too or envoy restart fails. the good part it it doesnt try to restart it with broken config, so it's still up [21:55:59] recreated cert, pubkey, csr and cert that was in non-standard location before [22:03:15] Thanks a lot mutante! [22:13:40] well, yw, but it's still not working yet [22:14:34] we can either run cergen to recreate it all with the normal workflow or try to move forward to cfssl [22:15:30] Can we move forward to cfssl without the chained crt? [22:21:27] we should be able to but there must be some issue with the patch for that [22:22:00] Yes, PCC show's there's a NOOP. https://puppet-compiler.wmflabs.org/output/1018749/1863/ [22:22:51] denisse: so prometheus has this: [22:23:12] profile:envoy::ensure: present [22:23:18] I think the error may have to do with this: modules/prometheus/manifests/blackbox/modules/service_catalog.pp [22:23:30] compare this to: [22:23:32] profile::tlsproxy::envoy::ensure: present [22:23:35] which others use [22:23:52] Interesting, let me see... [22:24:02] also prometheus had a cert in a location that not a single other service had a crt file [22:39:33] fixed it now! [22:39:45] puppet no errors on prometheus6002 [22:39:47] Thanks a lot for restoring the certs mutante! [22:40:02] so.. you can't ues git revert in private repo [22:40:10] so what I did was just look at git log [22:40:20] and copy/paste certs/keys from that into new file and a new commit [22:40:29] but when I did that.. because it's a diff view [22:40:38] there was one extra - at the beginning of every line [22:40:41] because it removed them [22:40:56] so the second fix was: 0,$s/^-//g in vi [22:41:07] to remove the extra - from every line [22:41:42] Thanks a lot!