[12:25:55] volans, Amir1 retrying IPIP encapsulation on ncredir@ulsfo [12:27:15] vgutierrez: ack, sending all pages to you then :D [12:27:29] https://www.irccloud.com/pastebin/0tDRFa9r/ [12:27:33] seems to be happy :) [12:27:48] yay [12:27:48] port 443 too [12:27:55] 🍾 [12:29:15] awesome [12:32:28] https://grafana.wikimedia.org/goto/YvrDD0NSk?orgId=1 --> ipip-multiqueue-optimizer metrics are being scrapped as expected [12:32:34] I'll create a dashboard after lunch [12:32:39] tcp_mss_clamper aren't there for some reason [12:32:45] ok [12:33:31] tcp_mss_clamper_packets_total{interface="ens13",state="clamped"} 2082 [12:33:51] but curl seems to report them properly [12:45:43] Folks I made an error in Netbox a moment ago - I am currently in the process of restoring from DB. Can I ask that anyone doing any work that needs to modify netbox to pause for the next few mins? [12:48:29] ok [13:32:29] Netbox should now be back up and working - apologies for the interruption [14:42:34] jbond: https://gerrit.wikimedia.org/r/c/operations/puppet/+/765257 broke puppet on some cloud vps vms, fix is https://gerrit.wikimedia.org/r/c/operations/puppet/+/978611/ [14:49:52] taavi: sorry about that thanks [16:30:43] https://labs.ripe.net/author/emileaben/does-the-internet-route-around-damage-edition-2023/ [16:52:26] if anyone has time for a quick +1, this is just move hosts out of insetup https://gerrit.wikimedia.org/r/c/operations/puppet/+/978634 [16:54:15] inflatador: Done. [16:54:42] btullis excellent! Once again I am in your debt ;) [16:59:51] A pleasure. [17:16:01] cdanis: that's a good read thanks :) [17:48:06] godog: If I understood correctly, we do still retain Prometheus metrics in Thanos for more than 1 year, but only at longer intervals (e.g. 5min, 1h, instead of 1m). How do I query these in Grafana? I can't seem to make them show up in https://grafana-rw.wikimedia.org/d/000000066/resourceloader?orgId=1&from=1663232400000&to=1673643600000&viewPanel=39&forceLogin&editPanel=39 [17:48:24] the query uses irate [5m] currently, although if I change it to 1h it similarly ends 1 year ago [17:48:35] ref https://wikitech.wikimedia.org/wiki/Prometheus#FAQ [17:52:00] My understanding of Prometheus/Thanos is that I wouldn't need to change the query in this way, but it's something random I tried in case it mattered. [18:50:16] Krinkle: i suspect we shortened the retention in https://phabricator.wikimedia.org/T311690 , profile::thanos::retention::raw in puppet looks to be set to 54w, which lines up reasonably with where your query cuts off. [18:51:32] hmm, the wiki page clearly says thats only for 1 minute retention though, so i dunnno [19:03:58] yeah... there's 3 separate retentions [19:04:07] but it seems Thanos only serves the raw one best I can tell [19:26:39] Krinkle: yeah will be good to sync up w godog about this. fwiw downsampled metrics can be seen using thanos.wm.o, and I know the grafana datasource can be configured with max_source_resolution param to get at the downsampled metrics too. there's also a thanos query option to enable auto-downsampling. In any event I added an "Experimental Thanos Downsample" datasource for the time being, but would be good to sync up on it [19:43:25] Hey there, is there a way to find individual Envoy logs? For T352328 we're trying to debug why Enjoy(?) doesn't like our new image version, but I can't find any errors in https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s for us (wikifunctions) in eqiad staging (and we're not defined for codfw staging?). [19:43:26] T352328: Cannot deploy new Evaluator images to prod; calls fail with `upstream connect error or disconnect/reset before headers. reset reason: connection termination` - https://phabricator.wikimedia.org/T352328 [19:49:59] James_F: I do see metrics for function-orchestrator and function-evaluator in eqiad/k8s-staging on that dashboard. [19:57:59] James_F: https://logstash.wikimedia.org/goto/73e57e05a25f033dc005c321929b8846 the EPIPE from orchestrator seems related? [20:12:20] taavi: Interesting, yeah, maybe? [21:03:50] hi folks, including on-callers: Traffic has made some changes to the DNS hosts today, summarized in https://phabricator.wikimedia.org/T347054#9368653 [21:04:20] no action is required from your side but if you see any issues, please let us know. such an issue can be: failure running authdns-update, DNS failures when running a cookbook [21:04:21] sukhe: coooool [21:18:42] we will send out a detailed email when the changes are complete so people not on IRC are also aware. thanks!