[00:10:28] belatedly: nothing to report from Americas oncall today [12:35:15] taavi: it is set on hieradata/role/eqiad/wmcs/openstack/eqiad1/cloudweb.yaml [12:35:49] that does not apply to cloudweb2002-dev, which is in codfw and uses the codfw1dev role and not the eqiad1 one [12:36:03] I see [12:36:28] i'll send a patch [12:37:09] that is ok, I'll sort it [12:42:44] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034483 [13:20:23] taavi: anything outstanding? [13:26:55] effie: nothing I can see, thanks! [13:27:10] cheers [15:02:40] cwhite, arnoldokoth: Nothing to report from EU oncall [15:11:29] Thanks! [15:22:15] eoghan: thank you. [16:33:02] I'm seeing changes for the manufacturer attribute for ps1-b3-magru when running sre.puppet.sync-netbox-hiera - safe to merge? [16:33:38] for ps[1234]-b4-magru actually [16:33:42] hnowlan: yep [16:33:46] cool, thanks [16:33:50] robh: ^ [17:50:18] oh, I had no idea that affected hiera [17:50:23] apologies hnowlan [17:50:33] was fine to merge yes i modified the netbox entries [17:50:54] no worries [17:52:37] had to run out for doggo walkin while it wasnt 80 degrees F [17:53:14] cuz now its 80F/26.6C and the sun is killer. [17:57:03] I'd kill for 80F. Summer in Houston has finally really begun this week! [17:58:48] the worst day in our short-term forecast is 6 days out on memorial day: 98F high, and 79F for the overnight low :P [18:09:27] jayme: claime: apologies for the potentially dumb question. About T359640, T365265 - I suppose we have ruled out installing statsd-exporter on a normal host, i.e. one like where we host legacy statsite relays today? [18:09:27] T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops - https://phabricator.wikimedia.org/T359640 [18:09:28] T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265 [18:10:16] I suppose one reason, besides general k8s complexity/benefits, is that maybe the exporter is too slow/inefficient to handle all of a given data center, unlike the C implementation for statsite which is presumably a lot more efficient. [18:10:31] but I don't actually know that for a fact, so I thought I'd mention it just in case. [18:12:00] If it does scale to taking in all (new) statsd messages from MW pods in a given DC, like the old statsite was able to handle, then that might offer a simple solution that should save several orders of magnitude in label explosion. [20:28:21] Krinkle: Limiting ourselves to one UDP receiver instance is simpler, but still a SPOF. We cannot perform maintenance without losing data in the current statsite arrangement. [20:29:01] In addition, the exporter instance(s) have to be partitioned by MW version (T359497). Changes to the metric signature leads to dropped metrics. [20:29:02] T359497: StatsD Exporter: gracefully handle metric signature changes - https://phabricator.wikimedia.org/T359497 [20:31:59] cwhite: I believe maintenance like OS upgrades do happen today withotu data loss. I'm guessing we switch the canonical/service DNS name to a standby/replacement. Anyway, I don't mean to literally suggest a single node, as much as to explore something that isn't multiplied/distributed by a large N. Indeed, you'd want multiple nodes which is fair. [20:32:34] The signature break is a good point though, that's something where Prometheus is inherently differnet and perhaps justified the big change. [20:32:48] jusifies* such a big change. [20:32:54] Thanks for pointing that out. [20:33:59] cwhite: so this is specific to clients and not an issue with the prometheus server/storage layer, right? That layer handles new labels gracefully? [20:34:20] i.e. queries over time that don't specify the new label, get continuity? [20:37:46] Correct. statsd-exporter creates a prometheus metric instance in memory and adds new samples to those metrics based on the signature. [20:39:11] Prometheus server accepts the metrics exposition format and turns them into timeseries data.