[00:52:21] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 07Grafana: Gaps in Grafana graphs using Thanos - https://phabricator.wikimedia.org/T371885#10050015 (10colewhite) I couldn't find any gaps in the data, but please let me know if you do! For awareness: https://grafana.com/blog/2020/09/28/new-in-grafana-7.... [06:13:41] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 07Grafana: Gaps in Grafana graphs using Thanos - https://phabricator.wikimedia.org/T371885#10050063 (10Joe) >>! In T371885#10049340, @Scott_French wrote: > **Edit**: Whoops, I completely missed T371885#10048618 onward before posting this. In any case, que... [06:57:22] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 07Grafana, 13Patch-For-Review: Gaps in Grafana graphs using Thanos - https://phabricator.wikimedia.org/T371885#10050121 (10Joe) Now prometheus only reports scraping the correct ports https://prometheus-eqiad.wikimedia.org/k8s/targets?scrapePool=k8s-pods... [08:08:12] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 07Grafana: Gaps in Grafana graphs using Thanos - https://phabricator.wikimedia.org/T371885#10050207 (10daniel) >>! In T371885#10050015, @colewhite wrote: > Possibly `$__rate_interval` is calculating some interval that yields no data (or not enough data to... [08:24:23] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 07Grafana: Gaps in Grafana graphs using Thanos - https://phabricator.wikimedia.org/T371885#10050256 (10fgiunchedi) Thank you all for the investigation and help on this -- appreciate it! To recap, this problem is actually the same as what's discussed at {... [13:06:27] swfrench-wmf: o/ I haven't forgot about https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1056001, I was just lagging a bit due to other tasks.. How soon would you need it? I'll hopefully work on it next week [14:41:38] elukey: no problem at all, in fact I did as well :) it's not blocking anything, as the change it would have supported simply resolved the cname from within the cookbook (like a handful of other cookbooks do). I'll follow up on the patch :) [14:44:03] ooook! Let's try to find a solution, then we can cut a new release [15:04:21] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10051598 (10Jhancock.wm) [15:04:24] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10051599 (10Jhancock.wm) @JMeybohm drives are installed. Lemme know if it all looks good or if you need anything else. [15:51:22] 06serviceops, 06Data Products, 06Data-Platform-SRE, 10Dumps-Generation, and 2 others: Migrate current-generation dumps to run from our containerized images - https://phabricator.wikimedia.org/T352650#10051792 (10dr0ptp4kt) Following up on some discussions: - @BTullis confirmed that Data Platform SRE is ok... [15:59:00] 06serviceops, 10MW-on-K8s, 10wikitech.wikimedia.org, 13Patch-For-Review: MVP: Privately serve wikitech via mw-on-k8s - https://phabricator.wikimedia.org/T371537#10051856 (10jijiki) 05Open→03In progress p:05Triage→03High [16:17:08] 06serviceops, 10LPL Essential, 10MinT, 10Community Wishlist (Translations), 10Community-Tech (Ezo Red Fox (July 29 - Aug 9, 2024)): Caching service request for MinT - https://phabricator.wikimedia.org/T370755#10051930 (10jijiki) @santhosh we don't have any updates for you yet, and most of our team will b... [16:47:41] 06serviceops, 10Charts, 10Wikimedia-Extension-setup, 07Epic, 07Wikimedia-extension-review-queue: Deploy Chart extension in production - https://phabricator.wikimedia.org/T369944#10052036 (10Catrope) [16:54:14] 06serviceops, 10Charts, 10Wikimedia-Extension-setup, 07Epic, 07Wikimedia-extension-review-queue: Deploy Chart extension in production - https://phabricator.wikimedia.org/T369944#10052110 (10Catrope) [17:26:07] Can anyone here help me unravel the mystery of a helmfile lint failure that seems unrelated to the change under test? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1060899 https://integration.wikimedia.org/ci/job/helm-lint/19680/console [17:31:01] 06serviceops, 10Charts, 10Wikimedia-Extension-setup, 07Epic, 07Wikimedia-extension-review-queue: Deploy Chart extension in production - https://phabricator.wikimedia.org/T369944#10052252 (10LGoto) [17:31:34] 06serviceops, 10Charts, 10Wikimedia-Extension-setup, 07Epic, 07Wikimedia-extension-review-queue: Deploy Chart extension in production - https://phabricator.wikimedia.org/T369944#10052254 (10LGoto) [17:33:54] brouberol: your https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1060828 looks suspiciously related to my linter failure of `unexpected type of key in map: expected string, got int: value=15, map=map[15:f46f55fc526b298a079dd95e818d6035427f4ded-postgresql@sha256:43ad731e316913a6e4b8d3bb61eb35dee8e2353f641b64cfd23318c2056908f4]` [17:39:41] yeah I agree that's suspicious [17:39:48] btullis: are you around by any chance? ^ [17:40:03] Yes I'm here. [17:40:04] I know it's late in the day for both, I don't want to roll back if I can avoid it [17:40:33] Looking now. [17:40:39] btullis: sorry, I know you're just the reviewer not the author :) thanks [17:41:10] I approved it, but feel free to revert. This is pre-prod for us. [17:41:45] cool, thank you [17:42:03] giving brouberol a few more minutes just in case he's around after all, then I'll revert [17:44:57] rzl: thanks for looking! I'm going to stop for lunch and hope things are less confusing when I get back. :) [17:44:59] in the meantime I can't actually see where that failure comes from immediately -- an integer key would definitely be No Good but all the quoting is right afaict [17:45:10] bd808: 👍 [17:45:51] yeah, the linter output is not helpful. I know I once knew how to run the linter locally and poke for more info, but I seem to have forgotten how to do that too. [17:47:19] the postgresql version string that is just 15 without quotes? like 75 of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1060828/6/charts/cloudnative-pg-cluster/values.yaml [17:47:25] line [17:48:54] mutante: if you can find somewhere that's included as a map key, you're on to something [17:49:53] I mean, and clearly it is somewhere, I just haven't been able to track it down yet [17:49:57] 06serviceops, 10Charts, 10Wikimedia-Extension-setup, 07Epic, 07Wikimedia-extension-review-queue: Deploy Chart extension in production - https://phabricator.wikimedia.org/T369944#10052316 (10Catrope) [17:50:12] 06serviceops, 10Charts, 10Wikimedia-Extension-setup, 07Epic, 07Wikimedia-extension-review-queue: Deploy Chart extension in production - https://phabricator.wikimedia.org/T369944#10052318 (10Catrope) [17:50:33] 06serviceops, 10Charts, 10Wikimedia-Extension-setup, 07Epic, 07Wikimedia-extension-review-queue: Epic: Deploy Chart extension in production - https://phabricator.wikimedia.org/T369944#10052321 (10Catrope) [17:52:49] I'm about to step away from the keyboard, but we really don't mind waiting until next week to debug this. Happy if you want to fix forward or revert. [17:54:40] btullis: okay, thanks! I'll give up on figuring it out and revert shortly :) no need to hang around, thanks for the quick response [17:54:42] have a good weekdn [17:54:45] weekend, also [17:57:47] Thanks. Same to you all, too. [17:59:40] can't find it being used as a map key [18:09:08] https://grafana.wikimedia.org/d/U7JT--knk/mediawiki-on-k8s?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&refresh=1m&from=now-24h&to=now&viewPanel=84 [18:09:11] mw-web is quite hot [18:09:19] since about 15:45 [18:36:11] hmm, reverting that cloudnative-pg-cluster chart patch (with a version bump) didn't resolve the lint error [18:41:22] does the patch need rebased now that the potential trigger is reverted at head? [18:43:06] maybe! I thought CI was pulling from the chart repo, which I thought should be updated independently, but that's a good thing to try [18:43:34] I don't want to step on bd808's toes so I'll let him rebase at will when he's back at keys [18:44:46] also ... this is an interesting problem ... IIRC helmfile does some weird things internally where it round-trips through template substitution, unmarshalling the resulting yaml to map types [18:45:15] I wonder if that's not interacting well with having a "string looking" map key? [18:46:22] yeah that's super plausible [19:22:11] Having yaml keys that need special quoting and the twisted mess of templating layers that is helm seems like a recipe for sadness. [19:26:05] Rebasing on the revert did not make the linter happier. The exact same lint error persists. The "f46f55fc526b298a079dd95e818d6035427f4ded-postgresql" string shown in the failure message doesn't exist in the deployment-charts.git repo. It instead comes from operations/puppet.git's hieradata/role/common/deployment_server/kubernetes.yaml as far as I can tell. [19:26:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060827/2/hieradata/role/common/deployment_server/kubernetes.yaml -- and that 15 key is not quoted. [19:27:24] oh! yes, clearly [19:27:41] rzl: I will upload a patch for you to review [19:28:02] I think it's fine to just revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060827 but happy to review something else if you like [19:29:13] rzl: I have a patch that makes it "15" which might be the needed magic [19:29:53] entirely possible [19:29:57] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060915 [19:30:56] the action at a distance here is unfortunate, but we seem to use this part helm, part puppet pattern fairly extensively [19:31:45] I am not an expert in this setup and can't respond usefully to criticism of it, I'm just trying to help unbreak it :) [19:40:16] bd808: passes locally for me now, give it another try [19:40:39] nm, I see you're on it :) [19:44:28] well dang...thanks for catching that. Maybe we can store all this stuff in etcd or someday ;) [19:45:15] delete "or" from my last msg ;( [19:52:47] inflatador: having a bad key imported from etcd would have been worse for my debugging ;) [19:55:14] rzl: changes are merged and in process of deploy. thank you very much for your assistance [19:55:34] yeah, but then it would only affect the bad key writer...who shall remain nameless ;) [19:56:32] bd808: glad to hear! [23:41:30] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#10053269 (10Scott_French) Though mainly focused on supporting the php 8.1 migration, there's ongoing work to support multiple base-image “flavors” and a helm-rele...