[09:12:15] gehel: how are grizzly dashboards accessible? [09:12:40] https://grafana.wikimedia.org/dashboards/f/SLOs/slos [09:12:52] thanks! [09:20:46] looking at https://grafana.wikimedia.org/d/yCBd7Tdnk/wdqs-wcqs-lag-slo?orgId=1 vs https://grafana.wikimedia.org/d/slo-WDQS/wdqs-slo-s?orgId=1&from=now-30d&to=now the grizzly dashboard does seem be inline with "all servers" SLO but inspecting the query I can see that it's using the same "active servers" filter [09:22:18] the grizzly dashboard does use org_wikidata_query_rdf_blazegraph_filters_QueryEventSenderFilter_event_sender_filter_StartedQueries which should exclude depooled servers, not sure to understand why we're red there :/ [09:31:44] dcausse: not sure I understand :/ [09:32:38] no worries [10:11:54] lunch [13:32:37] inflatador, ryankemper: I know I'm late, but I had a look at https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/937535. There is at least one bug in how we handle the data_loaded file that should be corrected. [13:34:04] gehel ACK, will review later today [13:49:36] errand [14:26:30] err, doh. we ended up with two tickets for the same wiki filter :P https://phabricator.wikimedia.org/T345651 and https://phabricator.wikimedia.org/T345634 [14:26:58] so chances are david did most of the same thing i did yesterday [14:32:27] sigh... [14:33:12] it's ok, you probably did it in a way that makes java people happy :) [14:33:31] well I don't know :) [14:41:08] I can push the patch I have but I don't mind throwing it away tbh, it's very small [14:45:30] shrug, mine isn't very impressive either :) +132 -19, filters by wiki in the ingester and by index prefix in the indexer [14:47:01] this seems smarter than what I've done I only added a filter on the ingester so we might prefer your approach I think? [14:50:14] ok, i can finish cleaning it up then. It was passing tests when i signed off last night, just need to look over it [14:50:29] sure [14:52:32] pfischer, ebernhardson: going to work on T346015, but are there objections to drop java8? [14:52:33] T346015: [Search Update Pipeline] Consider dropping support for java8 - https://phabricator.wikimedia.org/T346015 [14:52:50] main thing we lose is the ability to test/debug in yarn [14:53:25] sounds reasonable, i suppose we don't know if we'll need yarn..but without it i'm sure we'll come up with something :) [14:54:33] we relied on yarn with the wdqs updater to salvage some checkpoints but I believe that for the search update pipeline the state is less important and more easily "dropable" [14:55:02] and certainly a lot smaller and simpler as we won't use rocksdb [14:55:20] hmm, yea i don't think the state here is super important. Do we really have much beyond the windows? [14:55:46] the inflight requests made to MW [14:56:14] but that's at most the capacity set to the AsyncIO operator so probably less than a hundred event [14:57:10] yea sounds like we don't need to worry [15:03:23] inflatador: SRE meeting in https://meet.google.com/rnb-jtio-dcy [16:01:28] workout, back in ~40 [16:36:52] back [16:48:26] ebernhardson updated https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/956474/ per your suggestion if you have time to look [17:01:22] whooohoo brouberol ! Just saw your name in a git log for the first time ;) [17:02:23] inflatador: with the addition of thanos-swift to the listeners, you should be able to remove the thanos-swift entries from dst_nets. Can see with a test render that we get auto-magic egress rules [17:03:13] ebernhardson nice, how are you test rendering? [17:03:24] helm3 template charts/flink-app --dry-run --debug -f .fixtures/general-eqiad.yaml -f helmfile.d/dse-k8s-services/rdf-streaming-updater/values.yaml -f helmfile.d/dse-k8s-services/rdf-streaming-updater/values-dse-k8s-eqiad.yaml [17:03:51] there might be better ways that parse through the helmfile and figure out what values files to use, but this seems to work :) [17:04:55] you wont get the real ip addresses in the test render, the fixtures file gives all services the same ip address, but you can see that it renders the appropriate section in network policy and in prod it would have real ip addresses [17:05:14] no worries, this is super useful compared to merge/looking at helmfile apply [17:05:32] i suppose could scp the real general-eqiad.yaml from deployment.eqiad.wmnet if it's really important [17:06:40] ah, so what are you using for to populate .fixtures/general-eqiad.yaml? [17:07:20] that comes from the Rakefile refresh_fixtures task, i didn't know about it until pointed out on gerrit the other day :) sec [17:08:06] you can perhaps use this to refresh fixtures if you don't have all the rake/ruby bits installed: docker run -it --security-opt seccomp=unconfined -u $(id -u):$(id -g) -v $PWD:/src --init --rm docker-registry.wikimedia.org/releng/helm-linter:0.5.0 refresh_fixtures [17:09:25] does it merges listeners declared in helmfile.d/dse-k8s-services/rdf-streaming-updater/values.yaml (mw-api-int-async-ro & schema)? [17:09:35] I have rake but it doesn't seem to be working. Should be just `rake refresh-fixtures` from repo root, right? [17:09:53] dcausse: listeners wont be merged since it's an array, we have to declare all of them at once [17:09:59] (kinda annoying) [17:10:07] inflatador: `rake refresh_fixtures` i think [17:12:03] ah OK, that works but no response, then templating errors when I try to run helm3 . Let me try with docker insetead [17:12:43] inflatador: mine also doesn't output anything, but it populates the .fixtures/general-*.yaml and .fixtures/service_proxy.yaml [17:14:26] we might simply update values.yaml and add thanos-swift there and drop the egress rules completely from values.yaml? [17:15:32] dcausse: doesn't that goes against what you were aiming for though, with values.yaml being agnostic to production? I guess i would expect the listeners to movethe other way into the values-dse-* [17:16:24] what version of helmfile 3 are you using? Docker doesn't work, and I'm getting templating errors with helm3.10 [17:16:29] i guess i don't know how well any of that works though, the cirrus-streaming-updater helmfile stuff is completely unusable without the prod values [17:16:55] here same [17:17:04] https://phabricator.wikimedia.org/P52480 [17:17:33] ^^ errors I'm seeing. I can boot up a cloud server and run docker from x86 if need be [17:17:37] inflatador: i'm on helm v3.12.3 Checking the output [17:18:18] inflatador: thats not a helm issue, thats basically trying to execute the template and it saying the values files don't have everything defined that it needs. hmm [17:18:21] if we put the listeners in values-dse-k8s-eqiad.yaml this means we have to repeat it for all jobs/env [17:20:13] dcausse: could layer in another values-prod.yaml :) But yea perhaps we take the more pragmatic approach. [17:20:14] Yeah, Erik mentioned that. No problem putting it everywhere, but would like to test in values-dse-k8s-eqiad.yaml first. Since I'm having such problems with helm etc, let me revise the patch to just use the listeners and merge it for now [17:20:20] while writing these yaml files I was more concerned about DRY rather than local testing [17:21:02] inflatador: if you test with values-dse-k8s-eqiad.yaml don't forget to copy all other listeneres [17:21:25] inflatador: oh! i know why its failing, because you would need to run the refresh_fixtures from my other patch [17:21:34] inflatador: i hadn't regenerated my fixtures since working on that patch. sec [17:21:56] when i added the fixtures for zookeeper i also added the missing fixtures for kafka, and you're failing on kafka [17:22:12] this one: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/955032 [17:23:33] ebernhardson perfect, that did the trick [17:25:55] I guess we should merge that one too, Janis gave it the +1 [17:26:04] will probably wait till tomorrow early though [17:27:06] that one should be quite safe, it adds a new version of base.networkpolicy but doesn't use it anywhere. The next patch which updates flink-app to use it at least has a potential to change any deployed flink-app due to the updated dependencies [17:32:19] ebernhardson ah OK, I was concerned since it is a change to base, but if it's not going to cause any problems I'll go ahead and merge. [17:33:49] yup it'll be safe. In this repo the files in modules/ never get used directly at runtime. They only get used by sextant on the developers instances [17:36:04] cool, merging [17:42:21] lunch, back in time for pairing [17:46:02] dinner [17:58:56] ebernhardson: hello, do you know whether wikimedia/discovery/analytics got moved to Gitlab? :) [17:59:57] hashar: hmm, yes but not as an entire repo. The important bits were split into two different repos for new infrastructure and this repo is now unused [18:00:17] ah so I can get it archived in Gerrit and removed from CI? :) I will file a task [18:00:36] hashar: yup that works, thanks! [18:01:17] hashar: search/analytics-integration falls into the same bucket. It was the integration environment for the old system, now unused [18:12:12] https://phabricator.wikimedia.org/T346176 filed for wikimedia/discovery/analylitics [18:13:39] ebernhardson: I have marked search/analytics-integration read only, it was not even in CI :) [18:13:57] that is all for tonight! [18:20:08] back [18:31:32] inflatador, ryankemper: we're in https://meet.google.com/eki-rafx-cxi with Erik [20:35:24] Flink-app is back up in dse-k8s, although it still doesn't seem to be using ZK at all [20:42:04] rebooting search-loader2001 for security updates [20:44:35] Is this the best dashboard for search-loader? https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1 [20:48:36] inflatador: yup thats the most important one [20:49:10] the other mojlnir dashboard also comes from those but only runs once a week [20:53:25] Nice. We're going to replace the current search-loaders soon, any preference as far as bullseye or bookworm? [20:54:40] shouldn't matter much, i suppose the main thing they would need is the right python version. Newer might be ok, but might need to be verified [20:55:05] ouch, they use 3.7.10 today [20:55:13] Yeah, it's in a venv though? [20:55:13] that's not going to be in any new os :) [20:55:25] yes, but the venv is built locally on the instance [20:55:26] Oh well, bullseye it is, and hope for the best ;) [20:56:34] plausibly could move it into a helm service instead, it's a pretty straight forward python app that reads kafka and swift and writes to elasticsearch. But maybe we don't want to get into all that :) [20:57:17] I'm good with that. I like looking busy ;) [21:03:34] https://phabricator.wikimedia.org/T346189 [21:03:55] ^^ For the kubernetes/mjolnir discussion