[11:17:30] <dcausse>	 lunch
[11:18:28] <hnowlan>	 heads-up, CirrusSearch jobs are getting 503s atm https://logstash.wikimedia.org/goto/dff64f5cf1e57a1f18f31d966ba9de2c 
[11:48:44] <gehel>	 dcausse: ^
[13:01:28] <dcausse>	 looking
[13:04:54] <dcausse>	 related to T249745, it's MW and/or eventbus having issues talking to each other
[13:04:55] <stashbot>	 T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745
[13:12:19] <dcausse>	 it was not caused by cirrus, it's just that cirrus is a big user of the jobqueue, other jobs were affected 
[14:17:13] <inflatador>	 <o/
[14:23:20] <dcausse>	 o/
[14:47:06] <inflatador>	 dcausse FYI, I'm about to apply new rdf-streaming-updater charts to make savepoints. Re: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1004199
[14:47:15] <inflatador>	 if you want me to hold up LMK
[14:50:05] <dcausse>	 inflatador: looking
[14:52:06] <inflatador>	 dcausse ACK, I just did commons eqiad and it worked
[14:52:22] <dcausse>	 it created a savepoint?
[14:52:27] <inflatador>	 `@timestamp":"2024-02-20T14:50:21.101Z","log.level": "INFO","message":"Disposing s3://rdf-streaming-updater-eqiad/commons/savepoints/savepoint-0690a4-063
[14:52:27] <inflatador>	 0a3654160`
[14:52:59] <inflatador>	 Y, about to check from the swift side but I'm pretty sure we're good...also did this on staging last wk w/no problems
[14:53:09] <dcausse>	 nice
[14:56:51] <inflatador>	 OK, commons is good. Moving on to wikidata...
[15:02:54] <inflatador>	 OK, savepoints look good for commons/wikidata in both DCs...taking off for a quick workout
[15:07:36] <topranks>	 hey folks, how are we looking for today's network maintenance?  we've a bunch of elastic hosts as well as wdqs2009 and wdqs2020
[15:10:26] <topranks>	 T355867
[15:10:53] <dcausse>	 stashbot disappeared...
[15:11:13] <dcausse>	 topranks: unsure if we prepared anything, might have to wait for Brian, he should be back soon
[15:11:40] <topranks>	 dcausse: ok no probs thanks 
[15:12:48] <dcausse>	 if it's only node at a time it should be just fine tho
[15:13:05] <dcausse>	 *one
[15:13:21] <topranks>	 ok, yeah they are one-by-one, and usually very short (like 5 seconds or so)
[15:16:00] <dcausse>	 topranks: please feel free to do the two wdqs* nodes, I just depooled them
[15:16:11] <topranks>	 ok great thanks :)
[15:57:30] <inflatador>	 topranks sorry about that, I had that on my calendar for yesterday
[15:57:43] <topranks>	 ah not to worry 
[15:57:55] <topranks>	 is that a problem then?  we can skip the es hosts if needs be 
[15:58:06] <inflatador>	 nah, feel free to go ahead. I'll keep my eye out
[15:58:56] <topranks>	 ok great, thanks!
[15:59:37] <ebernhardson>	 \o
[16:00:19] <inflatador>	 <o/
[16:01:28] <dcausse>	 o/
[16:11:47] * ebernhardson goes back to guessing at how to integrate backfills
[16:13:17] <ebernhardson>	 maybe they need a dedicated namespace? I tried to create consumer-{search,cloudelastic}-backfill instances that we can use --set to apply start/end/wiki filters.  But helm wants to deploy them (or undeploy a running backfill, really). So i made them conditional on `--set backfill=true`, but now helm wants to undeploy them :P
[16:13:59] <inflatador>	 yeah, the helm paradigm seems non-optimal for this use case
[16:14:45] <ebernhardson>	 thinking about some sort of helmfile-backfill.yaml that we invoke directly or something...i dunno
[16:15:19] <ebernhardson>	 or maybe something sillier ... change the name of the consumer-search-backfill so it has an unrelated name when backfill != true
[16:15:29] <inflatador>	 just got an alert for morelike in CODFW...looks to be clearing but we had a nasty spike for a min or so
[16:15:35] <inflatador>	 https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39&from=now-15m&to=now
[16:17:17] <dcausse>	 ebernhardson: you plan to have a backfill per wiki? if so what about having the wikiname in the release name as well?
[16:19:18] <ebernhardson>	 dcausse: i was intending a backfill per cluster, matching how we run a reindex loop per cluster. The annoyance with per-wiki is flink doesn't shut down the job when it finishes.  The taskmanagers shut down, but not the jobmanager.  I guess it could monitor and then `helmfile destroy` the release when done
[16:19:30] <topranks>	 inflatador, dcausse: we're all done with the move in codfw 
[16:19:59] <ebernhardson>	 i'm already doing an insane method to monitor..(kubectl exec .. -- python3 -c 'import urllib; urllib.request....'
[16:20:15] <dcausse>	 ebernhardson: hm... this might be a limitaton of the flink-operator I guess
[16:20:20] <inflatador>	 topranks ACK, sorry for missing that ban. Will be ready for Thurs 
[16:20:39] <dcausse>	 topranks: thanks will repool wdqs machines
[16:20:41] <topranks>	 np, thanks!
[16:20:52] <ebernhardson>	 dcausse: yes, best i could find was about batch job's in the operator which it says is unsupported but should generally work.  It seems they haven't considered finished jobs in the operator
[16:22:16] <dcausse>	 so maybe it's better to directly talk to k8s directly and skip helm somehow...
[16:22:29] <ebernhardson>	 it still seems like the truly k8s way would be a custom operator where we apply a resource that says "reindex this wiki for these times" and the operator manages related flinkdeployment resources somehow.  But still not going there :P
[16:22:58] <dcausse>	 :)
[16:23:11] <dcausse>	 flink-cirrus-k8s-operator
[16:23:23] <ebernhardson>	 hmm, skipping helm seems tedious.  As a hack, i was thinking of naming the release `consumer-seaach-backfill{{ if not .Values.backfill }}-fake{{ end }}`
[16:23:41] <ebernhardson>	 in that case on a normal deploy it's called consumer-search-backfill-fake, there is no matching release deployed so helm should do nothing
[16:24:58] <ebernhardson>	 maybe, i think helmfile only tries to undeploy resources that are named in its config...more testing
[16:25:33] * ebernhardson wishes the structure of helmfile.yaml was templatable, and not just the strings
[16:25:35] <dcausse>	 not I understood why it would undeploy tho
[16:25:38] <dcausse>	 *sure
[16:25:52] <ebernhardson>	 if there is a release named X but it's not in the environments section, helmfile wants to undeploy it
[16:26:09] <ebernhardson>	 but i think thats only if it is named in the releases section, but not environments
[16:26:40] <dcausse>	 it's when deploying without the --release?
[16:28:14] <ebernhardson>	 dcausse: in a helmfile apply the end will say: Affected releases are: consumer-cloudelastic (wmf-stable/flink-app) DELETED
[16:28:55] <ebernhardson>	 i think thats because the release is defined in the releases section but not declared as something that should exist in this environment
[16:29:26] <ebernhardson>	 it all gets quite messy if i try and template all these names though ...
[16:30:15] <dcausse>	 so adding a new "-backfill" release will mess everything's up?
[16:30:42] <ebernhardson>	 dcausse: for example, see https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/services/cirrus-streaming-updater/helmfile.yaml#L105
[16:31:02] <dcausse>	 oh
[16:31:04] <ebernhardson>	 here i tried to have it always declare the regular and backfill releases, but only but the only correct for current context in the environments sectoin
[16:31:14] <ebernhardson>	 but when i try and deploy one, it wants to undeploy the other
[16:31:33] <ebernhardson>	 can of course use `--selector name=consumer-search-backfill` or whatever, but that makes life tedious
[16:32:26] <ebernhardson>	 and we can't let -backfill exist during a normal deployment, because the values files don't have the start/end/wiki filters
[16:32:51] <ebernhardson>	 (it would undeploy a running backfill)
[16:33:22] <ebernhardson>	 i'm sure i can hack something into place, just not convinced i will want to deal with the fallout :P
[16:33:40] <dcausse>	 what if we deploy a non running (suspended) backfill and use use kubectl to start it with proper wiki, start and end?
[16:34:26] <ebernhardson>	 dcausse: hmm, will that still require going through flinkdeployment resources?
[16:35:04] <dcausse>	 if it's suspended it should only deploy the flinkdeployment resource and the operator should do nothing until we change its values with kubectl
[16:35:27] <ebernhardson>	 right, but i mean if kubectl modifies the flinkdeployment resources, wont helm try and modify it back/
[16:36:20] <dcausse>	 if someone tried to deploy during a backfill most probably yes
[16:37:02] <ebernhardson>	 i guess thats part of my concern, reindexing can run for a few days across all wikis and i wanted to avoid  having helmfile act differently sometimes
[16:37:22] <dcausse>	 but perhaps same if we don't follow gitops by abusing of the helmfile --set flag
[16:38:07] <ebernhardson>	 yes, this is also awkward the other way around :)
[16:39:55] <dcausse>	 so we want backfills to bypass git, but we still want to take benefit from the helm templates
[16:40:27] <ebernhardson>	 the main thing we get from git is to use all the exact same configuration as a regular consumer, so the -backfill is only adding some extra properties
[16:41:58] <dcausse>	 perhaps somewhat related: https://github.com/roboll/helmfile/issues/49 (haven't read it fully yet)
[16:44:06] <inflatador>	 Ooh, interesting
[16:44:13] <inflatador>	 https://github.com/helmfile/helmfile/blob/main/examples/README.md
[16:45:16] <ebernhardson>	 hmm, so the solution there seems to be to put custom markers, and then invoke helmfile in such a way that it only considers the selected things
[16:50:54] <ebernhardson>	 i guess can investigate that route.  I kinda wish i had a local deployment i could better play with helmfile than shipping patches
[16:55:52] <inflatador>	 I sometimes test deploying from my homedir on the deploy server
[16:55:53] * ebernhardson still prefers the idea where the flink app recieves a message and reconfigures the live kafka client to read the right stuff...but that was somehow even harder :P
[16:56:16] <dcausse>	 :)
[18:02:15] <dcausse>	 dinner
[18:15:02] <inflatador>	 lunch, back in ~1h
[18:34:22] <dr0ptp4kt>	 i added some stuff to https://etherpad.wikimedia.org/p/wdqs-T336443 re: blazegraph import speed as an early check of if there's really any difference in their machine classes, thinking ahead on any server purchases for our DCs. nothing relevatory there.
[18:34:22] <stashbot>	 T336443: Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443
[18:48:58] <dr0ptp4kt>	 "their" being "aws"
[19:02:18] <ebernhardson>	 interesting, it does seem like more iops isn't doing that much (although the ramdisk test could make that more definitive). 
[19:02:49] <ebernhardson>	 cpu seems the limiter, which makes sense. My understanding is that blazegraph is a bit of a mess with locks and it has a hard time using more cores, preferring faster individual cores
[19:08:34] <inflatador>	 back
[19:24:05] <gehel>	 Thanks dr0ptp4kt ! That's confirming my suspicion. Always good to have hard data.
[19:25:35] <gehel>	 We might want to prioritize higher CPU speed for the next batch of servers. Do you have a sense of how much we might expect to gain on that side?
[19:29:57] <inflatador>	 I thought more cores was preferred, so we could service more queries at the same time? Whereas the data reload process isn't used all that much?
[19:30:07] <gehel>	 I'm also slightly worried that optimizing for data load might degrade the normal operation. We know that data load is single threaded (or at least single CPU - it's probably technically multiple thread but synchronizing on a single lock), but query load is quite highly parallelized. So trading cores for speed is probably not a good idea for query load.
[19:30:31] <gehel>	 yeah, it's a question of what we are optimizing for...
[19:30:53] <gehel>	 query load can also be addressed by adding more servers in the cluster, which can't be done for data load...
[19:35:46] <ebernhardson>	 in theory, since we copy the data around, you just need like 2 high-clock servers instead of the whole cluster. But still might be the wrong direction to optimize for
[21:21:16] <ebernhardson>	 random thought: We have to backfill even after a failed reindexing run, since the reindex could have completed on _content but failed on _general
[21:56:37] <ebernhardson>	 hmm, in theory thats a completed reindex+backfill.  Doesn't show up very well in the grafana dashboards though
[22:05:09] <inflatador>	 Looking at porting this check script over to prometheus-land ( https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/query_service/monitor/categories.pp#9 ) . Does this just become an exporter, or am I missing something?
[22:06:11] <ebernhardson>	 hmm, my typical way of moving something to prometheus is with a long running daemon that gets polled. Not sure if thats the plan here?
[22:21:40] <inflatador>	 Yeah, that's my thought too, is that it would be an exporter that formats the health check as prometheus metrics and runs as a daemon. Just wondering if that's the way to migrate these type of  local-script-based health checks to prometheus-land
[22:26:51] <ebernhardson>	 i guess it has to, the prometheus way is pull-based. There has to be something for prom to pull from.  They aren't too bad to write in python, there are some examples to look at
[22:32:35] <inflatador>	 Yeah, we have quite a few. Good to know that's I'm on the right track though ;)