[11:17:30] lunch [11:18:28] heads-up, CirrusSearch jobs are getting 503s atm https://logstash.wikimedia.org/goto/dff64f5cf1e57a1f18f31d966ba9de2c [11:48:44] dcausse: ^ [13:01:28] looking [13:04:54] related to T249745, it's MW and/or eventbus having issues talking to each other [13:04:55] T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 [13:12:19] it was not caused by cirrus, it's just that cirrus is a big user of the jobqueue, other jobs were affected [14:17:13] o/ [14:47:06] dcausse FYI, I'm about to apply new rdf-streaming-updater charts to make savepoints. Re: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1004199 [14:47:15] if you want me to hold up LMK [14:50:05] inflatador: looking [14:52:06] dcausse ACK, I just did commons eqiad and it worked [14:52:22] it created a savepoint? [14:52:27] `@timestamp":"2024-02-20T14:50:21.101Z","log.level": "INFO","message":"Disposing s3://rdf-streaming-updater-eqiad/commons/savepoints/savepoint-0690a4-063 [14:52:27] 0a3654160` [14:52:59] Y, about to check from the swift side but I'm pretty sure we're good...also did this on staging last wk w/no problems [14:53:09] nice [14:56:51] OK, commons is good. Moving on to wikidata... [15:02:54] OK, savepoints look good for commons/wikidata in both DCs...taking off for a quick workout [15:07:36] hey folks, how are we looking for today's network maintenance? we've a bunch of elastic hosts as well as wdqs2009 and wdqs2020 [15:10:26] T355867 [15:10:53] stashbot disappeared... [15:11:13] topranks: unsure if we prepared anything, might have to wait for Brian, he should be back soon [15:11:40] dcausse: ok no probs thanks [15:12:48] if it's only node at a time it should be just fine tho [15:13:05] *one [15:13:21] ok, yeah they are one-by-one, and usually very short (like 5 seconds or so) [15:16:00] topranks: please feel free to do the two wdqs* nodes, I just depooled them [15:16:11] ok great thanks :) [15:57:30] topranks sorry about that, I had that on my calendar for yesterday [15:57:43] ah not to worry [15:57:55] is that a problem then? we can skip the es hosts if needs be [15:58:06] nah, feel free to go ahead. I'll keep my eye out [15:58:56] ok great, thanks! [15:59:37] \o [16:00:19] o/ [16:11:47] * ebernhardson goes back to guessing at how to integrate backfills [16:13:17] maybe they need a dedicated namespace? I tried to create consumer-{search,cloudelastic}-backfill instances that we can use --set to apply start/end/wiki filters. But helm wants to deploy them (or undeploy a running backfill, really). So i made them conditional on `--set backfill=true`, but now helm wants to undeploy them :P [16:13:59] yeah, the helm paradigm seems non-optimal for this use case [16:14:45] thinking about some sort of helmfile-backfill.yaml that we invoke directly or something...i dunno [16:15:19] or maybe something sillier ... change the name of the consumer-search-backfill so it has an unrelated name when backfill != true [16:15:29] just got an alert for morelike in CODFW...looks to be clearing but we had a nasty spike for a min or so [16:15:35] https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39&from=now-15m&to=now [16:17:17] ebernhardson: you plan to have a backfill per wiki? if so what about having the wikiname in the release name as well? [16:19:18] dcausse: i was intending a backfill per cluster, matching how we run a reindex loop per cluster. The annoyance with per-wiki is flink doesn't shut down the job when it finishes. The taskmanagers shut down, but not the jobmanager. I guess it could monitor and then `helmfile destroy` the release when done [16:19:30] inflatador, dcausse: we're all done with the move in codfw [16:19:59] i'm already doing an insane method to monitor..(kubectl exec .. -- python3 -c 'import urllib; urllib.request....' [16:20:15] ebernhardson: hm... this might be a limitaton of the flink-operator I guess [16:20:20] topranks ACK, sorry for missing that ban. Will be ready for Thurs [16:20:39] topranks: thanks will repool wdqs machines [16:20:41] np, thanks! [16:20:52] dcausse: yes, best i could find was about batch job's in the operator which it says is unsupported but should generally work. It seems they haven't considered finished jobs in the operator [16:22:16] so maybe it's better to directly talk to k8s directly and skip helm somehow... [16:22:29] it still seems like the truly k8s way would be a custom operator where we apply a resource that says "reindex this wiki for these times" and the operator manages related flinkdeployment resources somehow. But still not going there :P [16:22:58] :) [16:23:11] flink-cirrus-k8s-operator [16:23:23] hmm, skipping helm seems tedious. As a hack, i was thinking of naming the release `consumer-seaach-backfill{{ if not .Values.backfill }}-fake{{ end }}` [16:23:41] in that case on a normal deploy it's called consumer-search-backfill-fake, there is no matching release deployed so helm should do nothing [16:24:58] maybe, i think helmfile only tries to undeploy resources that are named in its config...more testing [16:25:33] * ebernhardson wishes the structure of helmfile.yaml was templatable, and not just the strings [16:25:35] not I understood why it would undeploy tho [16:25:38] *sure [16:25:52] if there is a release named X but it's not in the environments section, helmfile wants to undeploy it [16:26:09] but i think thats only if it is named in the releases section, but not environments [16:26:40] it's when deploying without the --release? [16:28:14] dcausse: in a helmfile apply the end will say: Affected releases are: consumer-cloudelastic (wmf-stable/flink-app) DELETED [16:28:55] i think thats because the release is defined in the releases section but not declared as something that should exist in this environment [16:29:26] it all gets quite messy if i try and template all these names though ... [16:30:15] so adding a new "-backfill" release will mess everything's up? [16:30:42] dcausse: for example, see https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/services/cirrus-streaming-updater/helmfile.yaml#L105 [16:31:02] oh [16:31:04] here i tried to have it always declare the regular and backfill releases, but only but the only correct for current context in the environments sectoin [16:31:14] but when i try and deploy one, it wants to undeploy the other [16:31:33] can of course use `--selector name=consumer-search-backfill` or whatever, but that makes life tedious [16:32:26] and we can't let -backfill exist during a normal deployment, because the values files don't have the start/end/wiki filters [16:32:51] (it would undeploy a running backfill) [16:33:22] i'm sure i can hack something into place, just not convinced i will want to deal with the fallout :P [16:33:40] what if we deploy a non running (suspended) backfill and use use kubectl to start it with proper wiki, start and end? [16:34:26] dcausse: hmm, will that still require going through flinkdeployment resources? [16:35:04] if it's suspended it should only deploy the flinkdeployment resource and the operator should do nothing until we change its values with kubectl [16:35:27] right, but i mean if kubectl modifies the flinkdeployment resources, wont helm try and modify it back/ [16:36:20] if someone tried to deploy during a backfill most probably yes [16:37:02] i guess thats part of my concern, reindexing can run for a few days across all wikis and i wanted to avoid having helmfile act differently sometimes [16:37:22] but perhaps same if we don't follow gitops by abusing of the helmfile --set flag [16:38:07] yes, this is also awkward the other way around :) [16:39:55] so we want backfills to bypass git, but we still want to take benefit from the helm templates [16:40:27] the main thing we get from git is to use all the exact same configuration as a regular consumer, so the -backfill is only adding some extra properties [16:41:58] perhaps somewhat related: https://github.com/roboll/helmfile/issues/49 (haven't read it fully yet) [16:44:06] Ooh, interesting [16:44:13] https://github.com/helmfile/helmfile/blob/main/examples/README.md [16:45:16] hmm, so the solution there seems to be to put custom markers, and then invoke helmfile in such a way that it only considers the selected things [16:50:54] i guess can investigate that route. I kinda wish i had a local deployment i could better play with helmfile than shipping patches [16:55:52] I sometimes test deploying from my homedir on the deploy server [16:55:53] * ebernhardson still prefers the idea where the flink app recieves a message and reconfigures the live kafka client to read the right stuff...but that was somehow even harder :P [16:56:16] :) [18:02:15] dinner [18:15:02] lunch, back in ~1h [18:34:22] i added some stuff to https://etherpad.wikimedia.org/p/wdqs-T336443 re: blazegraph import speed as an early check of if there's really any difference in their machine classes, thinking ahead on any server purchases for our DCs. nothing relevatory there. [18:34:22] T336443: Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 [18:48:58] "their" being "aws" [19:02:18] interesting, it does seem like more iops isn't doing that much (although the ramdisk test could make that more definitive). [19:02:49] cpu seems the limiter, which makes sense. My understanding is that blazegraph is a bit of a mess with locks and it has a hard time using more cores, preferring faster individual cores [19:08:34] back [19:24:05] Thanks dr0ptp4kt ! That's confirming my suspicion. Always good to have hard data. [19:25:35] We might want to prioritize higher CPU speed for the next batch of servers. Do you have a sense of how much we might expect to gain on that side? [19:29:57] I thought more cores was preferred, so we could service more queries at the same time? Whereas the data reload process isn't used all that much? [19:30:07] I'm also slightly worried that optimizing for data load might degrade the normal operation. We know that data load is single threaded (or at least single CPU - it's probably technically multiple thread but synchronizing on a single lock), but query load is quite highly parallelized. So trading cores for speed is probably not a good idea for query load. [19:30:31] yeah, it's a question of what we are optimizing for... [19:30:53] query load can also be addressed by adding more servers in the cluster, which can't be done for data load... [19:35:46] in theory, since we copy the data around, you just need like 2 high-clock servers instead of the whole cluster. But still might be the wrong direction to optimize for [21:21:16] random thought: We have to backfill even after a failed reindexing run, since the reindex could have completed on _content but failed on _general [21:56:37] hmm, in theory thats a completed reindex+backfill. Doesn't show up very well in the grafana dashboards though [22:05:09] Looking at porting this check script over to prometheus-land ( https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/query_service/monitor/categories.pp#9 ) . Does this just become an exporter, or am I missing something? [22:06:11] hmm, my typical way of moving something to prometheus is with a long running daemon that gets polled. Not sure if thats the plan here? [22:21:40] Yeah, that's my thought too, is that it would be an exporter that formats the health check as prometheus metrics and runs as a daemon. Just wondering if that's the way to migrate these type of local-script-based health checks to prometheus-land [22:26:51] i guess it has to, the prometheus way is pull-based. There has to be something for prom to pull from. They aren't too bad to write in python, there are some examples to look at [22:32:35] Yeah, we have quite a few. Good to know that's I'm on the right track though ;)