[01:12:26] When you submit a query to the Wikidata Query Service, the result is cached. Is this result cached for a fixed period of time? [06:31:28] hare: yes I believe so, for 300 sec according to https://gerrit.wikimedia.org/g/operations/puppet/+/c7534b382f2180da55ab7f67aa7fafa4d4fd6c24/modules/query_service/templates/nginx.erb#222 and what I see in response headers [06:44:35] at a glance it seems a lot easier to me if all elastic instances see a big disk, not sure that the failure scenario where a couple instances still survive if one disk fails is worth the complexity? if a disk fails we'd depool this host anyways no? [06:45:35] ah because it's "hotswappable" we could in theory salvage this elastic instance without totatlly depool the host... [06:49:00] reading T231010 I'm a bit confused about the multi es datapath approach tho, I'm not sure to understand what value it brings in the end compared to raid0 [06:49:01] T231010: Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 [06:56:13] if all three elastic instances share the same 3 disks, in the end they'll all have to be restarted, I doubt that elastic can work properly with one of its datapath broken (could be tested tho) [07:25:45] o/ dcausse: I’m about to release event-utilities, just noticed that your change would be semver-worth a minor version bump but I already merged it. Is there any way to instruct zuul to bump that version in the release process? https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/build [07:27:41] pfischer: o/ looking [07:28:08] apparently no :/ [07:28:20] Okay, I’ll create a patch then… [07:28:31] thanks! [07:44:36] dcausse: If you have a moment: https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/959946 [07:44:46] sure [07:48:56] Thx! [09:55:22] weekly update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-09-22 [10:03:54] dcausse: config options are now verified by default, re-render can be requested via command line flag: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/24 [10:05:05] pfischer: thanks! will take a look [10:10:06] lunch [13:53:02] dcausse based on what you and Erik are saying, it sounds like we need to move forward with RAID-0 instead of multiple data paths. Too many unknowns doing it the other way... [13:55:58] o/ [13:58:00] inflatador: yeah... but since we run 3 elastic instances could we setup one disk per instance? [13:58:24] would not work well if one instance requires a lot more space than the others tho [13:58:33] Y, was just about to say that [13:59:32] I like the idea of being able to survive the loss of one disk, but not enough to change everything around completely. Losing a disk really just means we'd be down a host until they replace the disk and reimage...probably a few days at worst [14:08:39] And speaking of cloudelastic...we have a proposal to change its networking around https://phabricator.wikimedia.org/T346946 [14:53:13] ryankemper Let us know your thoughts on https://phabricator.wikimedia.org/T231010 once you get in...sounds like we should probably use RAID0 as opposed to JBOD [14:54:55] dcausse I'm going to apply https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/959790/ unless you think we should wait. I'm still not clear about how a restore will work though [14:55:38] inflatador: thanks, will check how it looks like [14:55:57] for a restore it's just setting the initialSavepointPath in the jobspec [14:56:04] but here it won't stop anything [14:59:54] flinkdeployment is in restarting state, pods/deploy hasn't changed yet [15:00:49] \o [15:01:08] getting some failures in the logs...hmm [15:03:49] o/ [15:03:51] looking [15:05:31] hm it tries "Triggering stop-with-savepoint for job ad699cfb0eb2c53365df1d982b806b70." [15:05:38] not what I'd have expected, [15:05:59] cool, I copied the logs to your people.wikimedia.org homedir ...just in case it has secrets ;) [15:06:19] :) [15:08:03] stop-with-savepoint does not work because of https://issues.apache.org/jira/browse/FLINK-28758 [15:08:32] we have to understand why the operator decided to run a stop-with-savepoint rather than a savepoint [15:09:40] probably should've done the savepoint separately from the path change...my fault, sorry [15:10:13] ah that might explain yes [15:11:07] I do see the new paths in ZK [15:11:47] would restoring from a checkpoint work? [15:12:10] yes I think so, we need to identify the checkpoint first [15:13:45] might be 19444 for job ad699cfb0eb2c53365df1d982b806b70 [15:13:56] here it seems to be in a desperate restart loop, stop-with-savepoint -> fail -> restart [15:14:18] we should cancel it (by undeploying it I suppose) [15:14:45] OK, destroyed [15:14:52] do we need to roll back the chart? [15:15:07] what do you mean? [15:15:09] so it doesn't try to take a savepoint [15:15:21] if the deployment is gone it should not [15:15:54] seems to be gone, now [15:16:11] It still has that `savepointTriggerNonce` in the values file [15:16:24] that's why I was thinking it might try to take a savepoint when we deploy again [15:17:03] this is just a flag used by the operator to know if it has changed or not, if the initial deploy has a savepointTriggerNonce I don't think that matters [15:18:35] 19444 does seem to be the right checkpoint indeed, k8s_op_test_dse/wikidata_test/checkpoints/ad699cfb0eb2c53365df1d982b806b70/chk-19444 [15:19:27] * inflatador starts looking up how to restore from checkpoint [15:21:38] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/job-management/#restart-failed-job-deployments [15:22:58] inflatador: yes the "Manual Recovery" section [15:24:03] OK, will start a patch for that using ^^ path [15:26:14] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/960080 [15:30:21] OK merged, deploying now [15:32:26] Looks like it worked! [15:34:27] nice [15:36:10] ZK paths have changed too, they look good [15:36:13] inflatador: I think you can try again the manual savepoint by incrementing the Nonce flag [15:36:37] dcausse OK, will get a patch ready for that [15:36:40] thanks! [15:38:29] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/960087 [15:39:59] I goofed on the commit message and Gerrit WebUI doesn't seem to be working for that, 1 sec [15:46:00] OK, deployed [15:46:36] [pod/flink-app-wdqs-d9cfbd46d-mtfgf/flink-main-container] {"@timestamp":"2023-09-22T15:45:53.553Z","log.level": "INFO","message":"Triggering checkpoint 19474 (type=SavepointType{name='Savepoint', postCheckpointAction=NONE, formatType=NATIVE}) @ 1695397553545 for job 434830c8020a6fe160224465a127517b.", "ecs.version": "1.2.0","process.thread.name":"Checkpoint [15:46:37] Timer","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"} [15:46:54] seems to work (SavepointType{name='Savepoint') [15:47:02] Y, seeing the same thing [15:47:23] do we want to try and restore from savepoint, or is our checkpoint test sufficient? [15:48:23] we'll restore manually from a savepoint when we'll deploy the next version of the job that hopefully fixes the stop-with-savepoint issue [15:48:43] here I'm curious to see where it saved this savepoints [15:49:50] do we want to set any of those auto recovery settings such as `kubernetes.operator.job.restart.failed`? [15:51:29] kubectl get flinkdeployment -o json | jq -r '.items[].status.jobStatus.savepointInfo.lastSavepoint.location' [15:51:35] inflatador: I don't know [15:51:50] I'm trying to think of a scenario where it would really help [15:52:01] can't think of any so far [15:53:03] I mean, in theory that would be nice, but it doesn't seem likely a failed job would just fix itself after a redeploy [15:53:48] and that could potentially interfere with deploys anyway [15:57:04] yes I think we lack operational knowledge, if Gabriele has entered a situation where he thinks that could be useful I'm all for it [15:57:40] but from my pov it might be premature as we don't really understand in what situations this would be useful to us [15:59:22] Y, same here. I'm going to start documenting some of the steps for savepoint/checkpoint recovery under the operator at https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Recover_from_a_checkpoint [16:02:51] thanks! [16:15:42] going offline, have a nice week-end [16:25:37] you too [18:19:43] inflatador: re raid0 vs jbod, yeah I think raid0 sounds acceptable…we should do some math though to see if needing to add another server will wipe out our cost savings [18:21:44] ryankemper it probably would for cloudelastic itself, but based on what ebernhardson was telling me, we should be able to lose 2 cloudelastic hosts without losing the whole cluster [18:23:22] Weathering up to 33% host loss is pretty decent [18:23:50] If that’s the case we’re probably fine not scaling up host count then [18:36:04] ryankemper ACK. I'll update the ticket and we'll start the RAID0 approach. I've got a patch for partman that's ready to review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/960114/ . My first crack at partman, probably won't work, but there's no blast radius AFAIK [18:40:05] lunch, back in ~45 [18:43:22] kk will take a look [18:55:28] back, forgot we have prometheus training in 5 [20:13:36] break, back in ~20 [20:32:50] back