[08:14:37] <gehel>	 I'm re-reading T345015, trying to understand what is asked. I think I misunderstood this before. The ask seem to be that searching for `{{some_template}}` would behave like a search for `template:some_template`. Which is kind of reasonable.
[08:14:37] <stashbot>	 T345015: Interpret search term surrounded by {{ }} as template search, and strip [[ ]] around entered page names - https://phabricator.wikimedia.org/T345015
[08:21:36] <dcausse>	 I suspect they want to use the "go" feature, with: copy/paste {{template_name}} top-right and press enter then enter directly the page "Template:Template_name"
[08:22:37] <dcausse>	 skipping the search result page directly
[08:25:00] <dcausse>	 imo we have to be carefull about possible ambiguities with other search features and ponder whether this syntactic sugar is worthwhile or not
[08:26:13] <gehel>	 I don't think this is something I would like to see as a "go-bar" feature. But for the SRP, why not.
[08:26:41] <gehel>	 We don't have time to work on this anyway, so the question is more should we close this as a bad idea or are we open to do it in a distant future.
[08:28:36] <dcausse>	 well... it's not entirely a bad/broken idea and it's a matter of "taste" I suppose
[08:29:41] <gehel>	 There is also a question of adding even more feature and complexity to our Search DSL, and how users can understand it.
[08:30:01] <gehel>	 This specific feature could be implemented as a client side query rewrite in a gadget...
[08:31:48] <dcausse>	 indeed
[11:00:08] <pfischer>	 dcausse: I picked up your late fetch PR and continued your work. Would you have a minute (probably more) to review it? https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/11
[11:00:37] <dcausse>	 pfischer: sure, just saw it and was planning to have a look soonish
[11:58:36] <dcausse>	 break
[12:04:47] <pfischer>	 I’m off, back 2h
[13:15:32] <inflatador>	 <o/
[13:19:39] <dcausse>	 o/
[13:20:37] <inflatador>	 welcome back dcausse !
[13:20:42] <dcausse>	 thx!
[14:33:08] <inflatador>	 rebooting elastic eqiad for security updates
[14:37:14] <inflatador>	 dcausse I'm working on merging https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/dse-k8s-services/rdf-streaming-updater/values.yaml and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/dse-k8s-services/rdf-streaming-updater/values-dse-k8s-eqiad.yaml as there seems to be some contradictory config. LMK if you have any
[14:37:14] <inflatador>	  objections/feedback
[14:38:24] <dcausse>	 inflatador: you mean removing one of these file?
[14:38:55] <inflatador>	 dcausse Y, moving everything into values.yaml
[14:40:17] <inflatador>	 e-lukey suggested this, looks values.yaml is overriding with some configs we don't like...re https://phabricator.wikimedia.org/T344614#9136175
[14:41:08] <dcausse>	 inflatador: I thought that by convention we had to have a values.yaml that is "prod-agnostic"
[14:41:43] <dcausse>	 also we'll have a values-codfw.yaml that'll differ once we move out of dse-k8s
[14:43:39] <inflatador>	 dcausse I do see staging-specific values.yaml repeated in a few places...hmm. Shouldn't values-dse-k8s-eqiad.yaml override values.yaml? Looks like the opposite is happening based on Luca's comment
[14:44:35] <dcausse>	 you have "high-availability.type" in values-dse-k8s-eqiad.yaml but "high-availability" in values.yaml resulting in both being present I guess
[14:45:49] <dcausse>	 I think this is what Luca is referring to? we should have only one way to set this config value
[14:46:42] <dcausse>	 if "high-availability: SOMETHING" works we should perhaps set "high-availability: ZOOKEEPER" in values-dse-k8s-eqiad.yaml instead of high-availability.type: ZOOKEEPER
[14:48:21] <dcausse>	 but I have no objection to change values.yaml and set directly "high-availability: ZOOKEEPER" there
[14:48:47] <inflatador>	 Yeah...I think the config changed between flink versions too...I think "type" is the preferred way now, but let me verify
[14:50:36] <dcausse>	 indeed https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/config/#high-availability
[14:55:34] <inflatador>	 OK, I'll get a patch up to move high-availability.type into values.yaml
[14:57:06] <dcausse>	 sounds good to me!
[14:57:30] <ebernhardson>	 \o
[14:57:33] <dcausse>	 o/
[14:59:48] <inflatador>	 <o/
[15:00:06] <inflatador>	 Here's the patch for flink zk stuff: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/954957
[15:06:02] <dcausse>	 ebernhardson: marco pinged me to do a reload of analytics_platform_eng.image_suggestions_search_index_full/snapshot=2023-08-21 (T345545), wanted to check with you if this was not already, as far as I can see only the corresponding "delta" was shipped
[15:06:02] <stashbot>	 T345545: Search indices image suggestion tags differ from the dataset used to update - https://phabricator.wikimedia.org/T345545
[15:07:02] <ebernhardson>	 dcausse: oh, thats my mistake. I asked if it was in the same partition/place as it always is and he said yes, so i re-ran the existing bit.  But full dumps are in another table
[15:07:25] <dcausse>	 ok, prepped a couple airflow patch to help with this
[15:08:14] <dcausse>	 trying to repurpose the "image_suggestions_manual" dag into something that can be configured via dag run params
[15:12:42] <ebernhardson>	 dcausse: it must be too early in the morning, how does this work? You pass an additional hint parameter to str.format but i'm not sure how it get used
[15:15:03] <dcausse>	 ebernhardson: this uses the ability to pass a json config when triggering a dag
[15:15:45] * dcausse trying to find some airflow doc
[15:16:16] <dcausse>	 https://stackoverflow.com/questions/53663534/for-apache-airflow-how-can-i-pass-the-parameters-when-manually-trigger-dag-via
[15:16:42] <ebernhardson>	 dcausse: oh, i hadn't seen you changed the template on image_suggestions_manual to '{partition_spec_hint}'
[15:16:53] <ebernhardson>	 so, basically it works only when calling a dag specifically configured for it
[15:17:03] <dcausse>	 yes
[15:18:23] <ebernhardson>	 should be ok i suppose
[15:19:02] <dcausse>	 this to avoid having to create a new specific config and a specific dag for the fixup like image_suggestions_fixup_T320656
[15:22:02] <dcausse>	 damn seems like I pushed directly to discolytics@main while I wanted to update my MR :/
[15:22:34] <ebernhardson>	 hmm, we could disable that probably
[15:22:52] <ebernhardson>	 well, hmm. maybe :)
[15:24:14] <ebernhardson>	 dcausse: i dunno if that will work...i left 'allowed to merge' as maintainers, but 'Allowed to push and merge' to 'instance admins'
[15:24:15] <dcausse>	 ah no you merged my MR while I was pushing a cleanup
[15:25:01] <dcausse>	 lemme try to push this fixup to main directly to test
[15:27:06] <dcausse>	 no I can't seeing: ! [remote rejected] HEAD -> main (pre-receive hook declined)
[15:28:14] <ebernhardson>	 sounds good, maybe we should set that on the rest of our repos.
[15:28:32] <ebernhardson>	 i wonder if there is any more convenient management than clicking around each project
[15:33:45] <dcausse>	 no clue :/
[15:54:13] <dcausse>	 hm.. the discolytics trigger_release now fails with  ! [remote rejected] HEAD -> main (pre-receive hook declined) :/
[15:54:23] <ebernhardson>	 sigh...the bot is a maintainer
[15:54:30] <dcausse>	 :/
[15:56:27] <ebernhardson>	 apparently there is some way of setting ACL's, but now getting more complex per-project config :P
[15:58:57] <ebernhardson>	 meh probably best to leave it as it was before and try to remember to not push
[15:59:22] <dcausse>	 sure
[16:00:11] <ebernhardson>	 done
[16:00:31] <inflatador>	 OK, rolling out the flink-app changes on DSE
[16:09:05] <inflatador>	 HA related validation errors: https://logstash.wikimedia.org/goto/a4c8565c3cf36179b1130211b26d20af . Taking a look
[16:12:32] <dcausse>	 hm... not sure we asked for multiple jobmanagers
[16:13:02] <dcausse>	 the operator might be confused if you attempted to change the H/A mode of a running app
[16:13:22] <dcausse>	 did you stop with a savepoint before doing all this?
[16:15:04] <inflatador>	 dcausse no, I wasn't too concerned with breaking it but it looks like I broke it anyway ;)
[16:15:50] <inflatador>	 I can try to restore from a save point if that works, or restart the operator
[16:16:19] <inflatador>	 as far as multiple job managers, that's required for Zookeeper HA
[16:22:44] <inflatador>	 I'm restarting flink operator a la https://phabricator.wikimedia.org/T340059
[16:33:11] <inflatador>	 well, that didn't work...flink-operator failed with error when I tried to restart. Looking in to it now
[16:33:24] <inflatador>	 https://logstash.wikimedia.org/goto/d6ac9e78aa74a40505d3119edf4daa49
[16:43:11] <inflatador>	 getting validation errors when I try to redeploy , it doesn't like https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/flink-operator/helmfile.yaml#18
[16:43:32] <inflatador>	 "<.Values.chartVersions>: map has no entry for key "chartVersions"
[16:50:57] <ebernhardson>	 hmm, chartVersions is certainly referenced in a bunch of places
[16:51:21] <ebernhardson>	 and it's defined in values/common.yaml as an empty map.  Something odd if it's not seeing that
[16:54:55] <inflatador>	 yeah, I'm asking in #wikimedia-k8s-sig but also wondering if chartmuseum URL needs to be explicitly set in the flink operator charts
[16:55:37] <inflatador>	 a la https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/eventgate/Chart.yaml#13
[16:57:46] <ebernhardson>	 also...wow. I hadn't looked at this thing but it sure takes a lot to run flink...
[16:58:46] <ebernhardson>	 well not that bad, but more than i expected :P 
[16:59:08] <inflatador>	 no kidding
[16:59:27] <inflatador>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/flink-kubernetes-operator/Chart.yaml#13 interestingly, the chartmuseum URL is in a comment for the other flink operator chart
[17:02:51] <ebernhardson>	 This error is from helmfile, rather than helm right? Must be since it's the helmfile.yaml.  I suspect helmfile doesn't care about the repository and only references it by name, helm would be the one complaining if the repo was a problem iiuc
[17:04:08] <inflatador>	 Y, I think so
[17:05:11] <ebernhardson>	 i wonder about how it's sourcing the values, what it thinks it's base directory is, etc.  This chartVersions stuff is copy/pasted to most (all?) the admin_ng charts, seems it's supposed to read the empty map from values/common.yaml
[17:06:13] <ebernhardson>	 inflatador: could you paste full output somewhere?
[17:08:39] <inflatador>	 ebernhardson it's here https://phabricator.wikimedia.org/P52261 . jayme is helping me in #wikimedia-k8s-sig
[17:09:19] <ebernhardson>	 oh, you probably can't run it from that directory. but we can continue on in the other 
[17:09:20] <dcausse>	 inflatador: did you pass the release name?
[17:09:42] <dcausse>	 I think this one does not use the default "main" release
[17:09:43] <ebernhardson>	 admin_ng is a base helmfile, i think it has to be run from there
[17:09:57] <ebernhardson>	 thats where the path references for values/... are from at least
[17:10:04] <dcausse>	 oops sorry did not see that you were deploying admin_ng
[17:30:49] <inflatador>	 elastic eqiad reboots are done...heading to lunch, back in ~40
[17:32:31] <dcausse>	 inflatador: might be best to throw away the current state in the dse-ks8-eqiad@rdf-streaming-updater namespace and start fresh (possibly using the latest known checkpoint if not too old or I can build a new fresh savepoint if we need to)
[17:41:02] <dcausse>	 k8s H/A did not require multiple jobmanagers, did it complain that we had a single one? my understanding that multiple jobmanagers help with recovery times as it does not have to wait for a pod restart but it should not be strictly required 
[17:51:16] <dcausse>	 going to reimport the image suggestions dataset (if my patch works..), this is ~90M tags but hopefully a lot of them are already there
[18:04:32] <gehel>	 inflatador, ryankemper: I'm just out of my last meeting, I might be late for the pairing session, or not even show up
[18:04:46] <inflatador>	 gehel ACK
[18:05:19] <inflatador>	 dcausse I was thinking we need HA job managers because of https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/ha/overview/ , but I guess it's not a hard requirement? 
[18:06:43] <inflatador>	 As far as the dse flink-app, probably throw away current state. I'll take a savepoint next time if it makes things easier for you. Sorry for any trouble
[18:07:26] <dcausse>	 no worries at all!
[18:09:08] <dcausse>	 for jobmanagers replica being 1 vs 3 it's a matter of how fast we want to recover vs how much resource we want to utilize, if we have the machines I'm all for having standby replicas but imo this should not be a strong requirement for us, a pod restart should take a couple minutes which is totally affordable within our SLO
[18:12:10] <inflatador>	 Ah OK, then we can probably skip it
[18:32:51] <inflatador>	 ryankemper we're in pairing if you wanna join
[20:18:16] <inflatador>	 OK, we rolled back to a single job manager. It took a destroy/apply cycle , but we're good. Working on restoring from checkpoint next
[20:41:40] <inflatador>	 Simple PR if anyone has a chance to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/954134/
[20:41:51] <inflatador>	 just marking some hosts back to insetup in Puppet
[20:42:18] <ebernhardson>	 +1
[20:46:01] <inflatador>	 looks like this kibana query for savepoints isn't returning anything https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Recover_from_a_checkpoint . Tried a few permutations but nothing so far
[20:46:12] <inflatador>	 also, thanks for the +1!
[20:46:29] <ebernhardson>	 :) looking
[20:50:39] <inflatador>	  kubernetes.master_url:"https://dse-k8s-ctrl.svc.eqiad.wmnet:6443" AND kubernetes.namespace_name:rdf-streaming-updater might be closer
[20:52:31] <ebernhardson>	 even if we get that, a simple search for "Completed checkpoint" over the last 14 days give no results :( 
[20:52:43] <ebernhardson>	 either the logs aren't making it into logstash, or it's not happening.
[20:56:18] <inflatador>	 yeah, I'm seeing the same thing. Let me check what the prod logs say
[20:58:51] <inflatador>	  `kubernetes.master_url:"https://kubemaster.svc.codfw.wmnet:6443" AND kubernetes.namespace_name:rdf-streaming-updater AND "Completed checkpoint"` doesn't show anything, looking an arbitrary wk in August
[21:02:22] <inflatador>	 the `kubectl logs` from prod shows the messages, but still not sure where they end up in logstash
[21:03:09] <ebernhardson>	 sounds like they aren't making it then, should be able to choose an arbitrary string from the log and find it
[21:04:30] <inflatador>	 that's what it looks like from kubectl https://phabricator.wikimedia.org/P52264
[21:08:35] <ebernhardson>	 yea not making it into logstash :(
[21:08:56] <ebernhardson>	 i'm not even sure where to look, k8s logs to logstash should be magic :P
[21:09:17] <inflatador>	 smells like a new ticket ;)
[21:14:23] * inflatador wonders if some of that info is in Swift metadata
[21:24:08] <inflatador>	 https://phabricator.wikimedia.org/T345668
[21:42:58] <inflatador>	 so the job ID from kube logs does appear as a path in swift, but no relevant data besides timestamp. I wonder if there's a way to tell when checkpoints are too old and drop them
[21:44:12] <inflatador>	 ah, it's all in the docs https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Recover_from_a_checkpoint