[07:30:59] <gehel>	 pfischer: would you be able to help ryankemper on the errors above?
[07:32:02] <gehel>	 ryankemper: otherwise, Data Engineering might have some Flink experience as well. Maybe ottomata or btullis ?
[08:33:23] <btullis>	 ryankemper: I can try to have a look, if you would like. I've not much first-hand experience of flink yet, but I'm happy to try to help.
[08:34:18] <gehel>	 Errand, back in a few
[08:38:07] <gehel>	 btullis: Ryan is probably asleep at this time. I don't think this is an emergency yet, so don't spend time on it!
[08:38:26] <gehel>	 David is our main Flink expert, but he is out this week.
[08:41:16] <btullis>	 gehel: Ah yes, I forgot about the timezones :) OK, feel free to ping again if it looks more serious in the meantime.
[09:55:56] <gehel>	 weekly status update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-05-05
[12:56:10] <pfischer>	 gehel: I’ll look into it.
[12:56:20] <gehel>	 pfischer: thanks!
[14:43:27] <ebernhardson>	 \o
[14:59:27] <ebernhardson>	 going to guess no luck on the rdf-streaming-updater? graphs for codfw are all flat
[15:28:59] <ebernhardson>	 at a general level, it seems like the taskmanager is trying to connect to the jobmanager and failing, since sometime around 10:20 UTC on may 4th.   There are a few logs from that point in time, but nothing that looks too different from the previous day. I would probably randomly guess at restarting the jobmanager, it seems stuck
[15:49:11] <gehel>	 Time to start the weekend. Have fun and see you next week !
[16:05:48] <ebernhardson>	 am left wondering a bit about the log situation. Our docs make it seem like logstash and kubectl should show the same set of logs, but the kubectl logs update a couple times a minute and the logstash ones can go hours without  a single message
[16:17:22] <ebernhardson>	 not sure how related it is, but kubemaster was rebooted ~2 minutes before everything fell over. On coming back up we get a "No master state to restore" log message which seems suspicious, but i don't really know what a normal startup looks like
[16:48:38] <ebernhardson>	 seems like logging broke around apr 18th, prior to that logstash has regular messages (10k+/hr) from the instances but since then only a few rare messages make it through. 
[16:49:44] <ebernhardson>	 the difficulty is i'm supposed to get the most recent checkpoint from those logs to restart the thing, but they are missing. The backup procedure is to look in swift,
[16:50:54] <ebernhardson>	 inflatador: do you perhaps know where to find the swift credentials for rdf-streaming-updater? It's in the deployment templates as .Values.config.private.swift_api_key
[18:45:19] <ryankemper>	 ebernhardson: do you still need those creds? think I found it in `/srv/private` on puppetmaster
[18:52:45] <ebernhardson>	 ryankemper: yea i didn't end up getting any further without creds
[19:04:28] <ryankemper>	 (creds delivered)
[19:04:48] <ryankemper>	 I restarted (deleted) the `flink-session-cluster-main-` pod to see if that does anything
[19:07:40] <ryankemper>	 Stuff's moving again, looks like it just needed the restart
[19:50:38] <ryankemper>	 Interestingly the `RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable` alert is still firing, but I see the codfw wcqs hosts working overtime to catch up on lag...gonna just keep watching for now
[19:52:34] <pfischer>	 ryankemper: thanks! ebernhardson: (or maybe inflatador: ?) is the RDF streaming updater already running with the new flink k8s operator by now?
[19:58:03] <ebernhardson>	 pfischer: hmm, i don't know. I suspect not
[20:07:47] <ebernhardson>	 the runner in codf wdoes still seem a bit unstable, but i can't say for sure. Since restarting the main bit ~70m ago two of the task managers have been restarted 3 times. 
[20:07:50] <ebernhardson>	 but maybe thats normal, i dunno :P
[20:54:34] <pfischer>	 ottomata: Is there a way to generate schema updates iteratively? I wanted to amend my change request but now I run into an error: Dereferenced current schema does not equal latest schema version 1.1.0. How can I fix this?