[07:30:59] pfischer: would you be able to help ryankemper on the errors above? [07:32:02] ryankemper: otherwise, Data Engineering might have some Flink experience as well. Maybe ottomata or btullis ? [08:33:23] ryankemper: I can try to have a look, if you would like. I've not much first-hand experience of flink yet, but I'm happy to try to help. [08:34:18] Errand, back in a few [08:38:07] btullis: Ryan is probably asleep at this time. I don't think this is an emergency yet, so don't spend time on it! [08:38:26] David is our main Flink expert, but he is out this week. [08:41:16] gehel: Ah yes, I forgot about the timezones :) OK, feel free to ping again if it looks more serious in the meantime. [09:55:56] weekly status update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-05-05 [12:56:10] gehel: I’ll look into it. [12:56:20] pfischer: thanks! [14:43:27] \o [14:59:27] going to guess no luck on the rdf-streaming-updater? graphs for codfw are all flat [15:28:59] at a general level, it seems like the taskmanager is trying to connect to the jobmanager and failing, since sometime around 10:20 UTC on may 4th. There are a few logs from that point in time, but nothing that looks too different from the previous day. I would probably randomly guess at restarting the jobmanager, it seems stuck [15:49:11] Time to start the weekend. Have fun and see you next week ! [16:05:48] am left wondering a bit about the log situation. Our docs make it seem like logstash and kubectl should show the same set of logs, but the kubectl logs update a couple times a minute and the logstash ones can go hours without a single message [16:17:22] not sure how related it is, but kubemaster was rebooted ~2 minutes before everything fell over. On coming back up we get a "No master state to restore" log message which seems suspicious, but i don't really know what a normal startup looks like [16:48:38] seems like logging broke around apr 18th, prior to that logstash has regular messages (10k+/hr) from the instances but since then only a few rare messages make it through. [16:49:44] the difficulty is i'm supposed to get the most recent checkpoint from those logs to restart the thing, but they are missing. The backup procedure is to look in swift, [16:50:54] inflatador: do you perhaps know where to find the swift credentials for rdf-streaming-updater? It's in the deployment templates as .Values.config.private.swift_api_key [18:45:19] ebernhardson: do you still need those creds? think I found it in `/srv/private` on puppetmaster [18:52:45] ryankemper: yea i didn't end up getting any further without creds [19:04:28] (creds delivered) [19:04:48] I restarted (deleted) the `flink-session-cluster-main-` pod to see if that does anything [19:07:40] Stuff's moving again, looks like it just needed the restart [19:50:38] Interestingly the `RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable` alert is still firing, but I see the codfw wcqs hosts working overtime to catch up on lag...gonna just keep watching for now [19:52:34] ryankemper: thanks! ebernhardson: (or maybe inflatador: ?) is the RDF streaming updater already running with the new flink k8s operator by now? [19:58:03] pfischer: hmm, i don't know. I suspect not [20:07:47] the runner in codf wdoes still seem a bit unstable, but i can't say for sure. Since restarting the main bit ~70m ago two of the task managers have been restarted 3 times. [20:07:50] but maybe thats normal, i dunno :P [20:54:34] ottomata: Is there a way to generate schema updates iteratively? I wanted to amend my change request but now I run into an error: Dereferenced current schema does not equal latest schema version 1.1.0. How can I fix this?