[09:02:39] dcausse: regarding https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/1033386/comments/3c1a2c91_8bfd4cde - did that IllegalArgumentException get thrown when you initiated a deployment? [09:12:14] pfischer: yes, but was able to reproduce it locally, working on small fix, reason is the getOrElse at https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/1033386/25/streaming-updater-producer/src/main/scala/org/wikidata/query/rdf/updater/UpdaterPipeline.scala line 104 [09:12:52] adding some config validation code [10:00:17] lunch [12:29:38] pfischer: sorry it took longer than expected (got some issues when adapting the integration tests): https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/1033386/25..26 [12:32:52] will test this in yarn now [12:40:58] o/ [12:49:54] Caused by: org.apache.flink.util.FlinkRuntimeException: Failed to send data to Kafka wdqs_streaming_updater_test_T361935_full-0@-1 with FlinkKafkaInternalProducer{transactionalId='wdqs_streaming_updater_test_T361935_full:0-0-1', inTransaction=true, closed=false} [12:50:10] I think we need to alter the transaction per subgraph output [12:56:05] Okay [12:59:12] Why do we need kafka transactions after all? [13:00:41] pfischer: because we use transactional kafka producers (exactly_once semantic) [13:05:31] Ah, got it, so right now, we still share the transaction ID among *all* sinks [13:06:02] Shall we just append the name/UUID suffix? [13:09:31] I’ll patch it. [13:12:55] pfischer: oops sorry just patched it, actually removed this var from UpdaterPipelineOutputStreamConfig and compute it on the fly based on topic:partition [13:13:12] seems to run now [13:13:19] Just saw it. Thank you! [13:13:30] checking the output [13:14:35] pfischer: you have the flink ui at ssh -L 8888:an-worker1107.eqiad.wmnet:38739 stat1009.eqiad.wmnet [13:18:06] \o [13:18:36] o/ [13:19:20] o/ [13:19:41] dcausse: that still shows me the Exception mentioned above. Is this the latest deployment? [13:20:41] pfischer: yes, I see it running, 2024-07-08 15:10:20 Duration 10m 9s [13:22:56] scholarly subgraph seems to be populated: kafkacat -b kafka-jumbo1009.eqiad.wmnet:9092 -t wdqs_streaming_updater_test_T361935_scholarly -o end [13:24:03] dcausse: Ah, now I see it, too. That’s good news! [13:27:03] but it did not properly backfill :/ [13:27:07] dcausse: Some cast fails [13:27:35] ah indeed [13:28:36] ah might because we skipped the chunk operation for deletes in the past? [13:36:29] Hm, it’s trying to cast DeleteOp -> EntityPatchOp since the operation is a Diff. But we also issue deletes (of stubs) resulting from a Diff operation. So we have to consider that scenario. [13:37:11] dropping off my son at camp, back in ~30 [13:42:37] dcausse: both diff and reconcile now have to be aware of deletes, I’ll create a patch [14:01:09] dcausse: I undid your patchset by accident [14:01:39] no worries, gerrit should still have it I think [14:24:39] * ebernhardson has apparently forgetten how to remove cindy votes... [14:31:28] :) [14:38:38] pfischer: running the split updater again with your patch [14:44:58] it's properly backfilling now, I had input_idleness set to 2s (which is sway too low and suddenly considered all historical events as late) [14:46:13] unsurprisingly most updates are on the main graph [14:54:06] pfischer: I'd sugget to let the pipeline run for one day or two, last thing I believe that we might want to fix is org.wikidata.query.rdf.tool.rdf.RdfRepositoryUpdater#applyPatch so that it does not fail if it accumulates only a set of stub patches [14:54:41] might be slightly late to mtg [14:54:49] it currently disallows "empty" patches [15:01:38] dcausse: triage meeting: https://meet.google.com/eki-rafx-cxi [16:12:18] dinner [16:27:41] ryankemper if you're around today, we might want to look over T369521 at pairing [16:27:42] T369521: Clean up sre.wdqs.reboot cookbook - https://phabricator.wikimedia.org/T369521 [16:45:28] * ebernhardson isn't really sure why cindy keeps failing, they don't all fail the same :( [17:11:58] dinner [17:13:37] maybe try updating nodejs again, it complains on install that a variety of packages declare compatability for >= 18 [17:14:08] * ebernhardson is just trying to get through code review, but cindy isn't helping :P [18:03:26] rebooted search-loader1002 and 2002 for security updates (not at the same time), LMK if you notice any issues [18:04:13] ebernhardson: I've seen your "maybe" to the WE3.1 meeting tomorrow. Don't feel like you have to wake up at 6am! We'll keep you posted [18:07:49] lunch, back in ~40 [18:08:40] gehel: i might be up, depends. i started about 6:20 this morning since i was up and had nothing better to do [18:09:26] but i certainly wont be setting any alarms [20:03:22] dr0ptp4kt ryankemper will it mess anything up if I reboot the graph split hosts for security updates? (ref T366555 ) [20:04:27] inflatador: dr0ptp4kt: those are safe to reboot, no active reloads and they’re not serving non-experimental traffic [20:09:10] ryankemper sounds good, will start reboots once I'm done w/current batch (internal in wdqs) [20:33:08] ryankemper OK, rebooting the test servers now [20:44:31] ah, looks like the current wdqs reboot cook-book expects streaming-updater...which is fine, will reboot with reboot-host cb instead [22:31:05] ryankemper forgot to mention: I'm rebooting wdqs-internal eqiad in a tmux on cumin2002. if you want to get prod wdqs, that's almost all of our hosts besides apifeatureusage (I think) [22:31:35] also, wdqs2012 is alerting for ping, but that's not one I've tried to reboot [22:33:11] also, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052819 is ready for review. I gave envoy its own port, so it should be able to coexist with nginx