[10:31:20] errand+lunch [11:04:44] dcausse, dr0ptp4kt: I can't make it to the Graph Split log analysis meeting this afternoon. I'll read the notes and I'm sure you can move forward without me! [11:32:40] 👍 [13:42:53] dcausse, inflatador : we have a thanisnspace usage alert for the WDQS streaming updater that seems to be open since quite a long time. Should we do something about it! [13:43:26] See #wikimedia-operations : 2:38 PM <• jinxer-wm> (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [13:43:48] gehel: yes, we should definitely clean this up and start learning how we could automate this process [13:44:32] Ok, I'll create a ticket [13:45:12] thanks! [13:57:59] ryankemper: per yesterday's discussion I filed T355888 [13:57:59] T355888: Enable cross federation between experimental WDQS endpoints - https://phabricator.wikimedia.org/T355888 [14:12:33] gehel Y I just saw that one yesterday...perfect example of an alert that should create a ticket. Anyway, I'll take a look [14:14:35] o/ [14:38:34] inflatador: shall I go ahead and switch wdqs::internal to Puppet 7 next? or do you prefer to first go with a single canary for that role? [15:05:27] moritzm Y, sorry I haven't been able to move that fwd. Feel free to migrate wdqs:internal to Puppet 7 [15:08:36] ok, I'll do that now, then [15:21:12] inflatador: wdqs/internal is done, I'll also proceed with the wcqs role, ok? the canary host (wcqs2001) is on Puppet 7 since Nov 22 [15:24:57] moritzm excellent, feel free to proceed [15:25:12] ack, on it [15:35:05] wcqs is now also done, I'll do wdqs::public on Monday, the switch maintenance is upcoming and I don't want to run the migraton (which takes a bit for 24 hosts) while it lasts [15:39:43] moritzm understood. Feel free to give me a heads-up if you into issues. So far it's been pretty seamless (at least for us, you might have a different take ;) ) [15:41:03] yeah, I wouldn't expect any major issues for search related systems at this point [15:41:39] for the cirrus main cluster it would be good to switch a few more canaries before the migrate the entire role, let's catch up on Monday for that [15:58:57] inflatador: If you can get The Wordhord from your library, then it's worth checking out (in both senses!) [16:02:56] \o [16:11:48] hmm, patch 992974...is it time to start planning on how to capture patch 1000000? :) [16:30:44] o/ [16:30:46] :) [16:39:30] hmm, i think ready to turn on SUP in prod..wondering if i'm forgetting anything [16:43:59] sorry haven't looked at gerrit today, looking at it now [16:44:15] should be just https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/979147 [16:45:56] so it's the first time we'll write to the update topic in kafka main I think [16:46:06] yes i believe so. Do we need to repartition it first? [16:46:19] might already be partitioned, lookihng [16:47:25] looks like we have update.rc0 and fetch_error.rc0 topics in kafka-main with 5 partitions each [16:47:43] ok :) [16:48:35] i guess i should point the fetch-error-topic at .rc0 then...fixing [16:49:45] ? should it be referencing the stream name? [16:50:02] oh, indeed it does a lookup to decide it's rc0 [16:50:38] we have rc0 specific in the update-topic for other things though. Perhaps that shouldn't be? [16:51:29] yes... hm.. perhaps it's because the job cannot decide which topic to use the codfw or the eqiad one and we have to help it [16:51:58] hm I remember we implemented some kind of topic filtering for this [16:54:03] yes we have topic-prefix-filter: 'eqiad' but might not be used for all the streams? [16:55:05] hm but that's only for the producer... [16:55:20] oh, i need to deploy producer in both dc's [16:55:42] ebernhardson: I think you're right values-consumer-cloudelastic-eqiad.yaml should be updated with fetch-error-topic: eqiad.cirrussearch.update_pipeline.fetch_error.rc0 [16:58:48] ok, going to merge and start the producers first [17:11:58] hmm. ran `kube_env cirrus-streaming-updater eqiad` and then `helmfile -e eqiad --selector name=producer -i apply`. `kubectl get pods` says nothing is running, but helmfile diff doesn't say there is anything remaining to deploy [17:13:11] the operator has to pick it up... [17:13:15] might take time? [17:13:44] it's been a couple minutes :S wonder if we have to whitelist the namespace somehow [17:14:21] oh indeed hm.. [17:15:03] in e.g. helmfile.d/admin_ng/values/eqiad/flink-operator-values.yaml [17:15:24] oh! somehow i didn't notice there was a separate values directory and was wondering where it came from :) [17:17:14] inflatador: if you have some time: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/992983 [17:19:36] ebernhardson :eyes [17:20:33] ebernhardson +1'd, I can help deploy to admin-ng sometime this afternoon if you like [17:20:59] inflatador: ok [17:21:23] I'm heading to lunch but will reach out when I get back [17:58:34] err, interesting...the "accepted" way to throw an exception from a session provider is to register a hook to either ApiBeforeMain or BeforeInitialize and throw from the closure you register [18:02:19] * ebernhardson is tempted to try and move the exception throwing code from OAuth into core, so at least can re-use instead of copying, but i suspect then have to deal with additional concerns that aren't (currently) relevant [18:13:07] back [18:19:17] https://gerrit.wikimedia.org/r/c/operations/dns/+/993014 small CR for the cloudelastic migration if anyone has time to look [18:48:15] inflatador: would appreciate deploying the admin-ng patch when you have a chance [18:49:07] ebernhardson ACK, will ping you in 10-15 if that works [18:50:29] sure, thanks [19:19:31] ebernhardson I'm in https://meet.google.com/eki-rafx-cxi if you wanna join. Starting the process now [20:46:15] * ebernhardson should have started on the producer yesterday, it's giving a fun `java.lang.OutOfMemoryError: Direct buffer memory` from the kafka producer. [20:46:55] curiously, we configure `kafka-source-config.security.protocol: SSL` (and no kafka-sink-config, so it's copied) but the stacktrace includes org.apache.kafka.common.network.PlaintextTransportLayer.read [20:49:37] huh, actually something doesn't work there. In the logs for taskmanager-1-1 it dumps the ProducerConfig before connecting, and the producer has `security.protocol = PLAINTEXT`. hmm [21:00:14] setting that explicitly made things work, double checking the app config we only copy the bootstrap servers, not all kafka properties [21:05:29] ERITF [21:05:50] or should I say, "weird" [21:05:53] :) [21:06:24] i guess i had a wrong undertanding of the way we handle kafka config. I was thinking all kafka properties are copied from the source to sink if we don't define the sink, but it only copies the bootstrap list [21:07:12] also noticed i didn't manage to properly limit the wikiids (only applied to eqiad, which is idle), so good thing only turned the producer on :) fixing... [21:22:02] * ebernhardson now twiddles thumbs for 5 minutes waiting for the aggregation window [21:34:10] sigh...5 minutes passed an no events coming out :S [21:34:59] and the answer is something i should have known... the configuration parser uses ; to separate lists and not , [21:36:17] inflatador: patch to remove old cloudelastic masters => https://gerrit.wikimedia.org/r/c/operations/puppet/+/993038 [21:38:05] ryankemper ACK, +1'd [21:45:06] running queries serially without a pause against wdqs1022 and wdqs1024 from stat1006. doing a simple 5.5 minute test per node right now. i'll see what rps looks like, then want to ratchet it up a bit. [21:52:50] inflatador: alright gearing up to start working on the old masters [21:53:00] inflatador: can update you on irc or we can hop on a meet [21:53:35] ryankemper meet is good, I'm up at https://meet.google.com/fde-tbpf-wqh [22:23:31] alrighty, consumer-cloudelastic is up and running, writing to testwiki and mediawikiwiki. Expecting it to basically do nothing, since the old updater is also running and has lower end-to-end latency [22:57:30] (off-topic) I'm going to see Lotus (jam band) in LA for the first time tonight. Here's a song of theirs I like (also good for work since it's minimal vocals): https://www.youtube.com/watch?v=iMS8k7pkmlM [23:20:02] it's smooooth [23:20:40] at first i was expecting Flying Lotus, but no, this is Lotus. cool! have fun! [23:53:30] Flying lotus has great stuff too for sure!