[07:01:03] pfischer: (if not done already) should we file a ticket to ask for topic creation? [07:10:12] iiuc we'll need 3 topics per DC: $DC.cirrussearch.update_pipeline.fetch_error.rc0, $DC.cirrussearch.update_pipeline.update.rc0 and $DC.mediawiki.cirrussearch.page_rerender.v1 [07:24:38] dcausse: good point, I’ll do that. That should be tagged with service ops? From the versioning scheme, I guess we keep page_rerender out of the development space (for schemas). [07:25:13] yes, added the v1 suffix [07:26:10] I thought that the schema being simple enough minor and compatible changes will be possible [07:26:32] Sure, I’ll request deployment for this afternoon then: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230926T1300 [07:27:00] the page_rerender one requires a patch to cirrus search sadly [07:27:10] so will have to wait for the train [07:27:35] pfischer: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/961000 [07:30:51] but yours does not have to wait for the train I think [07:33:29] also we might have to request a restart of evengate-main after new streams are deployed to mw-config (c.f. https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#Stream_Configuration_deployment) [08:01:03] Yes, although I do not understand the limitation to streams “that will be produced through an destination_event_service” in the end. Is this relevant? To my understanding both SUP streams (update and fetch_error) are not produced from EventGate, therefore, we do not need to specify destination_event_service. [08:02:33] Or do we replicate them to HDFS and need canary events to make sure we don’t have empty buckets? [08:02:47] pfischer: indeed, canary events would have been the exception but we don't activate them apparently [08:04:09] hm.. thinking about canary events I think that only the fetch_error stream might be consumed from hdfs [08:05:22] we might need the canary events for cases when there are no fetch errors within an hour [08:05:44] so perhaps we should enable canary events for all our streams? [08:07:49] and set destination_event_service to eventgate-main (for canary events) eventhough we're producing directly to kafka? [08:12:19] Sure, I’ll update my patch [09:22:47] pfischer: I'd like to make a release of the rdf repo possibly with https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/960554 (thanks for the +1!), are there things you're not sure we could discuss to +2 this patch? (totally fine to wait for Erik's review if you prefer tho) [09:32:43] errand + lunch [09:59:09] Lunch [10:37:38] dcausse: 👀 [10:38:29] dcausse: +2 [10:41:54] errands + lunch [12:09:06] thanks for the +2! [12:41:25] gehel: There are some low hanging SUP-related ticket fruits on our work board: https://phabricator.wikimedia.org/project/board/1849/. Since we skipped triage: How can we pull in tickets mid-week? Should I just estimate them myself? [12:41:48] yep, just estimate them yourself and pull them in! [13:09:10] pfischer: deploy window started in #wikimedia-operations [13:12:45] pfischer: amended your patch based on latest comments [13:13:32] o/ [14:03:44] dcausse, inflatador : SUP meeting in https://meet.google.com/pup-xwxi-oqw [14:39:40] inflatador: any objections if I try to deploy https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/961101 and https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/959218 to wikikube-staging (context is T326914) [14:39:41] T326914: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 [14:40:20] dcausse no, feel free to deploy. I'm free for the next 20m if you need anything [14:40:29] thanks! [15:01:52] \o [15:06:03] o/ [15:19:31] o/ [15:21:14] forgot something regarding moving away from FlinkKafkaConsumer with the WDQS flink job, if anyone has a couple minutes: https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/961137 [15:22:28] dcausse: i don't know what it does, but it looks plausible :) [15:22:47] ebernhardson: this is from this procedure: https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/datastream/kafka/#upgrading-to-the-latest-connector-version [15:23:00] thanks! :) [15:37:01] dcausse: just found a nit [15:37:05] sigh... seems like I forgot something else... KafkaSource defaults to earliest not group offsets :( [15:37:36] pfischer: thanks! [15:39:16] Off, back in 60’ [15:48:46] workout, back in ~40 [15:55:56] and another one: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/961144, sorry about that I should have paid more attention here and not assume that defaults would be the same... [15:57:02] yup, seems sane. Curious they changed that default [15:57:14] i suppose, i think pykafka does the same with defaulting to earliest [15:57:25] perhaps to avoid failure when starting a fresh app? [15:57:37] yea perhaps [15:58:03] i guess pykafka only defaults to earliest if there are no commited offsets though, it's odd it would ignore them [15:58:28] yes true [16:30:49] back [16:44:23] if you still need reviews for the WDQS deploy stuff LMK [16:45:20] inflatador: thanks, still waiting for CI so might not get to it today, (still will have to make a new release after that) [16:45:55] dcausse np, I know it's late for you [16:51:27] yes, so the /srv/deployment/wdqs/wdqs state is not great at that point as it does not reflect what's running, if there's a need to re-deploy this flink job (I highly doubt we'll need to but...) the few patches merged to the deploy repo today will have to be reverted. [17:01:07] updated partman recipe ready for review if anyone has time: https://gerrit.wikimedia.org/r/c/operations/puppet/+/960114/ [17:01:31] Not sure if it'll work, but it shouldn't mess with anyone else's servers at least [17:02:34] inflatador: seems plausible [17:03:21] ebernhardson thanks! Will give it a shot shortly [17:15:10] and one more patch for adding cloudelastic to site.pp: https://gerrit.wikimedia.org/r/c/operations/puppet/+/961167 [17:16:06] inflatador: is your cloudelastic patch intending to change the cloudcontrol ones? [17:17:35] ebernhardson no...not sure WTF happened there...will fix [17:17:50] dinner [17:23:22] hmm, getting `empty range in char class: /^cloudelastic10[07-12]\.wikimedia\./`...bad regex or more likely I missed a lifecycle step. Checking... [17:23:56] hmm [17:24:16] oh, because 07-12 is not 07,08,09,10,11,12 [17:24:27] its 0, 2 and 7-1 (which is invalid) [17:24:41] inflatador: ^ [17:25:04] ebernhardson got it, looking at other host regexes [17:25:51] i think we've often kept it simple, make two rules one for cloudelastic100[7-9] and another for cloudelastic101[0-2] [17:26:02] otherwise regex alternation can do it, but it looks funky [17:36:12] I was pinging in dcops as well, looks like rob-h went ahead and merged my fixed patch [17:44:52] oops, added too many hosts...there's only 07-10. New patch up to fix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/961173 [17:45:18] lol :) [17:45:48] lunch, back in time for SRE pairing [18:21:01] back [20:28:51] break, back in ~15 [20:45:50] sigh..spark is just odd sometimes. It was reading the input as 3 partitions and OOM'ing, so tell it to limit partitions to 16MB when reading and it increased to 12 partitions, great! Except 8 of those partitions are empty :P [20:46:45] and it still has the same 2 large partitions as before...sigh. [20:53:54] back [20:54:10] sounds strangely similar to partman ;)