[08:51:14] o/ dcausse: Do we have a ticket for enabling the page_rerender topics? Besides the retention period, kafka also supports log compaction to reduce storage requirements. By configuring it to only keep the latest event of (by key) we would see all page_rerenders at least once but after some time, kafka would only keep the latest record and discard the rest, so if you restore at a later point in time, you only see the subset [08:51:14] of t the original events. [08:51:23] https://medium.com/swlh/introduction-to-topic-log-compaction-in-apache-kafka-3e4d4afd2262#4e8e [08:53:16] pfischer: I was creating it, do you want me to mention this? I think Andrew would like to experiment with it at some point [08:53:46] unsure if they'd agree to do such experimention on kafka-main tho, nor if the kafka version we use has it [08:54:53] I was reviewing the sheet with topic sizes and realize that we might have accounted for kafka replication twice... [08:55:04] https://docs.google.com/spreadsheets/d/1Fp44MdLxUVlxi03MBD_64m0zQErny-9jUD5C6RGf_bU/edit#gid=0 [08:55:58] we multiply colum I with 3 and reuse I to calculate J but also multiple again by 3 [08:56:38] column I is the topic size (replication included) [08:58:47] Hm, I’m not sure I follow. Did you already change it? [08:59:09] yes, reverted it now [08:59:51] 327Gb is the total size replication included for the page_rerender topic [09:01:49] with 5 partition we're more on 65Gb per broker [09:02:05] not 196 [09:03:53] Why do we have two lines (22 and 27) for replication? There is only one replication factor and the replications get distributed among the brokers, right? [09:04:25] yes, not sure why we added this twice probably a mistake [09:05:16] Okay, so in the first place 300 rerenders get split in 5 partitions and those are replicated, therefore the share of each broker is less than what I calculated originally? [09:07:31] yes, either we stop applying the replication factor to I or stop applying it to J [09:08:00] Sure, now I got it. Thanks! [09:09:29] BTW: log compaction is documented for 1.1.x: https://kafka.apache.org/11/documentation.html#compaction [09:10:52] good to know [09:15:44] can it cleanup based on retention as well or will it keep keys for ever? (The usecase Andrew has in mind was to keep all keys) [09:18:50] Retention and compaction are independent of each other. [09:20:08] ok [09:23:49] Hm, maybe not, looking into that: https://developer.confluent.io/courses/architecture/compaction/ [09:28:50] seems like you to be explicit by sending a null value message [09:28:56] *have [09:29:21] Yes, we could automate that. [09:31:55] ok I'll add a note as something to possibly explore if size is a concern [09:36:43] Thanks! I thought of it in context of a restore: When we have to catch up 4 days worth of events, then a compacted topic could be processed faster. Especially for search, where we are primarily interested in the latest version of a document. [09:38:35] dcausse / inflatador: is T326409 completed? I see there is a follow up task (T350784), but stand up notes are saying that the migration itself is completed. I'm not sure how to validate that myself, could one of you check? [09:38:36] T326409: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 [09:38:36] T350784: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 [09:40:48] pfischer: oh indeed, haven't thought about this that way but you're right in the backfill use-case this would be useful as well [09:41:50] gehel: it's not complete [09:42:35] Oh right, it's deployed in staging only! [09:43:42] inflatador: the wdqs updater in staging died, might perhaps be because of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/972725, we can fine-tune staging rather than relying on increasing quotas [09:44:44] weekly update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-11-17 [11:08:30] lunch [11:48:45] lunch [12:56:07] hello, may one of you review my change to the parent pom to bump maven-javadoc-plugin to 3.3.0 ? ;) https://gerrit.wikimedia.org/r/c/wikimedia/discovery/discovery-parent-pom/+/975003 [12:56:32] bonus if a release can get cut, not necessarily today [13:15:48] hashar: 👀 [13:15:53] :-] [13:16:06] I don't know much about pom/maven so it is entirely possible I screwed up something [13:16:27] notably I don't know whether maven plugins should all be aligned to the same minor version (eg 3.3.XXX) [13:20:20] I’m not aware of such convention. I ran a build locally against the latest snapshot. Gehel [13:20:38] gehel: Do we have any semantics in the version of the parent pom? [13:22:09] I don't see any reason why the javadoc plugin should be aligned with anything [13:22:31] hashar: why upgrade to 3.3.0 and not 3.6.2 (which seems to be the latest) [13:22:54] I meant: Do major.minor.increment of the parent have any semantic? When do we bump which? [13:23:09] cause I am conservative and upgrade to the next version that has the fix I am looking for (detect `javadoc` binary under java 9 +) [13:23:12] Otherwise I’d just release the next increment. [13:24:07] I could not even find the changelog for 3.2.0 > 3.3.0 (maybe that is in Jira) to gauge what might potentially break [13:24:10] we don't really have any semantic on the version of the parent pom. It would get complicated to define what is or isn't a breaking change. [13:25:04] hashar: I usually bump maven plugins to latest, unless we have a good reason not to (and document that reason with an inline comment) [13:26:19] but then that can introduce a bunch of other breaking changes which would need to moare adjustements [13:27:22] I’ll bump it, so it’s on me. ;-) [13:28:16] Breaking changes are somewhat rare, and child projects don't have to upgrade right now, they can continue to use the previous version until they are ready to move forward. [13:28:55] yeah [13:29:18] at least that one goes in the way of upgrading to Java 11 and moving the `site:site` CI jobs to java 11 [13:29:50] Every now and then (and it's been a long while), I go through all plugins, upgrade to latest, and wait for things to break. Maven versions plugin is helpful in those cases: https://www.mojohaus.org/versions/versions-maven-plugin/display-plugin-updates-mojo.html [13:31:49] speaking of update, I have an outdated old repo that is using org.sonatype.oss.oss-parent 7 has a parent pom and I'd like to switch to our parent-pom [13:32:11] is there a way to review the difference ? [13:32:37] or I guess I am looking for a way to resolve the parent pom [13:33:20] gehel: https://phabricator.wikimedia.org/P53544 [13:33:51] hashar: you can change the parent and look at the effective pom afterwards [13:34:38] pfischer: looks like it is time to bump a number of those dependencies! Good job for a Friday! [13:34:50] pfischer: you want to do it? Or should I? [13:35:55] may you first cut a release for just the javadoc update and update the other plugins in another release? ;) [13:35:57] hashar: moving to our parent pom from sonatype parent will change a LOT of things. We take care not only of the publication to central (which is what sonatype parent pom is about), but also of a lot of static analysis. [13:36:48] It's likely to break the build until you fix all reported errors. But it is unlikely to break anything if the build passes. [13:36:59] pfischer: you to the parent pom release? [13:51:43] gehel: sorry, I can bump the versions and release [13:51:50] pfischer: thanks! [13:59:20] hashar: What is the minimum version of maven we can rely on to be present on CI servers? [14:01:45] dcausse $#$#%$#! I kept messing that one up. I'll fix it [14:02:50] .o/ [14:04:55] pfischer: the Docker images are bootstrapped with Maven 3.5.2 (regardless of the java flavor 8, 11, 17) [14:05:25] pfischer: and the entrypoint invokes mvnw which is expected to be at the root of all of our repositories and let developers pick whatever maven version they like [14:05:31] we should be using maven wrapper in most cases [14:05:45] so we should not even need to have maven installed in the docker image [14:06:02] and semi recent versions of maven will delegate to maven wrapper if found [14:06:36] o/ [14:07:13] inflatador: lol, no worries :) [14:39:16] 91% done with the data reload going by number of TTL files (1009/1104) [14:43:36] crossing fingers [14:48:43] pfischer: would you do a new release for the parent-pom ? ;) [14:53:37] dcausse https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/975289 is up for quota restore if you are OK with re-enabling [14:59:05] inflatador: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/967229/53/helmfile.d/services/rdf-streaming-updater/values-staging.yaml#15 would be a different approach not requiring to bump quota if you're OK testing it [14:59:25] I'm fine to increase quota as well but feels like we'll have to revert them at some point so... [15:01:35] dcausse FWiW I'd prefer to have resources and related settings as close as possible between staging and prod...but will defer to you. I'm used to throwing hardware at the problem [15:02:22] anyway, in the interest of moving fwd I'll get a patch started with your suggestion [15:03:01] inflatador: if serviceops are fine with the quota increase staying there I'm all for it [15:04:54] dcausse all good for now, I will add your suggested values file to my patchset and we can work from there [15:05:31] it's just a source of frustration for me...used to having basically unlimited hw resources ;P [15:05:43] :) [15:21:15] hashar: I’ll release, once https://gerrit.wikimedia.org/r/c/wikimedia/discovery/discovery-parent-pom/+/975297 has +2 [15:21:37] gehel: ^ 🙏 [15:21:49] can those be made two relesaes ? :) [15:22:34] hashar: if it’s that urgent, it won’t do harm. [15:23:00] not urgent [15:23:27] and maybe some of those upgrades also fix a few things for java 11 [15:24:32] pfischer: I've +1'ed the plugin upgrades, feel free to +2 once you've released the parent [15:24:42] s/parent/parent commit/ [15:30:56] ebernhardson see discussion in security channel when you get in...looks like they're going to detune that high RX alert, so we should be able to use LVS after all [15:31:02] hashar: intermediate release is on its way [15:31:46] awesome thank you pfischer ! :) [15:34:35] hashar: done [15:36:45] one last thing, how one can see the effective pom to review a diff? :) [15:38:15] `mvn help:effective-pom` [15:38:17] :) [15:39:38] pfischer: looks like you need to publish the parent-pom 1.69 to Maven Central [15:39:47] it complains about not being able to retrieve https://repo.maven.apache.org/maven2/org/wikimedia/discovery/discovery-parent-pom/1.69/discovery-parent-pom-1.69.pom [15:39:54] but maybe there is some cache in effect [15:40:25] dcausse how did you see that the updater was broken? Did you fix it already? It looks OK to me [15:40:57] I went to grafana and did not see it [15:41:29] perhaps I overlooked, checking [15:43:10] inflatador: ah my bad I think I selected the wrong prometheus datasource with codfw [15:44:22] hashar: I released/deployed to central. Maybe it takes some time to invalidate their caches? [15:44:28] dcausse np, glad you caught that quota regression [15:44:41] I think I then saw your revert and jump rapidly into concluding that the removal of the quota increase did fail the udpater but apparently not [15:45:03] might be that quota are only checked when creating new pods? [15:45:03] pfischer: yes possibly. I will retry on Monday. Thank you very much for the release! [15:50:58] dcausse not sure, but I'm going to self-merge the CR just so we get back to where we were before [15:51:14] ok [16:01:42] \o [16:02:34] o/ [16:15:43] hashar: publication to central can take some time. In my experience, usually less than 30 minutes [16:16:20] Looks like it is now available [16:17:36] what is the right process for updating the consumer offset? I'm rolling back the consumer to start when i took the itwiki/frwiki snapshots yesterday, but it seems like i need to issue two helmchart updates in short succession, first to add the parameter and start the consumer, and then a second to remove the parameter so any future restarts don't acidentally rewind? [16:21:11] ebernhardson: offsets given on the job args should be ignored if the job has a checkpoint [16:22:28] dcausse: hmm, so how do i move the consumer backwards? [16:22:43] you have to drop the state [16:23:03] oh :S I mean i guess our state isn't super important, but i wasn't expecting it to be so drastic [16:24:34] flink tries hard to make your state coherent with kafka offsets [16:26:29] one easy way is to do helm destroy (will tell flink to start fresh) and then helm deploy using --set some.config.value=offset [16:27:24] oh, i didn't realize you could override values like that. I've already destroy'd and re-applied the helmchart, but will try and remember next time [16:29:11] although it didn't work because while it says iso8601, apparently it doesn't accept 20231116T192000Z as a timestamp [16:29:45] * ebernhardson will have to play with test cases for a moment [16:50:57] sigh...so changing the timestamp to `2023-11-16T19:20:00Z` causes the helmfile yaml exporter to quote the string in the output, but we aren't writing yaml we are actually writing java properties and the quotes get parsed as part of the string [16:51:30] :( [16:52:20] nothing is ever easy :P [16:54:00] it's adding double quotes? [16:54:17] yea [16:54:34] resuts in: kafka-source-start-time: "2023-11-16T19:20:00Z" [16:54:40] i could hax the config to handle that, but seems awkward [16:55:28] we could load a yaml file actually? [16:56:01] instead of assuming that yaml -> java properties is OK [16:56:40] flink config apis might be able to load that directly [16:56:43] hmm, i don't see why not. Might have to add something that enforces is to be a strict Map parse [16:57:07] i guess don't have to specifically restrict it, it just wont work if you do sillyness [16:57:29] will poke a bit [16:57:34] sure [17:11:18] Workout, back in ~40 [17:30:09] heh, the CVE section at the top here is amusing: https://bitbucket.org/snakeyaml/snakeyaml/src/master/ [17:50:14] LOL. Another instance of "people that understand don't need a sign. people that need a sign won't look at it anyway" [18:16:12] pfischer: gehel: the publication to Maven Central eventually worked :) [18:44:37] small CR for ldf endpoint alerts if anyone has time to look...otherwise I'll probably self-merge after lunch https://gerrit.wikimedia.org/r/c/operations/puppet/+/974281 [18:50:15] lunch, back in ~40 [19:44:15] back, but going to my appointment a little early, back in ~90 [20:36:05] * ebernhardson has no clue what the right way is to depend on a library that is also transitively depended on (snakeyaml via jackson via wmf-event-utils) and ensures versions match...i just stuffed the same version into pom.xml and hope for the best :P [21:49:32] back