[05:58:49] Sitting tomorrow's retro out [08:13:45] o/ dcausse: have you seen Gabriele’s suggestion on keeping schemas under development/ until they are hardened? I thinks that’s a good idea, as the schema still changes fundamentally (thinking of getting rid of change_type.REDIRECT_UPDATE). I rushed ahead with fetch_error https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/854572 and update https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/951829 [08:14:32] If you’re fine, I’d ping Gabriele to finally +2 this. [08:18:05] I’m not sure about page_rerender though. Has the producing extension already been deployed? I can’t find that stream https://stream-beta.wmflabs.org/v2/ui/#/ [08:36:14] pfischer: +1 for the development approach! [08:37:20] for the page_render stream we have to ship a mw config patch to add it [08:38:30] for this stream we might perhaps to avoid doubling the rate kafka messages enable it on a per wiki as well [08:42:22] re getting rid of REDIRECT_UPDATE I'm a bit confused can you elaborate? [09:59:15] Sure, during Wednesday meeting we discussed merging redirect updates with redirects that come inside a cirrus doc. As far as I understood, we decided to drop redirect updates if they are related to a revision based updated in the same time window. [10:00:24] Now that I write it: We still might see redirect updates that come alone and cannot be submerged. [10:01:53] pfischer: yes exactly I was confused about this [10:01:56] lunch [13:19:29] o/ [13:33:45] !issync [13:56:00] sigh... I'm hit by https://issues.apache.org/jira/browse/FLINK-28758 while trying to re-deploy the rdf-streaming-updater job :( [14:14:36] wow, seems like a long time for such a big bug to be open [14:15:39] moving T326914 to current work to get rid of FlinkKafkaConsumer [14:15:40] T326914: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 [14:16:15] sadly this might affect our testing if stop-with-savepoint does not work [14:29:05] the fix is being reviewed now https://github.com/apache/flink-connector-kafka/pull/48 but not sure how much time we'll have to wait for a flink 1.16.3 release [14:31:28] Interesting...looks like the fix should be coming soon, but in the meantime if I can help with T326914 LMK [14:31:29] T326914: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 [14:32:50] inflatador: thanks, it's mostly code to write at this point, will ping you once we have to think about the migration procedure [14:32:50] dcausse: I vaguely remember we had an issue with the image recommendation pipeline. Do you remember what it was and if we fixed it? [14:33:30] gehel: should be fixed by now, lemme find the ticket [14:35:21] the top-level issue was: T345188 and the cause for the problem is described here T345141#9140582 [14:35:22] T345188: Add Image: all wikis ran out of image recommendations - https://phabricator.wikimedia.org/T345188 [14:35:22] T345141: No ALIS for 2023-08-14 snapshot - https://phabricator.wikimedia.org/T345141 [14:36:23] tl/dr the cause is not releated to anything the search team owns but they asked us to do a full reload to fix the production dataset [14:36:24] if I remember, some of the underlying issue is that our interface is a hive table, and it is probably not the most descriptive way to do that. [14:36:51] We might want to either improve the monitoring (on the Growth side), or define a better / higher level interface? [14:37:28] dcausse: ok, thanks! That's probably all I need at the moment. [14:38:27] the fact that it's a hive table and airflow jobs makes it hard to do fixups and they often require manual actions why are not easy to track [14:39:29] probably nothing we'll do right now, but it might make sense to dig a bit into how we want other teams to interact with Search [14:39:42] or how we could make their life's simpler [14:40:17] unrelated: please leave your comments on our decision record: https://docs.google.com/document/d/1YFdImvbe2LXBYrzFJpJnMSyWt-nGVGX__OOOmQmBb5g/edit#heading=h.9be6v3q5a7sr (in particular pfischer) [15:01:22] \o [15:01:34] o/ [15:01:37] looks like a netsplit [15:28:53] ebernhardson I'm working on replacing the search-loader hosts. Will it be a problem if we have 2 search-loaders running at the same time in the same DC? [15:29:30] (They're in insetup so they shouldn't get their roles applied yet) [15:36:24] * inflatador reads https://wikitech.wikimedia.org/wiki/Search/MLR_Pipeline again [15:36:54] inflatador: no problem, it will naturally do the right thing [15:37:11] ebernhardson ACK, thanks [15:37:52] the short answer is it's a kafka consumer application, and kafka consumers have a process for figuring out how N consumers map to M topic partitions, we don't need to worry too much about the specifics it will just work it out [15:38:29] works for me [15:38:50] inflatador, ryankemper I'll skip our pairing session, conflicting meeting [15:39:08] gehel ACK, np [15:44:34] https://gerrit.wikimedia.org/r/c/operations/puppet/+/957762 patch up to move the search-loaders into production role [15:56:50] should we bet on if puppet works first try? Based on past experience i wouldn't bet much :P [16:07:07] probably notr [16:07:20] although zookeeper role did, it was a nice surprise [16:07:25] anyway, W/O, back in ~40 [16:21:57] "funny" https://phabricator.wikimedia.org/T338189#9167424 :) [16:30:36] back [16:46:30] dinner [16:47:32] search-loader says: `Package 'python3.7' has no installation candidate`...as expected [17:16:30] lunch, back in ~40 [17:25:15] i wonder what version of python we use for mjolnir in yarn...sec checking [17:26:08] oh i was lazy, it's also using 3.7.10 [17:26:38] inflatador: i suppose i could at least run the test suite with the new version, what version of python is available? [17:47:36] back [17:48:01] ebernhardson bullseye comes with 3.9. https://phabricator.wikimedia.org/T346373 [17:48:22] I forked the mjolnir repo and I've been tweaking a few things, was gonna try to run tox [17:51:21] bookworm ships with 3.11.2 , not sure how hard it would be to jump that far though [17:51:35] inflatador: in theory, change the conda-environment.yaml to define a different version of python and let it go. But there might be more dependency changes [17:54:21] Y, I'll check it out. I can make venvs with the best of 'em ;P [17:55:18] because python completely lost its "there should be one, and preferably only one,obvious way to do it" principal as it grew...these aren't venv's but conda envs :P [17:56:23] ah yes, I've never used conda but let's just say package/dep management is not a strength of python's ;P [17:57:06] https://www.anaconda.com/download#downloads is this what I'd need for Conda? [17:57:58] yea thats the thing, i use a docker image from releng [18:01:06] actually i guess i use a custom image [18:03:24] wow, 4.6 GB...this guy's beefy [18:04:57] i think that's mostly spark, but 4.6G sounds pretty hefty [18:06:21] yea looking at the exported artifact it's only 450M with all dependencies. That conda download must include a bunch of other stuff [18:07:48] it has a silly GUI portal app [18:10:46] another fun thing with conda is sometimes it's dependency resolution gets stuck and runs for days: https://github.com/conda/conda/issues/11919 [18:11:39] they claim to be resolving it with a replaced resolver, with it being the default starting this month. Maybe will help, i guess i have to figure out if this is using an old version of conda or how that works [18:12:00] i mention it because after changing the python versions to 3.11 it's just spinning in that step :P [18:12:35] interesting. I'm using 3.9 and just a couple of errors for missing pkgs (probably too old) [18:12:41] - xgboost=0.90 - networkx==1.11 [18:13:24] but I'm just running tox, not the setup cmd from that github issue [18:14:18] hmm, yea xgboost 0.9 is ancient. networkx is a dependency of hyperopt, we might have to upgrade both of those to use a newer python version. One the one hand neither of those is used in the daemons so could hax around it somehow [18:14:34] I just removed the version constraint and things are happening [18:18:46] `Exception: Java gateway process exited before sending its port number` hmmm [18:18:51] curious. I have conda 4.11.0. The release talks about conda 23.9.0. I'm going to hope they changed their version numbering at some point to be based on calendar dates :) [18:20:32] i guess i'm using https://apt.wikimedia.org/wikimedia/pool/thirdparty/conda/c/conda/ [18:24:45] the internet seems to think my error is related to JAVA_HOME, I'm dubious [18:25:03] I am using Java 17 though, hmmm [18:25:08] hmm, this is probably java 8 [18:25:13] it runs in the yarn cluster [18:25:40] I can try w/Java 8, I'd just expect it to bail out earlier if that envvar was missing [18:28:35] i'm going to guess you're getting the libmamba solver in the newest conda, because mine still simply spins trying to resolve dependencies (i had to manually figure out versions last time instead of letting the resolver figure it out). Oddly i can't even figure out when 4.11.0, which we use, was released. 4.4.0 was released in may 2017, then they went to v5. Not seeing a 4.11.0 :P [18:29:06] ;( [18:29:35] I installed openjdk8 and it bombs out on start. Probably due to mixing old java with M1 mac [18:29:43] will give it a shot on a WMCS VM [18:31:39] oh, aparently you don't need anaconda. Just conda. It seems anaconda is a default set of packages for data science installed through conda [18:32:12] SRE pairing in https://meet.google.com/eki-rafx-cxi if you wanna play along. Promise I won't bait and switch this time ;) [18:32:22] conda does have a 4.11.0 release, from 2021-11-22, so not that old but not too recent [19:10:19] yea turns out switching to libmamba (via conda install -n base conda-libmamba-solver; conda config --set solver libmamba) on the latest (23.5.2) gets past the resolver step quickly [19:11:29] oh perhaps i spoke too soon :P [19:15:15] curiously there are only 2 packages without explicit version numbers...how much work could this be :P [19:16:04] I got java8 running using the azul version [19:16:29] it's running , I see a few errors though [19:17:51] xgboost.py is complaining, not too surprising [19:18:45] https://phabricator.wikimedia.org/P52512 full output [19:19:48] passing pytest is a surprise, gives hope [19:20:23] the mypy and flake8 errors are generally about either lints related to the new version of python, or type changes in the libraries we use [19:22:07] I think I can fix https://phabricator.wikimedia.org/P52512$61 , might need some help w/the otehrs [19:23:50] "fix" as in getting flake8 happy, no idea if it will work after change though [19:25:33] looks like the vespa server finished as well, with similar results [19:27:43] OK, everything is happy except mypy [19:28:12] ahh, perhaps i should have been more patient. I kept killing it after ~5 minutes, in the past i let it run for hours and it got no where [19:36:28] I guess stuff changed in xgboost...maybe just need to update the tests as you mentioned [19:37:30] yea, i'm currently figuring out what versions we should update to for hyperopt/xgboost, xgboost is probably not too bad but i think we integrated reasonably deeply into hyperopt and that might be harder (at the time hyperopt didn't support spark, so i wrote my own thing) [19:38:39] the test suite passing is hopeful, but then i wonder if the tests are complete enough :) [19:42:17] No rush, I'm just happy I fixed the dependency hell on my Mac and got Java 8 going [20:19:20] meh, made everything pass with 3.9 locally but CI is of course spinning on the dependency resolution because it's using the version from apt.wikimedia.org [20:19:36] i guess i just have to pin down enough versions that it doesn't have to think again, easier with a new version that can resolve them [20:53:32] If anyone has the time, I'd appreciate another set of eyes on a recent patch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/957806 .. jenkins seems to be failing with a phan error in unrelated code. I rebased and the error persists. I can't find any updates that explain the problem, and I don't know the code with the error so I'm hesitant to supress or try to fix the phan error. Thoughts? [20:54:48] Trey314159: looks like a problem with the master branch, plausable guess it needs IDatabase changed to IReadableDatabase [20:55:16] to verify i usually submit an empty change (add a new file to repo called DO_NOT_MERGE) to gerrit [20:56:22] Thanks, ebernhardson .. I thought of trying the empty change, but didn't want to spam changes if there was an obvious problem or solution. [22:09:30] there was a follow-on IDatabase that seems to need to be IReadableDatabase, too. So far phan/jenkins are happy. We'll see. Thanks, Erik!