[08:02:53] pfischer: thanks for testing the serialization bits, I'm a bit puzzled by this one... and not sure how to reproduce it locally [08:03:06] so far I see 3 possible causes [08:04:38] 1/ reading a checkpoint that's now incompatible (we should have the checkpoints in swift so might be testable, they should be small so hopefully a single _metadata file to read) [08:05:55] 2/ deserializing events between operators, this one might be annoying as it suggests that in some condition a given serializer is unable to read the bytes it generated [08:07:13] 3/ a consequence of another failure, same as the above but if the upstream component failed it might not have fully written the serialized record and the downstream operator has not yet been notified that the pipeline has failed and thus might fail reading incomplete bytes [08:21:35] the serialization code is coming from StreamElementSerializer which is used by WindowReaderOperator, AsyncWaitOperator (but this one can be disarded since it's happening in the producer) and AbstractStreamTaskNetworkInput [08:22:23] the stack mentions AbstractStreamTaskNetworkInput, so might be indication that it's either problem 2 or 3 [08:22:53] if it's 2 it's going to happen again for sure [08:28:30] o/ dcausse: thank you for looking deeper into this! Looking at your code for testing the serialisation via flink I was wondering what is stored with the serialised DTO: If first extracts/restores the serializer from the DataInputView and after that, the actual dto extracted from the same DataInputView. Does that mean, flink stores the serialiser with every DTO? Do you know of any general exception handler that would allow us to [08:28:30] side-output error messages in case of such failure? [08:28:54] s/If first/It first/ [08:32:03] o/, I think there are two 2 main use-cases: serialization happening for in-between operators, this one assumes that the serializer is from the exact same TypeInfo so no need to store/restore the serializer itself (via this Snapshot thing) [08:32:24] for durable states in checkpoints/savepoints [08:33:06] this one requires storing the serializer config (SerializerSnapshot) but it's stored per data file not per record [08:34:32] capturing serialization failures sound difficult and that would have to happen in the serialization itself I suspect, but I'm not sure that we should let the pipeline run if the serialization code is bogus [08:36:20] when restoring a state flink should double check that the serializer snapshot is compatible (main purpose of the UpdateEvent serializer test) so I'm assuming that the failure would have been different [08:37:18] and reading the stack and seeing that AbstractStreamTaskNetworkInput is being mentionned I'm tempted to assume that the problem happened during in-between operators serialization [08:38:03] if it's a bug there it'll most certainly happen again soon and that should tell us to investigate more on this side [08:38:25] Okay. [08:38:59] ofcourse, I might be all wrong as well :) [08:39:34] I still can’t figure out, why I can’t enforce that deserialisation issue: We introduce a new field (boolean) right before weighted_tags (List) and this should lead to a problem, when either the old serialiser tries to read a boolean where a list starts or the other way round. [08:42:06] pfischer: if you read the bytes with the exact same serializer it should fail indeed, if you read the bytes using a "restored serializer" it should work [08:47:03] e.g. instead of reading the bytes with objectTypeSerializerSnapshot.restoreSerializer().deserialize(deser) you read them with UPDATE_EVENT_ROW_TYPEINFO_1_0_0.createSerializer(null).deserialize(deser) [08:47:15] this should trigger a bug [08:47:33] s/bug/deserialization issue/ [08:48:03] if not then I'm puzzled too [08:51:16] s/UPDATE_EVENT_ROW_TYPEINFO_1_0_0/UPDATE_EVENT_ROW_TYPEINFO_LATEST/ [08:53:08] bah... no sorry should be CURRENT_TYPE_INFO [09:07:38] hm testing the above, the deserialization code does not fail indeed it's rather the resulting object that's corrupted... the weighted_tags field is null instead of [mytag/value|10] and redirectRemove is [null] instead of null, so weird... [09:14:58] I tested exactly that: https://phabricator.wikimedia.org/P53063 [09:19:50] test_v2_cannot_restore_v1 is the one, I wanted to fail [09:24:06] All I see are unexpected default values: If I decode v2.bytes with v1 typeserializer, than rawFieldsFetchedAt is set to the lowest timestamp (1970-…) and the weightedTags are empty [09:25:37] If I decode v1.bytes with v2 typeserializer, than rawFieldsFetchedAt is null and weightedTags are still empty [09:28:32] Looks like nulls in general are not properly restored but are rather filled with defaults (string -> empty string; instant -> 1970-…; etc.) [10:05:09] I think field serializers are being mixed-up, null fields are managed via a mask so would not be surprised that some fields are mistreated as null and thus attemtptiong to read a lot less than what it's supposed to [10:11:04] yes assert that the number of remaining bytes in the stream is 0 I get a failure [10:11:49] I think the defaults you see are just by luck, 0 -> epoch, 0 string length => empty string [10:12:47] we don't set any defaults other than null or "empty" Rows [10:17:06] but in some other conditions, instead of not fully-reading the stream I suppose that the opposite could happen [10:20:12] pfischer: a small improvement to the test: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/50 [10:52:47] lunch [12:15:06] Thanks! Enforcing a discrepancy between DataInputView used for extracting the TypeSerializer and extracting the data obviously is not possible due to the fact that flink keeps state of read position inside the DataInputView itself. Hence, the independent TypeSerializer with an independent DataInputView underneath reads an arbitrary byte, assuming it’s the POJO mask but in fact it isn’t. As a consequence the POJO gets created [12:15:07] with the sub-serialiser’s default values but no attempt is made to deserialise the field values from the DataInputView. [13:14:55] o/ [14:21:41] o/ [14:32:07] small patch to bring the new search-loader hosts online if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/969761 [14:34:27] thanks! [14:38:50] ebernhardson: I looked into the is_redirect: false issue and it’s indeed weird. This flag is persisted as page_is_redirect but defaults to 0. On create, it is only updated once the parser is done and a revision is linked to the page. Since the event/kafka message has information about the revision, the revision should be around at event time. Also, the timestamps match: According to quarry, page_links_updated has the same time as [14:38:50] the event’s dt and the db record has page_is_redirect = 1. However, the revision information gets passed seperately from the wikiPage to PageChangeHooks#onPageSaveComplete so it might be the case that the wikiPage object is out of date. [14:56:58] re: search-loader...looks like we never migrated its deploy repo to gitlab. The gerrit repo is read-only, so I think I'm just going to one-off the scap config on the deploy host. If anyone thinks that's a bad idea LMK [15:00:16] inflatador: it's definately moved, /srv/deployment/search/mjolnir/deploy contains a conda_venv.zip and we only ever built that in gitlab [15:00:35] inflatador: i suppose what happened is the deploy repo itself didn't move, just changed [15:01:30] ebernhardson Ah, I was expecting to see the scap stuff in the gitlab repo, is it just a different deploy process now? [15:06:19] pfischer: do you knoiw if this is safe to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/969761 [15:06:41] was merge to tip ~30 mions ago but not puppet-merged and bking not responding to pings [15:06:49] if not will have to revert [15:07:13] jbond: not sure why it's not deployed, generally thats safe and wont do much of anything [15:07:23] jbond: bking is in a meeting now [15:08:04] if i merge this now then on the next puppet run it will start installing all the classes related to the (search::loader which is probably safe but i dont know [15:08:29] jbond the puppet-merge UI told me it merge? Or maybe I missed it [15:08:40] anyway, feel free to merge [15:08:58] ok merging now thanks [15:09:39] inflatador: no worries and also noticed i pinged bking not +inflatador [15:58:11] SRE meeting confl w/ triage, will be at sre [15:58:22] ^^ same here [15:58:59] gehel ^^ is this OK? I can go to the search triage mtg if you prefer [16:00:03] May only be able to join triage by phone [16:01:46] pfischer: search triage in https://meet.google.com/eki-rafx-cxi [16:33:25] we're seeing a lot of mjolnir reduced availability alerts in AM, going to talk to observability about this after SRE mtg [16:35:30] inflatador: that basically says it's failing to collect metrics from the prometheus thing, guessing the new instance isn't coming up? [16:35:41] or the service is in a reboot loop [16:36:29] ebernhardson ah , it it's just https://phabricator.wikimedia.org/T348222 then we can ignore [16:38:09] Although we do need to set up notifications for this on our team [16:38:41] Probably need prometheus failed scrapes to go to the owning team in general as opposed to the observability team [16:39:17] dcausse is this still needed as a follow-up from the Sept ES incident? https://phabricator.wikimedia.org/T347034 . I'll go ahead and poke data persistence if so [16:45:22] inflatador: yup that's still relevant [16:58:07] ACK thanks [17:10:30] i hadn't noticed, but of course the redirect problem is also in reverse. Editing a redirect into a non-redirect gets is_redirect: true on the edit [17:13:34] lunch, back in ~40 [17:19:56] looking into the invalid surrogate pair problem, elastic claims it is our problem :P The `fvh` highlighter has this problem, the `unified` highlighter does not and became the default in es 6. [17:55:22] dinner [17:59:52] back [18:04:57] Re: mjolnir deploy, looks like we need to use gitlab instead of scap? Going to try this process: https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/tree/main#deployment [18:05:31] inflatador: it should be scap iirc. The deployment process there builds a new zip file, but scap should fetch that zip file [18:06:01] inflatador: oh! i see what happened. We accidentally archived the deploy repo with the regular repo [18:06:45] inflatador: so, i guess we either looking into un-archiving, or we create a new deploy repo on the gitlab side with the same content as the archived repo [18:06:52] https://gerrit.wikimedia.org/r/admin/repos/search/MjoLniR/deploy,general [18:10:00] ebernhardson ACK...I opened https://phabricator.wikimedia.org/T350043 for it, will migrate the deploy repo to gitlab (seems less confusing) and probably one-off for this [18:31:32] need to go meet w/ water utility, back in 1.5 hrs [18:31:47] gehel: ^ see above, can't make our 1:1 today [18:34:24] i suspect our redirect problem is related to replication lag? Hacked up WikiPage::getRedirectTarget to always query the primary and now redirect->plain gets correct answer, and plain->redirect gets an error about not finding the related row in the redirects table [18:34:32] doesn't explain why it works from shell.php [18:36:58] 'nother minor puppet change for search-loader if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/969947/ [18:38:50] thanks! [18:41:24] ebernhardson mjolnir is now running on search-loader1002,2001,2002 . (I stopped on 1001 in case we need to roll back). Is there anything we need to do to verify the new hosts (1002,2002) are working? [18:45:12] appointment, back in ~2h [18:47:32] inflatador: should see them show up in the mjolnir grafana dashboards [18:51:55] i wonder...WikiPage::insertRedirect deferrs inserting into redirect until DeferredUpdates::POSTSEND, but the code that runs the PageSaveComplete hook is in PRESEND [18:52:06] still doesn't explain how it works in shell.php :P [19:53:25] hmm, so i think i'm reasonably confident the issue is reading from replicas in the request that wrote to the primary db. Calling $wikiPage->loadPageData( 'fromdbmaster' ) before calling $page->getRedirectTarget(), along with changing WikiPage::getRedirectTarget to read from DB_PRIMARY when querying the redirects table, seems to make it work as expected [19:53:28] but i can't just do that as a fix :P [19:53:35] Hi, all. I'm working on my MW 1.39 upgrade testing, currently creating Elasticsearch indexes from scratch since these are in a brand new Amazon OpenSearch cluster due to the version upgrade requirement from Elasticsearch 6.x to 7.10.x. I haven't had to create new indexes since the originals almost three years ago (just post-upgrade updates of the [19:53:36] existing indexes), and I have some questions about the procedure and performance, especially since I can only run the job queue runners with a single process (i.e. maxprocs=1). The bootstrap commands (ForceSearchIndex.php) create 10s of thousands of jobs (42k for the 2nd smallest wiki just now), so the job queue processing will take ages. My plan [19:53:36] is do the initial index creation of the new production wikis using DBs created from snapshots prior to the actual migration downtime so that I just have to do index updates using fresh DBs from snapshots. Kind of an open question, but if anyone has any thoughts and/or recommendations regarding this process, I'd appreciate hearing them. [19:55:41] justinl: indeed creating new indexes from scratch is a big thing, I don't think we've ever done full indexes from the database on big databases before and it's a bit of a scary thought (would expect it to takes month+ for us). [19:56:02] justinl: usually the bottleneck there is wikitext parsing [19:56:58] justinl: i don't know that it's viable for you, depends on the upgrade schedule, but one way would be to take snapshots from one cluster, load into the other clusters, and then use cirrus multi-cluster capabilities to write to both? [19:57:03] planning a switchover gets more complicated thugh [19:59:37] thats how we do the switchover at least (minus the snapshots). For transition there will be two clusters, one on old code and one on new. Both will be configured to write to both clusters. When the code deploys it switches read to the new version [20:00:12] How did you handle the requirement to go from Elasticsearch 6.x to 7.x? [20:01:03] And I am effectively a one-man shop with little knowledge of Elasticsearch itself. I basically create the Amazon OpenSearch clusters, follow the CirrusSearch documentation procedures, and hope it works. The multi-cluster thing you mention probably wouldn't be feasible for me. [20:01:32] justinl: we never do reads to the wrong version, only writes. For the writes from the 7.10 code to the 6.8 cluster we needed this bit: https://www.mediawiki.org/wiki/Extension:CirrusSearch/ES6CompatTransportWrapper [20:01:43] I have five wikis of various size but the two largest are huge, at least by my standards. [20:01:44] justinl: iirc, writes from 6.8 to 7.10 "just worked" [20:01:51] obviously test that :) [20:02:51] Currently my setting for that variable is just this: [20:02:52] # Use the local Nginx proxy that forwards port 9200 to AWS Elasticsearch on port 443. [20:02:53] $wgCirrusSearchServers = [ 'localhost' ]; [20:03:29] So I'd just need to add that transport setting, and yeah, test. [20:05:23] Part of my problem is that I still have this issue, which is what forces me to run the job runners with a single process: https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues/5392 [20:05:32] for doing multi-cluster there are a few bits that are probably not greatly documented. docs/multi_cluster.txt has some design info and examples. Ignore the parts about multiple groups per cluster, thats only relevant when running many hundreds or thousands of wikis [20:08:11] justinl: that does look like a tedious one to resolve [20:09:58] Yeah, I am unable to diagnose the problem myself, not being a MW or PHP developer, so all I could do at best would be to test after each major upgrade to see if the problem still exists. [20:16:29] (back) [20:25:57] back [20:37:48] new search loaders aren't talking to kafka-jumbo...checking FW rules now [20:39:41] ebernhardson The multi-cluster procedure looks complicated (and expensive since Amazon OpenSearch clusters aren't cheap). Plus I am nowhere near enough of an expert with MW, let alone Elasticsearch, to trust that I could properly carry out that procedure. I think my best option, then, would be to stage the new live 1.39 environment with copies of [20:39:42] the database from a point-in-time, create the new cluster and indexes and allow that to complete (probably a week, I'd bet), and then just do the index update procedure during the actual downtime migration to the new environment. [20:45:22] Btw, I don't know how much of a difference this makes, but we're on Elasticsearch 6.5.4, not 6.8.x, so I don't know if that version difference would impact the multi-cluster approach. [20:48:48] justinl: thats fair, i don't think anyone other than us has used the multi cluster bits so it's certainly not polished up and easy. Your approach sounds reasonable, you can certainly build the index outside of live production. [20:48:54] I have another unrelated question as well. When testing the Amazon Aurora MySQL upgrade procedure from 5.7 to 8 (Aurora 2 to 3 since Aurora 2 support ends next October), I got the following errors (a while back): [20:48:55] Index si_text from en_wikidb_gw2.searchindex has Index length 16777215 > 767 bytes, which will cause error after upgradation. Consider changing the row_format of the tables to dynamic and restart the upgrade. [20:48:55] The auxiliary tables of FTS indexes on the table 'en_wikidb_gw2.searchindex' are created in system table-space due to https://bugs.mysql.com/bug.php?id=72132. In MySQL8.0, DDL queries executed on this table shall cause database unavailability. To avoid that, drop and recreate all the FTS indexes on the table or rebuild the table using ALTER TABLE [20:48:56] query before the upgrade. [20:48:56] Index o_text from en_wikidb_gw2.smw_ft_search has Index length 196605 > 767 bytes, which will cause error after upgradation. Consider changing the row_format of the tables to dynamic and restart the upgrade. [20:48:57] The auxiliary tables of FTS indexes on the table 'en_wikidb_gw2.smw_ft_search' are created in system table-space due to https://bugs.mysql.com/bug.php?id=72132. In MySQL8.0, DDL queries executed on this table shall cause database unavailability. To avoid that, drop and recreate all the FTS indexes on the table or rebuild the table using ALTER TABLE [20:48:57] query before the upgrade. [20:49:12] justinl: i wonder in that case, does it even need to rebuild? I suspect you can use the s3 snapshot plugin to write the indexes to s3 and then load them into a second cluster [20:49:41] i suppose we haven't done that across revisions, but i would expect it to be forward compatible [20:50:07] s/revisions/elastic versions/ [20:50:15] Going from MW 1.38 to 1.39 and ES 6.5.4 to 7.10.2, wouldn't that require a completely new index? I don't know anything about the details under the hood. [20:51:23] I do, however, have a Python script I wrote that can create and restore snapshots from an S3 bucket registered as an index snapshot repository. So if it's possible I could just use existing indexes that way, that would be amazing! [20:51:24] justinl: not really, in terms of cirrus/mediawiki the structure we store in elasticsearch is largely unchanged. There might be a few fields that were added 2 or 3 years ago [20:52:06] justinl: so then it's just elasticsearch, they can always open indexes from the previous major version in the current, it's how they support online version upgrades. Elasticsearch does sometimes require you to recreate an index, that can be done in-place though [20:52:48] I moved to CirrusSearch in August 2021. [20:53:29] I can get a meeting with an Amazon OpenSearch specialist to discuss that possibility. [20:57:44] justinl: oh for the earlier question about 6.8.23, you probably don't need to worry about it. Elastic makes special changes to the last releases of a major version to include lots of warnings about things that will break when you upgrade, we used that to inform development [20:58:18] another tiny puppet patch for search-loader if anyone can take a look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/969968 [20:58:39] Good to know, thanks! 6.5.4 was the recommended version at the time (when it wasn't even sure if MW would work with a managed service) and I never tried upgrading beyond that point release. [20:58:53] oops, stopped a typo...patching [20:59:11] gave my +1 [20:59:14] OK fixed [21:00:57] I'd also like to follow up later on my other question above about the db errors I got during MySQL 8 upgrade testing. That's not urgent and I need to step away for a bit, but I'd also appreciate the opportunity to help determine how to resolve the issues. I've been told that deleting the searchindex and smw_ft_index tables and running update.php [21:00:57] should fix those, but there are the other tables mentioned as well, plus other warnings about some table columns being utf8 and needing to be upgraded to utf8mb4. [21:01:08] stepping away for a bit [21:01:29] thanks again [21:17:36] re: search-loader, looks like the connection errors went away after FW rules updated: https://logstash.wikimedia.org/goto/ca2dd480835e60da6edb024f35e079e6 [21:21:00] back - Forget about the utf8 columns, that'd be another team I should ask. [21:45:18] ebernhardson According to the MW upgrade docs, they don't have a straight upgrade path from < 6.8 to 7.x due to breaking changes. Based on your comments about possibly updating to 7 from a 6.5 index, what do you think about what this page shows? https://docs.aws.amazon.com/opensearch-service/latest/developerguide/version-migration.html [22:08:44] justinl: at least in elasticsearch the banner (url path of /) reports its compatability, our elastic 7.10.2 says minimum_index_compatability_version: 6.0.0-beta1. I'm not sure how much that changes when using opensearch though. [22:09:30] justinl: there may still be practical issues, indeed the generally suggested path is 6.5->6.8->7.x, i suspect that's more important when evaluating if things are going to work, as opposed to having a codebase already prepared for 7.10, but hard to say [22:11:52] So testing upgrading our 6.5.4 indexes directly to 7.10.2 probably won't work, then? If that's the case, I'd need to really think through a comparison of the processes and time requirements that would be required to do 6.5 -> 6.8 -> 7.10 vs. fresh new 7.10 indexes. [22:14:23] Combined with the 1.39 change of merging the revision_actor_temp table into revision (took 13.5 hours on our biggest wiki until I tested multiple batch sizes increases, getting it down to < 4 hours), has really been complicating and lengthing the upgrade testing for this major version upgrade. I'll take your further feedback to our AWS support team [22:14:23] to follow up regarding talking with an OpenSearch specialist. [22:55:35] The more I think about it, the more I lean towards the multiple in-place upgrade procedure of 6.5 -> 6.8 -> 7.10. That's a few more steps but likely much faster than creating new indexes from scratch. [22:58:32] Actually that may not be true. I need to think on this a lot more. Thank you for the feedback and advice! I may come back tomorrow once I have had time to think and sleep on this.