[07:43:25] dcausse, zpapierski: good morning! [07:43:37] gehel: welcome back! [07:43:42] would you have a bit of time to catch me up on what happened while I was away? [07:43:47] sure [07:43:54] In 5min [07:43:55] meet.google.com/ege-hhgd-yyj [07:44:00] ok, let's wait 5' [07:47:59] I'm ready [07:48:26] dcausse: we're in! [08:41:50] wonderful, context suggester implementation greets me with throw new NotImplementedYet (only that one) [08:41:54] in Elastica [08:43:27] fortunately it's only a bit different that completion [08:44:35] sometimes you can alter raw params of the elastica object perhaps that'll be sufficient to inject what you need [08:44:51] yeah, hence fortunately [08:45:09] I can easily add the param by my own [08:45:31] other approaches (both exist in cirrus): work directly with arrays (phrase suggester does that) or implement your own Elastica subclass [08:46:11] I was thinking about the sedon one, but adding an array param (since I only need one), seems simpler to read and to write [08:47:13] where is the safest place to get current wiki from during a search request? [08:52:06] SearchConfig::getWikiId() [09:21:28] thx [09:23:49] flink died after the row A failure and it could not resume from the checkpoint as the _metadata file grew past the max size allowed by akka [09:24:26] increasing the akka max message size+giving more heap it restarted again but the _metadata file kept growing [09:34:40] why was it affected? [09:35:01] I have no clue [09:36:50] errand + lunch [09:38:36] surprisingly the state extraction job is able to run on that big _metadata file without failing [09:40:55] I downloaded the _metadata file locally (stat1004:/home/dcausse/flink-1.12.1-wdqs/wdqs_streaming_updater/checkpoints/b4d1cd3eb1ab4002a63b7c229a8c3542/chk-140815) [09:41:11] will file a task to inspect it and understand what's in there [09:41:13] are you sure that row A failure was the cause here? [09:41:18] good idea [09:41:32] we actually do have some consultation hours left with Ververica, right? [09:41:43] we haven't yet released those to other teams, AFAIR [09:42:53] no I'm not sure, it's likely it caused app to fail (times correspond) but does the _metadata growth is related to the row A failure hard to tell [09:43:07] yes we could ask for help indeed [09:43:51] might be timers if all the sudden all the events start to be out-of-order [09:50:59] it would be good to capture the context of the failure for this - do you know how were the dropped operations in Flink from that time? I don't know if we can verify the actual out-of-orderness of events directly [09:57:09] out-of-orderness should create more "state-inconsistencies" tracked in event.rdf_streaming_updater_state_inconsistency if the pipeline did fail too early [09:57:18] s/did/did not/ [09:57:44] well it captured only one inconsistency on that day at 10am UTC ... [09:57:45] right, ignored mutations are saved too [09:58:00] select hour, count(*) from rdf_streaming_updater_state_inconsistency where year=2021 and month=7 and day = 16 group by hour order by hour desc limit 24; [09:58:28] if pipeline failed because of a growing _metadata file, then out-of-orderness isn't to blame [09:58:42] it wouldn't (I assume) grew beyond limit immediately [10:00:07] well it failed with 70mb _metadata file, I resumed it after increasing some limits and then took a savepoint the resulting metadata file is 800mb [10:00:25] it was backfilling so consuming faster than realtime [10:01:30] still, if it was inconsistencies I assume we'd see more of them [10:02:30] anyway - about the recovery - you wrote that state extraction job works correctly [10:02:47] so it's simply a matter of reaplying state now? [10:36:38] to resume the pipeline hopefully, as to understand what caused this we'll have to spend some more time [10:36:44] lunch [10:36:48] yeah [10:39:59] mpham: I promised to have more about the documentation needs for todays triage/planning, but I expected less distractions this week, sorry for that, I'll make sure to have it ready next time [10:45:29] huh, it was surprisingly easy to add context [10:50:23] (to a query completion) [10:51:34] break [12:27:01] thanks for the update zpapierski [12:34:14] Damn, I messed up the daycare schedule for Oscar during this holiday week. [12:34:40] I need to get him right now. I expect spotty availability this afternoon [14:17:40] mpham and the rest invited: can we move Streaming updater meet by an hour? I have a personal errand to attend exactly at original time [14:18:13] (still waiting for the doc, I probably won't make it ot WDQS sync and triage/planning) [14:31:56] gehel: meeting? [14:32:14] sorry [14:36:38] does anyone knows about "search-loader1001" ? It is going to be impacted by a network maintenance on row D switches [14:39:33] that's the model loader, so it should not be an issue if it looses network connectivity for a short while [14:39:58] still, we should keep an eye on it during the maintenance (cc ryankemper) [14:55:40] mpham: sorry, I meant even half an hour later - 5:30pm UTC, I won't be back in time [14:57:47] search-loader1001 actual impact will be visible on https://grafana-rw.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1 [14:58:11] I think it will just be fine [15:01:53] if it's not, we'll learn something as well! [15:36:03] zpapierski: can you check 2fa on your phab account: T285947 [15:36:04] T285947: Security Issue Access Request for Search Platform team members - https://phabricator.wikimedia.org/T285947 [15:44:24] zpapierski: meeting moved to 17:30 UTC [15:44:30] thx! [15:50:29] gehel: The other thing to consider for the maintenance is we're going to lose 9 servers in eqiad (see https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=6&status=active&role=server); normally that would drop us into red cluster status, but since this is all in a single row we should just hit yellow right? [15:55:20] Also we'll lose master `elastic1036` (https://github.com/wikimedia/puppet/blob/b95d22dec6d86cfa94fea24c56c662c5654196d7/hieradata/role/eqiad/elasticsearch/cirrus.yaml#L12), and for `omega` we'll lose `elastic1038` (https://github.com/wikimedia/puppet/blob/b95d22dec6d86cfa94fea24c56c662c5654196d7/hieradata/role/eqiad/elasticsearch/cirrus.yaml#L85) [15:55:23] and for `psi` we'll lose `elastic1050` (https://github.com/wikimedia/puppet/blob/b95d22dec6d86cfa94fea24c56c662c5654196d7/hieradata/role/eqiad/elasticsearch/cirrus.yaml#L131) [15:55:45] So we may want to switch out the masters beforehand so that we keep 3 masters up during the whole maint window [15:57:23] Per https://phabricator.wikimedia.org/T286061 we should plan for up to a 5 minute outage on the "happy path" altho it will likely be even shorter than that, unless something goes wrong in which case it will be more than 5 minutes naturally [15:57:57] Anyway so TL;DR I think we'll probably want to switch the masters up, otherwise we shouldn't really have to do anything else for elastic-eqiad AFAICT [16:17:10] ryankemper: as far as I know, the only way to switch the master is to kill the current one, which is the same impact as loosing it during maintenance. [16:17:27] It would be nice to be able to proactively switch masters! [16:17:59] I guess the main benefit is we can switch masters when we're not already down 9 normal nodes [16:18:23] er 9 total nodes so less than 9 worker nodes but yeah [16:18:37] The switch reconfiguration should be a very short network outage, shards will start to relocate, and probably a lot of the nodes will be back online before relocation completes [16:19:11] I'd rather use this as a test of the robustness. We **should** be robust to loosing a full row. If we're not, we need to know. [16:19:25] dinner! [16:26:58] Sounds good to me [18:02:41] dinner