[07:43:25] <gehel>	 dcausse, zpapierski: good morning!
[07:43:37] <dcausse>	 gehel: welcome back!
[07:43:42] <gehel>	 would you have a bit of time to catch me up on what happened while I was away?
[07:43:47] <dcausse>	 sure
[07:43:54] <zpapierski>	 In 5min 
[07:43:55] <gehel>	 meet.google.com/ege-hhgd-yyj
[07:44:00] <gehel>	 ok, let's wait 5'
[07:47:59] <zpapierski>	 I'm ready
[07:48:26] <gehel>	 dcausse: we're in!
[08:41:50] <zpapierski>	 wonderful, context suggester implementation greets me with throw new NotImplementedYet (only that one)
[08:41:54] <zpapierski>	 in Elastica
[08:43:27] <zpapierski>	 fortunately it's only a bit different that completion
[08:44:35] <dcausse>	 sometimes you can alter raw params of the elastica object perhaps that'll be sufficient to inject what you need
[08:44:51] <zpapierski>	 yeah, hence fortunately
[08:45:09] <zpapierski>	 I can easily add the param by my own
[08:45:31] <dcausse>	 other approaches (both exist in cirrus): work directly with arrays (phrase suggester does that) or implement your own Elastica subclass 
[08:46:11] <zpapierski>	 I was thinking about the sedon one, but adding an array param (since I only need one), seems simpler to read and to write
[08:47:13] <zpapierski>	 where is the safest place to get current wiki from during a search request?
[08:52:06] <dcausse>	 SearchConfig::getWikiId()
[09:21:28] <zpapierski>	 thx
[09:23:49] <dcausse>	 flink died after the row A failure and it could not resume from the checkpoint as the _metadata file grew past the max size allowed by akka
[09:24:26] <dcausse>	 increasing the akka max message size+giving more heap it restarted again but the _metadata file kept growing
[09:34:40] <zpapierski>	 why was it affected?
[09:35:01] <dcausse>	 I have no clue
[09:36:50] <gehel>	 errand + lunch
[09:38:36] <dcausse>	 surprisingly the state extraction job is able to run on that big _metadata file without failing
[09:40:55] <dcausse>	 I downloaded the _metadata file locally (stat1004:/home/dcausse/flink-1.12.1-wdqs/wdqs_streaming_updater/checkpoints/b4d1cd3eb1ab4002a63b7c229a8c3542/chk-140815) 
[09:41:11] <dcausse>	 will file a task to inspect it and understand what's in there
[09:41:13] <zpapierski>	 are you sure that row A failure was the cause here?
[09:41:18] <zpapierski>	 good idea
[09:41:32] <zpapierski>	 we actually do have some consultation hours left with Ververica, right?
[09:41:43] <zpapierski>	 we haven't yet released those to other teams, AFAIR
[09:42:53] <dcausse>	 no I'm not sure, it's likely it caused app to fail (times correspond) but does the _metadata growth is related to the row A failure hard to tell 
[09:43:07] <dcausse>	 yes we could ask for help indeed
[09:43:51] <dcausse>	 might be timers if all the sudden all the events start to be out-of-order
[09:50:59] <zpapierski>	 it would be good to capture the context of the failure for this - do you know how were the dropped operations in Flink from that time? I don't know if we can verify the actual out-of-orderness of events directly
[09:57:09] <dcausse>	 out-of-orderness should create more "state-inconsistencies" tracked in event.rdf_streaming_updater_state_inconsistency if the pipeline did fail too early
[09:57:18] <dcausse>	 s/did/did not/
[09:57:44] <dcausse>	 well it captured only one inconsistency on that day at 10am UTC ...
[09:57:45] <zpapierski>	 right, ignored mutations are saved too
[09:58:00] <dcausse>	 select hour, count(*) from rdf_streaming_updater_state_inconsistency where year=2021 and month=7 and day = 16 group by hour order by hour desc limit 24;
[09:58:28] <zpapierski>	 if pipeline failed because of a growing _metadata file, then out-of-orderness isn't to blame 
[09:58:42] <zpapierski>	 it wouldn't (I assume) grew beyond limit immediately
[10:00:07] <dcausse>	 well it failed with 70mb _metadata file, I resumed it after increasing some limits and then took a savepoint the resulting metadata file is 800mb
[10:00:25] <dcausse>	 it was backfilling so consuming faster than realtime
[10:01:30] <zpapierski>	 still, if it was inconsistencies I assume we'd see more of them
[10:02:30] <zpapierski>	 anyway - about the recovery - you wrote that state extraction job works correctly
[10:02:47] <zpapierski>	 so it's simply a matter of reaplying state now?
[10:36:38] <dcausse>	 to resume the pipeline hopefully, as to understand what caused this we'll have to spend some more time
[10:36:44] <dcausse>	 lunch
[10:36:48] <zpapierski>	 yeah
[10:39:59] <zpapierski>	 mpham: I promised to have more about the documentation needs for todays triage/planning, but I expected less distractions this week, sorry for that, I'll make sure to have it ready next time
[10:45:29] <zpapierski>	 huh, it was surprisingly easy to add context
[10:50:23] <zpapierski>	 (to a query completion)
[10:51:34] <zpapierski>	 break
[12:27:01] <mpham>	 thanks for the update zpapierski 
[12:34:14] <gehel>	 Damn, I messed up the daycare schedule for Oscar during this holiday week. 
[12:34:40] <gehel>	 I need to get him right now. I expect spotty availability this afternoon
[14:17:40] <zpapierski>	 mpham and the rest invited: can we move Streaming updater meet by an hour? I have a personal errand to attend exactly at original time 
[14:18:13] <zpapierski>	 (still waiting for the doc, I probably won't make it ot WDQS sync and triage/planning) 
[14:31:56] <dcausse>	 gehel: meeting?
[14:32:14] <gehel>	 sorry
[14:36:38] <gehel>	 does anyone knows about "search-loader1001" ? It is going to be impacted by a network maintenance on row D switches
[14:39:33] <gehel>	 that's the model loader, so it should not be an issue if it looses network connectivity for a short while
[14:39:58] <gehel>	 still, we should keep an eye on it during the maintenance (cc ryankemper)
[14:55:40] <zpapierski>	 mpham: sorry, I meant even half an hour later - 5:30pm UTC, I won't be back in time 
[14:57:47] <dcausse>	 search-loader1001 actual impact will be visible on https://grafana-rw.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1
[14:58:11] <dcausse>	 I think it will just be fine
[15:01:53] <gehel>	 if it's not, we'll learn something as well!
[15:36:03] <gehel>	 zpapierski: can you check 2fa on your phab account: T285947
[15:36:04] <stashbot>	 T285947: Security Issue Access Request for Search Platform team members - https://phabricator.wikimedia.org/T285947
[15:44:24] <mpham>	 zpapierski: meeting moved to 17:30 UTC
[15:44:30] <zpapierski>	 thx!
[15:50:29] <ryankemper>	 gehel: The other thing to consider for the maintenance is we're going to lose 9 servers in eqiad (see https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=6&status=active&role=server); normally that would drop us into red cluster status, but since this is all in a single row we should just hit yellow right?
[15:55:20] <ryankemper>	 Also we'll lose master `elastic1036` (https://github.com/wikimedia/puppet/blob/b95d22dec6d86cfa94fea24c56c662c5654196d7/hieradata/role/eqiad/elasticsearch/cirrus.yaml#L12), and for `omega` we'll lose `elastic1038` (https://github.com/wikimedia/puppet/blob/b95d22dec6d86cfa94fea24c56c662c5654196d7/hieradata/role/eqiad/elasticsearch/cirrus.yaml#L85)
[15:55:23] <ryankemper>	 and for `psi` we'll lose `elastic1050` (https://github.com/wikimedia/puppet/blob/b95d22dec6d86cfa94fea24c56c662c5654196d7/hieradata/role/eqiad/elasticsearch/cirrus.yaml#L131)
[15:55:45] <ryankemper>	 So we may want to switch out the masters beforehand so that we keep 3 masters up during the whole maint window
[15:57:23] <ryankemper>	 Per https://phabricator.wikimedia.org/T286061 we should plan for up to a 5 minute outage on the "happy path" altho it will likely be even shorter than that, unless something goes wrong in which case it will be more than 5 minutes naturally
[15:57:57] <ryankemper>	 Anyway so TL;DR I think we'll probably want to switch the masters up, otherwise we shouldn't really have to do anything else for elastic-eqiad AFAICT
[16:17:10] <gehel>	 ryankemper: as far as I know, the only way to switch the master is to kill the current one, which is the same impact as loosing it during maintenance.
[16:17:27] <gehel>	 It would be nice to be able to proactively switch masters!
[16:17:59] <ryankemper>	 I guess the main benefit is we can switch masters when we're not already down 9 normal nodes
[16:18:23] <ryankemper>	 er 9 total nodes so less than 9 worker nodes but yeah
[16:18:37] <gehel>	 The switch reconfiguration should be a very short network outage, shards will start to relocate, and probably a lot of the nodes will be back online before relocation completes
[16:19:11] <gehel>	 I'd rather use this as a test of the robustness. We **should** be robust to loosing a full row. If we're not, we need to know.
[16:19:25] <gehel>	 dinner!
[16:26:58] <ryankemper>	 Sounds good to me
[18:02:41] <dcausse>	 dinner