[09:34:50] <dcausse>	 errand, back after lunch most probably
[12:52:12] <dr0ptp4kt>	 i.nflatador i pinged on that task. out for today (reminder, out on friday also).
[14:04:31] <inflatador>	 <o/
[14:11:23] <inflatador>	 cloudelastic CCS users are now complaining of duplicate results...any ideas what might cause that? ref https://phabricator.wikimedia.org/T358541#9599341
[14:37:09] <gehel>	 dcausse: would you have a few minutes to jump in https://meet.google.com/jzv-jpsd-yqw
[14:37:11] <inflatador>	 dcausse ^^ any ideas?
[14:37:23] <dcausse>	 sure
[15:16:35] <inflatador>	 quick break, back in ~20
[15:42:56] <inflatador>	 sorry, been back
[15:50:45] <inflatador>	 Dr appointment/workout, back in ~2h
[15:59:24] <ebernhardson>	 \o
[16:00:24] <ebernhardson>	 hmm, cloudelastic reindexing is up to avwiktionary. I don't know if that's slow or not. Feels slow
[16:12:42] <dcausse>	 o/
[16:14:18] <dcausse>	 flink start/stop overhead for small wikis might not be ideal :/
[16:14:48] <ebernhardson>	 indeed.  Looking at trey's stuff, he started maybe an hour or two earlier, he is up to bawiki in eqiad and codfw, and cloudelastic is on azbwiki. So really not that far apart
[16:15:57] <dcausse>	 could the reindexer tells if a change happened during the reindex?
[16:16:08] <ebernhardson>	 bawiki is the 79th, azbwiki is 68, so a bit slower but not crazy
[16:16:17] <dcausse>	 oh ok
[16:16:19] <ebernhardson>	 hmm, not easily
[16:17:07] <dcausse>	 10 wikis behind but with one hour less of runtime perhaps not slow enough to justify optimizing something yet?
[16:17:23] <ebernhardson>	 i mean, i guess kafkacat + jq could figure out if anything for that wiki happened in the time period
[16:17:36] <ebernhardson>	 probably not, but it would be fun :)
[16:17:43] <dcausse>	 :)
[16:17:51] <ebernhardson>	 I was thinking about something with async futures, and a quota for the number of shards that can be reindexing/backfilling in parallel
[16:20:04] <dcausse>	 can't remember how the elastic reindexer is working, do we tune it in any ways?
[16:20:59] <ebernhardson>	 we do a few tweaks regarding replicas and refresh rates while reindexing, but thats about it.
[16:22:31] <dcausse>	 so the limit would be in the number of reindex tasks running which can be approximated with the number of shards of the input index?
[16:22:37] <ebernhardson>	 i also realized while writing this...i have some error handling in the wrong order :P  If the script bails out and you re-run a reindex it will first reindex the wiki, then bail because the backfill isn't in a good state. will fix
[16:37:22] <ebernhardson>	 for performing the http requests for the checker,  should i just dupe what we are doing with fetching? We will probably need parallel async requests (although not nearly as many)
[16:40:46] <ebernhardson>	 for example, my calcs say that with a batch size of 100 commonswiki needs a batch every 800ms
[16:40:58] <dcausse>	 ebernhardson: depends, for the mw-enrichment they just batch synchronously and it's ok
[16:41:43] <ebernhardson>	 dcausse: does that still result in parallel requests? I'm expecting we need to engage multiple application server threads
[16:42:06] <dcausse>	 no unless you use flink parallelism
[16:42:22] <ebernhardson>	 hmm, i was trying to avoid flink parallelism here because then i have to decide how to partition the wikis :P
[16:42:35] <ebernhardson>	 it's possible though, much of the source supports parallelism i guess. 
[16:42:41] <dcausse>	 yes and also that requires fine-tuning all the operator parallelism
[16:43:05] <ebernhardson>	 i guess the parallelism doesn't have to come from the source, the source could emit one thing and we shuffle it (or whatever flink calls it)
[16:43:06] <dcausse>	 here we just use the default parallelism everywhere
[16:43:56] <dcausse>	 generally for throughtput and http calls the async operator should be preferred but that implies a small state for inflight requests and thus some serialization bits to write
[16:44:15] <ebernhardson>	 so whats more complicated...async http or managing parallelism, and i guess i wonder how much we need.  The CheckerJob runs at a parallelism of 20-40, but it doesn't spread out
[16:45:07] <ebernhardson>	 i guess async seems more appropriate
[16:45:37] <dcausse>	 yes I think so... here we don't really need an error side output so might be simpler than what we have? 
[16:45:56] <ebernhardson>	 perhaps a little bit
[17:24:51] <Trey314159>	 ebernhardson & dcausse: I didn't see you discussing reindexing here.. I have been heads down on moitoring the reindexing for a few hours now.. Be careful comparing my runtime to yours, Erik. I only ran the wikis starting with a, let both finish, then I took a break to estimate some stuff, then restarted wikis starting with b through d (skipping commons).
[17:24:56] <Trey314159>	 I was running batches so things would be easier to pause, and so that eqiad and codfw would be in similar states if anything went haywire. So far, though, codfw has been reindexing much much faster than eqiad.
[17:25:01] <Trey314159>	 A typical wiki takes 15% longer on eqiad, though the largest wiki in the a's, arwiki, took a little more than twice as long, so all the a's together took almost exactly 1.5x on eqiad (13.8 vs 20.6 hours).
[17:25:06] <Trey314159>	 I'm not sure if it's something about the larger wikis that slows them down.. bigger index means more swapping or whatever. The tiny wikis (based on diffs in timestamps of the log files) all take the same amount of time (~.5 min on codfw, ~0.55 min on eqiad), so the startup overhead dominates there.
[17:26:48] <ebernhardson>	 Trey314159: due to hardware switches, the codfw cluster has 69 nodes, and eqiad has 50.  codfw will be back down to 50 soon-ish
[17:26:53] <ebernhardson>	 but that might explain codfw working faster
[17:27:05] <Trey314159>	 Hmmm.. interesting!
[17:27:51] <Trey314159>	 That makes everything in my spreadhseet a bit suspect!
[17:29:21] <ebernhardson>	 yea, probably a bit
[17:33:28] <inflatador>	 back
[17:40:09] <inflatador>	 !log bking@prometheus1006 reload prometheus service as part of troubleshooting T358029
[17:40:12] <stashbot>	 inflatador: Not expecting to hear !log here
[17:40:13] <stashbot>	 T358029: Migrate selected Search Platform alerts from icinga search-platform team to prometheus data-platform team - https://phabricator.wikimedia.org/T358029
[17:56:02] <inflatador>	 ebernhardson just saw your update on T359136 , if you want to revert the CCS settings LMK. If that is indeed what fixed the dupes, they should show up again right away
[17:56:03] <stashbot>	 T359136: Global-search is showing duplicate results - https://phabricator.wikimedia.org/T359136
[17:59:40] <ebernhardson>	 inflatador: no it's fine, if it's not showing problems now then everything should be reasonable
[18:02:59] <inflatador>	 ACK, will work on getting some monitoring up for that
[18:03:13] * inflatador did not realize the importance of codesearch
[18:21:17] <dcausse>	 dinner
[18:43:19] <inflatador>	 lunch, back in time for pairing
[19:18:21] <inflatador>	 back
[19:35:18] <inflatador>	 ebernhardson we're in if you feel like discussing the SUP notifications in #wikimedia-operations. Not urgent or anything though
[19:55:50] <ebernhardson>	 inflatador: doh! i was distracted. can show up now if available
[20:30:45] <inflatador>	 created T359213 for the flink alerts
[20:30:47] <stashbot>	 T359213: Adapt Flink-related rdf-streaming-updater alerts for Cirrus Streaming Updater - https://phabricator.wikimedia.org/T359213
[22:26:45] * ebernhardson hmms...a 2 minute backfill has been running for 20 minutes
[22:35:08] <inflatador>	 we're getting a few alerts for Puppet on the wdqs graph split hosts...going to set a new downtime
[22:38:22] <ebernhardson>	 err, 2 hours 20 minutes :S From logs it looks like it made 4 checkpoints, then lost connection to zk. On reconnect it shutdown all the tasks, redeployed them from the 4th checkpoint, and just kinda didn't finish. Not really sure what happened :S
[22:41:11] <inflatador>	 ebernhardson : ryankemper rebooted ZK hosts while we were on the call earlier...didn't lose cluster quorum AFAIK. Hmm
[22:45:06] <inflatador>	 heading over to the flink dashboard to take a look at rdf-streaming-updater
[22:45:58] <ryankemper>	 ebernhardson: inflatador: can we just restart the backfill job or is there manual cleanup we need
[22:49:06] <inflatador>	 uptime matches our reboots https://grafana-rw.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata&var-flink_job_name=WDQS_Streaming_Updater&var-operator_name=All
[22:53:33] <inflatador>	 reboot time ~2000 UTC . Going AFK but logstash might have something https://logstash.wikimedia.org/goto/c3797f6cdacc79b106ddf01c89a70cc9
[23:22:25] <inflatador>	 back. Not sure what to think about the ZK stuff. ZK is pretty low-touch in my experience...but I'm def not an expert. Might ask b-rouberol his opinion tomorrow 
[23:27:47] <inflatador>	 OK, made a ticket so I don't forget ( T359226 ) ... feel free to update if you find anything interesting, otherwise I'll take a closer look tomorrow
[23:27:48] <stashbot>	 T359226: Investigate Zookeeper failure modes for flink operator applications - https://phabricator.wikimedia.org/T359226