[10:20:25] <gehel>	 Lunch
[10:38:05] <ejoseph>	 I cant seem to be able to run git review
[11:19:30] <gehel>	 ejoseph: what error do you get?
[11:19:40] <ejoseph>	 no error
[11:20:02] <ejoseph>	 Just stalled
[11:20:28] <gehel>	 gerrit might be slow / stuck, it happens
[11:23:13] <gehel>	 want to jump in a meet to see if we can sort this out?
[11:46:20] <ejoseph>	 To join the video meeting, click this link: https://meet.google.com/psx-jjoj-xpv
[11:46:20] <ejoseph>	 Otherwise, to join by phone, dial +27 10 823 0593 and enter this PIN: 843 256 474#
[11:46:20] <ejoseph>	 To view more phone numbers, click this link: https://tel.meet/psx-jjoj-xpv?hs=5
[11:46:58] <ejoseph>	 gehel
[12:30:06] <tarrow>	 Hey! o/ I have some questions about running Elasticsearch / Cirrus: How long does it take for a node to "start" for you after restarting? e.g. time for all shards to be ready?
[12:31:36] <tarrow>	 We're running a small ES cluster with 388 shards (e.g 97 wikis but little data <1.7GB) and it's taking more than an hour to start. Is that normal?
[12:44:05] <gehel>	 tarrow: that seems surprisingly long. I don't have an exact number for our case (ryankemper or inflatador might have better / more recent estimates). But our restart time for a node is in minutes, not hours
[12:47:10] <gehel>	 The restart time is probably more dependant on the number of shards than on the overall data size. Not sure what your number of shards per index is (given 388 shards for 97 wikis, I suppose you have 1 primary + 1 replica and 2 shards per index?). You might want to try to reduce the number of shards per index. But I doubt it will make any significant difference if your restart time is in hours.
[13:09:09] <inflatador>	 greetings
[13:40:53] <inflatador>	 tarrow I would also agree that is a really long time. Our main cluster has ~1500 shards and ~600 GB of data, 1 Gbps ethernet and it takes 10 minutes at most, usually less
[13:41:30] <tarrow>	 We have 4 shards, one per index
[13:41:39] <tarrow>	 4 shards per wiki*
[13:49:27] <addshore>	 and also currently a 3 master / primary setup I think
[14:00:49] <cbogen_>	 ebernhardson: do you have what you need for T304954 and T305851? Hoping to get them wrapped up this week
[14:00:49] <stashbot>	 T305851: Import has-suggestions flags to search indices - https://phabricator.wikimedia.org/T305851
[14:00:50] <stashbot>	 T304954: Import data from hdfs to commonswiki_file - https://phabricator.wikimedia.org/T304954
[14:35:41] <inflatador>	 Just added AAAA records to relforge100[34].eqiad.wmnet  , I don't think there should be a problem, but do let me know if there is
[14:45:11] <tarrow>	 do you ever manually trigger flushes of the indices? My googling suggested that this might have some impact on startup time. 
[14:45:46] <tarrow>	 Right now (while I'm very slowly waiting for it to start up) I see lots of things like `281 21.7s URGENT shard-started StartedShardEntry{shardId [[mwdb_wbstack_7c328819fe_general_first][0]], allocationId [hb3RW3aBRDeONdMLHddAfw], primary term [12], message [after existing store recovery; bootstrap_history_uuid=false]}` in the pending tasks
[14:58:03] <ebernhardson>	 cbogen_: i still need to find out if there is a second dataset to import, or whats going on there
[14:58:13] <cbogen_>	 okay, I pinged Cormac to make sure you get an answer
[15:01:37] <gehel>	 triage meeting: https://meet.google.com/eki-rafx-cxi
[15:01:52] <gehel>	 cc: inflatador, ryankemper, 
[15:45:51] <ebernhardson>	 random thoughts, if the SERP is trying to justify why a search result is there, should popularity be represented somehow in the results UI?  But how do you represent that it isn't a linear "more popular is better"?
[17:05:43] <inflatador>	 FYI, I just added AAAA records to elastic202[5-9]  . I don't anticipate any issues, but I'm watching the percentiles dashboard and https://config-master.wikimedia.org/pybal/codfw/search-https just in case. Let me now ASAP if you notice anything amiss
[17:40:35] <inflatador>	 OK, it's been 30 minutes and the DNS changes have propagated. No issues I'm aware of based on looking at cluster health, load balanced pool, and dashboard. I'm going to fast-track adding the rest of the records in CODFW once I get back from lunch. Please let me know if you have an objections.
[18:17:58] <inflatador>	 back
[18:18:45] <ryankemper>	 inflatador: great! no objections here
[18:22:19] <inflatador>	 ryankemper cool, will update the ticket shortly
[18:28:03] <gehel>	 ryankemper: I might be 2' late
[18:29:14] <gehel>	 ebernhardson: what is SERP ?
[18:32:17] <ryankemper>	 gehel: search engine results page IIRC
[18:32:26] <gehel>	 right, that make sense!
[18:40:27] <ebernhardson>	 hmm, airflow filled its default of 16 executors, which means new things don't get executed until something finishes. But they are almost entirely wait_for_* tasks. I wonder if we should switch those all to re-execute on poke instead of sleeping
[18:40:57] <ebernhardson>	 for the moment i increased it to 24, will keep things going for now
[19:34:09] <inflatador>	 quick break, back in ~20
[23:20:55] <ebernhardson>	 ejoseph: hopefully these instructions will work, this documents how i setup mw docker + docker-compose + phpstorm 2022.1   to run tests from phpstorm, and be able to set breakpoints and step through the code: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/797598