[07:28:17] Errand, back in a few [13:12:28] greetings [13:20:50] o/ [13:21:11] ejoseph: around? Would you have time to demo T309689 and T259442 ? [13:21:15] T309689: Inappropriate error message when search users don't provide a regular expression to insource - https://phabricator.wikimedia.org/T309689 [13:21:15] T259442: insource and intitle regular expression search doesn't allow final escaped slash - https://phabricator.wikimedia.org/T259442 [13:21:48] yes i am [13:21:49] as discussed in our previous retro, I'm trying to test that things are fixed before closing tickets [13:22:04] ejoseph: meet.google.com/snj-utht-pnz [13:23:05] I can validate T309689 myself, all good ! [13:24:20] I'll join in 5mins [13:25:00] ack [13:26:48] o/ [14:57:27] quick break, back in ~15 [14:58:10] hi ebernhardson , welcome back. once you've had a chance to catch up, can you take a look at Marco's comment in T304954? [14:58:11] T304954: Import data from hdfs to commonswiki_file - https://phabricator.wikimedia.org/T304954 [15:01:02] inflatador, ryankemper: triage meeting: https://meet.google.com/eki-rafx-cxi [16:07:34] checked jobqueue / cloudelastic related tickets, it looks like the work was successfull and we are now seeing sufficient throughput writing to the cloudelastic cluster. It has the downside that edit+delete in quick succession could potentially arrive out of order but was a tradeoff we accepted [16:08:02] That's good news. I bumbled my way thru a deploy last wk [17:04:03] lots of airflow errors last week :S seems some of input data stopped being ingested last week, but i still see it being produced into kafka. Looking into why [17:07:42] cirrus also seems to have been alerting about high failure rates, ran my `check_cross_cluster.py` script and verified chi->omega cross-cluster broken itself again. ran fix-cross-cluster.sh again and it should quiet down [17:25:47] lunch, back in 30-45. Reimage in flight for elastic1052.eqiad.wmnet but don't foresee any problems [17:57:23] not finding anything useful about why no data showing up in hadoop for the search events, making a ticket and hoping analytics can find something [18:08:54] back [18:15:17] inflatador & ryankemper: with everything going on right now... is is a good time or not to do a little light reindexing? A couple Bengali-language wikis are not small (~1M across all pages), but not huge (enwiki is ~60M), so it *should* all finish same day it starts. OTOH, I can wait to whenever is convenient if now is a bad time. [18:18:12] Trey314159 reindexing can sometimes cause the cluster to go red (when a new index is created but it's not aliased to the main yet, and doesn't have any shards). There's no data loss but it can trigger alarms and stop the reimage process [18:18:41] So if you can wait until Weds, that'd be ideal. Otherwise we can keep an eye out for bad indices and delete as encountered [18:19:09] No worries. I'll check the status again on Wednesday! [18:19:14] (Thanks) [18:19:20] well, kinda mixed up shards and indices there, but you get the drift...thanks! [18:32:27] gehel: slightly late, will brt in a few mins [18:32:35] ryankemper: ack [20:02:06] tracked down (with help from analytics) what happened with various other airflow failures last week, a bad deploy of refinery caused one hour worth of canary events to not be delivered. Without canary events in codfw most partitions are empty and thus dont get created. Airflow then waits around for partitions that aren't going to be created [20:07:44] yay! Maybe we won't get an airflow alert email everyday now! [20:09:02] i also increased the SLA on subgraph query/mapping dags from 1 day to 2 days, and undeployed the tasks that drop old data from subgraph query/mapping to reduce the spam. Hopefully less now :) [20:09:42] i was initially letting the drop old data keep going as a reminder to fix it ... but it's simply too spammy and clearly the reminder wasn't enough. We still have the task in blocked on phab board, hopefully reminder enough [20:42:31] 72% of the way to Bullseye in eqiad [20:46:37] woot [21:00:52] took longer than i would care to admit to figure this out (it's probably documented somewhere i didn't look), but the wdqs hosts in codfw are reporting their ` schema:dateModified ?date` value as current, so they probably could be pooled if necessary [21:01:10] but if not necessary, probably no harm leaving them unpooled [21:22:20] school run, back in 45m or so [22:05:23] back [23:13:01] hmm, probably not worth doing much about but realizes the times when the mjolnir bulk loader is running slow is when it's going through a bunch of tiny wikis. It uses a bunch of parallelism when piping bulk updates into elasticsearch, but each wiki is run sequentially and when a wiki only has 20 updates to apply the overhead of the rest of the process makes it go slowly [23:13:12] (where slow is hundreds of updates/sec, instead of thousands)