[07:04:33] o/ [07:34:04] o/ [09:39:16] errand+lunch [13:12:37] o/ [13:46:31] \o [13:46:48] o/ [13:46:50] hmm, two out of three re-indexes claim completion [13:47:15] not too bad :) [13:47:44] might be a few failures, i apparently got rid of the final report before exiting, but not terrible [13:48:03] hmm, the last one is waiting on brwikisource_content for 2d 18h...i suspect that one is probably done :P [13:57:40] hmm, best guess is it's because brwikisrouce_content_nnnn is red, and we are waiting for green [13:57:58] i guess i probably don't want it to auto-fix that by nuking the red index [13:58:50] we have a red index in production? :( [13:59:17] inflatador_: not anymore, i just deleted it :P It was from reindexing though, we reindex into 0 replica indices which makes it easy(-ier) to go red [14:00:05] Yeah, we run into that a lot. Although I thought we had alerts for red status [14:00:16] * inflatador_ should probably make one [14:13:13] annoyingly, the reindexer that finished is with the old-style backfill. I reworked the backfiller into the state transitions but separately made the version installed on deployment host work [14:13:16] which to keep? :P [14:13:54] the new one requires a small modification to the flink chart, to add a `comment` field like mwscript that we can inject an orchestration id into [14:28:12] I'm on my hotspot until the cable repair folks get here, hopefully soon [14:51:12] OK! Looks like they fixed the Internet [15:00:55] Trey314159 dcausse ebernhardson I'm in a maintenance, will not make standup [15:31:50] Looks like we are getting logs into logstash again. As we discussed last week, the shape is not ideal, but it's a start: https://logstash.wikimedia.org/goto/36d80c81763008234ddacc36c56db14c [15:33:28] success! [16:40:37] OK, kicking off CODFW restart [17:01:09] forgot to mention it in the triage meeting: don't forget to do your stats homework [17:06:29] Looks like we ignore ES cluster status in our one check that I've found so far: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/opensearch/monitoring/base_checks.pp#18 [17:06:52] which is fine, but I'm wondering if we should have another one for red...or maybe we do? Haven't found it yet [17:09:20] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-sre/opensearch.yaml there are alerts for mainline SRE, but not for us. I'll get a ticket started [17:12:36] dinner [18:00:53] Making lunch, will be away for 15m. codfw is about 3/4 done. cc: ryankemper [18:01:50] inflatador_: ack [18:35:40] Looks like CODFW finished OK. Taking a short break, will ping when I restart eqiad [19:15:42] back, starting on eqiad cirrussearch now. I just depooled [19:18:23] hmm, things mostly work but jawiki_general fails with [{"index":"jawiki_general_1755143237","type":"_doc","id":"5112606","cause":{"type":"execution_exception","reason":"execution_exception: java.lang.IndexOutOfBoundsException: Index -1024 out of bounds for length 87966","caused_by":{"type":"index_out_of_bounds_exception","reason":"index_out_of_bounds_exception: Index -1024 out of bounds [19:18:25] for length 87966"}},"status":500}] [19:19:54] looks to be https://ja.wikipedia.org/wiki/%E5%88%A9%E7%94%A8%E8%80%85:%E4%B8%87%E6%AD%B3%E5%B8%9D%E5%9B%BD translated as `User:Banzai Empire` [19:20:18] Trey314159: looks like ja analysis might still not like emoji ^^ [19:20:38] ebernhardson: looking [19:20:48] personally, i would change the page :P But we have to accept insanity [19:25:18] sadly not seeing a better stack trace in the log files [19:26:33] heh, the template slapped on it says "The removal of some or all versions of this user page is currently under consideration, in accordance withour removal policy ." [19:31:09] per d.causse suggestion from last week, if we lose cluster quorum I'm gonna go ahead and let it ride for 15m [19:45:47] looks like reindex almost done. Failed jawiki_general on all clusters, and needs one more round for nlwiki_general in codfw (it kept restarting due to the cluster restart and hit retry limit) [19:45:55] otherwise, everything else is done [19:48:29] ebernhardson: it's definitely the sudachi tokenizer that's falling down. 16K emojis each expanded to 12 characters internally is too much. Oddly, it breaks up long strings of ascii characters okay. [19:49:22] Trey314159: not the end of the world, but does make searching ja a little surprising since it has the different analysis chains on content/general :) I suppose it also means there is opportunity to see the same error in commons/wikidata. Probably not a rush, but should make a ticket i guess? [19:54:31] https://phabricator.wikimedia.org/T402220 [20:55:02] ebernhardson: I was finally able to successfully run an end-to-end AirFlow/Spark/Kafka roundtrip, but that requires a tweaked version of discolytics. Could you have a look at https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/merge_requests/54. Once that is released, I can finalise the DAG artifacts. [20:56:33] pfischer: sure. I'm about to leave for a school run in ~5 minutes, but will look after [21:28:36] ebernhardson: thanks! [21:37:49] OK, so we have our expected quorum breakage [21:38:01] success? :P [21:38:22] success in that we have logs for the failure now ;P https://logstash.wikimedia.org/goto/5e24483de76a37e0fdf925e1cedcef32 [21:38:42] nice [21:41:29] looks like it's been down for ~20m. I'm ready to fix it but LMK if y'al think it would be better to hang on a little longer cc: ryankemper ebernhardson [21:41:48] i'd probably lean on getting it back to a happy state [21:47:54] welp...it just recovered on its own [21:48:28] takes too long, but promissing! [22:04:15] Didn't quite finish eqiad today, we'll pick it up again tomorrow. We've repooled/re-enabled shard allocations [22:12:20] ryankemper heading out, but wdqs1011 just alerted so I restarted its BG. I depooled it but you might wanna a keep an eye on https://grafana.wikimedia.org/goto/Tj9i5muNR?orgId=1 and repool once it goes back down [22:12:37] ack [23:52:12] looks like there is not-too-expensive hack to chop really long stings of non-Japanese characters into manageable chunks.... more tomorrow