[10:42:03] <gehel>	 dcausse: thanks for the examples of how to split WDQS queries! https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples
[10:51:48] <dcausse>	 lunch
[14:01:38] <dcausse>	 errand
[14:24:54] <inflatador>	 <o/
[14:40:17] <inflatador>	 FYI, we had an incident where mailman was down for a few days ( T358020 ), which means we didn't get any email-based alerts . This is going to move up the migration timetable for mailman -> Google groups. No action necessary for anyone at this point, but just an FYI
[14:40:17] <stashbot>	 T358020: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T358020
[15:31:03] <dcausse>	 o/
[16:03:14] <ebernhardson>	 \o
[16:04:05] <ebernhardson>	 janis had a curious question, re backfilling. " Isn't this what a Flink session cluster is for? Having just one Jobmanager that controls multiple Jobs (e.g. the generic one plus backfill) that can be submitted at runtime?"
[16:04:24] <ebernhardson>	 i suppose i was ignoring the session cluster because i thought it would be easier if everything was run the same way. But is there a reasonable idea there?
[16:04:50] <ebernhardson>	 i wouldn't put the generic one there though, only the backfill
[16:49:09] <dcausse>	 indeed, perhaps it's possible? the flink-k8s-operator might allow creating a session cluster since I see that a "FlinkSessionJob" resource exists
[16:56:06] <dcausse>	 but if deploying a FlinkSessionJob via helm has similar drawbacks than a classic FlinkDeployment I'm unsure what's the benefit 
[16:57:27] <dcausse>	 and deploying the job directly to the flink API with something a la https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/983452/1/flink/flink-job.py then we have to find a way to pull the job config from the helm values file 
[16:57:34] <inflatador>	 interesting idea
[16:58:41] <inflatador>	 headed to the doc, back in ~60-90
[17:02:50] <gehel>	 heading out, enjoy the weekend!
[17:18:24] <dcausse>	 heading out too, have a nice week-end
[18:24:31] <inflatador>	 back
[18:54:02] <inflatador>	 Looking at T355795 ... does anyone know where to find the Elastica errors in logstash? Specifically this https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/890ea5ff1acc9bac57aa8bf08b9008a1e8ebe469/includes/ElasticaErrorHandler.php#189 
[18:54:03] <stashbot>	 T355795: Fix "requests triggering circuit breakers" Elastic alert - https://phabricator.wikimedia.org/T355795
[18:55:14] <ebernhardson>	 inflatador: you are unlikely to find any of those errors. IIRC we had them when we initially deployed elastic 7, but once we got the configuration squared away we haven't seen them since
[18:56:16] <ebernhardson>	 i forget what exactly we changed, but 7 didn't like how we were configured in 6 and we had to adjust something to prevent the memory issues
[19:02:29] <inflatador>	 ebernhardson ACK. Do you think we still need to alert on them? Regardless, I'd still like to know where the Elastica errors live in Logstash
[19:04:00] <ebernhardson>	 inflatador: it's a curious thing. If those errors are occuring it's a pretty bad thing, we would want to know. We look to have solved the issue and never see them, but would want to know if they come back. We would probably have less direct warnings (high level of failed queries i guess)
[19:04:11] <ebernhardson>	 inflatador: in general, the elastica logs are all in logstash with the general mediawiki logging
[19:04:20] <ebernhardson>	 sec finding a link
[19:05:01] <inflatador>	 ebernhardson ACK, I was looking there but my kibana/lucene querying skills aren't up to snuff
[19:05:34] <inflatador>	 was looking at https://logstash.wikimedia.org/goto/1b586ba9acee9941a75791f108d8035d
[19:08:56] <ebernhardson>	 inflatador: That's generally the right place.  You can filter with `channel:CirrusSearch AND "backend error"`. I was trying to find a better filter, but it's being elusive
[19:10:32] <ebernhardson>	 The records it responds with, of the format 'Search backend error during {logType} after {tookMs}: {error_message}', should have a cirrus_error_type field which contains the result of the error classification you linked earlier. That field sadly doesn't seem to be searchable though
[19:11:22] <ebernhardson>	 would could plausibly rename that to something other than cirrus_error_type that is searchable (thats the point of ECS i suppose, a common schema). But would have to look into what exactly
[19:12:59] <ebernhardson>	 that error is generated by our generic "elasticsearch intermediary" which utilized for (almost?) every request between cirrus and elasticsearch
[19:15:32] <ebernhardson>	 curiously, there are some messages about missing wikinews / wikisource indices in there right now :S
[19:15:33] <inflatador>	 Thanks, that's helpful. Sounds like I need to learn more about ECS
[19:16:42] <ebernhardson>	 it makes me wonder if we should also alert on the ones we call config_issue
[19:17:01] <ebernhardson>	 i would have to review all the occurances, but it sounds like something that requires manual intervention :)
[19:17:19] <ebernhardson>	 alert would be a bit strong maybe...phab task?
[19:17:26] <inflatador>	 Sounds plausible
[19:18:01] <inflatador>	 team-sre/mediawiki.yaml in the alerts repo has some alerts based on number of errors reported by logstash
[19:18:07] <ebernhardson>	 but i guess i better look into why these indexes are reporting missing then :P
[19:18:52] <inflatador>	 cool, I'll add a bit about logging for CirrusSearch
[19:19:00] <inflatador>	 to the search wikitech page that is
[19:23:18] <ebernhardson>	 something up with cross-cluster search
[19:29:47] <inflatador>	 related to the cloudelastic migration?
[19:32:31] <inflatador>	 `curl -XGET https://cloudelastic.wikimedia.org:8243/omega:ttwikiquote_general/_search?q=example?pretty` worked for me, the example on https://wikitech.wikimedia.org/wiki/Help:CirrusSearch_elasticsearch_replicas did not
[19:40:08] <ebernhardson>	 inflatador: not cloudelastic, but the errors i'm seeing about an index not existing only seem to come from cross-wiki queries, not the wiki itself
[19:59:11] <ebernhardson>	 i dunno..having a hard time reproducing. Tried running a cross-cluster search against one of the ones that logs a failure against all servers pybal lists for the cluster, but they al
[19:59:17] <ebernhardson>	 but they are clearly happening
[20:06:41] <ebernhardson>	 ran a couple hundred simple cross-cluster searches against every node in the cluster, no failures :S
[20:09:21] <ebernhardson>	 oh well..not finding anything and it's not critical, ignoring :(
[20:19:31] <inflatador>	 only so much time in a day ;) . Will look into creating phab tasks for that issue, though
[20:26:56] <inflatador>	 are those missing index errors in the CirrusSearch channel? I can't seem to find them
[20:27:34] <ebernhardson>	 inflatador: try https://logstash.wikimedia.org/goto/d053028b6cfa12e73b414de242656040
[20:29:43] <inflatador>	 ebernhardson ah thanks, I was only looking at the normalized_message and not scrolling down ;(
[21:41:04] <inflatador>	 created T358389 to kick around the missing search indices stuff...I'll look at it next week
[21:41:05] <stashbot>	 T358389: Determine cause/fix cross-cluster search missing index errors - https://phabricator.wikimedia.org/T358389