[06:02:23] ryankemper: looking at the number of triples before the incident there were still ~30k triples diff, it's probable that eqiad & codfw drifted a bit over time, we don't guarantee that both DC receive the same stream of data (two separate and independent flink jobs are running) [06:04:21] re throttling of monitoring queries: I don't think we do anything special regarding these queries, they might go through the throttling filter I suppose, still a bit surprising that they get throttled tho... [06:21:02] They definitely go through the filter per https://logstash.wikimedia.org/goto/97818dbfb41e586fa27edc55f4891d6f [06:21:37] But yeah I still don't have a great idea of why they get throttled, especially in codfw [09:31:50] errand [10:13:36] lunch [12:58:48] gehel forgot I have a doctor visit this morning...cancelled our 1x1. Should be back in ~2h [14:14:59] o/ [14:28:51] o/ [15:01:22] \o [15:01:52] o/ [15:02:30] o/ [15:03:46] \o [15:29:26] .o/ [16:13:11] ebernhardson I wanna say we were talking about missing logs last week? This might be part of the problem https://phabricator.wikimedia.org/T357616 [16:13:49] inflatador: hmm, i can certainly say i get different logs in kubectl than in logstash for the streaming apps [16:18:19] yea, still not getting them. So for example flink-app-consumer-cloudelastic-taskmanager-1-1 has emit at least 10 warn's in the last 15 minutes, but logstash has none [16:19:39] interesting...I'm still in the K [16:20:03] 8s sig mtg but will take a look once we get out [17:07:16] ebernhardson seeing similar stuff on the SUP's k8s node as noted in https://phabricator.wikimedia.org/T357616#9735153 [17:10:58] OK, restarted ryslog on SUP's k8s node (mw1461)...let's check logstash again [17:13:16] looks like stuff is coming thru now, but LMK if not [17:26:49] inflatador: i see logs from envoy, but not from our app. There should be some messages about 504 gateway timeout [17:26:56] but from our side, not envoy [17:31:16] ebernhardson maybe I'm looking in the wrong place? I see logs from job manager at least https://logstash.wikimedia.org/goto/54f16bc3933666ac5400856cd1465d3e [17:33:58] dinner [17:35:29] inflatador: oh, i'm looking at 'App Logs (Kubernetes)' and you are on 'App Logs - K8s - Flink ECS' [17:35:39] why aren't we found in the normal logs? [17:36:35] also that dashboard doesn't seem linked from the logstash homepage (where i usually click through to various dashboards) [17:39:31] yeah, I wish I could recall how I found that one. Probably from d-causse. I wonder if I can add it to homepage [17:41:52] * inflatador always feels like I'm missing something when I use logstash [17:45:16] lunch, back in time for pairing [17:49:06] * ebernhardson suspects the searchsatisfaction data is wholly incomplete...< 10k fulltext click throughs in an hour. Survival analysis looks almost the same as one from 2016. Which tells me this graph isn't very clear :P [17:51:04] compare https://phabricator.wikimedia.org/F48313867 vs page 4 of https://upload.wikimedia.org/wikipedia/commons/3/3b/Swap2and3_Search_Test_Analysis.pdf [18:18:03] back [18:30:55] Blubber drives me nuts. I’m trying to work around the ca-certificates issue by splitting up the build in two stages but blubber/docker buildkit starts to parallelise presumably unrelated build steps. Hence, I cannot install a GPG key first to install from another apt source in a later stage. [18:47:14] oh wonderful [20:44:44] * ebernhardson assumes either the data or the analysis is wrong. it finds 97.58% of search sessions with clickthroughs are successful (submitted a checkin >= 10s) [20:51:18] i wonder if 10s is simply being too generous, i'm not really seeing the same bimodal i remember in the past. Mostly a smooth decline in checkins, except for somehow there are 2x as many checkins at 60s as 50s, even though you have to fire the 50s event to fire the 60s one [20:53:07] although i'm also realizing, this checkin code is all wrong :( It has checkins at [10, 20, 30, ...]. which is supposed to be time since page loaded. But what actually happens is sleep 10s, fire event, sleep 20s, fire event, sleep 30s, fire event. [20:54:13] basically this data is a mess :P