[20:13:15] <tgr>	 Krinkle: do you know why we are not allowing sampling of Logstash logs?
[20:13:53] <tgr>	 sampling seems like the least bad option for T395899
[20:13:53] <stashbot>	 T395899: Some subset of MediaWiki Logstash events are capped to 100/s - https://phabricator.wikimedia.org/T395899
[20:14:07] <Krinkle>	 tgr: what kind of sampling? are you referring to a specific logstash or monolog feature?
[20:14:33] <tgr>	 monolog / WMF config
[20:14:58] <tgr>	 you can set a sample rate in $wmgMonologChannels but doing so automatically disables the Logstash backend
[20:17:21] <Krinkle>	 I guess it's because we generally want to keep all errors, and for any request where we have logs from, we'd want consistency that other messages from the reqId weren't lost/sampled away separately. That doesnt' leave much. And historically that's been fine.
[20:17:38] <Krinkle>	 Shared/multi-tennancy means that when ml servers overload logstash, MW logs get trashed it seems.
[20:18:03] <Krinkle>	 the current sampling I'm not sure is entirely by design/intentional. I hadn't heard of it before.
[20:18:19] <Krinkle>	 what have you observed so far? does it appear to throttle that by indivifual message fingerprint?
[20:19:02] <Krinkle>	 or only when logstash is generally under load, i.e, not always all the time at 100/s
[20:19:47] <Krinkle>	 "all errors" as in, find out new errors before deciding to sample (i.e. not holstically sample 1/N requests for sure), for a given error signature, we will inevitably throttle at the in-take if it's very noisy.
[20:20:56] <tgr>	 individual messages for sure, I get exactly 100/sec of certain kinds of messages in Logstash
[20:21:31] <tgr>	 on average anyway, the bucket size seems to be a few minutes
[20:21:37] <tgr>	 the task has more details
[20:23:00] <tgr>	 but that seems like a reasonable precaution, e.g. the session write logs would have probably caused problems if not throttled
[20:23:17] <tgr>	 but we have this thing in the logging config: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/2e4ab14aa15bb95568f9c07dd777065901eb2126/wmf-config/logging.php#278
[20:23:27] <tgr>	 and I don't see any reason for that
[20:24:31] <Krinkle>	 https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002
[20:24:36] <Krinkle>	 > for ($i = 0; $i < 2000; $i++) { $logger->info('Hello from noisy Krinkle'); if ($i % 42 === 0) { $logger->info('Hello from not-noisy Krinkle every 42nd iteration'); } }
[20:24:46] <Krinkle>	 produces 2,049 messages
[20:24:51] <Krinkle>	 arrived in Logstash
[20:25:22] <Krinkle>	 in one 30s window
[20:25:43] <Krinkle>	 but yeah that's within a potential 3-5min burst allowance
[20:27:11] <Krinkle>	 > for ($i = 0; $i < 20_000; $i++) { $logger->info('My noisy message'); if ($i % 420 === 0) { $logger->info('My not-noisy 1-in-420 message'); } }
[20:27:22] <Krinkle>	 all 20,048 arrived in Logstash
[20:28:15] <tgr>	 no idea what it depends on
[20:29:28] <Krinkle>	 > for ($i = 0; $i < 80_000; $i++) { $logger->info('My other noisy message'); if ($i % 420 === 0) { $logger->info('Some other not-noisy 1-in-420 message'); } }
[20:29:35] <Krinkle>	 only 30,191 made it
[20:29:54] <Krinkle>	 191 non-noise, and 30K of the 80K
[20:30:22] <Krinkle>	 so your second guess is spot on
[20:30:39] <Krinkle>	 30K burst, not sure what the period is but 5min seems likely indeed
[20:39:11] <Krinkle>	 posted on https://phabricator.wikimedia.org/T395899#10881727
[20:39:56] <tgr>	 thanks for testing
[20:41:03] <Krinkle>	 https://gerrit.wikimedia.org/g/operations/puppet/+/9478f801d8850d23ff056ffc02128061ee4d52d2/modules/profile/files/logstash/filters/75-filter_throttle.conf#2
[20:41:23] <Krinkle>	 I don't know if this applies to these messages, I got it through a crude codesearch in Puppet for "normalized_message"
[20:41:41] <Krinkle>	 that paints differnet numbers, but it mentions the throttler is local to each logstash instance and we have ~3 servers
[20:42:09] <Krinkle>	 This drops after 5K/5min
[20:42:48] <tgr>	 that explains the slightly irregular sawtooth pattern
[20:43:55] <Krinkle>	 If the code comments are wrong and we have actually (30/5=) 6 logstash servers, then this might be it. 
[20:44:05] <Krinkle>	 And yeah, the servers that have a smaller proportion of the messages would keep going for a bit longer
[20:44:51] <Krinkle>	 https://grafana.wikimedia.org/d/000000561/logstash
[20:45:16] <tgr>	 or their 5min windows just aren't in sync
[20:45:19] <Krinkle>	 I see exactly 6 of each "filtered", "in"  and "out" instances there
[20:45:53] <Krinkle>	 logstash1023:9198, logstash1024:9198, logstash1025:9198, logstash1030:9198, logstash1031:9198 - filtered, logstash1032:9198 - filtered
[20:45:53] <Krinkle>	 	
[20:46:47] <Krinkle>	 tgr: right yeah, that makes more sense if we assume kafka/logstash distributes it randomly/evenly.
[20:47:11] <Krinkle>	 maybe 2-3 of the servers are closely aligned, and the others slightly off
[21:05:17] <Krinkle>	 duesen: can you have another look at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1153092? I think there may be a misunderstanding there of what https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1109780 intended.