[20:13:15] Krinkle: do you know why we are not allowing sampling of Logstash logs? [20:13:53] sampling seems like the least bad option for T395899 [20:13:53] T395899: Some subset of MediaWiki Logstash events are capped to 100/s - https://phabricator.wikimedia.org/T395899 [20:14:07] tgr: what kind of sampling? are you referring to a specific logstash or monolog feature? [20:14:33] monolog / WMF config [20:14:58] you can set a sample rate in $wmgMonologChannels but doing so automatically disables the Logstash backend [20:17:21] I guess it's because we generally want to keep all errors, and for any request where we have logs from, we'd want consistency that other messages from the reqId weren't lost/sampled away separately. That doesnt' leave much. And historically that's been fine. [20:17:38] Shared/multi-tennancy means that when ml servers overload logstash, MW logs get trashed it seems. [20:18:03] the current sampling I'm not sure is entirely by design/intentional. I hadn't heard of it before. [20:18:19] what have you observed so far? does it appear to throttle that by indivifual message fingerprint? [20:19:02] or only when logstash is generally under load, i.e, not always all the time at 100/s [20:19:47] "all errors" as in, find out new errors before deciding to sample (i.e. not holstically sample 1/N requests for sure), for a given error signature, we will inevitably throttle at the in-take if it's very noisy. [20:20:56] individual messages for sure, I get exactly 100/sec of certain kinds of messages in Logstash [20:21:31] on average anyway, the bucket size seems to be a few minutes [20:21:37] the task has more details [20:23:00] but that seems like a reasonable precaution, e.g. the session write logs would have probably caused problems if not throttled [20:23:17] but we have this thing in the logging config: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/2e4ab14aa15bb95568f9c07dd777065901eb2126/wmf-config/logging.php#278 [20:23:27] and I don't see any reason for that [20:24:31] https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002 [20:24:36] > for ($i = 0; $i < 2000; $i++) { $logger->info('Hello from noisy Krinkle'); if ($i % 42 === 0) { $logger->info('Hello from not-noisy Krinkle every 42nd iteration'); } } [20:24:46] produces 2,049 messages [20:24:51] arrived in Logstash [20:25:22] in one 30s window [20:25:43] but yeah that's within a potential 3-5min burst allowance [20:27:11] > for ($i = 0; $i < 20_000; $i++) { $logger->info('My noisy message'); if ($i % 420 === 0) { $logger->info('My not-noisy 1-in-420 message'); } } [20:27:22] all 20,048 arrived in Logstash [20:28:15] no idea what it depends on [20:29:28] > for ($i = 0; $i < 80_000; $i++) { $logger->info('My other noisy message'); if ($i % 420 === 0) { $logger->info('Some other not-noisy 1-in-420 message'); } } [20:29:35] only 30,191 made it [20:29:54] 191 non-noise, and 30K of the 80K [20:30:22] so your second guess is spot on [20:30:39] 30K burst, not sure what the period is but 5min seems likely indeed [20:39:11] posted on https://phabricator.wikimedia.org/T395899#10881727 [20:39:56] thanks for testing [20:41:03] https://gerrit.wikimedia.org/g/operations/puppet/+/9478f801d8850d23ff056ffc02128061ee4d52d2/modules/profile/files/logstash/filters/75-filter_throttle.conf#2 [20:41:23] I don't know if this applies to these messages, I got it through a crude codesearch in Puppet for "normalized_message" [20:41:41] that paints differnet numbers, but it mentions the throttler is local to each logstash instance and we have ~3 servers [20:42:09] This drops after 5K/5min [20:42:48] that explains the slightly irregular sawtooth pattern [20:43:55] If the code comments are wrong and we have actually (30/5=) 6 logstash servers, then this might be it. [20:44:05] And yeah, the servers that have a smaller proportion of the messages would keep going for a bit longer [20:44:51] https://grafana.wikimedia.org/d/000000561/logstash [20:45:16] or their 5min windows just aren't in sync [20:45:19] I see exactly 6 of each "filtered", "in" and "out" instances there [20:45:53] logstash1023:9198, logstash1024:9198, logstash1025:9198, logstash1030:9198, logstash1031:9198 - filtered, logstash1032:9198 - filtered [20:45:53] [20:46:47] tgr: right yeah, that makes more sense if we assume kafka/logstash distributes it randomly/evenly. [20:47:11] maybe 2-3 of the servers are closely aligned, and the others slightly off [21:05:17] duesen: can you have another look at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1153092? I think there may be a misunderstanding there of what https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1109780 intended.