[12:24:43] FYI about to merge a refactor of monitoring classes 725045 [12:36:08] reverting there was a minor bug [12:47:08] ack, thanks jbond [12:48:24] fyi some checks lost all there parents for drmrs, also the anycast monitoring broke for about 10 mins [12:48:37] should all be fixing it self now [13:11:15] *nod*, is https://gerrit.wikimedia.org/r/c/operations/puppet/+/744787 good to be reviewed jbond ? [13:11:47] also what was wrong with the first iteration? [13:11:54] godog: im just running a pcc now https://puppet-compiler.wmflabs.org/compiler1002/32843/ [13:12:18] godog: specifically https://gerrit.wikimedia.org/r/c/operations/puppet/+/744787/4/hieradata/common/profile/monitoring.yaml was missing the drmrs parent [13:12:39] the first patch in monitoring::hst #23 had [13:12:58] $nagios_address = pick($ip_address, $host_fqdn) and not $nagios_address = pick($host_fqdn, $ip_address) [13:14:06] ack, I'll take a look at the new patch shortly [13:14:27] ack thanks [13:38:45] godog: fyi going for a second attempt of that patch. this time ill disable puppet and roll it out a bit more gradually [13:39:36] jbond: +1 [15:18:06] did the prometheus servers in eqiad just oom and restart...? https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?orgId=1&refresh=1m [15:22:08] looks like it [15:22:21] codfw too [15:22:50] codfw had 3 months uptime (!) well deserved I'd say [15:27:23] chanced across this post, not sure if it's a worry for us? [15:27:24] https://discord.com/channels/766613591994007562/766613591994007565/917749600717258773 [15:27:32] Not that sorry [15:27:48] https://twitter.com/Dinosn/status/1468144253239021574 [15:30:29] topranks: interesting! thanks for the heads up, I'm takign a look [15:31:14] thread suggests it affects all plugins, but unsure if we have any? Also don't see mention of a CVE or other post yet. [15:36:19] yeah unclear how to reproduce it, good to keep an eye on it heh! [15:41:28] We can't reproduce on 8.3 godog topranks [15:41:40] I just asked our team to check Miraheze's [15:42:31] ack, thanks RhinosF1! appreciate it [15:44:29] RhinosF1: thanks, apologies for the false alarm [15:46:59] topranks: i'd rather a false alarm than an xmas day RCE [16:20:01] (the bug is real btw, only 8.x is affected though) [16:20:12] we're running 7 still [16:22:39] cc moritzm ^ [16:23:56] ack, thx [16:24:30] is there a formal announcement/confirmation by Grafana yetß [16:24:31] is there a formal announcement/confirmation by Grafana yet? [16:25:16] not afaik [17:28:38] moritzm: i sent security@ an email 40 minutes ago [17:29:18] RhinosF1: thanks :-) [17:31:03] moritzm: https://grafana.com/blog/2021/12/07/grafana-8.3.1-8.2.7-8.1.8-and-8.0.7-released-with-high-severity-security-fix/ [17:31:08] about 10 seconds ago [17:31:25] cc godog [17:32:22] Thanks for the update! [17:55:26] o/ we just saw a ~30mins gap (missing) in some graphite metrics coming from MW [17:57:43] dcausse: there were some issues with graphite1004, I believe herron was looking into it [17:58:05] thanks! [18:00:00] yes the host had a very high load avg and I power cycled it, it is back online now though still looking into it [19:39:04] Hello observability folks. Can someone tell me what the size limit is that will result in a log message going to the jsonTruncated channel? [19:39:19] asking in regard to T297219 [20:16:14] dancy: jsonTruncated tag gets applied when type=syslog, program=mediawiki, and message does not match `^{.*}$`. Truncation could be caused within mediawiki itself (monolog), the UDP transport, or somewhere in rsyslog. On the output side, ElasticSearch has a max 32k keyword indexing limit for lucene's term byte-length limit. [21:09:15] thx