[13:39:04] So... Cassandra uses a logback encoder to send to logstash, via rsyslog on port 11514. I upgraded Java two days ago on a canary, and it hasn't logged since. There are no errors, and with tcpdump I can see what looks like well formatted JSON being delivered to port 11514. Any suggestions as to where next to start looking? [13:51:35] urandom: o/ is it aqs1010? Where di you see that it stopped logging? [13:51:46] (if there is a dashboard to check etc..) [13:52:52] elukey: https://logstash.wikimedia.org/goto/7c157dd0ba4fb6b67564000ecb919929 [13:53:14] the last thing logged is the shutdown before restarting on Java 11 [13:55:07] ah I see udp [13:55:33] the only thing that came up to mind is an indexing error on the logstash front [13:55:44] maybe the format of the message changed [13:56:09] I don't see anything on aqs1010 tcpdumping though [13:56:10] It shouldn't, but something is different [13:56:14] yeah, same [13:56:19] `tcpdump udp port 11514 -A -i lo` [13:56:26] right, I did the same [13:56:54] but I've read above that you do see something logged via tcpdump right? [13:57:07] ah yes now I see some msgs [13:57:35] "I don't see anything on aqs1010 tcpdumping though" <-- oh, I read that as you didn't see anything wrong [13:58:04] italians and english [13:58:07] :D [13:58:15] yeah, it's logging, and the output looks reasonable (at least reading it via tcpdump output) [13:58:39] and if you compare the tcpdump with say aqs1011 do you see anything different? [13:58:54] I don't recall how to check for indexing errors, but they are on logstash [13:59:29] o/ I see no indexing errors from aqs1010 [13:59:38] :/ [13:59:55] also not seeing anything from tcpdump either [13:59:59] cwhite: where should I look for those? So I'll save the bookmark :) [14:00:02] but I'll be patient [14:00:17] (/me meeting) [14:00:18] cwhite: you're monitoring? [14:00:19] elukey: https://logstash.wikimedia.org/app/discover#/view/6086dd90-85dd-11eb-99a9-c1243d7de186 [14:00:22] <3 [14:00:27] yes [14:00:57] I can bump cassandra and make it noisy, just a sec... [14:00:59] TIL dead letters [14:01:56] cwhite: it's noisy now [14:03:08] in meetings too [14:04:30] urandom: the host field changed from fqdn to just the host name `aqs1010` [14:05:22] oooh, it did [14:05:27] * urandom groans [14:14:13] cwhite: thank you :) [14:16:15] No problem! Glad to help :) [14:34:31] <_joe_> godog / cwhite do you have a query I can make to prometheus to verify it's collecting metrics from mw-debug? [14:35:28] <_joe_> I am merging https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1041656 which AIUI should start making metrics flow to the exporters from the service [14:49:14] _joe_: for sure sth on the statsd-exporter own metrics like https://w.wiki/ANGo [14:49:18] for k8s/eqiad that is [14:49:34] <_joe_> yeah I was hoping to check the mediawiki ones [14:49:55] that should be increasing, I'm not sure re: mediawiki own metrics tho i.e. a metric that will be there for sure [14:50:30] _joe_: maybe `mediawiki_action_executeTiming_seconds_count{kubernetes_namespace="mw-web"}` [14:50:39] ? [14:51:10] <_joe_> it should be mw-debug [14:51:19] <_joe_> that's where I enabled the service [14:51:29] <_joe_> but uh, it seems prom is collecting from all the sidecars now? [14:52:54] <_joe_> yep, looks like it [14:53:19] you added the ingress rule didn't you? [14:54:48] <_joe_> just for the statsd deployment [14:54:55] <_joe_> it shouldn't work for those [14:54:58] huh [14:55:28] <_joe_> but I'd assume we have a global ingress rule for prometheus [14:57:10] looking at the graph it started collecting the metrics at 14:35 ish [14:58:08] wait no [14:58:43] seems we only have one kubernetes_namespace collecting: mw-web [14:59:38] <_joe_> there is no reason for that [14:59:48] <_joe_> so cwhite in an mwdebug pod I have [15:00:05] <_joe_> var_dump($wgStatsTarget); [15:00:07] <_joe_> string(23) "udp://10.64.72.158:9125" [15:00:14] <_joe_> the IP is the cluster ip of the service [15:01:21] <_joe_> uhm, but why is port 9125 there? [15:01:59] port 9125 is the statsd exporter port we've been using [15:02:17] <_joe_> uh not historically on k8s for every other service [15:02:46] <_joe_> var_dump($wgStatsTarget); [15:02:48] <_joe_> string(23) "udp://10.64.72.158:9125" [15:02:52] <_joe_> sorry, wrong paste [15:03:03] <_joe_> containerPort: 9102 [15:03:08] <_joe_> so ofc it's not working :) [15:03:17] <_joe_> I can change the port of the service though [15:03:31] <_joe_> now one is left to wonder how are those mw-web metrics being collected :D [15:04:26] yeah, it seems they've been collected intermittently looking back the last 7 days [15:06:36] <_joe_> but they really shouldn't be :D [15:07:08] <_joe_> anyways, I'll just change the port [15:14:18] I'm in more meetings and lurking FWIW