[07:09:36] elukey: hi o/ [07:09:36] following this change https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Streams&diff=prev&oldid=2266795 [07:09:36] we were able to consume events from a stream using the command below: [07:09:36] ``` [07:09:36] $ kafkacat -C -b kafka-main1006.eqiad.wmnet:9093 -t codfw.mediawiki.page_prediction_change.rc0 -o -100 -e -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt | jq [07:09:36] ``` [07:09:36] we tried the same command today and could not access the stream. has something changed? [07:50:29] no worries. finally got the event using: [07:50:29] ``` [07:50:29] $ kafkacat -C -b kafka-main1006.eqiad.wmnet:9093 -t codfw.mediawiki.page_prediction_change.rc0 -o -1 -e -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt | jq [07:50:29] ``` [08:26:36] I'm going to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131052 on two host in magru. These hosts will be first depooled and then updated and repooled. We already tested this change but this is the first time we deploy it on hosts that are serving traffic, so heads up! [08:27:14] (rollback is easy but it's even easier to depool these in case of issues) [08:32:02] ack [08:39:57] all checks are ok, ready to repool these 2 hosts [08:41:37] {{done}} [11:11:07] Following on yesterday's message: the druid datasource webrequest_sampled_128 is gone. Please ping in the analytics chan if this creates problems [11:15:20] thanks for the info joal <3 [11:24:09] ack, thanks [13:04:05] I have been reviewing logstash and am seeing lots of timeouts, database server overloaded, could not connect errors. Are we aware of an issue at the moment? [13:16:44] dwalden: timeframe? [13:16:44] is that happening right now or are you reviewing past logs? [13:16:50] I am not aware of something right now fwiw [13:17:13] There was a peak around early morning (UTC) 24th. Seen a few hundred since then [13:46:02] I'm trying to target all elastic hosts in row A in CODFW with a cumin command, not having much luck with the examples listed at https://wikitech.wikimedia.org/wiki/Cumin - tried `sudo cumin P:netbox::host%location ~ "B1.*eqiad"` and that throws an error [13:51:41] try: sudo cumin 'P:netbox::host%location ~ "A.*codfw" and R:class@tag = role::elasticsearch::cirrus' [13:52:59] sukhe thanks, will give it a try [13:55:18] yeah, P{O:} and what you did is probably better. [14:02:16] all credit to btullis for that one ;P [15:04:11] hello on-callers [15:04:33] I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131300 in a bit, it will change the benthos config for webrequest_live [15:04:52] this may impact our superset/turnilo data for a bit [15:05:08] we are switching input data streams on behalf of DPE [15:06:44] <_joe_> elukey: as long as it's just webrequest_live, it's ok [15:07:38] _joe_ what do you mean? [15:08:05] <_joe_> elukey: that webrequest_128 is currently quite important [15:09:01] _joe_ have you read what joal wrote the other day? https://phabricator.wikimedia.org/T385198 [15:09:07] in this chan [15:09:14] <_joe_> no, I missed it [15:09:25] <_joe_> well no I mean "currently" as "today" [15:10:00] ah okok, please add your thoughts in the task if you don't want _128 to be decommed [15:11:07] <_joe_> elukey: thanks [15:11:41] <_joe_> sigh it's been removed *now* [15:11:44] <_joe_> oh dear. [15:13:16] already been removed [15:13:33] <_joe_> yeah that's truly unfortunate given _live has such a short retention [15:13:44] was the same IIRC [15:13:59] _live is one month [15:14:06] 128 was 3? [15:14:07] <_joe_> you're remembering incorrectly; it was, the the one for _live was reduced to 30 days [15:14:20] maybe with the saved space we can increase it back? [15:14:21] <_joe_> in any case, sorry, meeting [15:14:43] <_joe_> yeah we needed the data *now*, not in 90 days. There's always hive in any case. [15:17:44] currently watching https://grafana.wikimedia.org/d/V0TSK7O4z/benthos?orgId=1&from=now-30m&to=now [15:17:58] the "validate" drop it is expected, that step is not used anymore [15:18:10] the overall rps is good [15:18:21] trying to get why the "batching" config changed [15:19:23] something to be aware of - since we switched kafka topics, we may have loss a bit of data and/or consumed something already processed, so this timewindow may show some inconsistency in superset/turnilo [15:30:01] I can confirm data is kept for 30D [15:30:18] I'll follow up with Joseph to increase it [16:26:15] Hi folks ( _joe_ in particular :) I have seen the exchange with elukey above, and we talked about retention. I have grown the retention for the webrequest_sampled_live to 90 days on the cluster. The thing is that we only 60 days of deep-storage data, so the cluster is now filled with 60 days of data, and I'm going to change the deep-storage to 90 days and it'll build up from now on. This represents [16:26:21] quite a lot of data for druid, if we have space-issues I'll let you know and possibly revert the retention back to 60 days. [16:26:49] <_joe_> joal: that's ok, thanks <3 [16:33:33] joal: oh nice, I can see we can already access data back up to end of Jan. in the live dashboard [16:36:30] That's right, druid has 2 storage level, one on the druid-cluster itself, and one on hadoop. We keep more data on hadoop for this exact cases. We kept 60 days, we should have kept 90. [19:14:44] I know I'm super late on this, but are there any recorded talks from the SRE offsite? [19:14:58] I've been digging thru my email, no luck so far [19:17:03] yes, DMed