[06:47:08] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [06:51:16] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [08:55:53] Hi mforns - are you up now? I'm questioning about the webrequest alert [08:59:54] (03CR) 10Joal: "Thank you @milimetric for finding this" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/774535 (https://phabricator.wikimedia.org/T304884) (owner: 10Milimetric) [10:06:55] PROBLEM - Check unit status of check_webrequest_partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:17:12] (VarnishkafkaNoMessages) firing: ... [10:17:12] varnishkafka for instance cp1079:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp1079:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [10:18:17] --^ this is the first alert from https://phabricator.wikimedia.org/T300246 which I've just deployed. Investigating now. [10:18:42] nice, it seems working fine :) [10:19:22] Joe may be working on it, I see https://sal.toolforge.org/production?p=0&q=cp1079&d= [10:20:31] Ah good. :-) I just need to get that confctl integration working correctly then. [10:22:29] but it is a great start, nice job! [11:28:22] hi joal I just joined! [11:37:19] btullis, joal, do you know how to interpret the varnishkafka alert? [11:37:26] Seems important [11:37:33] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10cmooney) @Cmjohnson there is an issue with the port assigned for **an-worker1143** on lsw1-e2-eqiad, **an-worker1145** on lsw1-f2-eqiad, and *... [11:45:28] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10MoritzMuehlenhoff) By default only "main" and "thirdparty/hwraid" (for baremetal hosts) are added to our servers. And that's by design, so that we have full control what we... [11:51:21] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10cmooney) an-worker1142, an-worker1144, an-worker1147 and an-worker1148 should be good to go. I'm not sure why the re-image failed on those tb... [12:05:42] mforns: Sorry about that. Yes I do know how to interpret it. It's not a problem right now. Basically we now trigger a critical alert for this team when a varnishkafka instance sends zero messages for 5 minutes. [12:07:30] The alert is in place as-of today (https://gerrit.wikimedia.org/r/c/operations/alerts/+/773801) but at the moment it will trigger when somebody intentionally depools a cp-* server. I'm working on a way of getting the conftool pooled/depooled status into Prometheus so that we can excluded depooled hosts from the alert. [12:07:43] btullis: thanks for the explanation :] [12:08:08] A pleasure. [12:08:43] btullis: can I copy your explanation in the alert email? [12:09:28] Yes, feel free. I'm just adding another comment to this ticket too, where the approach to minimize false positives is also being discussed: https://phabricator.wikimedia.org/T300246 [12:15:29] 10Data-Engineering, 10Data-Engineering-Kanban: Add alert for varnishkafka low/zero messages per second to alertmanager - https://phabricator.wikimedia.org/T300246 (10BTullis) The alert fired today, shortly after merging, when @Joe intentially depooled cp1079 in order to re-image it. {F35028457} While this wa... [12:20:38] btullis: sandra asked me about it in slack too [12:26:48] ottomata: Yes, I should have done a better job of communicating this, or getting the false-positive mitigation finished before deploying. Sorry about that. [12:27:17] mforns: Heya - to me the moss pressing think is rerunning the dataloss error hour, to unlock donstream jobs [12:27:29] looking into that! joal [12:28:27] mforns: there was errors on cahces for that hour - I think we should rerun with high threshold [12:28:52] for that hour? the caches alert arrived at 14h, no? [12:28:58] the webrequest hour is 6am [12:29:09] yes mforns - those alerts are not relate [12:29:26] ok, will re-run with high threshold [12:29:51] mforns: wall of errors in ops chan at hour 6-UTC [12:30:03] I see [12:36:10] thanks mforns [12:36:16] joal: executed a coord to re-run the thing [12:36:20] \o/ [13:01:16] mforns: I confirm your manual run has finished and has started trickling down [13:01:33] oh, yea, just saw that [13:01:34] I think it'll take some time to catch up, but it's on its way [13:01:39] k [13:05:42] RECOVERY - Check unit status of check_webrequest_partitions on an-launcher1002 is OK: OK: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:07:08] yay :) --^ [13:07:41] thank you mforns :) [13:10:20] \o/ [13:23:58] hi I have a data lake question :] We have the idea of sending Gerrit events to the data lake which could theorically let us do some analysis on Superset / Hadoop or whatever magic tooling [13:24:19] I am more or less aware of EventGate which I understand it is a http interface to which event producer can POST events [13:24:48] o/ [13:24:50] I found out Gerrit has a Kafka producer event which can produce json formatted events. And my question is, do you have support to poll a kafka producer? ;] [13:25:08] eventgate is an http proxy for a kafka producer [13:25:12] or would need to write a converted from kafka to http POST to EventGate? [13:25:24] if you like, you can produce directly to kafka from gerrit...however [13:25:46] event platform is a little opinionated about the events [13:25:49] there are some required fields [13:26:24] https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Required_fields [13:26:43] ouch [13:27:39] we were able to work around this in eventgate for network error logging...but to use that you'd have to HTTP post to eventgate (looking to remember workarouund) [13:28:40] ah yes, we added some special query params to the API to allow the POSTer to ask eventgate to augment the event with the required fields [13:28:43] POST /v2/events?schema_uri=/cool/schema/1.0.0&stream=cool.stream [13:28:57] (eventgate will already default the meta.dt field) [13:29:15] (and dt i think) [13:29:59] hashar: if you want to use this gerrit pluging to produce its own events to kafka without a schema, etc. (e.g. not using event platform), you can do that [13:30:18] but there wont' be any automated tooling to e.g. ingest the stream into Hive so you can easily use Superset [13:30:25] it can be done, but it would have to be a custom job [13:31:23] ottomata: well I am at the very start of the journey and I don't even know what Kafka is :] [13:31:46] I guess what I am looking for is for your infra to come poll the Kafka producer embeded in Gerrit [13:31:48] and magic to happen [13:31:50] :D [13:32:08] :) [13:32:12] I am going to read the Guidelines you gave me [13:32:20] okay that is for event schema guidlines [13:32:23] if you are starting your journey [13:32:33] the Gerrit plugin recognizes kafka configuration settings so maybe the required fields can be injected this way [13:32:48] https://wikitech.wikimedia.org/wiki/Event_Platform [13:33:17] beside that it seems the events are merely `key: