[02:05:39] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Tes [02:05:39] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views return [02:05:40] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:05:43] PROBLEM - aqs endpoints health on aqs1010 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Tes [02:05:43] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views return [02:05:44] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:05:46] PROBLEM - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Tes [02:05:46] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views return [02:05:47] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:06:01] PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Tes [02:06:01] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views return [02:06:02] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:06:04] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Tes [02:06:04] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views return [02:06:05] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:06:09] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Tes [02:06:09] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views return [02:06:10] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [06:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [06:43:59] hello folks, I acked the cassandra/aqs alerts in icinga [06:49:39] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.9842 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [07:29:55] eqi aqs1004 [07:29:59] mwarf [07:39:57] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Check AQS with cassandra (serving + data) - https://phabricator.wikimedia.org/T290068 (10JAllemandou) 05Open→03Resolved [07:40:00] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10JAllemandou) [07:40:35] My 2 minutes test tells me the problem comes from the new hosts - second thing is we don't have logs in logstash for those new hosts - I thought we had a task for this but there wasn't - just created one [07:41:20] 10Data-Engineering: Send cassandra3 (new hosts) logs to logstash - https://phabricator.wikimedia.org/T297460 (10JAllemandou) [07:41:48] 10Data-Engineering: Send cassandra3 (new hosts) logs to logstash - https://phabricator.wikimedia.org/T297460 (10JAllemandou) [07:41:52] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Cassandra3 migration for Analytics AQS - https://phabricator.wikimedia.org/T249755 (10JAllemandou) [07:46:09] joal: it is strange, do you mean aqs logstash logs? [07:48:06] yessir [07:48:28] in logstash I was not able to find any log from aqs1010 :( Maybe I did something wrong? [07:50:56] it is strange, the config is the same [07:52:44] elukey: networking maybe? I assume we'd get errors if so [07:53:12] anyhow - no prod broken, I'm going back to day off ;) [07:54:16] ah! enjoy :) [07:54:37] yes yes I acked because it was new-hosts-related, nothing major [07:55:11] 10Analytics: Check home/HDFS leftovers of christinedk - https://phabricator.wikimedia.org/T297461 (10MoritzMuehlenhoff) [07:55:12] It's actaully not cool that our IRC alerts don't tell us about hosts - I don't know if/how we can change that [07:56:20] ok I just checked and we have data in a table of the new cluster for yesterday - seems a node<->cassandra problem [07:56:37] elukey: would you give aqs1010 node process a shake? [07:56:39] please? [08:00:39] joal: I found some logs about cassandra on aqs1011, indeed the logs in there for cassandra-a are not good [08:01:34] like [08:01:35] ERROR [main] 2021-12-10 02:01:16,328 CassandraDaemon.java:749 - Cannot remove temporary or obsoleted files for loc [08:01:38] al_group_default_T_mediarequest_per_file.data due to a problem with transaction log files. Please check records wi [08:01:41] th problems in the log messages above and fix them. Refer to the 3.0 upgrading instructions in NEWS.txt for a desc [08:01:44] ription of transaction log files. [08:02:27] and [08:02:28] ERROR [main] 2021-12-10 08:01:56,100 LogTransaction.java:492 - Unexpected disk state: failed to read transaction l [08:02:30] elukey: right :( that keyspace was the last one we were loading with btullis [08:02:31] og [08:02:33] etc.. [08:02:44] ok :( [08:03:14] now the problem seems to impact all endpoints for AQS :( would it be that cassandra moved itself to a non-available mode? [08:03:16] so aqs1011-a seems down [08:03:32] ok [08:03:42] down like: no cassandra process [08:04:01] nono in a weird state, DN in nodetool status [08:04:09] on cassandra-a aqs1010 I see Cannot achieve consi [08:04:10] stency level LOCAL_ONE [08:04:16] (horrible pastE) [08:04:39] wow - consistency local_one is like no consistency other than self - not good :( [08:05:26] hm - I think we have realoded the data too fast - not letting cassandra compact in peace was a bad idea :( [08:06:09] yeah even on 1012, same error [08:07:38] from the error above it seems that the failing health checks are for top-by-country [08:07:56] and top [08:08:45] elukey: try curl -X GET --header 'Accept: application/json; charset=utf-8' 'http://aqs1010.eqiad.wmnet:7232/analytics.wikimedia.org/v1/pageviews/per-article/en.wikipedia/desktop/user/Barack_Obama/daily/20211201/20211209' on aqs1010 [08:08:53] problem is from ll keyspace I think :) [08:09:22] ah yes I see also alerts for per-article, maybe the cluster is borked [08:10:29] anyway, joal is your day off [08:10:38] plaese stay away from a non-urgent issue :) [08:11:14] ack elukey - doing that - later \o [09:26:19] elukey: joal: Just seeing these messages now. [09:41:17] I'm investigating the transaction log errors mentioned in /var/log/cassandra/system-a.log on aqs1011 - Will stop puppet trying to restart cassandra while I do so. [09:43:21] The message is similar to what was shown once before, during the loading of pageviews per article to aqs1013-b: https://phabricator.wikimedia.org/T291472#7435832 [09:46:52] (03CR) 10Mforns: "This is great!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) (owner: 10Joal) [09:49:52] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) We have experienced another error during compaction. This is similar to the error that we saw on the pageview... [09:58:36] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) I created a backup of the affected transaction log: ` root@aqs1011:/srv/cassandra-a/data/local_group_default_... [10:01:44] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:02:00] aqs1011-a restarted and rejoined the cluster. [10:02:26] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:02:32] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.056 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [10:02:48] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:03:02] RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:03:06] RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:03:20] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:03:32] I don't understand why one instance being down should cause the endpoints check to fail though. [10:15:49] most of the affected nodes complained about [10:15:50] Cannot achieve consi [10:15:51] stency level LOCAL_ONE [10:17:19] mmmm [10:17:21] at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:97) ~[apache-cassandra-3.11.4.jar:3.11.4] [10:17:31] how is the system keyspace replicateD? [10:18:10] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.029 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [10:19:43] system_auth | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} [10:19:46] bingo :) [10:19:49] btullis, joal --^ [10:20:16] I guess it is only on aqs1011-a [10:20:40] so when the instance failed, all queries failed since the aqs user was not able to authenticate [10:23:13] Ah, I see. I wonder if this is likley to affect other clusters, or whether it simply the way that this one was configured. [10:25:59] I think it is the default when you create a new cluster, IIRC we increased it in the current/old aqs cluster [10:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [10:48:27] elukey: Excellent, thanks. I'll make a task to do it before putting the new cluster into production. [11:00:22] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.222 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [11:29:16] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.36 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [11:53:44] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.033 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [11:57:31] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) The service has restarted and appears to be working normally now. The loading process itself was reported to... [11:57:57] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) a:05BTullis→03JAllemandou [12:39:07] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.015 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [13:02:39] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.013 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [13:03:21] 10Data-Engineering, 10Data-Engineering-Kanban: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster - https://phabricator.wikimedia.org/T297483 (10BTullis) [13:15:17] good catch elukey - I was sure we had talked about that when setting up the cluster in order to prevent it - Meh :S [14:13:23] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.111 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [14:33:31] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10Ottomata) >_ Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.4:npm (npm install) on project atlas-dashboardv2: Failed to run task: 'npm install' failed. org... [14:35:01] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [15:01:21] Hey all [15:01:43] Hi Seddon [15:02:38] Is waiting till Monday something that is viable for the team? [15:03:07] Or is this disruptive enough to merit getting something deployed on a Friday? [15:03:43] (And apologies my initial patch didn't fix this) [15:06:25] Seddon: Are you talking about this one? https://phabricator.wikimedia.org/T297400 [15:06:33] yep [15:08:36] If I understand it correctly, your patch will disable the event generation, right? [15:09:08] At the moment events are being generated, but being blocked by a schema validation failure in eventgate. [15:09:28] Yep, when I get back from leave will look at updating the schema [15:10:22] Seddon: i think probably a quick deploy to disable would be good [15:10:40] mw config patch, right? [15:10:47] i can help you deploy that real quick if you have one [15:11:25] oh, sorry, misread your question, i mean...yeah nothing bad will happen if it waits til monday [15:11:29] OK. Just so I'm clear, the only benefit to deploying the patch would be to stop the alerts over the weekend. [15:11:29] Not a config patch, just commenting out the eventlog emit [15:11:34] oh in the code [15:11:41] meh [15:11:43] hm [15:11:52] yeah that's slightly more complicated to deploy, buuut yeah [15:11:55] lets wait til monday i think [15:11:59] +1 [15:12:15] I'll try to ack the alert and note that [15:12:45] Cool. Really sorry for the disruption and not getting it fixed first time [15:13:13] s'ok! [15:13:30] Seddon: We're supposed to be meeting in real life sometime this weekend, aren't we? :-) [15:13:59] Oh! btullis i was not authorized to ack an alert in icinga UI? [15:14:11] can you? [15:14:26] ottomata: Yeah, I think I can. How come you're not? [15:14:31] i dunno! [15:14:40] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=alert1001&service=eventgate-analytics-external+validation+error+rate+too+high [15:14:43] is the alert [15:15:44] ACKNOWLEDGEMENT - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.134 gt 2 Btullis This is being worked on. Acked until fixed. T297400 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [15:17:42] btullis: I did message Lukasz but haven't heard from him whilst he is travelling. Only complication is that my Sunday is currently being taken up by the F1 (F1 is life) between 11 and 4 [15:18:54] Seddon: Ah, no worries. My Sunday is taken up by unicycle hockey (is also life) between 7-11pm. :-) [15:23:52] btullis: thanks for acking that. What I don't understand is that it looks like the validation errors are happening on all streams, am I reading something wrong? [15:27:12] milimetric: I clicked through to logstash. It seemed to me previously that it was all the `mediawiki.mediasearch_interaction` stream, but now that I look again there are a few others. [15:27:17] https://logstash.wikimedia.org/app/discover#/view/AXMlVWkuMQ_08tQas2Xi?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-1h,mode:quick,to:now))&_a=h@c2dd4ef [15:28:43] There are a few `eventlogging_SearchSatisfaction` errors and a few `eventlogging_NavigationTiming` ones. [15:29:12] yeah btullis I click on the "errored_stream_name" in the selected fields on the left, and filter out our main offender, there are plenty of others [15:32:02] hm, btullis you wanna batcave about it for a sec? [15:32:04] I'm confused [15:33:35] Interestingly though, it's been like this for at least the last 30 days. In fact the number of validation errors excluding the prime suspect is relatively low at the moment, with about 2K per 12 hours. [15:33:38] https://usercontent.irccloud-cdn.com/file/YOycVmJt/image.png [15:33:46] Yeah, sure. See you in the bc. [15:33:47] never mind btullis I figured it out, sorry [15:34:05] the grafana visualization is an area graph, which should be outlawed as a crime against humanity. Ugh [15:34:57] it's just this schema, so all's well and logical and we're good to wait until they roll back. I'm not sure if this is putting extra pressure on EventGate and if it'll mess it up in other ways, but seems ok so far [15:36:00] for the record, I was looking at this and thought it looked like the error rate of all streams went up: https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&from=1638992404035&to=1639008430481, but because of how area graphs [don't] work with lots of lines, all the other streams are just hitching [15:36:00] a ride up the mediasearch_interaction spike [15:36:25] Oh I see what you mean. The Grafana graph is stacked, so it looks like they're all following the same pattern of error rates. Yes, that's horrible. [15:37:02] to double check I isolated each of the other top error producers and none of their pattern had changed in logstash for the last three days, so we're good. [15:37:20] oh i like stacked graphs! you easily see the total number of errors, but also see the individual contributions [15:37:28] (which is why i made it stacked!) [15:37:32] buuut we can change that easily [15:37:35] :-) [15:37:46] you can select an indivdual stream to see just that one [15:38:18] I agree they're useful if you know they're stacked, but that's almost never the default so it caught me off guard [15:38:21] When I said horrible, what I reeaaly meant was, nice in its way. [15:38:45] I stand by crimes against humanity [15:39:11] What about those overlapping filled area graphs though? https://grafana.wikimedia.org/d/000000418/cassandra?viewPanel=7&orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-keyspace=local_group_default_T_editors_bycountry&var-table=data&var-quantile=99p [15:39:38] haha [15:40:06] but that seems as good as a line graph but harder to read to me? [15:40:09] btullis: * [15:40:43] milimetric: when i make stacked graphs in grafana, i make them area fill, but when i make non stacked, they are just lines [15:40:48] so i see them and know which is which [15:41:41] Yes, harder to read. I didn't make that graph. I'm thinking about changing it to a line graph. [15:41:41] Now, has anyone got a moment to help me with a maven problem? I'm trying to do a `mvn build` and the `spotbugs-maven-plugin` is telling me that I can't code. (Which is true.) How can I tell it which bugs to ignore? [15:41:58] mvn build? [15:42:04] Yes. [15:42:05] btullis: what are you trying to do? [15:42:28] Oh no. `mvn install` . Unknown lifecycle phase "build" [15:42:35] Presto query logger. [15:43:58] btullis: link? to pom? [15:44:01] also, what about mvn package [15:44:03] https://gitlab.wikimedia.org/repos/data-engineering/presto-query-logger/-/tree/log_to_file [15:44:45] Oh, `mvn package` works. [INFO] BUILD SUCCESS [15:45:28] nice, yeah i rarely mvn install, unless i need to use one project i'm deving as a dependency to another [15:45:53] btullis: but also [15:45:53] https://stackoverflow.com/a/60504443 [15:47:15] Excellent, thanks. I'm a total noob with maven, sadly. But I now have a `presto-query-logger-1.0-SNAPSHOT.jar`file anyway. Now I can see if it works on the test cluster. [16:00:25] nice! [16:39:46] FYI: I'm updating Olja's SSH key https://gerrit.wikimedia.org/r/c/operations/puppet/+/745908 - Feel free to check and +1 [16:41:06] btullis: if you have visual confirmation from Olja about the new key is more than enough to proceed [16:41:36] ah yeah already merged :) [16:41:43] Great, thanks. Yes, on chat as we spek. [16:41:47] Meet. [16:46:40] ottomata: in the jsonschema, I see 3 files, and either I missed it in the doc or am blind, but is there a way to generate some of them from the others? [16:47:06] damit, found it "Now git add and git commit this new file. jsonschema-tools will automatically dereference and materialize this file for you." i guess? [16:47:57] yes, but you can also use the jsonschema-tools CLI. Also, we might change that: https://phabricator.wikimedia.org/T290074 [16:48:49] $(npm bin)/jsonschema-tools --help [16:49:01] *does the npm install* [16:49:03] $(npm bin)/jsonschema-tools materialize ./path/to/current.yaml [16:49:56] ty [16:50:19] (03PS1) 10Addshore: Add analytics/mwcli/command_report [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/745914 (https://phabricator.wikimedia.org/T293583) [16:51:18] (03PS2) 10Addshore: Add analytics/mwcli/command_report [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/745914 (https://phabricator.wikimedia.org/T293583) [16:52:06] (03CR) 10jerkins-bot: [V: 04-1] Add analytics/mwcli/command_report [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/745914 (https://phabricator.wikimedia.org/T293583) (owner: 10Addshore) [17:13:06] addshore: why not just emit an event every time a command is run? [17:13:31] might not be connected to the internet & also just results in lots of requests [17:13:52] so the CLI batches up events in a JSON file, and then tries to emit them every X hours (im writing the X hours bit now) [17:14:09] it could then send all of the individual events, but that would be lots? :/ [17:17:51] Oh right i forgot this is external clients [17:18:09] :) [17:18:11] you could send them as individual events but batched in a single HTTP POST as array [17:18:21] eventgate takes an array of event objects in the post body too [17:18:24] oooh interesting, i didnt think about that [17:19:18] That could be nicer, as then If I create another event, (that isnt a generic command run) then I could emit them using the same request [17:19:24] I'm gonna look at that now, thanks! [17:20:18] yeah makes the schema model easier to reason about too [18:18:59] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10BTullis) I've made a little more progress on this, although for the time being I have just decided to write to files and to base my work on the AWS exampl... [18:20:32] ottomata: Just wondering if puppet is still meant to be disabled on an-test-client1001. [18:22:26] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10BTullis) ...and now I've got a log file containing the query. ` btullis@an-test-coord1001:/var/log/presto$ cat queries-2021-12-10T18:17:21.0.log Dec 10, 2... [18:24:44] I'm pleased with that. --^ Got my first query logged from Presto to a file on the test cluster. [18:35:01] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [18:36:02] Have nice weekends everyone. [18:53:41] ottomata: just an array of events, or also a map of events? (I'm guessing the former) [18:57:30] yeah an array [18:57:41] either an array fo event objects, or a single event object [18:58:35] addshore_^ [18:58:39] ty! [18:58:53] just trying to figure out how to do this right in golang xD `listOfEvents := []interface{}` [19:01:25] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10Ottomata) Very cool! [19:02:43] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10Ottomata) If you end up wanting to produce this as an event on Event Platform, https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia-event-utilities/#g... [19:35:37] (03PS3) 10Addshore: Add analytics/mwcli/command_execution [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/745914 (https://phabricator.wikimedia.org/T293583) [19:50:35] ottomata: thanks for the hint, im pretty happy with how this has tunred out! [19:50:47] great! [19:50:49] looking [19:51:04] oh you got a schema update? [19:51:29] I went for a totally different event name and params [19:51:46] so now, each command run is an event, and it comes with the nammed command, and version of the cli [19:52:02] oh wait [19:52:07] did i forget to push it? xD [19:52:26] (03PS4) 10Addshore: Add analytics/mwcli/command_execution [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/745914 (https://phabricator.wikimedia.org/T293583) [19:52:40] bah, and package-lock changed because node version, let me fix taht [19:53:15] (03PS5) 10Addshore: Add analytics/mwcli/command_execution [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/745914 (https://phabricator.wikimedia.org/T293583) [19:53:25] right, thats better [19:54:17] (03CR) 10jerkins-bot: [V: 04-1] Add analytics/mwcli/command_execution [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/745914 (https://phabricator.wikimedia.org/T293583) (owner: 10Addshore) [19:55:51] addshore: how about a verb instead of a noun? [19:55:53] https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Event_Data_Modeling_and_Schema_Naming [19:55:53] (03PS6) 10Addshore: Add analytics/mwcli/command_execution [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/745914 (https://phabricator.wikimedia.org/T293583) [19:55:57] command_execute [19:55:57] ? [19:56:02] ottomata: can do! [19:56:36] awesome, aside from that lgtm! :) [19:56:41] (03PS7) 10Addshore: Add analytics/mwcli/command_execution [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/745914 (https://phabricator.wikimedia.org/T293583) [19:56:43] oh [19:56:46] actually [19:56:55] addshore: you can $ref the common schema in your current.yaml [19:56:59] so you do'nt have to copy paste all that in [19:57:03] ooooo [19:57:15] https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Schema_fragments [19:57:32] (03CR) 10jerkins-bot: [V: 04-1] Add analytics/mwcli/command_execution [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/745914 (https://phabricator.wikimedia.org/T293583) (owner: 10Addshore) [19:57:57] https://schema.wikimedia.org/repositories//primary/jsonschema/fragment/common/1.1.0.yaml [19:58:05] that'll get you $schema and meta [19:58:29] ack! [19:58:50] I can remove them from required too? [19:59:55] (03PS8) 10Addshore: Add analytics/mwcli/command_execution [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/745914 (https://phabricator.wikimedia.org/T293583) [20:00:33] Also [20:00:36] you can [20:00:36] `we want to allow meta.dt to [20:00:37] be the received time (filled in by EventGate at ingest time), as we don't [20:00:37] trust client-sent events to always set the proper date and time.` [20:00:45] So I shouldnt send that in my client payload? [20:01:02] if you are posting to eventgate, leave meta.dt blank [20:01:06] but adding a dt like you have and setting that is good [20:01:13] ack! [20:01:34] so posting to https://intake-analytics.wikimedia.org/v1/events?hasty=true i can remove meta.dt from my client data? [20:01:42] but I leave stream? [20:01:45] yes [20:01:51] great! [20:38:57] 10Analytics-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10Ottomata) Ok, more findings, this time about how to run a python function in a packaged conda environment as a Spark job without having that conda env and python function locally on the Airflo... [20:39:26] mforns: yt? [22:31:51] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.9824 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [22:35:01] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org