[04:27:41] PROBLEM - Check unit status of monitor_refine_event_sanitized_main_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_main_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:09:20] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I have now deployed the change to double the number of replica pods for eventgate-an... [09:44:51] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Setup and backups for datahub - https://phabricator.wikimedia.org/T308113 (10BTullis) [09:45:18] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Setup and backups for datahub - https://phabricator.wikimedia.org/T308113 (10BTullis) p:05Triage→03High [09:49:46] 10Data-Engineering, 10Airflow, 10Patch-For-Review: Set up backups and monitoring of airflow instances - https://phabricator.wikimedia.org/T307102 (10BTullis) p:05Triage→03High a:03BTullis [09:50:28] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Set up backups and monitoring of airflow instances - https://phabricator.wikimedia.org/T307102 (10BTullis) [12:47:01] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) > Yup, it's there. Subtle but noticeable. Shaved off ~1s from p99 and ~80-100ms from... [12:55:22] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I have stumbled upon this issue with HAProxy, which seems to fit some of the symptom... [13:29:40] !log restarted oozie jobs after deployment: mediarequest_top_files, pageview_top_articles, unique_devices_per_domain_monthly, unique_devices_per_project_family_monthly [13:29:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:31:41] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) Hm, looks like there may be [[ https://lists.apache.org/list.html?users@kafka.apache.org | some bugs ]] in... [13:44:59] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Vgutierrez) Beginning with HAProxy 2.1 HTX is the only way to go. On another issue (https://g... [13:51:00] 10Quarry: Quarry not running queries - https://phabricator.wikimedia.org/T308131 (10rook) [13:51:12] 10Quarry: Quarry not running queries - https://phabricator.wikimedia.org/T308131 (10rook) a:03rook [13:51:42] 10Quarry: Quarry not running queries - https://phabricator.wikimedia.org/T308131 (10rook) ssh not responsive on either of the workers. Trying a soft reboot quarry-worker-03 via horizon. [14:01:21] ottomata, btullis: I'm looking at the EventgateLoggingExternalLatency alert. It seems there is a real problem, see: https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external&from=now-2d&to=now [14:01:33] Not sure how to act here, can you guys help me? [14:02:16] btullis: has been digging deep in tho this one, you can ignore ifor now i think [14:02:29] mforns: There's a big old coversation going on around this in T306181 and T294911 [14:02:29] T306181: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 [14:02:29] T294911: Apparent latency warning in 90th centile of eventgate-logging-external - https://phabricator.wikimedia.org/T294911 [14:03:00] ottomata, btullis: oh, thanks a lot! [14:08:06] 10Quarry: Quarry not running queries - https://phabricator.wikimedia.org/T308131 (10rook) puppet gave a segfault a little before the system was rebooted: ` May 11 13:33:46 quarry-worker-03 diamond[13068]: Took too long to run! Killed! May 11 13:33:46 quarry-worker-03 diamond[13068]: Took too long to run! Killed!... [14:16:42] ottomata: do you have a link ot the whole "reliable events" ticket again? [14:18:28] aaah, must be https://phabricator.wikimedia.org/T120242 [15:04:54] ottomata: I'm trying to send a test event payload with curl to https://intake-analytics.wikimedia.org [15:05:17] I thought I'd use one of the examples given here, but it's rejected. https://schema.wikimedia.org/repositories//secondary/jsonschema/analytics/test/latest.json [15:06:13] Have you a recommendation for what event payload I can use for this? [15:06:50] I got `"message": "event 1a958f76-3288-49fd-9593-d7bcca7d2e8d of schema at /analytics/test/2.0.0 destined to stream test.analytics.mediawiki is not allowed in stream; test.analytics.mediawiki is not configured."` [15:08:01] 10Data-Engineering, 10Cassandra: Ensure AQS Cassandra client connections are multi-datacenter - https://phabricator.wikimedia.org/T307799 (10Eevans) >>! In T307799#7916240, @JAllemandou wrote: > @Eevans : The AQS-loader is not datacenter-aware. It takes base hosts as a parameter and gets the cassandra cluster... [15:24:13] I've added the `hasty=true` to the URL I get a validation failure, but a 202 from eventgate, which I think is fine for my testing: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-syslog-2022.05.11?id=YrOys4ABv2PODa7FVzhy [15:25:09] 10Quarry: Quarry not running queries - https://phabricator.wikimedia.org/T308131 (10rook) Doesn't really look related to puppet though. Runs fine now. and the other worker didn't show the same. Rebooting the nodes seems to have got them back in working order. Closing for now. [15:25:18] 10Quarry: Quarry not running queries - https://phabricator.wikimedia.org/T308131 (10rook) 05Open→03Resolved [15:33:52] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Metrics-Platform: jsonschema-tools tests should fail if schema $id does not match title or path - https://phabricator.wikimedia.org/T300404 (10Ottomata) Here is an example of how we are already asserting that the schema title match the schema file path:... [15:34:01] a-team anybody else coming to the PA sync? [15:40:07] btullis: change the meta.stream name :) actuually the easiest way to get a functioning example is just consume one from kafka [15:40:47] kafkacat -b kafka-main1001.eqiad.wmnet -C -t eqiad.eventgate-main.test.event -u -c 1 [15:41:06] {"$schema":"/test/event/1.0.0","meta":{"stream":"eventgate-main.test.event","id":"d64aa522-a0fa-4ece-b30c-ca9019b38bce","dt":"2022-04-28T01:01:43.882Z","request_id":"c5434aa0-c68e-11ec-bfd9-8b1e7f9abb2d"},"test":"default value"} [15:41:22] actually you are doing intake analyti cs [15:41:53] kafkacat -b kafka-jumbo1001.eqiad.wmnet -C -t eqiad.eventgate-analytics-external.test.event -u -c 1 [15:41:56] {"$schema":"/test/event/1.0.0","meta":{"stream":"eventgate-analytics-external.test.event","id":"0b4067f8-8519-40ad-b762-e7d58d0187f7","dt":"2022-05-02T17:40:37.130Z","request_id":"f9ea0790-ca3e-11ec-93eb-25cf6e0f4be5"},"test":"default value"} [15:52:19] ottomata: Thanks. Turned out to be a red herring anyway, I think. [16:23:51] (03PS1) 10Jsn.sherman: Add namespace to MobileWebUIActionsTracking [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/791053 (https://phabricator.wikimedia.org/T306648) [16:24:33] (03CR) 10jerkins-bot: [V: 04-1] Add namespace to MobileWebUIActionsTracking [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/791053 (https://phabricator.wikimedia.org/T306648) (owner: 10Jsn.sherman) [16:44:34] (03PS2) 10Jsn.sherman: Add namespace to MobileWebUIActionsTracking [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/791053 (https://phabricator.wikimedia.org/T306648) [17:14:49] (03Restored) 10NOkafor: Created a function to get cassandra password from file [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/790652 (https://phabricator.wikimedia.org/T306895) (owner: 10NOkafor) [17:15:13] 10Data-Engineering, 10Data-Engineering-Kanban: [POC] Use airflow-installed Spark3 for an Airflow job - https://phabricator.wikimedia.org/T308168 (10JAllemandou) [17:17:18] (03Abandoned) 10NOkafor: Created a function to get cassandra password from file [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/790652 (https://phabricator.wikimedia.org/T306895) (owner: 10NOkafor) [17:33:33] 10Data-Engineering, 10Airflow: Install spark3 - https://phabricator.wikimedia.org/T295072 (10Ottomata) [17:33:36] 10Data-Engineering, 10Product-Analytics: Consider not using anaconda as base conda environment - https://phabricator.wikimedia.org/T302819 (10Ottomata) [17:40:33] 10Data-Engineering, 10Airflow: Install spark3 - https://phabricator.wikimedia.org/T295072 (10Ottomata) [17:41:55] 10Data-Engineering, 10Patch-For-Review: Upgrade Refinery Jobs to Spark 3 - https://phabricator.wikimedia.org/T291386 (10Ottomata) [17:41:57] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Refine drops $schema field values - https://phabricator.wikimedia.org/T255818 (10Ottomata) [17:43:43] 10Data-Engineering, 10Patch-For-Review: Upgrade Refinery Jobs to Spark 3 - https://phabricator.wikimedia.org/T291386 (10Ottomata) [17:43:47] 10Data-Engineering, 10Patch-For-Review: HiveExtensions.convertToSchema does not properly convert arrays of structs - https://phabricator.wikimedia.org/T259924 (10Ottomata) [17:44:46] 10Data-Engineering: Analytics-hadoop Spark3 package upgrade (production) - https://phabricator.wikimedia.org/T291466 (10Ottomata) [17:44:50] 10Data-Engineering, 10Airflow: Install spark3 - https://phabricator.wikimedia.org/T295072 (10Ottomata) [17:45:00] 10Data-Engineering, 10Airflow: Install spark3 in analytics clusters - https://phabricator.wikimedia.org/T295072 (10Ottomata) [17:45:55] 10Data-Engineering: Analytics-test-hadoop Spark3 package upgrade - https://phabricator.wikimedia.org/T291465 (10Ottomata) [17:46:07] 10Data-Engineering, 10Airflow: Install spark3 in analytics clusters - https://phabricator.wikimedia.org/T295072 (10Ottomata) [17:53:50] 10Data-Engineering, 10Airflow: Install spark3 in analytics clusters - https://phabricator.wikimedia.org/T295072 (10Ottomata) [17:56:10] addshore: you know i must know why you were looking for it! :) [17:58:06] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) Had a really helpful meeting today with @bblack, @BTullis, @ayounsi, @cmooney and @NOkafor-WMF. Took some... [18:03:32] 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) In summary, NetOps/Traffic folks are okay with this from the WAN perspective. We came to the conclusion t... [18:06:49] !log razzi@lvs1020:~$ systemctl stop pybal.service to apply change https://gerrit.wikimedia.org/r/c/operations/puppet/+/779915 [18:06:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:11:51] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [POC] Use airflow-installed Spark3 for an Airflow job - https://phabricator.wikimedia.org/T308168 (10JArguello-WMF) [18:14:41] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10JArguello-WMF) [18:15:47] !log razzi@lvs1019:~$ systemctl stop pybal.service to apply change https://gerrit.wikimedia.org/r/c/operations/puppet/+/779915 [18:15:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:16:40] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 (10JArguello-WMF) [18:16:58] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10JArguello-WMF) [18:18:24] 10Data-Engineering-Radar, 10Cassandra: Enable Cassandra encryption (inter-node & client) - https://phabricator.wikimedia.org/T307798 (10JArguello-WMF) [18:18:38] (03PS3) 10NOkafor: Updated the get Cassandra password function to; [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/790651 (https://phabricator.wikimedia.org/T306895) [18:20:54] !log disregard the above log; wrote out the command but then saw there was a warning for cr2-eqiad [18:20:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:21:19] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10JArguello-WMF) [18:21:23] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Ozzie Migration: 4 wikidata metrics jobs - https://phabricator.wikimedia.org/T300021 (10Snwachukwu) [18:36:15] 10Data-Engineering, 10Data-Persistence, 10Privacy Engineering, 10SRE-swift-storage: Swift for differential privacy data publication - https://phabricator.wikimedia.org/T307245 (10Milimetric) @Htriedman: I know you're talking to @EChetty about this, we're triaging it to this column which is like a task "inc... [18:37:14] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Plan spark3 migration - possibly incrementally - https://phabricator.wikimedia.org/T306955 (10JArguello-WMF) [18:39:34] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Improvements of artifacts cache - https://phabricator.wikimedia.org/T307115 (10Ottomata) [18:41:00] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Improvements of artifacts cache - https://phabricator.wikimedia.org/T307115 (10Ottomata) Thanks Antoine! I just added another needed fix too. [18:44:20] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Superset: Help with data that's not appearing on charts - https://phabricator.wikimedia.org/T301895 (10Mayakp.wiki) @BTullis/ @Milimetric : since Line Charts are deprecated and to avoid similar problems in the future, would you recommen... [18:45:49] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering, 10User-TheresNoTime, and 2 others: Wikistats Bug - easy to understand language for pageviews - https://phabricator.wikimedia.org/T263973 (10Milimetric) @Kipala & @TheresNoTime: I recently updated the language here as part of another task, can you take... [18:51:51] (03CR) 10Jdlrobson: [C: 03+1] "Could you also update the desktop" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/791053 (https://phabricator.wikimedia.org/T306648) (owner: 10Jsn.sherman) [18:54:34] ottomata, aqu : We have spark3 working in yarn mode with hive from an-launcher1002 airflow install :) [18:55:07] Reusing spark2 config with minimal changes [18:55:41] And, it works with the old shuffler (less performant, but works) [18:55:44] Big win :) [18:56:34] ottomata: depending on how busy you are tomorrow, maybe we'll find time to try to puppetize the changes in conf? [18:57:01] Now it's all about making that work from airflow - We'll do that tomorrow with aqu [19:04:55] 10Data-Engineering, 10Data-Persistence, 10Privacy Engineering, 10SRE-swift-storage: Swift for differential privacy data publication - https://phabricator.wikimedia.org/T307245 (10Htriedman) @Milimetric Thanks for the pointers on this process! I also just talked to @gmodena and think that we're starting to... [19:05:18] 10Data-Engineering-Kanban, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Reimage WMCS db proxies to Bullseye - https://phabricator.wikimedia.org/T298940 (10razzi) I merged the related patch, but when I restarted pybal it caused an alert, so I'm waiting for input from the traffic team... [19:10:32] (03CR) 10Jsn.sherman: Add namespace to MobileWebUIActionsTracking (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/791053 (https://phabricator.wikimedia.org/T306648) (owner: 10Jsn.sherman) [19:13:23] (03CR) 10Jdlrobson: [C: 03+1] Add namespace to MobileWebUIActionsTracking (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/791053 (https://phabricator.wikimedia.org/T306648) (owner: 10Jsn.sherman) [19:14:38] joal: nice! [19:14:44] joal: sure sounds good [19:17:13] (03CR) 10Jsn.sherman: Add namespace to MobileWebUIActionsTracking (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/791053 (https://phabricator.wikimedia.org/T306648) (owner: 10Jsn.sherman) [19:22:05] 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10Ottomata) > Is there a mechanism in place to do this from Puppet (or otherwise via private.git)? Yes, but unfortunately, it is not smart enough... [19:30:08] (03PS3) 10Jsn.sherman: Add namespace to MobileWebUIActionsTracking [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/791053 (https://phabricator.wikimedia.org/T306648) [19:39:21] (03PS4) 10Jsn.sherman: Add namespace to Mobile & Desktop WebUIActionsTracking [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/791053 (https://phabricator.wikimedia.org/T306648) [20:28:55] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [20:36:45] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Superset: Help with data that's not appearing on charts - https://phabricator.wikimedia.org/T301895 (10Milimetric) @Mayakp.wiki I think we should build all new line charts using apache echarts (Time Series Line Chart in this case). When... [20:59:45] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) @gmodena @lbowmaker @dcausse, @JAllemandou, FYI I [[ https://docs.google.com/... [21:15:45] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) Oh, hm, we need a new ZK cluster! Or if we want to try KRaft, just another k... [21:21:21] ottomata: someone will reach out :) I believe! :P [21:21:34] But generally I was just checking to see if it was magically resolved yet or not ;) [21:26:13] (VarnishkafkaNoMessages) firing: ... [21:26:13] varnishkafka for instance cp3052:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3052:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [21:31:13] (VarnishkafkaNoMessages) resolved: ... [21:31:13] varnishkafka for instance cp3052:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3052:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [21:34:38] (03CR) 10Jdlrobson: [C: 03+2] Add namespace to Mobile & Desktop WebUIActionsTracking [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/791053 (https://phabricator.wikimedia.org/T306648) (owner: 10Jsn.sherman) [21:36:49] (03Merged) 10jenkins-bot: Add namespace to Mobile & Desktop WebUIActionsTracking [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/791053 (https://phabricator.wikimedia.org/T306648) (owner: 10Jsn.sherman) [22:50:51] 10Data-Engineering, 10Product-Analytics: Add editors_monthly data to Druid - https://phabricator.wikimedia.org/T256719 (10kzimmerman) [22:53:13] (VarnishkafkaNoMessages) firing: ... [22:53:18] varnishkafka for instance cp3058:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3058:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [23:11:13] (VarnishkafkaNoMessages) resolved: ... [23:11:13] varnishkafka for instance cp3058:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3058:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages