[01:40:47] 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-EventLogging, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10ori) > EventLogging is home grown, and was not designed for purposes other than low volume analytics in MySQL databa... [08:31:07] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 00): Create cassandra loading HQL files from their oozie definition - https://phabricator.wikimedia.org/T311507 (10EChetty) [08:31:43] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines: Create conda-base-env with last pyspark - https://phabricator.wikimedia.org/T309227 (10EChetty) [08:36:52] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 00): Create conda-base-env with last pyspark - https://phabricator.wikimedia.org/T309227 (10EChetty) [08:49:16] 10Analytics-Wikistats, 10Data Engineering Planning, 10Data Pipelines: Wikistats in Uzbek - https://phabricator.wikimedia.org/T314477 (10EChetty) [08:49:19] 10Analytics-Wikistats, 10Data Engineering Planning, 10Data Pipelines: "Pages to date" not loading with "daily" metric - https://phabricator.wikimedia.org/T312717 (10EChetty) [08:49:34] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines, 10Patch-For-Review: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10EChetty) [08:50:40] 10Data-Engineering-Kanban, 10Cassandra, 10Data Engineering Planning, 10Shared-Data-Infrastructure, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10EChetty) [08:56:13] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Data Pipelines (Sprint 00), 10Patch-For-Review: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10EChetty) [11:00:28] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Event-Platform Value Stream: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client instead - https://phabricator.wikimedia.org/T286344 (10phuedx) I was about to write a task to request that `consumer` is... [12:47:36] 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-EventLogging, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata) <3 [12:51:01] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Event-Platform Value Stream: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client instead - https://phabricator.wikimedia.org/T286344 (10Ottomata) > I'd also suggest that we shift to allowlisting stream... [12:52:13] (03CR) 10Ottomata: [C: 03+1] Clean up specific jar versions from comments [analytics/refinery] - 10https://gerrit.wikimedia.org/r/820776 (owner: 10Milimetric) [13:29:37] 10Data-Engineering-Kanban, 10Event-Platform Value Stream, 10Metrics-Platform, 10Wikidata, and 5 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10phuedx) [13:30:09] 10Data-Engineering-Kanban, 10Event-Platform Value Stream, 10Metrics-Platform, 10Wikidata, and 5 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10phuedx) AFAICT this is Done™. [14:10:38] 10Analytics, 10Data-Engineering, 10Metrics-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Event Platform Client Libraries - https://phabricator.wikimedia.org/T228175 (10Ottomata) [14:15:42] 10Data-Engineering, 10Event-Platform Value Stream, 10Generated Data Platform, 10Spike: Define how to authenticate with Cassandra and test Flink POC - https://phabricator.wikimedia.org/T313628 (10lbowmaker) [14:20:26] 10Data-Engineering, 10Event-Platform Value Stream, 10Epic: [Shared Event Platform] Design and Implement POC Flink Service to Combine Existing Streams, Enrich and Output to New Topic - https://phabricator.wikimedia.org/T307959 (10Ottomata) [14:20:34] 10Data-Engineering, 10Event-Platform Value Stream: [Shared Event Platform] enrichment module should not depend on flink-scala - https://phabricator.wikimedia.org/T310680 (10Ottomata) [14:20:38] 10Data-Engineering-Kanban, 10Data-Engineering-Radar, 10Event-Platform Value Stream, 10Patch-For-Review: Flink output support for Event Platform events - https://phabricator.wikimedia.org/T310218 (10Ottomata) [14:20:45] 10Data-Engineering, 10Event-Platform Value Stream: [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream. - https://phabricator.wikimedia.org/T311084 (10Ottomata) [14:20:57] 10Data-Engineering, 10Event-Platform Value Stream: [Shared Event Platform][SPIKE] investigate Flink metric reporters and prometheus integration - https://phabricator.wikimedia.org/T310805 (10Ottomata) [14:21:09] 10Data-Engineering, 10Event-Platform Value Stream, 10Spike: Define how to authenticate with Cassandra and test Flink POC - https://phabricator.wikimedia.org/T313628 (10Ottomata) [14:21:21] 10Data-Engineering, 10Event-Platform Value Stream: [Shared Event Platform] Rename scala package and setup a multi module maven project. - https://phabricator.wikimedia.org/T310626 (10Ottomata) [14:21:30] 10Data-Engineering, 10Event-Platform Value Stream, 10Generated Data Platform, 10Epic: Integrate Image Suggestions Feedback with Cassandra - https://phabricator.wikimedia.org/T306627 (10Ottomata) [14:21:51] 10Data-Engineering, 10Event-Platform Value Stream, 10Generated Data Platform: [Shared Event Platform][SPIKE] Decide a format for the enriched stream schema - https://phabricator.wikimedia.org/T311600 (10Ottomata) [14:21:53] 10Data-Engineering, 10Event-Platform Value Stream, 10Epic: [Shared Event Platform] [SPIKE] Decide on page state change storing and backfill approach - https://phabricator.wikimedia.org/T311085 (10Ottomata) [14:22:21] 10Data-Engineering, 10Event-Platform Value Stream: [Shared Event Platform][NEEDS GROOMING] We should standardize Flink app config for yarn (development) deployments - https://phabricator.wikimedia.org/T311070 (10Ottomata) [14:22:31] 10Data-Engineering, 10Event-Platform Value Stream: [Shared Event Platform] Implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10Ottomata) [14:22:39] 10Data-Engineering, 10Event-Platform Value Stream, 10Epic: Event Platform - Proof of Concept - Enriched Edit History Message Creation - https://phabricator.wikimedia.org/T302834 (10Ottomata) [14:22:47] 10Data-Engineering, 10Event-Platform Value Stream, 10Generated Data Platform: [Shared Event Platform][SPIKE] Decide a format for the enriched stream schema - https://phabricator.wikimedia.org/T311600 (10Ottomata) [14:23:15] 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) [14:23:28] 10Data-Engineering, 10Generated Data Platform, 10Epic, 10Event-Platform Value Stream (Sprint 00): Integrate Image Suggestions Feedback with Cassandra - https://phabricator.wikimedia.org/T306627 (10lbowmaker) [14:23:49] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Spike: Define how to authenticate with Cassandra and test Flink POC - https://phabricator.wikimedia.org/T313628 (10lbowmaker) [14:24:12] 10Data-Engineering, 10Epic, 10Event-Platform Value Stream (Sprint 00): [Shared Event Platform] [SPIKE] Decide on page state change storing and backfill approach - https://phabricator.wikimedia.org/T311085 (10Ottomata) [14:24:31] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10lbowmaker) [14:25:54] 10Data-Engineering, 10Generated Data Platform, 10Event-Platform Value Stream (Sprint 00): [Shared Event Platform] Produce new mediawiki.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129 (10lbowmaker) [14:25:56] 10Data-Engineering, 10Epic, 10Event-Platform Value Stream (Sprint 00): [Shared Event Platform] [SPIKE] Decide on page state change storing and backfill approach - https://phabricator.wikimedia.org/T311085 (10Ottomata) [14:34:12] 10Data-Engineering-Kanban, 10Event-Platform Value Stream (Sprint 00), 10Patch-For-Review: [BUG] jsonschema-tools materializes fields in yaml in a different order than in json files - https://phabricator.wikimedia.org/T308450 (10lbowmaker) [14:34:17] 10Analytics, 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) [14:34:29] 10Data-Engineering-Kanban, 10Data Engineering Planning, 10Event-Platform Value Stream, 10SRE, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Ottomata) [14:37:02] 10Data-Engineering-Kanban, 10Data-Engineering-Radar, 10Event-Platform Value Stream (Sprint 00), 10Patch-For-Review: Flink output support for Event Platform events - https://phabricator.wikimedia.org/T310218 (10Ottomata) [14:42:06] 10Data-Engineering-Kanban, 10Data-Engineering-Radar, 10Event-Platform Value Stream (Sprint 00), 10Patch-For-Review: Flink output support for Event Platform events - https://phabricator.wikimedia.org/T310218 (10Ottomata) [14:44:58] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform Value Stream, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata) [14:53:35] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00): [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream. - https://phabricator.wikimedia.org/T311084 (10lbowmaker) [14:54:17] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00): [Shared Event Platform] Produce new mediawiki.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129 (10lbowmaker) [14:55:58] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 00): [Shared Event Platform] Rename scala package and setup a multi module maven project. - https://phabricator.wikimedia.org/T310626 (10lbowmaker) [15:02:26] 10Data-Engineering, 10Epic, 10Event-Platform Value Stream (Sprint 00): Integrate Image Suggestions Feedback with Cassandra - https://phabricator.wikimedia.org/T306627 (10lbowmaker) [17:02:40] ottomata: I'm not seeing any new partitions in event.mediawiki_cirrussearch_request since 2022-08-09T09:00:00Z, any ideas why? I can see events in kafka-jumbo for eqiad.mediawiki.cirrussearch-request still coming in with current timestamps [17:02:59] (i've been on vacation, just looking over my email full of errors from last week) [18:02:22] hmm, seems thats incorrect, something else going on but the data is ingestiing (initial looks at sorted partitioning apparently sorted by string value instead of integer value of year/month/day/hour) [18:39:21] ebernhardson: oh soooo, sorry, is okay then? [18:39:46] instead it was a single missing partition that caused our pipelines to stall out, sec i can pull back up which exactly it was. If you could create the partition (even simply empty) that would work [18:41:17] ottomata: event.mediawiki_cirrussearch_request/datacenter=codfw/year=2022/month=8/day=4/hour=17 is missing [18:41:46] ottomata: being codfw i would expect it typically only has the example event in it [18:42:06] (although thats soon changing with the multi-dc deployment being inprogress) [19:15:34] PROBLEM - Check unit status of analytics-dumps-fetch-clickstream on clouddumps1001 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-clickstream https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:18:08] PROBLEM - Check unit status of analytics-dumps-fetch-geoeditors_dumps on clouddumps1001 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-geoeditors_dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:18:27] ebernhardson: the canary event is filtered out from the event. tables. looking... [19:20:13] huh indeed, even the raw data does not have a hour=17. [19:20:50] in /wmf/data/raw/event/codfw.mediawiki.cirrussearch-request/year=2022/month=08/day=04 [19:26:24] !log test [19:26:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:27:32] ottomata: unrelated, is there a convenient way to look at what partitions exist? I ended up taking `show partitions event.mediawiki_cirrussearch_request` in a pyspark shell and then manually parsing the output into a pandas dataframe, entirely error prone [19:28:09] trying to think of a better way to ask hive metastore what partitions are missing when we expect hourly ones [19:28:34] that is the best way I know of, some kind of show partitions ... query [19:28:52] you can also look at directories in hdfs, they should == show partitions output [19:29:17] hmm, actually thats not a bad idea, might be easier to walk through things with bash since it all comes split into pieces directly [19:29:20] this is def a very strange problem. still investigtating. [19:29:38] i think you can also ask the metastore api direclty...maybe? :) [19:30:10] probably more complicated tho [19:31:00] yea thats mostly what airflow does, for the moment i adjusted our ranged partition sensors in airflow to report the list of partitions it didn't find at the end of a poke (previously it only reported that it was checking things) which will hopefully be enough for the future [19:31:04] ebernhardson: also [19:31:05] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/RefineTarget.scala#L664 [19:31:16] that's how we do it in Refine [19:31:32] or via Hive: [19:31:32] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/RefineTarget.scala#L777 [19:32:05] getPartitionPathsFromDb [19:32:06] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/RefineTarget.scala#L912 [19:33:06] heh, yea seems quite similar, in airflow we ask it to iterate from timestamp a to b with a period of hours and generate all the expected partition specifications, and then ask metastore one at a time if that partition exists. [19:34:03] it's just entirely difficult to invoke from a shell or some other debug location [19:34:07] i think that's sort of how our airflow stuff works too? https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/sensors/hive.py [19:34:11] hm. [19:34:12] this raw data with missing hour=17 is strange [19:34:17] two things look weird. [19:34:43] one, the gobblin job that ran at this time wrote some data for both hour 16 and 18, [19:35:11] is that werid? hm maybe not. [19:35:25] but it indicates to me that there were no canary events for codfw hour=17 [19:35:32] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:35:41] i don't see a relevant canary event problem alert in my emails... [19:35:44] looking at logs... [19:35:49] maybe the canary ran close to an hourly boundary and got put into the nearby hour? [19:35:56] it runs 4 times an hour [19:36:06] ahh, so then it really should be there somewhere [19:36:17] yeha...unless the systemd timer was stopped for that whole hour [19:36:20] or it failed each time [19:38:03] according to produce_canary_events.log, we succesfully posted to https://eventgate-analytics.svc.codfw.wmnet:4592/v1/events for "stream":"mediawiki.cirrussearch-request" several times in hour=17 [19:38:15] hmm is it possible eventgate-main was depooled in codfw during that time? [19:38:32] no wait, shoudln't matter, i'm specifically addressing the codfw service [19:38:47] https://www.irccloud.com/pastebin/XXj8LJx9/ [19:40:03] (oh right, this is eventgate-analytics, but still same thing) [19:40:19] we did roll restart eventgate-analytics-external in hour=19 [19:40:39] (sorry also mistyped ^^^, meant 'several times in hour=19') [19:40:46] wait no, ah [19:40:52] i'm grepping for 19, should be grepping for 17. [19:41:01] i do same all the time :) [19:41:09] OH ho! [19:41:19] indeed, ther eare canary produce errors for hour=17 [19:41:24] 4 of them., each time [19:41:36] ahhhh this was during a botched refinery deploy, jars did not get synced correctly i think. [19:42:01] HM. [19:43:17] hm, it looks like they fixed the deployment by hour=18 though. [19:43:21] https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:44:15] ahh, i should have looked at SAL. would have seen lots of related deploys in the 17:00 hour [19:44:35] but, not even, they are in the 16 hour. [19:44:40] ARGH [19:44:43] 18 hour. [19:44:53] in main SAL i see: 17:09 ebysans@deploy1002: Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288 [19:45:02] (18==6pm brain is misbehaving!) [19:45:22] AHHHHH i see [19:45:35] sandra deploeyd at 15:59, and it was broken until 18:19 [19:45:38] yes. [19:45:56] did we not get an email?! [19:45:57] hm [19:48:23] hm, i can't quite tell. we did get alert emails about the job failing, but we got those for many many jobs, since the jars didn't sync properly during the deploy. [19:48:48] usually we get some info about particular produce canary events failing, but because the whole job faield with [19:48:48] Aug 4 17:15:00 an-launcher1002 produce_canary_events[23815]: Error: Could not find or load main class org.wikimedia.analytics.refinery.job.ProduceCanaryEvents [19:48:53] i guess the email didn't send. [19:49:17] sigh, it wouuld be nice to move this to airflow or k8s or something. [19:50:08] this means that no canary events were sent for hour 17, so topics with 0 regular volume got not canaries. [19:50:25] so there are probably a lot of event tables that are missing codfw hour=17 [19:50:56] hmm, yea makes sense. Probably not too many people yet rely on detecting that but as more teams pick up airflow (including yours :) will probably become more common [19:51:06] yeah, https://phabricator.wikimedia.org/T252585 [19:51:17] I'm going to use that ticket to add context, and also bring it up to team [19:56:27] thanks! [19:56:31] 10Analytics, 10Data-Engineering, 10Event-Platform Value Stream: Refine event pipeline at this time refines data in hourly partitions without knowing if the partition is complete - https://phabricator.wikimedia.org/T252585 (10Ottomata) We just encountered an issue related to this ticket. I'll add it here to... [20:09:40] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:12:08] RECOVERY - Check unit status of analytics-dumps-fetch-clickstream on clouddumps1001 is OK: OK: Status of the systemd unit analytics-dumps-fetch-clickstream https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:43:44] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:48:28] RECOVERY - Check unit status of analytics-dumps-fetch-geoeditors_dumps on clouddumps1001 is OK: OK: Status of the systemd unit analytics-dumps-fetch-geoeditors_dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:04:21] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:36:57] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:47:01] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:30:26] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:38:57] (03CR) 10Gergő Tisza: [C: 03+2] analytics/mediawiki/editgrowthconfig: Add is_registered_user [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/820661 (https://phabricator.wikimedia.org/T312148) (owner: 10Urbanecm) [22:39:35] (03Merged) 10jenkins-bot: analytics/mediawiki/editgrowthconfig: Add is_registered_user [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/820661 (https://phabricator.wikimedia.org/T312148) (owner: 10Urbanecm) [23:06:55] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:28:49] 10Data-Engineering, 10Product-Analytics: Suspicious bot activity in June - https://phabricator.wikimedia.org/T315267 (10Mayakp.wiki) [23:39:43] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:50:07] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring