[00:28:07] (03PS7) 10Snwachukwu: [WIP] Add Dynamic Pivot job for reportupdater reports [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995271 (https://phabricator.wikimedia.org/T354552) [01:54:54] 10Data-Engineering (Sprint 9), 10Patch-For-Review: [Maintenance] Migrate ReportUpdater browser queries to Airflow - https://phabricator.wikimedia.org/T354552#9553977 (10CodeReviewBot) ebysans opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/615 Add dag for browser All S... [07:53:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:58:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:02:21] o/ can someone check the canary events subsystem? it seems down since 2024/02/14T13:00 [08:13:11] (03CR) 10Brouberol: [C: 03+1] Add stat1010 and stat1011 to scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/1003042 (https://phabricator.wikimedia.org/T336040) (owner: 10Stevemunene) [08:17:30] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Configure OIDC Authentication for Superset on K8S - https://phabricator.wikimedia.org/T353794#9554212 (10brouberol) I managed to integrate our IDP server with the OIDC Superset login. However, I think we're missing some data in LDAP if we want... [08:49:42] (03CR) 10Joal: "I would change the folder organization to not contain reportupdater_queries, and to possibly be a bit more explicit than cx - for instance" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/1003928 (owner: 10Aleksandar Mastilovic) [08:52:20] (03PS4) 10Snwachukwu: Add Reportupdater Browser All Sites Queries. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995740 (https://phabricator.wikimedia.org/T354552) [09:26:35] Morning all. I'm back from leave now, just catching up. [09:27:33] dcausse: I'll check this. I think that the canary events publishing is already being worked on, but if it's still wedged now then I'll see if there is anything I can do. [09:28:44] btullis: thanks & welcome back! (yes brouberol had a look already and it's a known issue that was disucussed in slack) [09:29:20] Ack [09:40:11] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438#9554302 (10BTullis) [09:41:38] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578#9554301 (10BTullis) 05Open→03Resolved [10:14:12] (03CR) 10Btullis: [C: 03+1] Add stat1010 and stat1011 to scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/1003042 (https://phabricator.wikimedia.org/T336040) (owner: 10Stevemunene) [10:14:22] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9554373 (10taavi) [10:17:06] (03CR) 10Btullis: [C: 03+2] Add stat1010 and stat1011 to scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/1003042 (https://phabricator.wikimedia.org/T336040) (owner: 10Stevemunene) [10:17:09] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add stat1010 and stat1011 to scap targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/1003042 (https://phabricator.wikimedia.org/T336040) (owner: 10Stevemunene) [10:18:35] sorry to bug you again but is it known that wmf.pageview_hourly has no new partitions since year=2024/month=2/day=17/hour=6 ? [10:37:27] dcausse: I can see airflow task failures for `pageview_actor_hourly.compute_pageview_actor_hourly` and subsequent SLA misses, so it should be known about. I'll check with sfaci who is on Ops Week, I believe. [10:38:30] thx! [10:45:02] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Serve Superset static assets from an optimised container - https://phabricator.wikimedia.org/T357890#9554465 (10BTullis) [10:45:16] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Serve Superset static assets from an optimised container - https://phabricator.wikimedia.org/T357890#9554480 (10BTullis) p:05Triage→03High [10:51:32] dcausse: sfaci is looking at this issue now. Initial investigation shows that there were some repeated spark out-of-memory issues on that task. [10:53:20] btullis: ack, thanks for the update [11:14:34] !log rerunning the compute_pageview_actor_hourly task in the pageview_actor_hourly DAG 2024-02-17 08:00:00 UTC [11:14:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:25:46] sfaci and I have cleared the failed task in airflow, along with the downstream task, but for some reason it doesn't seem to be re-running it as we expect. [11:27:24] https://usercontent.irccloud-cdn.com/file/r8fboZVP/image.png [11:28:16] We're unsure why the DAG run remains in the scheduled state and the jobs do not appear to be retrying. [11:33:53] 10Data-Engineering, 10Data-Platform-SRE, 10Epic: Migrate the Analytics Superset instances to our DSE Kubernetes cluster - https://phabricator.wikimedia.org/T347710#9554633 (10BTullis) [11:47:18] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9554681 (10AndrewTavis_WMDE) Thank you @hnowlan for the check in here. Final word on this is coming from @Manuel who will be back in the office on... [11:54:35] the reason for which the tasks stay in "not yet started" mode is because another dag-instance is already started, and we can only have one running at any single point in time. The already started instance was waiting for data, it's now running, we should see the backfilling task start soon [11:57:00] I thought the parameter about aloowed concurrent dag-instances per dag had been changed from 1 to 3, but looking at code it seems that not [11:57:02] joal: Ah, of course! Thanks so much. I remember now the whole discussion around this: https://phabricator.wikimedia.org/T300870#9352906 when we switched the maximum concurrency back to 1. [11:58:10] And indeed the task is now running. https://usercontent.irccloud-cdn.com/file/dhMmgeJx/image.png [12:02:48] Ah! Thanks Joseph! I can see the task starting again right now [12:06:01] btullis: I think we have not applied decision made here: https://wikimedia.slack.com/archives/C02291Z9YQY/p1701198735026259 [12:10:46] It seems the job has failed again because the same error: something related to org.apache.spark.memory.SparkOutOfMemoryError. There are a lot of exceptions like that. Any clues about the next step here? [12:57:00] joal: Oops, sorry about that. That was clearly on me to do it and I failed to follow up on it. I'll make a patch now. [13:01:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:03:06] sfaci, this is where I'm not very sure what to do. It is possible, I believe, to override the configuration parameters for the `pageview_hourly` job, from the Airflow UI. You click on Admin->Variables and find the name of the DAG in the list. [13:03:09] https://usercontent.irccloud-cdn.com/file/Bo4T9yz5/image.png [13:05:11] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data Pipelines: Investigate Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263#9554901 (10lbowmaker) 05Open→03Resolved [13:06:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [13:07:41] I'll say that I'm a bit confused though , because the pageview_hourly DAG has some default options of 4GB (driver) and 8GB (executor) values configured here: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/analytics/dags/pageview/pageview_hourly_dag.py?ref_type=heads#L66 [13:09:12] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: A proposed modification to pageviews data protection for Wikistats - https://phabricator.wikimedia.org/T340926#9554915 (10lbowmaker) [13:09:34] I understand that like the default values in the case there are no values for the properties defined where you have shown before, right? There higger values are present right now. 12G for the driver for example [13:10:32] Oh, my mistake. I was looking at the wrong DAG. It's the pageview_actor_hourly which has these defaults configured here: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/analytics/dags/pageview/pageview_actor_hourly_dag.py?ref_type=heads#L72 [13:11:09] 10Data-Engineering (Sprint 9): Turn off ReportUpdater jobs no longer used - https://phabricator.wikimedia.org/T357419#9554922 (10lbowmaker) [13:12:13] It's still confusing to me. https://usercontent.irccloud-cdn.com/file/n4kAABwu/image.png [13:17:25] 10Data-Engineering: NEW BUG REPORT Some DAG run attempts fail because File *_temporary/0 does not exist. - https://phabricator.wikimedia.org/T347076#9554941 (10lbowmaker) 05Open→03Resolved a:03lbowmaker [13:24:06] last values are the ones are configured in the properties. I'm assume they are the ones that are applied when running the job. Aren't they enough? Should we provide more memory for some of them? [13:39:37] sfaci: Shall we pair on this again? Same meeting room? [14:20:27] 10Data-Engineering, 10MW-on-K8s, 10SRE, 10serviceops, 10Patch-For-Review: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9555132 (10Clement_Goubert) [14:20:41] 10Data-Engineering, 10MW-on-K8s, 10SRE, 10serviceops, 10Patch-For-Review: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9555135 (10Clement_Goubert) 05duplicate→03In progress [14:27:01] 10Data-Engineering, 10MW-on-K8s, 10SRE, 10serviceops, 10Patch-For-Review: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9555154 (10Clement_Goubert) We're all set for this, according to [[ https://wikitech.wikimedia.org/wiki/MediaWiki_Event_Enrichment#Upgr... [14:28:14] 10Data-Engineering: [Iceberg Migration] Migrate pageview tables to Iceberg - https://phabricator.wikimedia.org/T347690#9555165 (10lbowmaker) [14:29:12] 10Data-Engineering (Sprint 9): [Dataset Config Store] Deploy poc to dse-k8s - https://phabricator.wikimedia.org/T357434#9555168 (10lbowmaker) [14:41:39] For the record, we bumped the pageview_actor_hourly executor memory to 16 GB and the missing hour worked. Dependent jobs should have caught up now. [14:51:17] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Serve Superset static assets from an optimised container - https://phabricator.wikimedia.org/T357890#9555202 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/27 Configure the super... [15:10:12] joal: as discussed this morning: https://gitlab.wikimedia.org/repos/ci-tools/wmf-jvm-parent-pom/-/merge_requests/7 [15:10:24] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Configure OIDC Authentication for Superset on K8S - https://phabricator.wikimedia.org/T353794#9555238 (10BTullis) >>! In T353794#9554212, @brouberol wrote: > I managed to integrate our IDP server with the OIDC Superset login. However, I think... [15:10:33] and also https://gitlab.wikimedia.org/repos/ci-tools/wmf-jvm-parent-pom/-/merge_requests/6 if you have a bit more time. [15:11:13] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Configure OIDC Authentication for Superset on K8S - https://phabricator.wikimedia.org/T353794#9555242 (10BTullis) @SLyngshede-WMF has been working on the IDP project a lot, so he may have some additional insights if you run into trouble with g... [15:18:21] 10Data-Engineering (Sprint 9): Airflow mapped tasks UI & metrics - https://phabricator.wikimedia.org/T357430#9555282 (10lbowmaker) [15:19:37] 10Data-Engineering (Sprint 10): [Dataset Config Store] Setup initial CI checks - https://phabricator.wikimedia.org/T357468#9555287 (10lbowmaker) [15:19:41] 10Data-Engineering (Sprint 10), 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466#9555289 (10lbowmaker) [15:19:54] 10Data-Engineering (Sprint 10): [Iceberg Migration] Migrate pageview tables to Iceberg - https://phabricator.wikimedia.org/T347690#9555291 (10lbowmaker) [15:22:43] 10Data-Engineering, 10Pageviews-API, 10Tool-Pageviews: No Pageviews data since 2024-02-17 - https://phabricator.wikimedia.org/T357910#9555301 (10Framawiki) p:05Triage→03High [15:28:38] 10Data-Engineering, 10Pageviews-API, 10Tool-Pageviews: No Pageviews data since 2024-02-17 - https://phabricator.wikimedia.org/T357910#9555331 (10Framawiki) I don't know if it's just a temporary processing delay, or a breakage. But given the different user reports the same day, I prefer to fill a task. [15:30:34] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466#9555350 (10lbowmaker) [15:30:36] 10Data-Engineering: [Dataset Config Store] Setup initial CI checks - https://phabricator.wikimedia.org/T357468#9555348 (10lbowmaker) [15:30:47] 10Data-Engineering: [Iceberg Migration] Migrate pageview tables to Iceberg - https://phabricator.wikimedia.org/T347690#9555353 (10lbowmaker) [15:31:01] 10Data-Engineering: [Data Quality] decrease line width and point size in Airflow metrics dashboard - https://phabricator.wikimedia.org/T356359#9555357 (10lbowmaker) [15:31:04] 10Data-Engineering: [Data Quality] decrease line width and point size in Airflow metrics dashboard - https://phabricator.wikimedia.org/T356359#9504666 (10lbowmaker) [15:38:29] 10Data-Engineering, 10Data Products, 10Pageviews-API, 10Tool-Pageviews: No Pageviews data since 2024-02-17 - https://phabricator.wikimedia.org/T357910#9555397 (10lbowmaker) [15:39:30] 10Data-Engineering, 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959#9555401 (10Ottomata) @lbowmaker @gmodena Should we resolve and close this? [15:40:18] 10Data-Platform-SRE: RdfStreamingUpdaterSpaceUsageTooHigh - https://phabricator.wikimedia.org/T357330#9555408 (10Gehel) [15:42:27] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the Iconclass sparql endpoint - https://phabricator.wikimedia.org/T357533#9555412 (10Gehel) p:05Triage→03Medium [15:43:53] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): RdfStreamingUpdaterSpaceUsageTooHigh - https://phabricator.wikimedia.org/T357330#9555441 (10Gehel) p:05Triage→03High [15:44:38] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Decommission cloudelastic1001-1004 - https://phabricator.wikimedia.org/T357780#9555444 (10Gehel) p:05Triage→03High [15:45:41] 10Data-Engineering, 10Data Products: Modify ClickStreamBuilder pipeline to cope with pagelinks schema changes - https://phabricator.wikimedia.org/T355588#9555450 (10lbowmaker) [15:54:54] 10Data-Engineering, 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959#9555480 (10lbowmaker) 05Open→03Resolved > Quoted Text >>! In T307959#9555401, @Ottomata wrote: > @lbowmaker @gmod... [17:27:40] 10Data-Engineering, 10Data Products, 10Pageviews-API, 10Tool-Pageviews: No Pageviews data since 2024-02-17 - https://phabricator.wikimedia.org/T357910#9555814 (10Sfaci) @BTullis and I have been working on this just a couple of hours ago. A DAG was stuck on Saturday because of a out-of-memory error. We fixe... [19:09:03] (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) firing: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ... [19:09:04] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=eqiad.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected [20:09:04] (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) resolved: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ... [20:09:04] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=eqiad.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected [22:09:07] 10Data-Engineering (Sprint 9): Turn off ReportUpdater jobs no longer used - https://phabricator.wikimedia.org/T357419#9556275 (10lbowmaker) [22:10:32] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Serve Superset static assets from an optimised container - https://phabricator.wikimedia.org/T357890#9556276 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/27 Configure the super... [22:11:34] 10Data-Engineering: [Maintenance] Migrate wmcs to Airflw - https://phabricator.wikimedia.org/T357938#9556277 (10lbowmaker) [22:11:53] 10Data-Engineering: [Maintenance] Migrate wmcs to Airflow - https://phabricator.wikimedia.org/T357938#9556289 (10lbowmaker) [23:08:55] (03PS8) 10Snwachukwu: Add Dynamic Pivot job for reportupdater reports [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/995271 (https://phabricator.wikimedia.org/T354552) [23:17:29] (03CR) 10Snwachukwu: Add Reportupdater Browser All Sites Queries. (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995740 (https://phabricator.wikimedia.org/T354552) (owner: 10Snwachukwu) [23:18:13] (03CR) 10Snwachukwu: Add Reportupdater Browser All Sites Queries. (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/995740 (https://phabricator.wikimedia.org/T354552) (owner: 10Snwachukwu)