[00:14:55] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10BTullis) @bking - I had an idea, but I'm not sure whether or not it will work. I looked at the API spec for blazegraph here: https://github.c... [00:36:00] 10Data-Engineering, 10Event-Platform: Implement a new notification only revision-visibility-change stream - https://phabricator.wikimedia.org/T351565 (10Ottomata) [[ https://github.com/wikimedia/mediawiki-extensions-EventBus/blob/master/includes/EventBusHooks.php#L245 | Here is the code ]] currently creating a... [01:16:48] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 3.17% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:16:49] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 3.002% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:26:01] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10ayounsi) I think since https://gerrit.wikimedia.org/r/c/operations/puppet/+/982846 got merged, netflow1002 is faili... [08:26:24] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create a helm chart for the spark-history service - https://phabricator.wikimedia.org/T351722 (10brouberol) 05Open→03Resolved [08:26:27] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [08:26:29] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [08:26:31] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Define the spark history service helmfile deployments - https://phabricator.wikimedia.org/T352860 (10brouberol) 05Open→03Resolved [08:26:51] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [09:08:26] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Discovery-Search (Current work): Load Wikidata split graphs into test servers - https://phabricator.wikimedia.org/T350465 (10Gehel) [09:08:33] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [09:16:49] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 2.868% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:20:21] 10Data-Engineering, 10CirrusSearch, 10Image-Suggestions, 10Structured-Data-Backlog, 10Discovery-Search (Current work): Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 - https://phabricator.wikimedia.org/T353134 (10Gehel... [09:27:08] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10brouberol) [10:02:46] 10Data-Engineering, 10Data-Platform-SRE: Deprecate Hue and stop the services - https://phabricator.wikimedia.org/T341895 (10Gehel) [10:03:59] 10Data-Engineering, 10Data-Platform-SRE, 10SRE: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10Gehel) [10:04:48] 10Data-Engineering, 10Data-Platform-SRE, 10Foundational Technology Requests: Enable the Marketing Campaigns Reporting plugin for matomo - https://phabricator.wikimedia.org/T319013 (10Gehel) [10:05:10] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10Gehel) [10:06:55] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Event-Platform: [Event Platform] Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10Gehel) [10:07:28] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Event-Platform: [Event Platform] Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10Gehel) p:05Triage→03Medium [10:07:30] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Event-Platform: [Event Platform] Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10Gehel) p:05Medium→03Low [10:08:17] 10Data-Platform-SRE, 10Release-Engineering-Team, 10Scap: "scap deploy"'s config-deploy should check for broken symlinks - https://phabricator.wikimedia.org/T342162 (10Gehel) [10:08:59] 10Data-Platform-SRE: Deploy ceph radosgw processes to data-engineering cluster - https://phabricator.wikimedia.org/T330152 (10Gehel) [10:09:34] 10Data-Platform-SRE: Deploy ceph radosgw processes to data-engineering cluster - https://phabricator.wikimedia.org/T330152 (10Gehel) p:05High→03Low [10:09:47] 10Data-Platform-SRE: Upgrade matomo (piwik.wikimedia.org) to latest stable version - https://phabricator.wikimedia.org/T351552 (10Gehel) [10:10:24] 10Data-Platform-SRE: Upgrade matomo (piwik.wikimedia.org) to latest stable version - https://phabricator.wikimedia.org/T351552 (10Gehel) p:05High→03Medium [10:43:36] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Deploy the spark history services - https://phabricator.wikimedia.org/T352861 (10brouberol) a:03brouberol [12:10:34] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10brouberol) We have deleted the old spark-history keytabs and principals from... [12:30:49] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [12:31:19] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10brouberol) 05Open→03Resolved The new keytabs have been added to the priva... [12:31:46] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [12:55:10] !log deploying spark-history-analytics-test-hadoop.spark-history-test.dse-k8s-eqiad.wmnet - T351816 [12:55:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:55:13] T351816: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 [12:57:06] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Deploy the spark history services - https://phabricator.wikimedia.org/T352861 (10brouberol) [13:16:43] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10BTullis) We have decided to make the `/var/log/spark` directory on HDFS owned by the spark user, rather than hdfs.... [13:16:52] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 2.751% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:19:24] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10BTullis) [13:19:36] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10BTullis) [13:38:44] 10Data-Engineering, 10Data Pipelines: Refine: Use Spark SQL instead of Hive JDBC - https://phabricator.wikimedia.org/T209453 (10Ottomata) AH! It does work! Just via ADD COLUMN instead of CHANGE COLUMN! https://github.com/apache/spark/pull/21012#issuecomment-1857893125 [13:53:54] !log deploying spark-history-analytics-hadoop.spark-history.dse-k8s-eqiad.wmnet - T351816 [13:53:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:53:59] T351816: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 [14:18:10] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Deploy the spark history services - https://phabricator.wikimedia.org/T352861 (10brouberol) We've deployed both services, but for now, the endpoint returns a 503 at the ingress gateway level, as istio didn't seem to reconfigure the in... [14:28:17] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) I've one-offed `wdqs2010` and I'm now testing whether or not the change above will affect requests that already have an `Accept` heade... [14:48:58] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Deploy the spark history services - https://phabricator.wikimedia.org/T352861 (10brouberol) Now that https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/983422 is merged, the services can be requested as expected: ` broub... [14:49:27] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Deploy the spark history services - https://phabricator.wikimedia.org/T352861 (10brouberol) 05Open→03Resolved [14:49:29] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [14:50:13] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [14:50:46] 10Data-Engineering, 10Data Products, 10Structured-Data-Backlog: DagProperties don't automatically update Airflow variables - https://phabricator.wikimedia.org/T348963 (10xcollazo) > If an incoming Airflow DAGs revision changes a DAG's DagProperties, then the corresponding Airflow variable should also be upda... [14:52:42] 10Data-Engineering, 10Data Products, 10Structured-Data-Backlog: DagProperties don't automatically update Airflow variables - https://phabricator.wikimedia.org/T348963 (10xcollazo) > if the incoming DagProperties no longer has a given property, then the DAG can't be parsed because the given property can't be... [15:06:53] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) [15:07:01] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Configure appropriate permissions for the /var/log/spark HDFS directory - https://phabricator.wikimedia.org/T352838 (10BTullis) 05Open→03Resolved I have also updated this on the prod cluster. ` btullis@an-coord1001:~$ hdfs dfs -ls... [16:32:30] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/40 Fix a problem... [16:43:09] 10Data-Engineering, 10Tech-Docs-Team, 10Goal: Redesign Data Platform docs on Wikitech - https://phabricator.wikimedia.org/T350911 (10TBurmeister) Status update: * Olja created the next iteration of the information architecture draft ([[ https://docs.google.com/document/d/1vSy-D7dMqws-0aWjub6mObTT-4ynNeclV7n9... [17:13:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [17:16:52] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 2.731% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:38:42] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10dcausse) [20:31:25] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10nshahquinn-wmf) Thanks for working on this while I was on leave, @BTullis! If I'm understanding correctly, after [gerrit 709713](ht... [20:43:42] 10Data-Engineering: Make defaults immutable for Airflow confs - https://phabricator.wikimedia.org/T325014 (10xcollazo) [20:44:33] 10Data-Engineering: Make defaults immutable for Airflow confs - https://phabricator.wikimedia.org/T325014 (10xcollazo) @Htriedman likely hit this on https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/559. Will investigate as part of OpsWeek. [20:50:23] 10Data-Engineering, 10Data Products (Data Products Sprint 05): Make defaults immutable for Airflow confs - https://phabricator.wikimedia.org/T325014 (10xcollazo) [21:13:42] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:18:26] (DiskSpace) firing: Disk space an-test-worker1001:9100:/ 2.556% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-test-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [21:38:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:54:39] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 (10bking) >>! In T347355#9408219, @BTullis wrote: > @bking - I had an idea, but I'm not sure whether or not it will work. > I looked at the API... [22:38:01] 10Data-Platform-SRE: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper) [22:39:04] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 9 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10matmarex) As discovered above, Wikibase (and CommonsMetadata too) actually use the ContentHandler::ge... [22:40:05] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 9 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10matmarex)