[00:21:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [00:26:02] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search (Current work), 10Documentation: Review wikitech:Search and write processes for k8s world - https://phabricator.wikimedia.org/T356303#9565914 (10EBernhardson) On further review, simply documenting the various commands to run seemed error pron... [01:15:30] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Patch-For-Review: Migrate EventLogging to JSDoc - https://phabricator.wikimedia.org/T357444#9566080 (10apaskulin) [09:05:53] we've gotten quite a lot of SLA miss email alerts. Is that on our radar? [09:11:53] Hi brouberol - it is very much on my radar - I'll send slack threads about this [09:12:05] sounds good [09:29:09] 10Data-Engineering, 10Pageviews-API: Missed pageview data over API - https://phabricator.wikimedia.org/T358132#9566672 (10Dusan_Krehel) Fixed. [09:29:52] 10Data-Engineering, 10Pageviews-API: Missed pageview data over API - https://phabricator.wikimedia.org/T358132#9566673 (10Dusan_Krehel) 05Open→03Resolved [09:34:54] 10Data-Platform-SRE, 10Dumps-Generation: Decom dumpsdata100[1-2] - https://phabricator.wikimedia.org/T353787#9566687 (10BTullis) @ArielGlenn - Please could you have a quick loot to help confirm whether or not these servers are still running any production tasks, or whether they are now ready for decommissionsi... [09:41:06] 10Data-Platform-SRE, 10Dumps-Generation: Decom dumpsdata100[1-2] - https://phabricator.wikimedia.org/T353787#9566695 (10ArielGlenn) 1 and 2 are both role dumps::generation::server::spare and have been so since at least last July. See https://gerrit.wikimedia.org/r/c/operations/puppet/+/936379 and https://gerri... [10:32:14] 10Data-Platform-SRE, 10Data-Platform: [Presto] Use JWT authentication instead of Kerberos for cluster-internal communication - https://phabricator.wikimedia.org/T358196#9566883 (10BTullis) [10:52:47] 10Data-Platform-SRE, 10Dumps-Generation: Decom dumpsdata100[1-2] - https://phabricator.wikimedia.org/T353787#9566987 (10BTullis) Many thanks, that's great. [11:02:41] 10Data-Engineering, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): [superset-k8s] Find a solution for the requestctl-generator html page - https://phabricator.wikimedia.org/T356490#9567026 (10BTullis) I'm moving this into the current milestone, because we're effectively working on it together as part of {T357... [11:36:43] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Spark history server lags behind and some tasks are not indexed in time - https://phabricator.wikimedia.org/T358206#9567181 (10brouberol) [11:52:52] !log redeploying the spark-history server with expanded egress rules for hadoop workers - T358206 [11:52:52] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Spark history server lags behind and some tasks are not indexed in time - https://phabricator.wikimedia.org/T358206#9567255 (10brouberol) p:05Triage→03High [11:52:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:52:54] T358206: Spark history server lags behind and some tasks are not indexed in time - https://phabricator.wikimedia.org/T358206 [12:02:44] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Spark history server lags behind and some tasks are not indexed in time - https://phabricator.wikimedia.org/T358206#9567267 (10brouberol) 05Open→03Resolved The spark history server is now catching up on its lag after a redeploy. No more tr... [12:04:06] 10Data-Engineering: Remove wikidata from this historical dumps process - https://phabricator.wikimedia.org/T357438#9567290 (10lbowmaker) 05Open→03Declined Dupe of: https://phabricator.wikimedia.org/T357859 [12:04:08] 10Data-Engineering (Sprint 9), 10Data Products, 10Movement-Insights: Skip Wikidata when loading XML dumps to the Data Lake - https://phabricator.wikimedia.org/T357859#9553374 (10lbowmaker) [12:08:10] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Data-Platform: Investigate late/delayed Airflow task failure notifications - https://phabricator.wikimedia.org/T358205#9567303 (10BTullis) @JAllemandou replied to his copy of the original notification email at 10:08 AM UTC, mentioning the delay. > The job has b... [12:15:20] 10Data-Engineering, 10Epic: Dataset Config Store - https://phabricator.wikimedia.org/T354557#9567310 (10JAllemandou) > Have we looked around to see if there are existing 'dataset' config formats/specs we can already use? I have not investigated this road at all. https://datacontract.com/ is cool, I think we n... [12:15:58] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Data-Platform: Investigate late/delayed Airflow task failure notifications - https://phabricator.wikimedia.org/T358205#9567312 (10BTullis) Looking at the headers of that original notification email would appear to match with a mailman problem on listss1001. {F4... [12:18:26] 10Data-Engineering (Sprint 9): Delete reportupdater jobs data - https://phabricator.wikimedia.org/T358210#9567314 (10JAllemandou) [12:20:11] btullis or brouberol, would one of you merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005565 please? [12:20:26] joal: Looking now. [12:20:29] We are absenting reportupdater jobs that are not needed [12:21:06] Done. [12:22:08] ...and puppet running no on an-launcher1002 [12:22:11] now [12:23:11] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Data-Platform: Investigate late/delayed Airflow task failure notifications - https://phabricator.wikimedia.org/T358205#9567333 (10BTullis) This tallies with a measurement of the mail queue on the mailman server. From [[https://grafana.wikimedia.org/d/GvuAmuuGk/... [12:30:26] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9567347 (10MoritzMuehlenhoff) @AndrewTavis_WMDE Thanks! This is a long task and to make things explicit: Is the summary below... [12:30:30] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Data-Platform: Investigate late/delayed Airflow task failure notifications - https://phabricator.wikimedia.org/T358205#9567348 (10BTullis) We have something of an issue in that the mailman3 server is considered an //unowned// service and support tends to be on... [12:40:14] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Data-Platform: Investigate late/delayed Airflow task failure notifications - https://phabricator.wikimedia.org/T358205#9567126 (10BTullis) [12:40:16] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438#9567372 (10BTullis) [12:42:04] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9567384 (10AndrewTavis_WMDE) Thanks for checking in @MoritzMuehlenhoff! A correction to one of your points: - Membership in a... [12:52:56] thanks a lot btullis - I confirm timers are gone from an-launcher1002 [12:53:44] btullis: Is it ok if we wait a few weeks before removing the code in puppet? We plan on deleting the jobs data in 3 weeks, it would be noce to delete the puppet code at the same time - ok on your side? [12:54:45] 10Data-Engineering (Sprint 9): Delete reportupdater jobs data/puppet-code - https://phabricator.wikimedia.org/T358210#9567413 (10JAllemandou) [12:56:14] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9567420 (10MoritzMuehlenhoff) >>! In T356279#9567384, @AndrewTavis_WMDE wrote: > Thanks for checking in @MoritzMuehlenhoff! A... [12:58:29] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9567427 (10AndrewTavis_WMDE) Thank you for the help with this, @MoritzMuehlenhoff! Please also add in @M... [13:01:22] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Data-Platform: Investigate late/delayed Airflow task failure notifications - https://phabricator.wikimedia.org/T358205#9567448 (10JAllemandou) Thank you for the thorough investigation @BTullis ! This example gives us more traction on the need to move toward gog... [13:16:38] 10Data-Platform-SRE, 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279#9567502 (10MoritzMuehlenhoff) 05Stalled→03Open a:03MoritzMuehlenhoff [13:56:14] 10Data-Engineering, 10Data-Platform-SRE: [Iceberg Migration] P.O.C. on Iceberg sensor using Postgres table to keep status of updates - https://phabricator.wikimedia.org/T340466#9567681 (10lbowmaker) @xcollazo - yes we will do it under this ticket. [14:00:53] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:01:30] 10Data-Engineering, 10Data Products: NEW BUG REPORT - Pageviews Missing Hourly Partition - https://phabricator.wikimedia.org/T358142#9567699 (10lbowmaker) [14:01:39] 10Data-Platform-SRE, 10Data-Platform: Investigate late/delayed Airflow task failure notifications - https://phabricator.wikimedia.org/T358205#9567700 (10lbowmaker) [14:11:44] 10Data-Engineering, 10Data Products: Method for per-file cumulative total in mediarequests API - https://phabricator.wikimedia.org/T343947#9567733 (10lbowmaker) [14:18:25] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products: no view data by country for the last month (June 2023) - https://phabricator.wikimedia.org/T341523#9567750 (10lbowmaker) 05Open→03Resolved a:03lbowmaker There could have been a delay in processing. Looks ok now. https://stats.wikimedia... [14:21:07] 10Data-Engineering, 10Data Pipelines, 10Data Products: geoeditors public version is not available for non-Wikipedia projects - https://phabricator.wikimedia.org/T317040#9567777 (10lbowmaker) [14:27:59] 10Data-Engineering, 10Data Pipelines, 10Datasets-General-or-Unknown, 10Wikimedia Enterprise: Missing NS0 dumps in 20230420 and 20230501 and 20230520 - https://phabricator.wikimedia.org/T335887#9567799 (10lbowmaker) [14:32:35] 10Data-Engineering, 10Data Products: project-title-country missing US data in recent data, and double quote escaping - https://phabricator.wikimedia.org/T341139#9567857 (10lbowmaker) [14:32:42] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Serve Superset static assets from an optimised container - https://phabricator.wikimedia.org/T357890#9567839 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/superset/-/merge_requests/29 Run the superset-fr... [14:34:26] 10Data-Engineering, 10Movement-Insights: [Data Quality] Implement basic data quality metrics for Unique Devices datasets - https://phabricator.wikimedia.org/T357833#9567862 (10lbowmaker) I moved this ticket to our Data Quality column which we will review and prioritize based on the KR’s for next year. I think... [15:14:43] 10Data-Engineering, 10GitLab (CI & Job Runners), 10Performance Issue: Improve speed of Gitlab CI - https://phabricator.wikimedia.org/T311111#9568057 (10Antoine_Quhen) 05Open→03Resolved a:03Antoine_Quhen Done. The last version was done with Blubber: * https://gitlab.wikimedia.org/repos/data-engineering/... [16:00:53] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1003:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1003:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [16:14:19] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Data-Platform: Investigate late/delayed Airflow task failure notifications - https://phabricator.wikimedia.org/T358205#9568443 (10BTullis) [16:46:52] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Persistence: Migrate dbstore* hosts to 10.6 - https://phabricator.wikimedia.org/T356961#9568674 (10BTullis) [17:44:43] 10Data-Engineering, 10Cassandra, 10Structured Data Engineering, 10Structured-Data-Backlog: image suggestions DAG should not use aqsloader Cassandra role - https://phabricator.wikimedia.org/T356446#9569129 (10Eevans) p:05Triage→03High [17:45:30] 10Data-Engineering, 10Cassandra, 10Data Pipelines: Create puppet defined type for adding/updating/deleting secrets or other small files on HDFS - https://phabricator.wikimedia.org/T323692#9569144 (10Eevans) p:05Triage→03High [17:47:59] 10Data-Platform-SRE, 10Data-Platform: [Presto] Use JWT authentication instead of Kerberos for cluster-internal communication - https://phabricator.wikimedia.org/T358196#9569160 (10JAllemandou) I don't think this change would really affect queries performance, but I'm in favor of doing it for the benefit of rel... [17:48:11] 10Data-Engineering, 10Cassandra, 10Data Pipelines: Encrypt Spark-Cassandra connection - https://phabricator.wikimedia.org/T310820#9569163 (10Eevans) p:05Triage→03Medium [17:59:39] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search (Current work), 10Documentation, 10Patch-For-Review: Review wikitech:Search and write processes for k8s world - https://phabricator.wikimedia.org/T356303#9569236 (10EBernhardson) > Example query of the rest api (could be nicer if we install... [18:21:03] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 10), 10Technical-Debt: Fix public documentation for mw.eventLog.submit() and dispatch() - https://phabricator.wikimedia.org/T357003#9569346 (10apaskulin) I've opened a patch for {T35... [18:29:01] (03PS14) 10Joal: Extract RefineSingleApp code from Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/1003745 (https://phabricator.wikimedia.org/T356363) [19:20:16] (03PS3) 10Kimberly Sarabia: Adds new field to webA11y schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1005201 (https://phabricator.wikimedia.org/T356335) [19:26:15] 10Data-Platform-SRE: Update maxmind download to pull databases from new url - https://phabricator.wikimedia.org/T358268#9569579 (10odimitrijevic) [20:00:46] 10Data-Platform-SRE, 10SRE: Update maxmind download to pull databases from new url - https://phabricator.wikimedia.org/T358268#9569715 (10Gehel) p:05Triage→03High [20:21:50] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search, 10Patch-For-Review: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685#9569917 (10CodeReviewBot) bking opened https://gitlab.wikimedia.org/repos/search-platform/sre/cleanup-flink-objec... [20:22:18] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search, 10Patch-For-Review: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685#9569920 (10CodeReviewBot) bking merged https://gitlab.wikimedia.org/repos/search-platform/sre/cleanup-flink-objec... [21:06:28] 10Data-Platform-SRE, 10SRE: Update maxmind download to pull databases from new url - https://phabricator.wikimedia.org/T358268#9570086 (10Dwisehaupt) We are tracking this from the fr-tech side in T358043. No impact on your work, just adding for full knowledge. [21:12:43] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): RdfStreamingUpdaterSpaceUsageTooHigh - https://phabricator.wikimedia.org/T357330#9570108 (10bking) I've cleaned up the objects and [[ https://gerrit.wikimedia.org/r/c/operations/alerts/+/1005791 | increased the alert threshold to 100Gb ]] . Closing this, but work o... [21:12:57] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): RdfStreamingUpdaterSpaceUsageTooHigh - https://phabricator.wikimedia.org/T357330#9570110 (10bking) 05Open→03Resolved a:03bking [21:28:14] 10Data-Engineering (Sprint 10), 10AQS2.0: Review (and fix/remove?) the pipeline in the AQS 2.0 QA test suite repository - https://phabricator.wikimedia.org/T355508#9570142 (10EChukwukere-WMF) [21:30:23] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search, 10Patch-For-Review: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685#9570154 (10CodeReviewBot) bking merged https://gitlab.wikimedia.org/repos/search-platform/sre/cleanup-flink-objec... [21:31:15] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685#9570152 (10CodeReviewBot) bking opened https://gitlab.wikimedia.org/repos/search-platform/sre/cleanup-flink-object-storage/-/merge_requ... [21:43:53] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685#9570213 (10bking) 05Open→03Resolved The script is good enough to run as a one-off job. We may want to have it run automatically one... [22:09:49] (03CR) 10Bernard Wang: Adds new field to webA11y schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1005201 (https://phabricator.wikimedia.org/T356335) (owner: 10Kimberly Sarabia) [22:10:25] (03Merged) 10jenkins-bot: Adds new field to webA11y schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/1005201 (https://phabricator.wikimedia.org/T356335) (owner: 10Kimberly Sarabia) [22:19:30] 10Data-Engineering, 10Canonical-Data, 10Movement-Insights: Automate the loading of canonical data tables to the Data Lake - https://phabricator.wikimedia.org/T339928#9570358 (10nshahquinn-wmf) [23:13:42] (03PS2) 10Jeena Huneidi: Drop vestiges of git-fat [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/887000 (https://phabricator.wikimedia.org/T328473) (owner: 10Chad) [23:13:42] 10Data-Engineering: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911#9570546 (10nshahquinn-wmf) 05Resolved→03Open The most recent run of this job (which finished today) still had a retry. @JAllemandou and @xcollazo have been discussing this. @xcol... [23:17:40] (03CR) 10Thcipriani: [C: 03+1] "Today Jeena noticed the same thing Chad noticed about this repo a year ago: hdfs-tools doesn't look to use any large file storage anymore." [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/887000 (https://phabricator.wikimedia.org/T328473) (owner: 10Chad) [23:22:05] 10Data-Engineering, 10Data Products, 10Movement-Insights: Mediawiki_wikitext_history job often has long gaps between stages - https://phabricator.wikimedia.org/T357873#9570561 (10nshahquinn-wmf) Thank you, @Antoine_Quhen! At a meeting yesterday, we noted the following: * We want the job not to retry the c... [23:22:23] 10Data-Engineering, 10Data Products, 10Movement-Insights: Mediawiki_wikitext_history job often has long gaps between stages - https://phabricator.wikimedia.org/T357873#9570564 (10nshahquinn-wmf) [23:26:13] 10Data-Engineering: Remove wikidata from this historical dumps process - https://phabricator.wikimedia.org/T357438#9570583 (10nshahquinn-wmf) [23:26:16] 10Data-Engineering (Sprint 9), 10Data Products, 10Movement-Insights: Skip Wikidata when loading XML dumps to the Data Lake - https://phabricator.wikimedia.org/T357859#9570585 (10nshahquinn-wmf) [23:26:37] 10Data-Engineering (Sprint 9), 10Data Products, 10Movement-Insights, 10Movement-Metrics: Skip Wikidata when loading XML dumps to the Data Lake - https://phabricator.wikimedia.org/T357859#9553374 (10nshahquinn-wmf) [23:31:17] 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Discovery-Search (Current work), 10Documentation, 10Patch-For-Review: Review wikitech:Search and write processes for k8s world - https://phabricator.wikimedia.org/T356303#9570602 (10EBernhardson) To review the documentation changes (there are also two revis...