[00:36:53] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 4 others: Merge Ks-Arab and Ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10srishakatux) @Nikerabbit I made a few more minor changes to the related patches based on your comment and review from @Winston_Su... [03:48:40] (MediawikiPageContentChangeEnrichAvailability) firing: ... [03:48:40] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [06:05:00] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10Joe) >>! In T341625#9091018, @EBernhardson wrote: >>>! In T341625#9086139, @Joe wrote: >> I am uneasy suggesti... [07:48:40] (MediawikiPageContentChangeEnrichAvailability) firing: ... [07:48:40] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [08:39:31] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 4 others: Merge ks-Arab and ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Winston_Sung) [08:47:51] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 4 others: Merge ks-Arab and ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Winston_Sung) For renaming translation subpages, we need to solve the caching issues that appeared after using ReplaceText. It i... [09:23:37] 10Data-Platform-SRE, 10Data-Persistence, 10SRE-swift-storage, 10Discovery-Search (Current work): Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon This is done now. [09:23:43] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10MatthewVernon) [10:47:25] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Set data permission on new snapshot generation (discovery.wikibase_rdf) - https://phabricator.wikimedia.org/T342416 (10JAllemandou) >>! In T342416#9091146, @EBernhardson wrote: > I looked into these, the attached p... [10:49:20] 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) I've now created a 'WMF Analyst' role in both the production and test instances of Superset. {F37566109,width=50%} I did it by copying the Alpha role from the UI and then addin... [10:58:01] 10Data-Platform-SRE, 10Patch-For-Review: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) I've created a patch to change the default user role to 'WMF Analyst' and I'll aim to merge this next week and switch all existing 'Alpha' users to 'WMF A... [11:00:58] 10Data-Platform-SRE: analytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493 (10BTullis) a:03BTullis [11:13:10] (03CR) 10Joal: [V: 03+2 C: 03+2] "I verified tables are present in analytics-prod replicas and in labs-replicas. Merging for next deploy (before first of month)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/949539 (https://phabricator.wikimedia.org/T344356) (owner: 10MNeisler) [11:33:07] 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10JAllemandou) +1 to add compression to aggregated application logs! We changed the log-retention with Nicolas Fraison when he was here. Our idea was that normally l... [11:43:20] 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10BTullis) >>! In T342923#9101565, @JAllemandou wrote: > +1 to add compression to aggregated application logs! > > We changed the log-retention with Nicolas Fraison... [11:48:40] (MediawikiPageContentChangeEnrichAvailability) firing: ... [11:48:40] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [11:57:35] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10lbowmaker) [12:11:42] (03CR) 10Gmodena: "Catching up with this MR / phab, so apologies if I'm missing something." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/854572 (https://phabricator.wikimedia.org/T317609) (owner: 10DCausse) [13:13:44] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10pfischer) @Joe, thank you for your feedback! > I'm not sure I fully grasp the model, do you have any diagram... [13:19:45] 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10xcollazo) > So how about reverting to 40 days' worth of logs, but enabling lzo compression? Would that be a good compromise, or should we go right to the 90 days re... [13:21:16] 10Data-Platform-SRE: [opsweek] Bump Yarn logs retention period to support debugging long running jobs - https://phabricator.wikimedia.org/T342923 (10JAllemandou) Ok for me to keep 60 days - I think this will seldom be used, but eh, we have a counter example :) In terms of compression, I'd use `gzip` instead of `... [13:24:12] btullis: would you be up to trying some presto/alluxio magic? [13:24:25] Oh sorry - Hi btullis :) [13:25:12] joal: you know me, I'm up for anything :-) [13:25:19] (within reason) [13:26:18] Batcave, or just in IRC? [13:26:35] as you prefer btullis :) [13:26:44] let's talk, it makes quite some time :) [13:37:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:42:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:42] (SystemdUnitFailed) firing: presto-server.service Failed on an-test-presto1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:42] (SystemdUnitFailed) resolved: presto-server.service Failed on an-test-presto1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:25:57] (SystemdUnitFailed) firing: presto-server.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:12] (SystemdUnitFailed) firing: (2) presto-server.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:27:17] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) [14:30:57] (SystemdUnitFailed) resolved: (2) presto-server.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:31:37] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products, 10Data Pipelines (Sprint 12): Non-mobile UAs on mobile (2g/gprs, etc) IP-blocks - https://phabricator.wikimedia.org/T58628 (10MarkAHershberger) > is it fair to assume, give this unfortunate time lag of a decade, that this request no longer... [14:40:42] 10Data-Platform-SRE: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10BTullis) a:05nfraison→03None [14:55:16] 10Data-Engineering, 10Data-Platform-SRE, 10Patch-For-Review: Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10BTullis) @JAllemandou and I got this working! We substituted the alluxio-shaded-client jar from our presto-server package with version 2.9.3 from presto-server v... [15:05:05] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Set data permission on new snapshot generation (discovery.wikibase_rdf) - https://phabricator.wikimedia.org/T342416 (10EBernhardson) >>! In T342416#9101474, @JAllemandou wrote: >>>! In T342416#9091146, @EBernhardso... [15:14:48] (03PS1) 10Phuedx: Remove Echo* sanitisation allowlist entries [analytics/refinery] - 10https://gerrit.wikimedia.org/r/950183 (https://phabricator.wikimedia.org/T344167) [15:15:59] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Notifications, and 2 others: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 (10phuedx) [15:32:47] 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Notifications, and 2 others: Decommission the EchoMail and EchoInteraction instruments - https://phabricator.wikimedia.org/T344167 (10phuedx) @Mglaser @Osnard @ItSpiderman: You were added as reviewers for https://gerrit.wikimedia.org... [15:37:07] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10xcollazo) Coming back here to report. TLDR: Spark 3.3+ solves my small files problem with MERGE INTO. Longer story over at T340861#9101939 and T340... [15:47:52] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10xcollazo) Given the debugging steps at T340861, I believe that I could unblock myself by building a custom conda environment with Spark 3.3 or 3.4 and... [15:48:40] (MediawikiPageContentChangeEnrichAvailability) firing: ... [15:48:40] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [16:16:54] 10Data-Engineering, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Set data permission on new snapshot generation (discovery.wikibase_rdf) - https://phabricator.wikimedia.org/T342416 (10JAllemandou) >>! In T342416#9101868, @EBernhardson wrote: > These are both generated by spark.... [16:30:52] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye [16:36:28] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye executed with errors: - wdqs1010 (**FAIL**) - Removed from Pupp... [17:13:54] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: `flink-zk2002.codfw.wmnet` - flink-zk200... [17:17:38] (03PS1) 10Btullis: Use sudo with git in refinery_deploy_to_hdfs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/950195 (https://phabricator.wikimedia.org/T334493) [17:23:28] 10Data-Platform-SRE, 10Patch-For-Review: analytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493 (10BTullis) It seems to me that the simplest solution is to use `sudo` to run the git commands as the `analytics-deploy` user. The only git command being executed... [17:43:20] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10bking) To check alerting, I removed suppressions and shut off flink-zk1001 via the ganeti master. I saw flink-zk1001 turn red in Ici... [19:38:04] 10Analytics-Radar, 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Wikimedia-production-error: Exception: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T286610 (10Krinkle) [19:39:13] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Jhancock.wm) [19:48:40] (MediawikiPageContentChangeEnrichAvailability) firing: ... [19:48:40] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [19:53:41] 10Data-Engineering, 10All-and-every-Wikisource, 10ArticlePlaceholder, 10BetaFeatures, and 56 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Krinkle) [21:17:17] 10Data-Platform-SRE, 10DC-Ops: wdqs1010 unreachable from SSH or DRAC - https://phabricator.wikimedia.org/T344518 (10bking) [21:17:56] 10Data-Platform-SRE, 10DC-Ops: wdqs1010 unreachable from SSH or DRAC - https://phabricator.wikimedia.org/T344518 (10bking) [21:17:58] 10Data-Platform-SRE: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 (10bking) [21:32:22] 10Analytics-Radar, 10Data-Engineering-Icebox, 10NavigationTiming, 10Wikimedia-Performance-recommendation: Release performance data on a regular schedule - https://phabricator.wikimedia.org/T205342 (10Krinkle) [21:54:21] (03PS1) 10Clare Ming: Add Metrics Platform fragments by entity, platform [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/950208 (https://phabricator.wikimedia.org/T343557) [23:34:49] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10EBernhardson) >>! In T341625#9100951, @Joe wrote: > I have different numbers from [[ https://grafana.wikimedia... [23:48:40] (MediawikiPageContentChangeEnrichAvailability) firing: ... [23:48:40] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability