[01:45:41] (03CR) 10Gergő Tisza: Add analytics/mediawiki/mentor_dashboard/personalized_praise (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891368 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [08:03:42] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10achou) Luca has added support for mediawiki.page_change in Lift Wing, so now we can use page_change as the source event. (see T... [08:36:47] RECOVERY - Kerberos KDC daemon on krb1001 is OK: PROCS OK: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [08:39:19] RECOVERY - Kerberos KDC daemon on krb2001 is OK: PROCS OK: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [08:45:16] 10Data-Engineering, 10Event-Platform Value Stream: Event partitions missing since 2023-02-21T10:00 for stream without events (canary events not produced?) - https://phabricator.wikimedia.org/T330236 (10dcausse) Thanks for fixing the issue and dealing with failed sensors! :) Regarding what we could do to mitig... [09:10:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:20:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:32:10] steve_munene, nfraison o/ puppet seems broken on some analytics nodes, can you check when you have a moment? [09:32:34] you can find the list at the bottom of https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1 [09:33:44] elukey looking into it [09:35:31] thanks! [10:52:14] puppet issues on stats* servers is due to scap failing to sync deploy some app (for. ex hdfs-tools). It is due to relying on deploy1001.eqiad.wmnet to do git fetch. There is indeed a .config file in /srv/deployment/analytics/hdfs-tools/deploy-cache which define the git_server as deploy1001.eqiad.wmnet. [10:52:14] This server seems to not exist since a long time but file hasn't been changed since 2020, looks like nothing has changed on those repo since 2020/2021 which could explain why it has never failed before. [10:52:14] Do you know how this config file is managed (manually?) [11:23:57] nfraison: good finding! I think it is created probably when the repo is created the first time [11:24:22] or it may be in the scap config of the repo, but not sure if we set it [11:25:01] yeah I guess it is generated after https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/hdfs-tools/deploy/+/refs/heads/master/scap/scap.cfg [11:25:13] so git_server is probably added when creating the repo.. [11:25:48] let's ask in #sre to hashar (Antoine - Releng) [12:10:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:20:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:24:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:32:20] Thks elukey [12:34:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:36:29] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10jbond) [12:38:20] (03PS4) 10Urbanecm: Add analytics/mediawiki/mentor_dashboard/personalized_praise [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891368 (https://phabricator.wikimedia.org/T325117) [12:38:23] (03CR) 10Urbanecm: Add analytics/mediawiki/mentor_dashboard/personalized_praise (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891368 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [12:41:11] (03PS5) 10Urbanecm: Add analytics/mediawiki/mentor_dashboard/personalized_praise [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891368 (https://phabricator.wikimedia.org/T325117) [12:50:44] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade Spark to 2.4.x - https://phabricator.wikimedia.org/T222253 (10jbond) [12:50:48] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Rebuild spark2 for Debian Buster - https://phabricator.wikimedia.org/T229347 (10jbond) 05Resolved→03Open [12:50:59] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Rebuild spark2 for Debian Buster - https://phabricator.wikimedia.org/T229347 (10jbond) Hi all i think there may be a new variant of this issue. an-test-worker1001 is now running bullseye which uses python3.9 ([[ https://github.com/wikimedia/operations-pup... [13:34:19] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10nfraison) Updating .config file in /srv/deployment/analytics/hdfs-tools/deploy-cache in order to have git_server set to deploy1002.eqiad.wmnet instead of deploy1001.eqiad.wmnet This make the scap command r... [13:35:48] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10nfraison) Here is the result of the debug puppet logs for one of those scap target ` Debug: Executing: '/usr/bin/git -C /srv/deployment/analytics/hdfs-tools/deploy tag --points-at HEAD' Debug: scap pkg [ana... [13:39:00] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10nfraison) Running the `/usr/bin/git -C /srv/deployment/analytics/hdfs-tools/deploy tag --points-at HEAD` manually as the user owning the folder (analytics-deploy) works: `scap/sync/2020-02-28/0001` While r... [13:39:13] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10nfraison) a:03nfraison [13:43:28] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10hashar) >>! In T330394#8640720, @nfraison wrote: > Running the `/usr/bin/git -C /srv/deployment/analytics/hdfs-tools/deploy tag --points-at HEAD` manually as the user owning the folder (analytics-deploy) wo... [13:45:30] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10nfraison) Or we could ensure that the first call to get state is also run as the user owning the folder? [13:51:24] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10hashar) The git security update for `safe.directory` is intended exactly for that use case. A deployer could inject in the git repository some hook (as the deployment user), then when Puppet runs git as roo... [13:58:37] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) Please do! :) [14:07:03] 10Data-Engineering, 10Data-Services: Wiki replicas are not fully setup for newly created wikis - https://phabricator.wikimedia.org/T315442 (10MarcoAurelio) Hello @BTullis, some wikis are still missing from `meta_p` table: ` MariaDB [meta_p]> SELECT * FROM wiki WHERE dbname IN ('kcgwiki', 'guwwiki', 'shnwikivo... [14:08:42] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10bking) Per chatter #wikimedia-sre , the scap deployment failure on an-airflow1005 seems to be related to a git u... [14:14:13] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10MoritzMuehlenhoff) >>! In T330394#8640754, @nfraison wrote: > Or we could ensure that the first call to get state is also run as the user owning the folder? Agreed, the best way to fix that for such a depl... [14:30:54] 10Analytics-Clusters, 10Scap, 10Patch-For-Review: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10jbond) >>! In T330394#8640720, @nfraison wrote: > Running the `/usr/bin/git -C /srv/deployment/analytics/hdfs-tools/deploy tag --points-at HEAD` manually as the user owning the folder... [14:33:35] 10Data-Engineering, 10Data Pipelines (Sprint 09): Differential privacy airflow-dags merge request - https://phabricator.wikimedia.org/T330234 (10Aklapper) [14:33:52] 10Analytics-Clusters, 10Scap, 10Patch-For-Review: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10nfraison) seems we mostly end up with same patch @jbond https://gerrit.wikimedia.org/r/891555 / https://gerrit.wikimedia.org/r/891557 I would be interested in reading/seeing how you d... [14:34:39] 10Data-Engineering, 10Project-Admins, 10PM: Archive Analytics tag - https://phabricator.wikimedia.org/T298671 (10Aklapper) Could someone please answer the previous comment, so I could finish this task? Thanks a lot in advance! [14:35:47] (03PS31) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [14:49:03] 10Analytics-Clusters, 10Scap, 10Patch-For-Review: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10jbond) >>! In T330394#8640954, @nfraison wrote: > seems we mostly end up with same patch @jbond https://gerrit.wikimedia.org/r/891555 / https://gerrit.wikimedia.org/r/891557 > > I wou... [14:57:57] 10Analytics-Clusters, 10Scap, 10Patch-For-Review: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10nfraison) Seems to work fine Before trying to redeploy analytics/hdfs-tools/deploy ` Info: Unable to serialize catalog to json, retrying with pson Info: Applying configuration version... [15:02:04] 10Data-Engineering, 10Event-Platform Value Stream: Event partitions missing since 2023-02-21T10:00 for stream without events (canary events not produced?) - https://phabricator.wikimedia.org/T330236 (10Ottomata) > Perhaps the solution to creating these "empty" partitions could be done without relying on canar... [15:04:51] (03PS32) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [15:15:11] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Isaac) > Is it ok if I arrange a meeting (maybe next week?), including the ML team, Isaac and Ottomata to discuss the source ev... [15:30:59] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10nfraison) 05Open→03Resolved [15:34:47] (03CR) 10Aqu: "Thanks!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [15:58:18] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10Ottomata) @JMeybohm @akosiaris, we plan to deploy to wikikube by the end of this quarter (end of March). Are there any blockers i... [16:08:36] 10Data-Engineering, 10Event-Platform Value Stream, 10Product-Analytics: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Dbrant) [17:15:45] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10hashar) Very nice fix @nfraison thank you! [19:15:23] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10Dzahn) Is T330360 a duplicate of this? [19:18:54] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10Dzahn) Looks like it is. merging in as duplicate. also see T326668 [19:19:20] 10Analytics-Clusters, 10Scap: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 (10Dzahn) [19:21:07] Starting build #19 for job wikimedia-event-utilities-maven-release-docker [19:24:33] Project wikimedia-event-utilities-maven-release-docker build #19: 09SUCCESS in 3 min 26 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/19/ [19:37:45] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): Flink SQL queries should access Kafka topics from a Catalog - https://phabricator.wikimedia.org/T322022 (10Ottomata) Merged and released: https://archiva.wikimedia.org/repository/releases/org/wikimedia/eventutilities-flink/1.2.5/ I just... [20:21:45] 10Data-Engineering, 10Codex, 10Design-Systems-Team, 10DiscussionTools, and 6 others: Update existing foreign-resources.yaml files to add extra fields - https://phabricator.wikimedia.org/T330432 (10Reedy) [20:23:58] 10Data-Engineering, 10Codex, 10Design-Systems-Team, 10DiscussionTools, and 7 others: Update existing foreign-resources.yaml files to add extra fields - https://phabricator.wikimedia.org/T330432 (10Reedy) [20:26:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09), 10Patch-For-Review: [Flink Operations] How to handle restarting a Flink application - https://phabricator.wikimedia.org/T328563 (10gmodena) @lbowmaker @Ottomata @dcausse I documented today's application restart discussion at https://www.... [20:39:59] 10Data-Engineering: Load wmf.unique_editors_by_country_monthly into Druid for access in Turnilo & Superset - https://phabricator.wikimedia.org/T330436 (10mpopov) [20:43:32] (03PS1) 10Jennifer Ebe: [WIP] Create_mediacounts_archive_hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/891621 [20:43:33] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/891621 (owner: 10Jennifer Ebe) [21:05:51] 10Data-Engineering, 10Event-Platform Value Stream: Flink EventStreamCatalog should add watermark - https://phabricator.wikimedia.org/T330441 (10Ottomata) [21:51:29] 10Data-Engineering, 10Abstract Wikipedia team, 10Codex, 10Design-Systems-Team, and 8 others: Update existing foreign-resources.yaml files to add extra fields - https://phabricator.wikimedia.org/T330432 (10Reedy) [22:57:57] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): [Tracking] Migrate Search Airflow jobs to Airflow 2 and use shared supporting code from the data engineering Airflow - https://phabricator.wikimedia.org/T318414 (10EBernhardson) [22:57:59] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate rdf_streaming_updater_reconcile.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329879 (10EBernhardson) 05duplicate→03Open [22:58:36] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate rdf_streaming_updater_reconcile.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329879 (10EBernhardson) re-opening. Considering the Migrate RDF Tooling task to be about migrating the code and releasi... [23:00:38] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate mediawiki_revision_recommendation_create.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T330447 (10EBernhardson) [23:06:38] 10Data-Engineering, 10Abstract Wikipedia team, 10Codex, 10Design-Systems-Team, and 8 others: Update existing foreign-resources.yaml files to add extra fields - https://phabricator.wikimedia.org/T330432 (10Reedy) [23:18:30] 10Data-Engineering, 10Abstract Wikipedia team, 10Codex, 10Design-Systems-Team, and 9 others: Update existing foreign-resources.yaml files to add extra fields - https://phabricator.wikimedia.org/T330432 (10Reedy) 05Open→03In progress [23:23:23] 10Data-Engineering, 10Abstract Wikipedia team, 10Design-Systems-Team, 10DiscussionTools, and 8 others: Update existing foreign-resources.yaml files to add extra fields - https://phabricator.wikimedia.org/T330432 (10Reedy) [23:26:01] 10Data-Engineering, 10Abstract Wikipedia team, 10DiscussionTools, 10Growth-Team, and 7 others: Update existing foreign-resources.yaml files to add extra fields - https://phabricator.wikimedia.org/T330432 (10Reedy) [23:28:03] 10Data-Engineering, 10Abstract Wikipedia team, 10DiscussionTools, 10Growth-Team, and 7 others: Update existing foreign-resources.yaml files to add extra fields - https://phabricator.wikimedia.org/T330432 (10Reedy)