[00:05:42] 10Data-Engineering-Planning, 10Data Pipelines: Provide aggregated user device data per-country - https://phabricator.wikimedia.org/T325306 (10Dreamy_Jazz) If it makes it easier, this data can be gleaned from cu_changes, cu_log_event and cu_private_event. Data is stored for 3 months. If longer is needed then th... [00:36:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:30] 10Data-Engineering: Check home/HDFS leftovers of cmacholan - https://phabricator.wikimedia.org/T330121 (10MoritzMuehlenhoff) [08:57:04] (03CR) 10DCausse: "looks great, nice cleanup!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [09:33:44] !log adding last batch of 5 nodes to the presto prod cluster [09:33:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:39:41] (03CR) 10Gehel: "A few comments inline. Mostly looks good, the comments are fairly minor." (036 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [10:47:55] (03PS29) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [10:49:42] (03CR) 10Aqu: "Thanks DCausse." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [10:51:04] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1001.eqia... [12:28:25] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 09), 10Patch-For-Review, 10SecTeam-Processed, 10Vuln-VulnComponent: Upgrade Puppet code to make Airflow configuration files compatible with version 2.5.0 - https://phabricator.wikimedia.org/T315580 (10Stevemunene) Puppet code updated to provide airflo... [12:41:44] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye c... [12:49:06] !log Reimage an-test-presto1002 to upgrade to bullseye T329361 [12:49:06] T329361: Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 [12:50:54] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1002.eqiad.wmnet with OS bullseye [12:52:01] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) [13:19:33] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Deploy ceph mon processes to data-engineering cluster - https://phabricator.wikimedia.org/T330149 (10JArguello-WMF) [13:21:55] 10Data-Engineering-Planning: Deploy ceph mon processes to data-engineering cluster - https://phabricator.wikimedia.org/T330149 (10JArguello-WMF) [13:22:22] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Deploy ceph mon processes to data-engineering cluster - https://phabricator.wikimedia.org/T330149 (10JArguello-WMF) [13:22:34] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Deploy ceph mon processes to data-engineering cluster - https://phabricator.wikimedia.org/T330149 (10JArguello-WMF) [13:24:19] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Deploy ceph mon processes to data-engineering cluster - https://phabricator.wikimedia.org/T330149 (10JArguello-WMF) [13:24:21] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1002.eqiad.wmnet with OS bullseye e... [13:28:27] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Epic: Install Ceph Cluster for Data Engineering - https://phabricator.wikimedia.org/T324660 (10JArguello-WMF) [13:30:51] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Deploy ceph radosgw processes to data-engineering cluster - https://phabricator.wikimedia.org/T330152 (10JArguello-WMF) [13:31:14] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Deploy ceph radosgw processes to data-engineering cluster - https://phabricator.wikimedia.org/T330152 (10JArguello-WMF) [13:32:41] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) [13:33:21] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Configure load-balancing approriate for ceph radosgw services on the data-engineering cluster - https://phabricator.wikimedia.org/T330153 (10JArguello-WMF) [13:36:29] 10Data-Engineering, 10Data Pipelines: Check if new airflow package is usable on both buster and bullseye - https://phabricator.wikimedia.org/T330154 (10Stevemunene) [13:37:46] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10JArguello-WMF) [13:40:10] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10JArguello-WMF) we'll defer this task to sprint 10 [13:43:38] 10Data-Engineering-Planning, 10Data-Catalog, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Datahub user records are not being created after login - https://phabricator.wikimedia.org/T327884 (10JArguello-WMF) [13:45:09] 10Data-Engineering, 10Data-Catalog, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure, 10CAS-SSO: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10JArguello-WMF) [13:47:23] 10Data-Engineering, 10Data-Catalog, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure, 10CAS-SSO: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10JArguello-WMF) Switching DataHub to OIDC authentication (T305874) is a big job, so we'll schedule it for the next... [13:48:28] 10Data-Engineering-Planning, 10Observability-Alerting, 10SRE, 10Traffic, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Reduce/eliminate false positives for VarnishKafkaNoMessages alert - https://phabricator.wikimedia.org/T324522 (10JArguello-WMF) 05Open→03Resolved [13:48:38] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Presto is unstable with more than 5 worker nodes - https://phabricator.wikimedia.org/T325809 (10JArguello-WMF) [13:48:49] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Presto is unstable with more than 5 worker nodes - https://phabricator.wikimedia.org/T325809 (10JArguello-WMF) 05Open→03Resolved [14:20:19] 10Data-Engineering, 10Shared-Data-Infrastructure, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10JArguello-WMF) [14:27:22] ACKNOWLEDGEMENT - Check systemd state on an-airflow1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@search.service,wmf_auto_restart_airflow-webserver@search.service Brian_King T327970 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:22] ACKNOWLEDGEMENT - Checks that the airflow database for airflow search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow db check did not succeed Brian_King T327970 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [14:27:22] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed Brian_King T327970 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [14:29:32] 10Data-Engineering, 10Shared-Data-Infrastructure, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10nfraison) Added to previous sprint because we were failing to add nodes and to find root cause for adding them. As the root cause has been identified without it + the... [15:06:42] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) p:05Triage→03Medium [15:07:03] 10Data-Engineering-Planning: Check home/HDFS leftovers of cmacholan - https://phabricator.wikimedia.org/T330121 (10lbowmaker) [15:07:19] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) [15:11:39] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10jcrespo) [15:13:18] 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, 10Infrastructure-Foundations, and 7 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MoritzMuehlenhoff) [15:30:44] (03CR) 10Xcollazo: [C: 03+2] "Sorry I meant to +2 this on last review. LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/889766 (https://phabricator.wikimedia.org/T307569) (owner: 10Joal) [16:13:56] !log Reimage an-test-presto1003 to upgrade to bullseye T329361 [16:13:56] T329361: Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 [16:14:13] !log Reimage an-presto1003 to upgrade to bullseye T329361 [16:15:36] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1003.eqiad.wmnet with OS bullseye [16:46:27] 10Data-Engineering, 10Pageviews-API, 10Tool-Pageviews: Missing Pageviews Data (projectviews-20230220-230000) - https://phabricator.wikimedia.org/T330184 (10Predata-Datasci) [16:48:36] 10Data-Engineering, 10Pageviews-API: Missing Pageviews Data (projectviews-20230220-230000) - https://phabricator.wikimedia.org/T330184 (10Reedy) [17:04:36] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate drop_old_data_daily.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329870 (10EBernhardson) a:03EBernhardson [17:09:16] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1003.eqiad.wmnet with OS bullseye c... [17:20:37] 10Data-Engineering, 10Data Pipelines (Sprint 09): Check if new airflow package is usable on both buster and bullseye - https://phabricator.wikimedia.org/T330154 (10JArguello-WMF) [17:20:45] 10Data-Engineering, 10Data Pipelines (Sprint 09): Upload new airflow package to wikimedia APT repository - https://phabricator.wikimedia.org/T330087 (10JArguello-WMF) [17:23:44] 10Data-Engineering, 10Data Pipelines (Sprint 09): Check if new airflow package is usable on both buster and bullseye - https://phabricator.wikimedia.org/T330154 (10JArguello-WMF) [17:24:03] 10Data-Engineering, 10Data Pipelines (Sprint 09): Upload new airflow package to wikimedia APT repository - https://phabricator.wikimedia.org/T330087 (10JArguello-WMF) [18:35:16] 10Data-Engineering, 10Foundational Technology Requests, 10Product-Analytics (Kanban): "Source of truth" dataset for pageviews - https://phabricator.wikimedia.org/T310732 (10mpopov) a:05EChetty→03Mayakp.wiki [18:38:36] ottomata: wondering if you have suggestions...we run refinery-drop-older-than from our airflow instance, and i'm looking to port that over to airflow 2. analytics/refinery's python isn't packaged in a way thats pip installable, what would be the suggested method to run? I suppose easiest would be to add our airflow instance to the scap deployment of refinery? Otherwise i could perhaps [18:38:38] copy the deployment out of hdfs to run locally or add /mnt/hdfs to mounts and run it from /mnt/hdfs [18:39:20] ideally i suppose we would refer to a conda environment that includes refinery python [19:00:44] !log we had another silent failure in airflow, a sensor that failed without sending an email. the logs are missing. [19:00:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:00:47] 10Data-Engineering, 10API Platform, 10AQS2.0, 10Code-Health-Objective, and 2 others: Dashboards for AQS 2.0 - https://phabricator.wikimedia.org/T288667 (10JArguello-WMF) [19:01:03] !log re airflow silent failure: the job was pageview_actor_hourly [19:01:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:14:35] 10Data-Engineering, 10Data Pipelines (Sprint 09): Check if new airflow package is usable on both buster and bullseye - https://phabricator.wikimedia.org/T330154 (10Antoine_Quhen) [20:23:49] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10bking) @EBernhardson Y ou can ignore the reimage failure above, the server is up and you should be able to login... [20:39:39] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10Dbrant) Can I request a few engineers on the apps team to get the sql_lab role? Namely: dbrant (myself) sharvaniharan cooltey tsev mazevedo seddon We occasi... [21:09:37] (03PS30) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [21:11:27] o/ wondering if canary events or some refinery jobs are not working, partitions stop at hour=10 in hdfs:///wmf/data/raw/event/eqiad.rdf-streaming-updater.lapsed-action/year=2023/month=02/day=21/ [21:24:16] 10Data-Engineering, 10Event-Platform Value Stream: Event partitions missing since 2023-02-21T10:00 for stream without events (canary events not produced?) - https://phabricator.wikimedia.org/T330236 (10dcausse) [21:25:16] filed ^ (if someone has time to look into it), some airflow sensors might fail because of this [21:28:55] (03CR) 10Aqu: "Thanks for the review Gehel!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [21:47:24] 10Data-Engineering, 10Pageviews-API: Missing Pageviews Data (projectviews-20230220-230000) - https://phabricator.wikimedia.org/T330184 (10Predata-Datasci) Update: data is now available at https://dumps.wikimedia.org/other/pageviews/2023/2023-02/, though still unavailable at https://dumps.wikimedia.org/other/pa... [22:08:46] 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Refine drops $schema field values - https://phabricator.wikimedia.org/T255818 (10Aklapper) @Ottomata: Hi, all related patches in Gerrit have been merged. Can this task be resolved (via {nav name=Add Action... > Change Status} in the dropd... [22:08:55] 10Data-Engineering, 10Event-Platform Value Stream: Refine drops $schema field values - https://phabricator.wikimedia.org/T255818 (10Aklapper) [22:09:46] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-EventLogging: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10Aklapper) @Ottomata: Hi, all related patches in Gerrit have been merged. Can this task be resolved (via {nav... [22:15:00] 10Data-Engineering, 10Patch-Needs-Improvement: HiveExtensions.convertToSchema does not properly convert arrays of structs - https://phabricator.wikimedia.org/T259924 (10Aklapper) [23:03:00] 10Data-Engineering, 10Pageviews-API, 10Data Pipelines (Sprint 09): Missing Pageviews Data (projectviews-20230220-230000) - https://phabricator.wikimedia.org/T330184 (10lbowmaker) a:03mforns [23:03:32] 10Data-Engineering, 10Pageviews-API, 10Data Pipelines (Sprint 09): Missing Pageviews Data (projectviews-20230220-230000) - https://phabricator.wikimedia.org/T330184 (10lbowmaker) [23:07:36] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Migrate drop_old_data_daily.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329870 (10EBernhardson) Once the patch to discolytics is merged i can update the [[ https://gitlab.wikimedia.... [23:20:39] (03CR) 10Gehel: "I tried to add some clarity on some of my comments. None of those are blocking, so feel free to ignore them." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu)