[00:26:00] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly: Massive spike in pageviews for a few enwiki pages beginning with "Index" - https://phabricator.wikimedia.org/T327027 (10odimitrijevic) Is this the same issue reported in https://phabricator.wikimedia.org/T328127? [00:58:18] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2023_02 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [01:18:18] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2023_02 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [03:08:18] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Peachey88) [03:09:46] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Peachey88) [03:09:55] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Peachey88) [03:10:07] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Peachey88) [04:15:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:32] (03PS1) 10Gergő Tisza: helppanel: Add not-known editor type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893842 (https://phabricator.wikimedia.org/T330727) [04:36:20] (03CR) 10CI reject: [V: 04-1] helppanel: Add not-known editor type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893842 (https://phabricator.wikimedia.org/T330727) (owner: 10Gergő Tisza) [05:09:18] (03PS2) 10Gergő Tisza: helppanel: Add not-known editor type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893842 (https://phabricator.wikimedia.org/T330727) [07:29:38] !log truncate /var/log/auth.log.1 on krb1001 to free space (root partition almost filled up) [07:29:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:30:29] moritzm, nfraison --^ the auth.log is full of presto-related entries, and the root partition was almost filled up. Maybe we could redirect the output of the krb daemons to a file that logs under /srv? [07:42:04] 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Proposal: Create a stream end point for Revision Risk Model - https://phabricator.wikimedia.org/T326179 (10elukey) Just to clarify - the ML team is happy to support any input/output schema that is reasonable. We are try... [07:47:53] yeah, I'll open a task. that's most certainly the effect of the expansion of the Presto cluster and the thus increased amount of requests [07:53:30] moritzm indeed it should be related as the amount of TGS req has increase. That could indeed be a solution to push the log in /srv or to update logrotate to run hourly [07:55:35] moving to /srv is fine, we also do this for a few other high volume services [08:04:04] 10Data-Engineering, 10Edit-Review-Improvements-Integrated-Filters, 10Event-Platform Value Stream, 10Growth-Team, and 2 others: Integration of Revert Risk Scores to Recent Changes as a filter - https://phabricator.wikimedia.org/T329071 (10elukey) Thanks a lot! > Regarding the jobs, the reason ores ext does... [08:36:05] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10nfraison) For reference disk bench from an-worker1131 ` nfraison@an-worker1131:/var/lib/hadoop/data/d/test$ sudo sysbench fileio --fil... [08:45:14] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10nfraison) On an-worker1132 all disks are having same stats with no more 4MiB for read/write per sec and 238/158 iops ` /var/lib/hadoo... [09:00:36] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10nfraison) Raid disk configuration is in WriteThrough instead of WriteBack. - On an-worker1131 ` nfraison@an-worker1131:~$ sudo megacl... [09:02:56] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10nfraison) BBU looks fine ` nfraison@an-worker1132:~$ sudo megacli -AdpBbuCmd -aALL BBU status fo... [09:18:27] (03CR) 10Kosta Harlan: "Can we combine this with I9913001d4d2a4624846ac8ec3d38fc3d5f3de97c?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893842 (https://phabricator.wikimedia.org/T330727) (owner: 10Gergő Tisza) [09:23:50] (03PS1) 10Phuedx: Remove SpecialMuteSubmit allowlist entry [analytics/refinery] - 10https://gerrit.wikimedia.org/r/893998 (https://phabricator.wikimedia.org/T329718) [09:25:06] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10nfraison) Enforcing cache to WriteBack doesn't work: `sudo megacli -LDSetProp -WB -Immediate -Lall -aAll` [09:50:27] 10Data-Engineering: Check home/HDFS leftovers of toan - https://phabricator.wikimedia.org/T331100 (10MoritzMuehlenhoff) [10:13:43] 10Data-Engineering: Check home/HDFS leftovers of jk - https://phabricator.wikimedia.org/T331108 (10MoritzMuehlenhoff) [10:30:16] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10nfraison) @Cmjohnson this node has strange behaviour on raid/disks All disks are really slow compare to ones on other nodes. After looking at that it has indeed bad Current Cache policy set... [11:18:56] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10BTullis) [11:21:32] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10BTullis) p:05Triage→03High Bringing into the current sprint with high priority, owing to the need to fix the puppet com... [11:27:56] 10Data-Engineering, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10BTullis) As a point of reference, `conftool-data` for aqs servers in codfw already exists and they are marked as inactive.... [12:11:48] (03CR) 10Kosta Harlan: [C: 03+2] helppanel: Add not-known editor type (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893842 (https://phabricator.wikimedia.org/T330727) (owner: 10Gergő Tisza) [12:12:23] (03Merged) 10jenkins-bot: helppanel: Add not-known editor type [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893842 (https://phabricator.wikimedia.org/T330727) (owner: 10Gergő Tisza) [12:19:37] (03PS2) 10Kosta Harlan: helppanel: Add support for trynewtask dialog [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893734 (https://phabricator.wikimedia.org/T330637) [12:20:09] (03CR) 10CI reject: [V: 04-1] helppanel: Add support for trynewtask dialog [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893734 (https://phabricator.wikimedia.org/T330637) (owner: 10Kosta Harlan) [12:23:50] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10BTullis) Based on a quick ping, the round-trip time from the aqs servers in codfw to the druid-public... [12:37:34] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10cmooney) a:03cmooney [13:10:59] (03PS3) 10Kosta Harlan: helppanel: Add support for trynewtask dialog [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/893734 (https://phabricator.wikimedia.org/T330637) [13:47:51] ACKNOWLEDGEMENT - Check systemd state on an-airflow1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@search.service,wmf_auto_restart_airflow-webserver@search.service Btullis Host not yet in service: T327970 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:51] ACKNOWLEDGEMENT - Checks that the airflow database for airflow search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow db check did not succeed Btullis Host not yet in service: T327970 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:47:51] ACKNOWLEDGEMENT - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed Btullis Host not yet in service: T327970 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:56:32] 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Cmjohnson) We can replace the BBU, let's get the disk replaced first and then create a new ticket for a BBU [14:15:02] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [14:17:38] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10jbond) [14:18:00] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi) [14:18:26] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [14:18:38] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly: Spike: Pageview Anomaly Analysis - https://phabricator.wikimedia.org/T328935 (10lbowmaker) 05Open→03Declined [14:18:42] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly: Massive spike in pageviews for a few enwiki pages beginning with "Index" - https://phabricator.wikimedia.org/T327027 (10lbowmaker) [14:19:00] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [14:19:12] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly: Spike: Pageview Anomaly Analysis - https://phabricator.wikimedia.org/T328935 (10lbowmaker) Not sure why this ticket was created. We will use parent ticket to track work. [14:23:36] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly: Massive spike in pageviews for a few enwiki pages beginning with "Index" - https://phabricator.wikimedia.org/T327027 (10lbowmaker) @SNowick_WMF - does this look like the same issue as you worked on in -> https://phabricator.wikimedia.org/T32... [14:33:37] 10Analytics-Radar, 10Data-Engineering-Planning, 10Data-Engineering-Wikistats, 10Data Pipelines, and 2 others: Discuss Wikistats integration for ORES - https://phabricator.wikimedia.org/T184479 (10lbowmaker) 05Open→03Declined Marking this as declined for now. Looking at the history it seems like nothing... [14:37:53] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10lbowmaker) [14:38:41] 10Data-Engineering-Planning: Check home/HDFS leftovers of jk - https://phabricator.wikimedia.org/T331108 (10lbowmaker) [14:38:47] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Patch-For-Review: Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 (10lbowmaker) [14:39:05] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10lbowmaker) [14:39:24] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Deploy ceph radosgw processes to data-engineering cluster - https://phabricator.wikimedia.org/T330152 (10lbowmaker) [14:40:06] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [Flink Operation] How to handle app upgrades - https://phabricator.wikimedia.org/T328569 (10lbowmaker) [14:40:23] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [Flink Operations] How to handle restarting a Flink application - https://phabricator.wikimedia.org/T328563 (10lbowmaker) [14:41:09] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Patch-For-Review: Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 (10lbowmaker) [14:41:23] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Flink + Event Platform integration for writing into streams via Table API - https://phabricator.wikimedia.org/T324114 (10lbowmaker) [14:41:26] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Spark Streaming Dumps POC: Backfill content table - https://phabricator.wikimedia.org/T323641 (10lbowmaker) [14:41:36] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Spark Streaming Dumps POC: Update iceberg tables - https://phabricator.wikimedia.org/T323645 (10lbowmaker) [14:41:45] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Investigate slownesses on an-worker1132 - https://phabricator.wikimedia.org/T330979 (10lbowmaker) [14:43:26] 10Data-Engineering-Planning: Check home/HDFS leftovers of toan - https://phabricator.wikimedia.org/T331100 (10lbowmaker) [14:44:54] 10Data-Engineering-Planning, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10lbowmaker) [14:45:27] 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10lbowmaker) [14:48:01] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, 10Technical-Debt: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10phuedx) [14:55:18] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10elukey) [14:57:38] 10Data-Engineering-Planning, 10Edit-Review-Improvements-Integrated-Filters, 10Event-Platform Value Stream, 10Growth-Team, and 2 others: Integration of Revert Risk Scores to Recent Changes as a filter - https://phabricator.wikimedia.org/T329071 (10lbowmaker) [14:57:56] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10fgiunchedi) [15:02:59] 10Data-Engineering, 10Event-Platform Value Stream, 10MediaWiki-extensions-WikimediaEvents, 10Product-Analytics, 10Technical-Debt: Decommission the EditorActivation instrument - https://phabricator.wikimedia.org/T330766 (10phuedx) @ori: You're listed as the maintainer for the EditorActivation schema. Do y... [15:10:29] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10herron) [15:17:37] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [15:22:14] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10hnowlan) [15:36:52] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Discovery-Search, 10Reading-Admin, and 3 others: Image Classification Working Group - https://phabricator.wikimedia.org/T215413 (10Miriam) [15:37:16] 10Analytics-Radar, 10Data-Engineering-Icebox, 10Discovery-Search, 10Reading-Admin, and 3 others: Image Classification Working Group - https://phabricator.wikimedia.org/T215413 (10Miriam) [15:44:09] !log Deploying latest image_suggestions DAG on platform_eng Airflow instance [15:44:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:53:45] !log deployed airflow analytics to unbreak edit_hourly_dag [15:53:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:07:32] 10Data-Engineering-Planning, 10Data Pipelines: Load wmf.unique_editors_by_country_monthly into Druid for access in Turnilo & Superset - https://phabricator.wikimedia.org/T330436 (10mpopov) Yes, the approach you described is only available in Superset and only available to analytics-privatedata-users. That als... [16:48:12] !log Deleted snapshot=2023-02-20 for tables image_suggestions_search_index_full, image_suggestions_search_index_delta, image_suggestions_lead_image_data and image_suggestions_wikidata_data from the analytics_platform_eng schema. This data will be regenerated. See https://phabricator.wikimedia.org/T330688. [16:48:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:25:44] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10colewhite) [17:31:04] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10BTullis) >>! In T328457#8634739, @Dbrant wrote: > Can I request a few engineers on the apps team to get the sql_lab role? Namely: > dbrant (myself) > sharvan... [17:38:16] o/ I'm having some issues with sqllab in usperset, [17:38:36] *sperset.... Every query I try to run against presot just says Database error Unknown error [17:54:59] 10Data-Engineering: Assign Superset sql_labs access through customer roles - https://phabricator.wikimedia.org/T331160 (10odimitrijevic) [18:24:32] looks like this fixed itself [18:44:05] (03PS1) 10Addshore: Sanitization: Keep version for mwcli_command_execute [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894104 [18:44:37] Regarding the above patch ^^, is there any way to re populate the data in the event_sanitized table once it is merged and deployed? [18:45:22] (03CR) 10Addshore: "Follow up to https://gerrit.wikimedia.org/r/c/analytics/refinery/+/873013" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894104 (owner: 10Addshore) [20:33:57] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate query_clicks.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329878 (10EBernhardson) a:03EBernhardson