[00:25:53] 10Data-Engineering, 10Data Pipelines: Wrong file names for 2 month files in pageview_complete/monthly - https://phabricator.wikimedia.org/T335685 (10Milimetric) [00:47:54] 10Data-Engineering-Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive - https://phabricator.wikimedia.org/T311738 (10xcollazo) 05Open→03In progress a:05BTullis→03xcollazo [00:47:58] 10Data-Engineering-Planning, 10Epic: [Iceberg] Epic: Icebergify event_sanitized database - https://phabricator.wikimedia.org/T311743 (10xcollazo) [00:57:29] 10Data-Engineering-Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: [Iceberg] Debianize and install iceberg support for Spark, Presto, and optionally Hive - https://phabricator.wikimedia.org/T311738 (10xcollazo) [01:38:23] (03PS1) 10Milimetric: Fix performance regression due to disallow list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914031 [01:55:13] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:36:35] 10Data-Engineering, 10Data Pipelines: Wrong file names for 2 month files in pageview_complete/monthly - https://phabricator.wikimedia.org/T335685 (10hashar) After a few hours, this is what I get now (4:33 UTC): For March https://dumps.wikimedia.org/other/pageview_complete/monthly/2023/2023-03/ ` pageviews-202... [05:55:13] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:34:23] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Jelto) [07:45:21] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [07:52:51] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10fgiunchedi) [08:28:50] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Gehel) [08:29:08] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Gehel) [08:42:01] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) [09:33:55] !log depooled schema2003 for T334049 [09:33:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:33:58] T334049: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 [09:34:00] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10BTullis) [09:55:13] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:55:15] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10BTullis) [09:59:45] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw ro... [10:05:40] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2023_04 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [10:25:40] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2023_04 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [10:52:58] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw ro... [11:31:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [11:46:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:11:17] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [12:20:35] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:25:55] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ssingh) [12:27:01] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:42:05] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:45:56] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [13:03:25] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=21224f03-d3c2-4431-accb-64fcadd01a0f) set by ayounsi@cumin1001 for 2:00:0... [13:03:39] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Delete empty tables unique_devices_*_wide_* - https://phabricator.wikimedia.org/T329978 (10JArguello-WMF) 05Open→03Resolved [13:03:43] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (sprint 10): 13 new wikis missing from mediawiki_history - https://phabricator.wikimedia.org/T329119 (10JArguello-WMF) [13:03:48] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 12): 2 additional new wikis - https://phabricator.wikimedia.org/T332070 (10JArguello-WMF) 05Open→03Resolved [13:03:52] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10JArguello-WMF) 05Open→03Resolved [13:09:32] (03CR) 10Mforns: [C: 03+1] "LGTM! Left a minor comment, but I'm OK to merge as is!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/913136 (https://phabricator.wikimedia.org/T334101) (owner: 10Aqu) [13:15:34] btullis, joal: I've just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/912301 After puppet has run everywhere (in 30 mins), could you or someone from DE doublecheck that this unbreaks the Refinery deploys to HDFS? [13:18:07] (SystemdUnitFailed) resolved: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:22] (SystemdUnitFailed) firing: (8) jupyter-aitolkyn-singleuser-conda-analytics.service Failed on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:53] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:34] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [13:25:24] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [13:35:02] 10Data-Engineering, 10Metrics-Platform-Planning, 10Product-Analytics, 10WMF-Architecture-Team, 10Event-Platform Value Stream (Sprint 12): Major (API) versioning of Event Platform streams - https://phabricator.wikimedia.org/T332212 (10Ottomata) [13:35:14] 10Data-Engineering, 10Metrics-Platform-Planning, 10Product-Analytics, 10WMF-Architecture-Team, 10Event-Platform Value Stream (Sprint 12): Major (API) versioning of Event Platform streams - https://phabricator.wikimedia.org/T332212 (10Ottomata) p:05Triage→03Medium a:03Ottomata [13:47:51] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Andrew) [13:48:47] moritzm: Yes, will do. Many thanks for that. [13:48:56] ack, sounds good! [13:49:10] !log deploying updated mediawiki history snapshot to aqs [13:49:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:00:39] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 12): Setup config to allow lineage instrumentation - https://phabricator.wikimedia.org/T333004 (10Antoine_Quhen) a:03Antoine_Quhen [14:02:32] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [14:02:49] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) [14:07:22] (03PS2) 10Aqu: Migrate geoeditors monthly Druid ingestion to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/913136 (https://phabricator.wikimedia.org/T334101) [14:57:08] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Jelto) [15:00:02] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row C swi... [15:07:00] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) 05Open→03Resolved a:03ayounsi Upgrade went fine! Thanks everybody. [15:17:04] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row C swi... [16:08:13] (DiskSpace) firing: Disk space stat1005:9100:/ 5.992% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:15:03] (03CR) 10Snwachukwu: "code looks good to me!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/913136 (https://phabricator.wikimedia.org/T334101) (owner: 10Aqu) [16:15:38] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914031 (owner: 10Milimetric) [16:24:47] !log roll-restarting AQS [16:24:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:03:27] (SystemdUnitFailed) firing: (20) jupyterhub-conda.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:18:27] (SystemdUnitFailed) firing: (20) jupyterhub-conda.service Failed on an-test-client1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:37:28] (03CR) 10Milimetric: [C: 03+2] Fix performance regression due to disallow list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914031 (owner: 10Milimetric) [17:37:32] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Fix performance regression due to disallow list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/914031 (owner: 10Milimetric) [17:48:57] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10Ottomata) [17:56:18] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10Ottomata) [17:56:39] 10Data-Engineering, 10Event-Platform Value Stream: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10Ottomata) [18:20:40] !log deployed refinery as part of weekly train [18:20:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:26:54] (03PS2) 10Mforns: Fix HiveToDruid to allow for non-partitioned source tables. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/910094 (https://phabricator.wikimedia.org/T334096) [18:37:33] (03PS3) 10Mforns: Fix HiveToDruid to allow for non-partitioned source tables. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/910094 (https://phabricator.wikimedia.org/T334096) [19:00:56] (03CR) 10Mforns: "Left a couple comments, but looks great to me overall!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/910520 (https://phabricator.wikimedia.org/T334104) (owner: 10Snwachukwu) [19:14:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [19:34:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1002:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1002:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [20:08:13] (DiskSpace) firing: Disk space stat1005:9100:/ 5.658% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=stat1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:19:05] 10Data-Engineering, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10colewhite) [20:28:45] (03CR) 10Milimetric: [C: 03+1] "The countMetricName change looks good, at first I thought DataFrameToDruid was hard-coding it but I see the constant there is just used as" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/910094 (https://phabricator.wikimedia.org/T334096) (owner: 10Mforns) [20:29:36] thanks milimetric :] [20:29:50] ++ [21:18:28] (SystemdUnitFailed) firing: (19) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:43:16] (03CR) 10Krinkle: [C: 03+2] CentralNoticeTiming: Remove CentralNoticeTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/912852 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [23:44:00] (03Merged) 10jenkins-bot: CentralNoticeTiming: Remove CentralNoticeTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/912852 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi)