[00:46:47] 10Data-Engineering, 10Event-Platform Value Stream, 10events: testing - https://phabricator.wikimedia.org/T336021 (10JArguello-WMF) [00:47:17] 10Data-Engineering, 10Event-Platform Value Stream, 10events: testing - https://phabricator.wikimedia.org/T336021 (10JArguello-WMF) 05Open→03Invalid [01:28:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:28:28] (SystemdUnitFailed) firing: (20) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:28:31] (SystemdUnitFailed) firing: (20) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:54:01] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10tchin) Do we know what's turning them into ecs format in the first place? [07:23:14] 10Data-Engineering, 10Privacy Engineering: The soon-to-be-released pageview datasets should be linked from dumps page - https://phabricator.wikimedia.org/T335958 (10ayayb173) [07:24:21] 10Data-Engineering, 10Privacy Engineering: The soon-to-be-released pageview datasets should be linked from dumps page - https://phabricator.wikimedia.org/T335958 (10ayayb173) [09:28:31] (SystemdUnitFailed) firing: (20) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:35] 10Data-Engineering, 10Shared-Data-Infrastructure: Bring stat1009 into service - https://phabricator.wikimedia.org/T336036 (10BTullis) [09:47:05] 10Data-Engineering, 10Shared-Data-Infrastructure: Bring stat1010 into service with GPU from stat1005 - https://phabricator.wikimedia.org/T336040 (10BTullis) [09:55:16] 10Data-Engineering, 10Shared-Data-Infrastructure: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10BTullis) [10:01:21] 10Data-Engineering, 10Shared-Data-Infrastructure: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10BTullis) [10:03:48] 10Data-Engineering, 10Shared-Data-Infrastructure: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10BTullis) [10:05:04] 10Data-Engineering, 10Shared-Data-Infrastructure: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10BTullis) [10:07:34] 10Data-Engineering, 10Shared-Data-Infrastructure: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) [10:40:56] 10Data-Engineering-Planning, 10Equity-Landscape: Load language data - https://phabricator.wikimedia.org/T315886 (10KCVelaga_WMF) [11:05:44] Announcing: scheduled downtime for the analytics mariadb replicas: 2023/05/09 at 09:30 UTC for 30 minutes [11:06:25] I've sent emails to analytics@lists.w.o and analytics-announce@.w.o [12:59:47] (03PS1) 10Btullis: Add a datahub-upgrade container [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/916483 (https://phabricator.wikimedia.org/T329514) [13:18:13] (03CR) 10CI reject: [V: 04-1] Add a datahub-upgrade container [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/916483 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:22:02] (03PS2) 10Btullis: Add a datahub-upgrade container [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/916483 (https://phabricator.wikimedia.org/T329514) [13:28:31] (SystemdUnitFailed) firing: (20) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:53] !log roll-rebooting presto workers for T335835 [14:26:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:29:10] (03PS3) 10Btullis: Add a datahub-upgrade container [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/916483 (https://phabricator.wikimedia.org/T329514) [15:05:14] 10Data-Engineering, 10Shared-Data-Infrastructure: Decommission an-test-coord1002 - https://phabricator.wikimedia.org/T336062 (10BTullis) [15:06:24] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Alluxio for Improved Superset Query Performance - https://phabricator.wikimedia.org/T288252 (10BTullis) [15:06:26] 10Data-Engineering: Analytics coordinator failover improvements - https://phabricator.wikimedia.org/T280905 (10BTullis) [15:06:46] !log deployed airflow analytics [15:06:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:07:16] 10Analytics-Kanban, 10Data-Engineering, 10Patch-For-Review: Deploy an-test-coord1002 to facilitate failover testing of analytics coordinator role - https://phabricator.wikimedia.org/T287864 (10BTullis) 05Open→03Declined Declining this task, since the server hasn't been used and is due for decommissioning... [15:24:57] (03PS4) 10Btullis: Add a datahub-upgrade container [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/916483 (https://phabricator.wikimedia.org/T329514) [15:42:59] (03CR) 10Btullis: [C: 03+2] Add a datahub-upgrade container [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/916483 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [15:44:33] !log re-ran projectview_hourly DAG for 2023-05-05T13 [15:44:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:50:46] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: mediawiki-event-enrichment: issue async requests from ProcessFunction - https://phabricator.wikimedia.org/T332948 (10Ottomata) Backfilled the full day, job is still running. Things look okay...except Python RSS is still slowl... [17:07:10] PROBLEM - MegaRAID on an-worker1088 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:07:59] 10Data-Engineering-Planning, 10XTools, 10Chinese-Sites, 10Data Pipelines (Sprint 12): Run maintain-views on zhwiki, newiki - https://phabricator.wikimedia.org/T334041 (10BTullis) 05Open→03Resolved a:03BTullis @MusikAnimal - Apologies for the delay in carrying out this task. I believe that it the work... [17:14:49] 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure: MegaRAID error on an-worker1088 - https://phabricator.wikimedia.org/T336077 (10BTullis) [17:15:13] ACKNOWLEDGEMENT - MegaRAID on an-worker1088 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T336077 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:28:31] (SystemdUnitFailed) firing: (20) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:24] 10Data-Engineering, 10Metrics-Platform-Planning, 10Wikimedia-production-error: EventGate Validation error: `session_id` wrong length (multiple schemas) - https://phabricator.wikimedia.org/T336078 (10matmarex) Ah, I found them: https://logstash.wikimedia.org/goto/c845c75739d3a2749a59a488762ed066 The same mes... [17:50:05] (03PS1) 10Mforns: Fix webrequest sampled 128 druid loading queries [analytics/refinery] - 10https://gerrit.wikimedia.org/r/916537 (https://phabricator.wikimedia.org/T334106) [17:58:45] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Fix webrequest sampled 128 druid loading queries [analytics/refinery] - 10https://gerrit.wikimedia.org/r/916537 (https://phabricator.wikimedia.org/T334106) (owner: 10Mforns) [18:40:59] RECOVERY - MegaRAID on an-worker1088 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:16:21] 10Data-Engineering, 10Metrics-Platform-Planning, 10Wikimedia-production-error: EventGate Validation error: `session_id` wrong length (multiple schemas) - https://phabricator.wikimedia.org/T336078 (10Ottomata) > one of the fields magically provided by the event platform? By Metrics Platform. cc @phuedx :) [20:01:11] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 12), 10Patch-For-Review: eventutilities-python manager should set up python logging with ECS format - https://phabricator.wikimedia.org/T335802 (10Ottomata) Not 100% but pretty sure it is our k8s stdout -> logstash stuff. We should ask ServiceOps I gu... [20:01:31] 10Data-Engineering, 10Product-Analytics: Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10Milimetric) [20:03:32] 10Data-Engineering, 10Product-Analytics: Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10Milimetric) [20:04:31] 10Data-Engineering, 10Product-Analytics: Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10Milimetric) [20:23:03] 10Data-Engineering, 10Data-Persistence, 10IP Masking: Adding user_is_temp to the user table - https://phabricator.wikimedia.org/T333223 (10Tchanders) [21:28:31] (SystemdUnitFailed) firing: (20) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:07] (03PS1) 10Kimberly Sarabia: References new fragment in scroll and editattemptstep [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/916625 (https://phabricator.wikimedia.org/T335309)