[00:03:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:39:29] 10Analytics-Radar, 10Editing-team, 10observability, 10Performance-Team (Radar): VE edit data stopped due to statsv falling over (?) on webperf1001 - https://phabricator.wikimedia.org/T239121 (10Krinkle) 05Open→03Resolved a:03Krinkle [01:43:12] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp6010 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [01:48:12] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp6010 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [02:36:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp6009 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp6009%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [02:36:12] (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [02:41:12] (VarnishkafkaNoMessages) resolved: (10) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [02:41:12] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [03:03:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:35:31] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:02:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:06:13] PROBLEM - SSH on an-coord1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:08:23] PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:08:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:01] RECOVERY - SSH on an-coord1002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:26:27] (03PS1) 10Nmaphophe: mend [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858369 [09:26:35] RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:54:53] (03PS1) 10Nmaphophe: Fix temp tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858370 [09:56:11] (03Abandoned) 10Nmaphophe: mend [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858369 (owner: 10Nmaphophe) [10:38:22] 10Data-Engineering, 10Equity-Landscape: Affiliates input metric - https://phabricator.wikimedia.org/T309275 (10KCVelaga_WMF) > Yes. AFFs by geo is a count of all affiliates operating in the country whereas Official Affiliate info is those primarily located in the country only and often excludes the many themat... [10:44:48] 10Data-Engineering-Planning, 10Data Pipelines, 10Foundational Technology Requests, 10Traffic, and 2 others: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10elukey) After a few unsuccessful tries (due to me+Friday combination), me and Filippo rolle... [11:45:39] (03PS1) 10Elukey: oozie: add cache_status to webrequest's druid indexations [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858561 (https://phabricator.wikimedia.org/T314981) [14:29:44] 10Data-Engineering-Radar, 10Cassandra: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 (10Eevans) [14:52:03] 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Add support for jupyterlab on conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo) [15:09:16] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Event-Platform Value Stream (Sprint 04): [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10bking) Just a heads-up as I'm re-engaging. I plan to use the `stream-enrichment-poc` namespac... [16:07:26] 10Data-Engineering, 10Equity-Landscape: Affiliates input metric - https://phabricator.wikimedia.org/T309275 (10JAnstee_WMF) >@JAnstee_WMF I am not sure if Affiliate Information 2021 (Mirror) has that information. That will require a different source to be used by @ntsako, currently all the Correct, it requires... [16:28:40] (03PS1) 10Mforns: Fix EventLogging sanitization allow-list to unbreak production [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858625 [16:30:47] SandraEbele: I think there's a problem in sanitize that we have to deploy a hotfix for [16:30:48] https://github.com/wikimedia/analytics-refinery/blame/master/static_data/sanitization/event_sanitized_analytics_allowlist.yaml#L1640 [16:30:53] cc mforns [16:30:59] milimetric: just commented on that in slack [16:31:15] oh ok! :) I think it just doesn't like that there's no space there right? [16:31:18] milimetric: will sync the fix to hdfs now [16:31:22] yes [16:31:30] sweet [16:31:43] thanks for looking into that! It had slipped me [18:29:15] (03PS2) 10Mforns: Fix EventLogging sanitization allow-list to unbreak production [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858625 [18:48:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:55:16] (03PS2) 10Milimetric: [WIP] Stream revision topics into iceberg table [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/858344 [18:56:36] !log re-ran refine_event_sanitized_analytics_immediate from 2022-11-17T13 to 2022-11-18T18 to fix the issues caused by a bug (allow-list typo) deployed yesterday. [18:56:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:56:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service,monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:13] (03CR) 10CI reject: [V: 04-1] [WIP] Stream revision topics into iceberg table [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/858344 (owner: 10Milimetric) [20:11:31] 10Quarry, 10Cloud-Services, 10cloud-services-team (Kanban): Consider moving Quarry to be an installation of Redash - https://phabricator.wikimedia.org/T169452 (10nskaggs) [20:46:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state