[00:03:25] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:09:15] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:02:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:08:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:39:29] <wikibugs>	 10Analytics-Radar, 10Editing-team, 10observability, 10Performance-Team (Radar): VE edit data stopped due to statsv falling over (?) on webperf1001 - https://phabricator.wikimedia.org/T239121 (10Krinkle) 05Open→03Resolved a:03Krinkle
[01:43:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp6010 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[01:48:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp6010 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[02:36:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: varnishkafka on cp6009 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=drmrs%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp6009%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[02:36:12] <jinxer-wm>	 (VarnishkafkaNoMessages) firing: (2) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[02:41:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (10) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[02:41:12] <jinxer-wm>	 (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka  - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages
[03:03:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:09:15] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:35:31] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:02:39] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:06:13] <icinga-wm>	 PROBLEM - SSH on an-coord1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:08:23] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1094 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:08:35] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:10:01] <icinga-wm>	 RECOVERY - SSH on an-coord1002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:26:27] <wikibugs>	 (03PS1) 10Nmaphophe: mend [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858369
[09:26:35] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1094 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:54:53] <wikibugs>	 (03PS1) 10Nmaphophe: Fix temp tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858370
[09:56:11] <wikibugs>	 (03Abandoned) 10Nmaphophe: mend [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858369 (owner: 10Nmaphophe)
[10:38:22] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Affiliates input metric - https://phabricator.wikimedia.org/T309275 (10KCVelaga_WMF) > Yes. AFFs by geo is a count of all affiliates operating in the country whereas Official Affiliate info is those primarily located in the country only and often excludes the many themat...
[10:44:48] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Foundational Technology Requests, 10Traffic, and 2 others: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10elukey) After a few unsuccessful tries (due to me+Friday combination), me and Filippo rolle...
[11:45:39] <wikibugs>	 (03PS1) 10Elukey: oozie: add cache_status to webrequest's druid indexations [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858561 (https://phabricator.wikimedia.org/T314981)
[14:29:44] <wikibugs>	 10Data-Engineering-Radar, 10Cassandra: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 (10Eevans)
[14:52:03] <wikibugs>	 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Add support for jupyterlab on conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo)
[15:09:16] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Event-Platform Value Stream (Sprint 04): [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10bking) Just a heads-up as I'm re-engaging. I plan to use the `stream-enrichment-poc` namespac...
[16:07:26] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Affiliates input metric - https://phabricator.wikimedia.org/T309275 (10JAnstee_WMF) >@JAnstee_WMF I am not sure if Affiliate Information 2021 (Mirror) has that information. That will require a different source to be used by @ntsako, currently all the Correct, it requires...
[16:28:40] <wikibugs>	 (03PS1) 10Mforns: Fix EventLogging sanitization allow-list to unbreak production [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858625
[16:30:47] <milimetric>	 SandraEbele: I think there's a problem in sanitize that we have to deploy a hotfix for
[16:30:48] <milimetric>	 https://github.com/wikimedia/analytics-refinery/blame/master/static_data/sanitization/event_sanitized_analytics_allowlist.yaml#L1640
[16:30:53] <milimetric>	 cc mforns
[16:30:59] <mforns>	 milimetric: just commented on that in slack
[16:31:15] <milimetric>	 oh ok!  :) I think it just doesn't like that there's no space there right?
[16:31:18] <mforns>	 milimetric: will sync the fix to hdfs now
[16:31:22] <mforns>	 yes
[16:31:30] <milimetric>	 sweet
[16:31:43] <mforns>	 thanks for looking into that! It had slipped me
[18:29:15] <wikibugs>	 (03PS2) 10Mforns: Fix EventLogging sanitization allow-list to unbreak production [analytics/refinery] - 10https://gerrit.wikimedia.org/r/858625
[18:48:40] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:55:16] <wikibugs>	 (03PS2) 10Milimetric: [WIP] Stream revision topics into iceberg table [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/858344
[18:56:36] <mforns>	 !log re-ran refine_event_sanitized_analytics_immediate from 2022-11-17T13 to 2022-11-18T18 to fix the issues caused by a bug (allow-list typo) deployed yesterday.
[18:56:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:56:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service,monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:00:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Stream revision topics into iceberg table [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/858344 (owner: 10Milimetric)
[20:11:31] <wikibugs>	 10Quarry, 10Cloud-Services, 10cloud-services-team (Kanban): Consider moving Quarry to be an installation of Redash - https://phabricator.wikimedia.org/T169452 (10nskaggs)
[20:46:48] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:54:46] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:08:48] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:59:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:01:20] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state