[03:43:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:40] 10Data-Engineering-Planning, 10Data Pipelines, 10Foundational Technology Requests, 10Traffic, 10User-fgiunchedi: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10elukey) Cross posting: T310997#8378791 Me and Filippo are working on a prototype wi... [11:31:34] 10Data-Engineering-Planning: 503 on Superset (reproducible) - https://phabricator.wikimedia.org/T322525 (10EChetty) [11:31:57] 10Data-Engineering-Planning: 503 on Superset (reproducible) - https://phabricator.wikimedia.org/T322525 (10EChetty) [11:51:18] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:24:04] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:18:48] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:11:18] (03PS3) 10Snwachukwu: [WIP] Add Custom Authentication Configuration Class for Cassandra. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/851077 (https://phabricator.wikimedia.org/T306895) [14:32:39] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:56:03] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 04), 10Spike: Easy Flink Python UDF + SQL enrichment - https://phabricator.wikimedia.org/T320968 (10tchin) I think the custom type inference in Java/Scala is really powerful, but if someone already is at a point where they're writing UDFs in... [15:04:41] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:12:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp1075%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:15:39] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:16:07] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 04): Prototype Flink job for content Dumps - https://phabricator.wikimedia.org/T320966 (10Milimetric) [15:17:12] (VarnishkafkaNoMessages) resolved: (2) varnishkafka on cp1075 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [15:20:00] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 04): Prototype Flink job for content Dumps - https://phabricator.wikimedia.org/T320966 (10Milimetric) [15:29:33] PROBLEM - Check systemd state on aqs1016 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:47] 10Analytics-Jupyter, 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 04), 10Patch-For-Review: Add support for jupyterhub on conda-analytics - https://phabricator.wikimedia.org/T321088 (10xcollazo) Thanks for fixing this Ben! >Perhaps we should add libmariadb-dev as a requireme... [15:34:36] RECOVERY - Check systemd state on aqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:40:59] PROBLEM - AQS root url on aqs1021 is CRITICAL: connect to address 10.64.135.7 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:46:05] PROBLEM - AQS root url on aqs1020 is CRITICAL: connect to address 10.64.131.7 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:47:15] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:51:17] PROBLEM - AQS root url on aqs1019 is CRITICAL: connect to address 10.64.48.147 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:56:21] PROBLEM - AQS root url on aqs1018 is CRITICAL: connect to address 10.64.32.185 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:58:38] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 04), 10Spike: Easy Flink Python UDF + SQL enrichment - https://phabricator.wikimedia.org/T320968 (10tchin) (When it comes to this task of making an example reading/writing with Flink SQL and a UDF, With Andrew's example and also a [[ https://... [16:01:27] PROBLEM - AQS root url on aqs1017 is CRITICAL: connect to address 10.64.16.75 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [16:05:25] RECOVERY - AQS root url on aqs1017 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [16:08:00] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10RobH) [16:08:07] RECOVERY - AQS root url on aqs1018 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [16:08:10] 10Data-Engineering-Planning, 10DC-Ops, 10SRE, 10Shared-Data-Infrastructure, 10ops-eqiad: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10RobH) [16:09:11] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:10:17] RECOVERY - AQS root url on aqs1019 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [16:11:43] RECOVERY - AQS root url on aqs1020 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [16:12:17] RECOVERY - AQS root url on aqs1021 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [17:04:01] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:36:55] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:27:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2033%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:28:21] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 04): [NEEDS GROOMING] Flink SQL queries should access Kafka topics from a Catalog - https://phabricator.wikimedia.org/T322022 (10Ottomata) Some thoughts and trials of implementing a Flink Event Platform catalog [[ https://phabricator.wikimedia.... [18:32:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2033 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2033%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:24:23] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:46:21] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:09:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 04), 10Spike: Easy Flink Python UDF + SQL enrichment - https://phabricator.wikimedia.org/T320968 (10Ottomata) > if we're deriving it from schemas then the user would have to go to the schema repo and figure out what they have to return anywa... [20:51:14] PROBLEM - MegaRAID on an-worker1089 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:01:33] RECOVERY - MegaRAID on an-worker1089 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:32:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Shared Event Platform] Design and Implement POC Flink Service to Combine Existing Streams, Enrich and Output to New Topic - https://phabricator.wikimedia.org/T307959 (10Ottomata) [21:33:24] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 03), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) 05Resolved→03Open Re-opening to discuss a schema change. In https://gerrit.wikimedia.org... [21:33:50] 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) [21:34:25] 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) [21:34:57] 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10Ottomata) [21:35:54] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Patch-For-Review: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10Ottomata) @EChetty why Event Platform here? [22:00:23] (03PS1) 10Ottomata: development/ page change - Remove comment_html fields, bump to 2.0.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/855146 (https://phabricator.wikimedia.org/T308017)