[00:00:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:09] 10Data-Engineering, 10Product-Analytics: Presto returns incorrect data for an added field - https://phabricator.wikimedia.org/T321960 (10nshahquinn-wmf) [00:47:28] 10Data-Engineering, 10Product-Analytics: Presto returns incorrect data for an added field - https://phabricator.wikimedia.org/T321960 (10nshahquinn-wmf) [00:47:50] 10Data-Engineering, 10Product-Analytics: Presto returns incorrect data for an added field - https://phabricator.wikimedia.org/T321960 (10nshahquinn-wmf) >>! In T321960#8356472, @Ottomata wrote: > If you select this data using Hive or Spark, it returns NULL for that column in old data. Ah, right! I should've c... [00:51:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:06:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:51:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:52] 10Data-Engineering: Check home/HDFS leftovers of faidon - https://phabricator.wikimedia.org/T322107 (10MoritzMuehlenhoff) [08:30:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:31] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 03), 10Spike: Easy Flink Python UDF + SQL enrichment - https://phabricator.wikimedia.org/T320968 (10gmodena) >>! In T320968#8358350, @Ottomata wrote: >> Can you elaborate on that? I thought the executable is venv/bin/python3 > > Uh hm. I ju... [09:21:26] hm, btullis if you're around, I see icinga-wm telling us about produce_canary_events failing and recovering, but I don't have any emails. Did someone fix that alert to not email if it recovers or something? I'm just worried that the alerts are broken otherwise... [09:22:01] I am around. Looking now. Thanks milimetric. [09:23:13] (03CR) 10Gehel: "Minor comment inline. Feel free to ping me on Slack or IRC if it does not make sense!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/851077 (https://phabricator.wikimedia.org/T306895) (owner: 10Snwachukwu) [09:27:51] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 03), 10Spike: Easy Flink Python UDF + SQL enrichment - https://phabricator.wikimedia.org/T320968 (10gmodena) >> Hm! Is this true? I would assume that Flink would have to unzip the virtualenv for every new taskmanager, but not every time you e... [09:30:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:17] milimetric: I've found out why you haven't got any emails. You need to subscribe to the mailing list for data-engineering-alerts@lists.wikimedia.org here: https://lists.wikimedia.org/postorius/lists/data-engineering-alerts.lists.wikimedia.org/ [09:35:58] There is some background info here: https://wikimedia.slack.com/archives/C02291Z9YQY/p1666872051845679?thread_ts=1666106567.844199&cid=C02291Z9YQY and here: https://phabricator.wikimedia.org/T315486 [09:36:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:43] I'll make you an admin on the mailing list too, once you've subscribed. [09:39:29] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Event-Platform Value Stream (Sprint 03): [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10JMeybohm) IIRC the application deployment cluster we ditched because of missing HA capabiliti... [09:41:09] https://usercontent.irccloud-cdn.com/file/1kvQjQUb/image.png [10:00:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:48] (03PS8) 10Btullis: Bump to version 0.9.0 of DataHub [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/851082 (https://phabricator.wikimedia.org/T321907) [11:43:50] 10Data-Engineering, 10Event-Platform Value Stream: [NEEDS GROOMING] Improve reliability of simple stateless services - https://phabricator.wikimedia.org/T322125 (10gmodena) [11:45:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:33] (03CR) 10Btullis: [C: 03+2] Bump to version 0.9.0 of DataHub [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/851082 (https://phabricator.wikimedia.org/T321907) (owner: 10Btullis) [12:41:19] (03Merged) 10jenkins-bot: Bump to version 0.9.0 of DataHub [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/851082 (https://phabricator.wikimedia.org/T321907) (owner: 10Btullis) [12:45:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:45] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 03), 10Spike: Easy Flink Python UDF + SQL enrichment - https://phabricator.wikimedia.org/T320968 (10tchin) The UDFs appear to be being [[ https://github.com/apache/flink/blob/6b04a50ae2182d4cdd8e44ea9a16171d1d2394ce/flink-python/src/main/java... [13:30:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:22] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 03), 10Spike: [SPIKE] Build simple stateless service using Flink SQL - https://phabricator.wikimedia.org/T318856 (10JArguello-WMF) 05Open→03Resolved [13:36:28] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 03), 10Spike: [SPIKE] Build simple stateless service using PyFlink - https://phabricator.wikimedia.org/T318859 (10JArguello-WMF) 05Open→03Resolved [13:36:36] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: [Shared Event Platform] Design and Implement POC Flink Service to Combine Existing Streams, Enrich and Output to New Topic - https://phabricator.wikimedia.org/T307959 (10JArguello-WMF) [13:36:40] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 03), 10Patch-For-Review: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 (10JArguello-WMF) 05Open→03Resolved [13:37:31] 10Data-Engineering, 10Event-Platform Value Stream (Event-Platform Value Stream (Sprint 04)), 10MW-1.40-notes (1.40.0-wmf.8; 2022-10-31): EventBus' stream config destination_event_service setting should move into producers.mediawikI_eventbus specific settings. - https://phabricator.wikimedia.org/T321557 (10JAr... [13:37:48] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Event-Platform Value Stream (Event-Platform Value Stream (Sprint 04)): [SPIKE] Deploy event driven stateless Flink service to DSE cluster - https://phabricator.wikimedia.org/T320812 (10JArguello-WMF) [13:37:50] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Event-Platform Value Stream (Sprint 04)): Prototype Flink job for content Dumps - https://phabricator.wikimedia.org/T320966 (10JArguello-WMF) [13:37:52] 10Data-Engineering, 10Event-Platform Value Stream (Event-Platform Value Stream (Sprint 04)), 10MW-1.40-notes (1.40.0-wmf.8; 2022-10-31), 10Platform Team Initiatives (Modern Event Platform (TEC2)): Allow disabling/enabling configured streams via wgEventStreams confi... - https://phabricator.wikimedia.org/T259712 [13:37:54] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Event-Platform Value Stream (Sprint 04)), 10Spike: Easy Flink Python UDF + SQL enrichment - https://phabricator.wikimedia.org/T320968 (10JArguello-WMF) [13:37:56] 10Data-Engineering-Planning, 10Machine-Learning-Team, 10Observability-Logging, 10observability, 10Event-Platform Value Stream (Event-Platform Value Stream (Sprint 04)): Evaluate Benthos as stream processor - https://phabricator.wikimedia.org/T319214 (10JArguello-WMF) [13:38:00] 10Data-Engineering, 10Event-Platform Value Stream (Event-Platform Value Stream (Sprint 04)), 10Patch-For-Review, 10Shared-Data-Infrastructure (Sprint 03): Create kubernetes namespace and user for the stream_enrichment PoC project - https://phabricator.wikimedia.org/T321682 (10JArguello-WMF) [13:38:50] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Event-Platform Value Stream (Sprint 04)), 10Patch-For-Review: [Shared Event Platform] Produce new mediawiki.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129 (10JArguello-WMF) [13:56:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:01:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4037 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4037%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:16:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:06] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Event-Platform Value Stream (Sprint 04)), 10Patch-For-Review: EventGate should support producing keyed messages for Kafka partitioning - https://phabricator.wikimedia.org/T318846 (10lbowmaker) [14:18:31] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Event-Platform Value Stream (Sprint 04)), 10Patch-For-Review: EventGate should support producing keyed messages for Kafka partitioning - https://phabricator.wikimedia.org/T318846 (10lbowmaker) a:03Ottomata [14:24:49] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Event-Platform Value Stream (Sprint 04)): [Shared Event Platform] Mediawiki Stream Enrichment should consume the consolidated page-change stream. - https://phabricator.wikimedia.org/T311084 (10lbowmaker) [14:29:28] 10Data-Engineering-Planning, 10Data-Catalog, 10Patch-For-Review: Errors in MAE consumer after upgrade of DataHub to 0.8.43 - https://phabricator.wikimedia.org/T317053 (10BTullis) I have now deployed version 0.9.0 of DataHub as per: {T321907} Now checking to see whether the ingestion issue is resolved. [14:30:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp4040 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4040%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:35:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4040 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp4040%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [14:37:06] (03PS1) 10Btullis: Update the datahub version to 0.9.0 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/851643 (https://phabricator.wikimedia.org/T321907) [15:15:58] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Event-Platform Value Stream (Sprint 04)), 10Patch-For-Review: [Shared Event Platform] Produce new mediawiki.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129 (10Ottomata) We are live in testwiki! [15:19:09] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure, 10Data Pipelines (Sprint 03): Create Plan for Spark 2 Deprecation - https://phabricator.wikimedia.org/T318367 (10mpopov) @mforns: ✔ from Product Analytics [16:16:20] (03PS1) 10Ottomata: Add content_body to development/mediawiki/page/change schema, bump to version 1.1.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/851670 (https://phabricator.wikimedia.org/T308017) [16:17:07] (03CR) 10CI reject: [V: 04-1] Add content_body to development/mediawiki/page/change schema, bump to version 1.1.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/851670 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [16:23:06] (03PS2) 10Ottomata: Add content_body to development/mediawiki/page/change schema, bump to version 1.1.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/851670 (https://phabricator.wikimedia.org/T308017) [16:23:44] (03CR) 10CI reject: [V: 04-1] Add content_body to development/mediawiki/page/change schema, bump to version 1.1.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/851670 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [16:56:07] 10Data-Engineering-Planning, 10Data-Catalog, 10Patch-For-Review: Errors in MAE consumer after upgrade of DataHub to 0.8.43 - https://phabricator.wikimedia.org/T317053 (10BTullis) I believe that the issue is now fixed. The test case that we were using was this ingestion of the `knowledge_gaps` database in hiv... [16:56:11] 10Data-Engineering-Planning, 10Data-Catalog, 10Patch-For-Review: Errors in MAE consumer after upgrade of DataHub to 0.8.43 - https://phabricator.wikimedia.org/T317053 (10BTullis) 05Open→03Resolved [17:25:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:30:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4047 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4047%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [17:55:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp4050 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4050%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:00:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4050 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4050%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:05:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp4051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4051%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:10:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4051 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4051%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:24:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp4052 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4052%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:29:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp4052 is not sending enough cache_upload requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=ulsfo%20prometheus/ops&var-cp_cluster=cache_upload&var-instance=cp4052%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:31:32] (03PS3) 10Ottomata: Add content_body to development/mediawiki/page/change schema, bump to version 1.1.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/851670 (https://phabricator.wikimedia.org/T308017) [18:43:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:48:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [18:52:06] (03CR) 10Ottomata: [C: 03+2] Add content_body to development/mediawiki/page/change schema, bump to version 1.1.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/851670 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [18:52:39] (03Merged) 10jenkins-bot: Add content_body to development/mediawiki/page/change schema, bump to version 1.1.0 [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/851670 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [18:56:35] 10Data-Engineering, 10Product-Analytics: Presto returns incorrect data for an added field - https://phabricator.wikimedia.org/T321960 (10Mayakp.wiki) Dan and I discovered a similar issue in T321231 where the mismatching order of fields in Parquet vs. Hive caused the query to fail for a snapshot in `wmf.mediawi... [19:04:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:22:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:25:42] (VarnishkafkaNoMessages) firing: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:25:57] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:42:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [19:47:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:38:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:43:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:51:12] (VarnishkafkaNoMessages) firing: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [20:51:15] 10Data-Engineering-Planning, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, 10Notifications, and 4 others: [EPIC] Deprecate EventLogging::logEvent() - https://phabricator.wikimedia.org/T318263 (10kostajh) [20:56:12] (VarnishkafkaNoMessages) resolved: varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-cp_cluster=cache_text&var-instance=cp2027%3A9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [21:15:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:51:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:45:24] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Event-Platform Value Stream (Sprint 04)): Prototype Flink job for content Dumps - https://phabricator.wikimedia.org/T320966 (10Milimetric) I was wrong to think I'd finish this by the end of the week. It's just been a series of errors with no docs to... [22:50:16] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "Merging and I'll build and deploy the artifact to archiva" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/851643 (https://phabricator.wikimedia.org/T321907) (owner: 10Btullis) [23:00:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:03:51] I think we completely forgot about the weekly train [23:06:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:35:12] (VarnishkafkaNoMessages) firing: (4) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [23:36:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:40:12] (VarnishkafkaNoMessages) resolved: (5) varnishkafka on cp2027 is not sending enough cache_text requests - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [23:54:55] (03PS1) 10Neil P. Quinn-WMF: Revise Wikistories schema documentation [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/851735 (https://phabricator.wikimedia.org/T312262)