[00:20:49] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:41] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:03] RECOVERY - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:41:54] 10Data-Engineering-Kanban, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Build and install spark3 assembly - https://phabricator.wikimedia.org/T310578 (10Antoine_Quhen) https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/98 [08:46:47] (03PS1) 10Joal: [WIP] Update refine to use Iceberg for event_sanitize [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) [10:56:59] Heads-up I've pushed out updated hadoop packages to the test cluster and I'm about to carry out a rolling restart of the test cluster. T311807 [10:57:44] joal: Are you doing anything icebergy on the test cluster at the moment? [10:58:05] nope btullis - you're free to go :) [11:00:24] Thanks. I'll start with a rolling restart of the masters, then the workers, then presto, druid, yarn-ui, then hive. [11:01:13] !log sudo cookbook sre.hadoop.roll-restart-masters test [11:01:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:30:43] btullis: nice work, I am super glad that you were able to rebuild all packages with upstream [12:31:25] btullis: we have some alerts about the test cluster - I assume it's normal and expected [12:32:14] one nit - we are moving from 2.10.1 to 2.10.2, and packages like hadoop-hdfs-datanode are updated [12:32:27] I am not sure if we need to also do an hdfs upgrade or not [12:32:33] (even if super small) [12:35:30] the changelog is not small https://hadoop.apache.org/docs/r2.10.2/hadoop-project-dist/hadoop-common/release/2.10.2/CHANGELOG.2.10.2.html [12:38:16] Hmm, thanks both. I'll look into those alarms and ack them. [12:39:01] I've only run the roll-restart-masters cookbook on test so far, so I wasn't expecting any. [12:40:40] I would have thought that for consistency we would be updating *all* packages built from the same source package, so including the hdfs packages. [12:45:31] The refine failures in the test cluster are showing an error like this: [12:45:33] `Jul 02 01:15:04 an-test-coord1001 monitor_refine_event_test[28247]: Exception in thread "main" java.lang.NoClassDefFoundError: org/wikimedia/analytics/refinery/core/LogHelper` [12:46:29] I'm a bit confused by this. Stepping afk for a while now, but back later to investigate further. [12:50:22] 10Analytics, 10Data-Engineering, 10Event-Platform, 10EventStreams: EventStreams (via KafkaSSE) does not consume from newly added partitions in topic - https://phabricator.wikimedia.org/T173006 (10Ottomata) 05Declinedβ†’03Open @JArguello-WMF this is indeed a software bug that will affect users if/when par... [12:51:01] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10wmfdata-python: conda-create-stacked breaks wmfdata.presto - https://phabricator.wikimedia.org/T301734 (10Ottomata) {T302819} [13:02:47] (03PS1) 10Hashar: Migrate package-lock.json to version 2 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811301 [13:02:49] (03PS1) 10Hashar: Schemas for Gerrit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) [13:03:22] (03CR) 10CI reject: [V: 04-1] Schemas for Gerrit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [13:04:26] (03CR) 10Hashar: "package-lock.json has a new format which I think got introduced with npm 7 (the version that should be everywhere on CI). Without this ch" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811301 (owner: 10Hashar) [13:05:26] (03CR) 10Ottomata: Spark job to create desktop page ids viewed and searches performed in each session. (031 comment) [analytics/refinery/source] (nav-vectors) - 10https://gerrit.wikimedia.org/r/383761 (https://phabricator.wikimedia.org/T174796) (owner: 10Shilad Sen) [13:10:03] (03CR) 10Hashar: "The canonical source of Gerrit events are the Java classes representing them ( https://cs.opensource.google/gerrit/gerrit/gerrit/+/master" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [13:18:37] (03CR) 10Michael Große: "This change is ready for review." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/811306 (owner: 10Michael Große) [13:38:00] (03CR) 10Joal: "The code is 2 versions spark old - this should be abandonned." [analytics/refinery/source] (nav-vectors) - 10https://gerrit.wikimedia.org/r/383761 (https://phabricator.wikimedia.org/T174796) (owner: 10Shilad Sen) [13:38:35] (03CR) 10Joal: [V: 03+2 C: 03+2] "Good catch! Merging" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/811306 (owner: 10Michael Große) [13:43:28] 10Data-Engineering, 10Data-Persistence (Consultation): Move Mediawiki QueryPages computation to Hadoop - https://phabricator.wikimedia.org/T309738 (10Milimetric) > Very likely I'm misunderstanding how analytics infra works and sorry for that. My worry is in the input. i.e. the sqoop from mw to hadoop happens m... [14:03:55] joal: I'm looking at the airflow-test alerts. Did we change something there today? [14:04:05] I'm confused! [14:09:58] (03CR) 10Ottomata: "Hm, perhaps, as part of your schema generation automation, just convert the .json to a current.yaml file (you can read the json and serial" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [14:11:45] (03CR) 10Ottomata: [C: 03+2] Migrate package-lock.json to version 2 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811301 (owner: 10Hashar) [14:12:23] (03Merged) 10jenkins-bot: Migrate package-lock.json to version 2 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811301 (owner: 10Hashar) [14:14:50] (03Abandoned) 10Ottomata: Spark job to create desktop page ids viewed and searches performed in each session. [analytics/refinery/source] (nav-vectors) - 10https://gerrit.wikimedia.org/r/383761 (https://phabricator.wikimedia.org/T174796) (owner: 10Shilad Sen) [14:15:52] joal: I think regular failure alerts in Airflow-prod work well. We received recently a pair of them, when some Operators failed. But we still have 2 issues: SLAs and weird sensor errors (happening before Airflow even realizes). I tried yesterday to force SLA emails from dev instance, but without success. Maybe we can pair today please? :-) [14:28:52] 10Data-Engineering-Kanban, 10Data Engineering Planning (Sprint 01), 10Patch-For-Review: Upgrade to latest PrestoDB and enable iceberg support - https://phabricator.wikimedia.org/T311525 (10Ottomata) Just sent this email to analytics-announce: > Hello! > > We will be upgrading Presto to version 0.273.3 on W... [14:32:04] (03CR) 10Hashar: "Thanks Andrew! Unfortunately there are circular dependencies which prevent the Java json schema generator I am using from generating schem" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [14:32:23] (03PS2) 10Hashar: Schemas for Gerrit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) [14:32:58] (03CR) 10CI reject: [V: 04-1] Schemas for Gerrit [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [14:33:46] (03CR) 10Hashar: Schemas for Gerrit (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [14:53:38] 10Analytics, 10Data-Engineering, 10Event-Platform, 10EventStreams: EventStreams (via KafkaSSE) does not consume from newly added partitions in topic - https://phabricator.wikimedia.org/T173006 (10JArguello-WMF) @Ottomata More than ok! Thank you for reopening the task. [14:54:46] (03CR) 10Ottomata: Schemas for Gerrit (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811302 (https://phabricator.wikimedia.org/T311615) (owner: 10Hashar) [15:08:42] I'm not 100% sure, now that I look closely at it, that the errors we're seeing on the test cluster are related to the hadoop pack upgrade. Some of the times of the failures are well before I rolled out the package upgrade. [15:09:48] https://usercontent.irccloud-cdn.com/file/CG0nIWJw/image.png [15:10:11] Could they be related to some other changes on the test cluster? [15:13:44] joal: ottomata: If you have any ideas, that would be great, thanks. --^ [15:27:00] (03PS1) 10Aqu: Fix done file path in HDFSArchiver [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811325 (https://phabricator.wikimedia.org/T310542) [15:31:58] (03CR) 10Aqu: "Joal do you think it would be possible to mock `apply`, in order to add a unit test here?" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811325 (https://phabricator.wikimedia.org/T310542) (owner: 10Aqu) [15:42:05] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Spark3 migration - Currently existing airflow jobs - https://phabricator.wikimedia.org/T306955 (10JArguello-WMF) 05Openβ†’03Resolved [15:42:07] 10Data-Engineering, 10Epic: Upgrade analytics-hadoop to Spark 3 + scala 2.12 - https://phabricator.wikimedia.org/T291464 (10JArguello-WMF) [15:42:18] 10Data-Engineering-Kanban, 10Airflow: [Airflow] Migrate Oozie's mediawiki_history_load jobs to Airflow - https://phabricator.wikimedia.org/T309718 (10JArguello-WMF) 05Openβ†’03Resolved [15:42:28] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Mediawiki History delayed 2022-05 - https://phabricator.wikimedia.org/T309987 (10JArguello-WMF) 05Openβ†’03Resolved [15:42:37] 10Data-Engineering-Kanban, 10Data-Catalog: Custom Metadata ingestion - https://phabricator.wikimedia.org/T307714 (10JArguello-WMF) 05Openβ†’03Resolved [15:42:55] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: Increase max.incremental.fetch.session.cache.slots on Kafka jumbo eqiad - https://phabricator.wikimedia.org/T303324 (10JArguello-WMF) 05Openβ†’03Resolved [15:43:03] 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Operational Excellence - Q2 21/22 - https://phabricator.wikimedia.org/T288250 (10JArguello-WMF) [15:43:05] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Push Gobblin import metrics to Prometheus and add alerts on some critical imports - https://phabricator.wikimedia.org/T286503 (10JArguello-WMF) 05Openβ†’03Resolved [15:43:08] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic, 10Patch-For-Review: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10JArguello-WMF) [15:49:38] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Proof of concept of Cassandra loading - https://phabricator.wikimedia.org/T307935 (10JArguello-WMF) 05Openβ†’03Resolved [15:49:40] 10Data-Engineering, 10Airflow: Use airflow to load cassandra - https://phabricator.wikimedia.org/T306962 (10JArguello-WMF) [15:49:57] 10Data-Engineering, 10Data-Engineering-Kanban, 10Discovery, 10Generated Data Platform: Agree on and adopt WMF scalastyle conventions - https://phabricator.wikimedia.org/T310143 (10JArguello-WMF) 05Openβ†’03Resolved [15:50:06] 10Data-Engineering-Kanban: The effect of sqooping large tables on mediawiki history - https://phabricator.wikimedia.org/T309806 (10JArguello-WMF) 05Openβ†’03Resolved [15:50:13] 10Data-Engineering, 10Data-Engineering-Kanban: Update webrequest error thresholds - https://phabricator.wikimedia.org/T310576 (10JArguello-WMF) 05Openβ†’03Resolved [15:50:31] 10Data-Engineering-Kanban, 10Trash, 10Upstream: --- DISCUSSED BELOW --- - https://phabricator.wikimedia.org/T114124 (10JArguello-WMF) 05Stalledβ†’03Resolved [15:51:59] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Switch off skipTrash for some data purging - https://phabricator.wikimedia.org/T270431 (10JArguello-WMF) 05Openβ†’03Resolved [15:52:07] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Fix api_daily job - https://phabricator.wikimedia.org/T308767 (10JArguello-WMF) 05Openβ†’03Resolved [15:52:18] 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10JArguello-WMF) 05Openβ†’03Resolved [15:52:21] 10Data-Engineering, 10Data-Engineering-Kanban: Crash of artifact-cache in scap deploy context - https://phabricator.wikimedia.org/T305868 (10JArguello-WMF) 05Openβ†’03Resolved [15:52:29] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10JArguello-WMF) 05Openβ†’03Resolved [15:52:34] 10Data-Engineering-Kanban, 10Data-Catalog: Spike: Evaluate datahub schema versioning support - https://phabricator.wikimedia.org/T307716 (10JArguello-WMF) 05Openβ†’03Resolved [15:52:36] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Upgrade Datahub - https://phabricator.wikimedia.org/T308052 (10JArguello-WMF) [15:52:39] 10Data-Engineering-Kanban, 10Data-Catalog: Spike: Evaluate interaction of manual description edits and automatic description reimport - https://phabricator.wikimedia.org/T307717 (10JArguello-WMF) 05Openβ†’03Resolved [15:52:45] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Fix airflow interlanguage job - https://phabricator.wikimedia.org/T308766 (10JArguello-WMF) 05Openβ†’03Resolved [15:52:54] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10EventStreams: Expose mediawiki/revision/tags-change in stream.wikimedia.org - https://phabricator.wikimedia.org/T294391 (10JArguello-WMF) 05Openβ†’03Resolved [15:53:01] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: 4 wikidata metrics jobs - https://phabricator.wikimedia.org/T300021 (10JArguello-WMF) 05Openβ†’03Resolved [15:53:04] 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF) [15:54:13] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Patch-For-Review: Generate $wgEventLoggingSchemas from $wgEventStreams - https://phabricator.wikimedia.org/T303602 (10JArguello-WMF) [15:55:18] 10Data-Engineering, 10Airflow: Use airflow to load cassandra - https://phabricator.wikimedia.org/T306962 (10BTullis) > The concern with waiting (I forgot to mention above) is that it prevents us from decommissioning the old AQS nodes. I'd also come down on the side of moving forward to spark3, despite the ext... [15:57:25] 10Data-Engineering, 10Event-Platform, 10EventStreams: EventStreams (via KafkaSSE) does not consume from newly added partitions in topic - https://phabricator.wikimedia.org/T173006 (10JArguello-WMF) [16:15:49] 10Data-Engineering, 10Airflow: Use airflow to load cassandra - https://phabricator.wikimedia.org/T306962 (10JAllemandou) We have a duplicate of this task currently used to track the migration. @BTullis do you mind if I merge this one onto the other one? [16:18:13] 10Data-Engineering, 10Airflow: Use airflow to load cassandra - https://phabricator.wikimedia.org/T306962 (10BTullis) >>! In T306962#8052482, @JAllemandou wrote: > We have a duplicate of this task currently used to track the migration. @BTullis do you mind if I merge this one onto the other one? I wouldn't mi... [16:19:21] 10Data-Engineering, 10Airflow: Use airflow to load cassandra - https://phabricator.wikimedia.org/T306962 (10JAllemandou) [16:31:56] 10Data-Engineering: Request for SQL Templating to be enabled in Superset - https://phabricator.wikimedia.org/T312134 (10mpopov) [16:33:22] 10Data-Engineering: Request for SQL Templating to be enabled in Superset - https://phabricator.wikimedia.org/T312134 (10mpopov) Suppose we have a dashboard with a chart powered by Presto. For simplicity, imagine transformations have been taken care of. The underlying query is then: ` SELECT * FROM tbl ` Becaus... [16:33:37] 10Data-Engineering, 10Product-Analytics: Request for SQL Templating to be enabled in Superset - https://phabricator.wikimedia.org/T312134 (10mpopov) [17:42:06] 10Data-Engineering, 10Airflow: Use airflow to load cassandra - https://phabricator.wikimedia.org/T306962 (10MoritzMuehlenhoff) >>! In T306962#8052401, @BTullis wrote: > Having spoken to some other members of SRE they're keen to decommission aqs100* because they're running stretch, but at the same time as long... [17:47:40] 10Analytics-Wikistats, 10Data-Engineering: Add proper trend numbers to wikistats metrics - https://phabricator.wikimedia.org/T251813 (10JArguello-WMF) [17:47:44] 10Analytics-Wikistats, 10Data-Engineering, 10I18n, 10RTL: Support right-to-left languages in Wikistats - https://phabricator.wikimedia.org/T251376 (10JArguello-WMF) [17:47:51] 10Analytics-Wikistats, 10Data-Engineering: Javascript-less Wikistats - https://phabricator.wikimedia.org/T251979 (10JArguello-WMF) [17:52:26] 10Analytics-Wikistats, 10Data-Engineering: Country pageview breakdown by language - https://phabricator.wikimedia.org/T250001 (10JArguello-WMF) [17:53:02] 10Analytics-Wikistats, 10Data-Engineering: Trends for editor types, and new editors in particular (in Wikistats 2.0) - https://phabricator.wikimedia.org/T186791 (10JArguello-WMF) [17:53:49] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: Contextualize wikistats metrics - https://phabricator.wikimedia.org/T187212 (10JArguello-WMF) 05Openβ†’03Declined [17:54:56] 10Analytics-Wikistats, 10Data-Engineering: Use version of Lato that renders non-roman alphabets - https://phabricator.wikimedia.org/T246777 (10JArguello-WMF) [17:55:15] 10Analytics-Wikistats, 10Data-Engineering: Wiki selector: "All wikis" is not translated - https://phabricator.wikimedia.org/T246911 (10JArguello-WMF) [17:55:29] 10Analytics-Wikistats, 10Data-Engineering: Metric widget won't stretch when content overflows - https://phabricator.wikimedia.org/T246913 (10JArguello-WMF) [17:59:31] 10Data-Engineering, 10Airflow: Use airflow to load cassandra - https://phabricator.wikimedia.org/T306962 (10JAllemandou) Thank you for the head up @MoritzMuehlenhoff. The migration should be done in a matter of weeks so I'm confident we'll be done by September. [18:00:30] 10Analytics-Wikistats, 10Data-Engineering: 'All" time range does not transfer well across metrics - https://phabricator.wikimedia.org/T227038 (10JArguello-WMF) [18:00:51] 10Analytics-Wikistats, 10Data-Engineering: Active Editors metric per project family - https://phabricator.wikimedia.org/T188265 (10JArguello-WMF) [18:01:39] 10Analytics-Wikistats, 10Data-Engineering, 10Patch-For-Review: Create report for "articles with most contributors" in Wikistats2 - https://phabricator.wikimedia.org/T204965 (10JArguello-WMF) [18:34:05] 10Data-Engineering: Review iceberg settings and document choices - https://phabricator.wikimedia.org/T312151 (10JAllemandou) [18:50:36] (03PS1) 10Urbanecm: Add analytics/mediawiki/editgrowthconfig [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811357 (https://phabricator.wikimedia.org/T312148) [18:51:05] (03CR) 10CI reject: [V: 04-1] Add analytics/mediawiki/editgrowthconfig [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811357 (https://phabricator.wikimedia.org/T312148) (owner: 10Urbanecm) [18:51:27] * urbanecm misses the old jerkins message [18:51:55] (03PS2) 10Urbanecm: Add analytics/mediawiki/editgrowthconfig [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811357 (https://phabricator.wikimedia.org/T312148) [18:52:34] (03CR) 10CI reject: [V: 04-1] Add analytics/mediawiki/editgrowthconfig [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811357 (https://phabricator.wikimedia.org/T312148) (owner: 10Urbanecm) [18:54:51] (03PS3) 10Urbanecm: Add analytics/mediawiki/editgrowthconfig [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/811357 (https://phabricator.wikimedia.org/T312148) [19:44:19] 10Data-Engineering: Create LVS endpoint for druid-public-overlord (for oozie job indexing) - https://phabricator.wikimedia.org/T180971 (10Aklapper) [20:07:00] (03CR) 10Ottomata: [WIP] Update refine to use Iceberg for event_sanitize (039 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/811212 (https://phabricator.wikimedia.org/T311739) (owner: 10Joal) [21:37:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:42:13] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:45:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:53:43] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers