[00:28:48] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-bscarone-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:38:09] <wikibugs>	 10Analytics, 10API Platform: Establish testing procedure for Druid-based endpoints - https://phabricator.wikimedia.org/T311190 (10BPirkle) Thank you @JAllemandou , we may need to take you up on that offer.  For now, can you check for any major flaws in my thinking?  # I'm still pursuing the possibility of test...
[06:40:47] <wikibugs>	 (03CR) 10Joal: "One thing again, and ask for double check of results being the same for the new CSV format and the old way :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu)
[08:25:27] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) I'm going to try updating the RAID controller firmware, then the BIOS on stat1010, to see if either of these fixes the drive ordering issue....
[08:27:43] <moritzm>	 FYI, I'm going to reboot the host running Turnilo in 5m
[08:28:26] <btullis>	 Ack. Many thanks moritzm. What's the reason, out of interest?
[08:28:39] <moritzm>	 for https://phabricator.wikimedia.org/T310483
[08:29:11] <btullis>	 👍
[08:56:28] <wikibugs>	 10Analytics-Radar, 10DBA: mariadb on dbstore hosts, and specifically dbstore1004, possible memory leaking - https://phabricator.wikimedia.org/T270112 (10BTullis)
[08:56:31] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation): dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10BTullis) 05Open→03Resolved a:03BTullis Thanks @Ladsgroup for...
[08:59:58] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: [Shared Event Platform] Produce new mediawii.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129 (10gmodena)
[09:19:26] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation): dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10jcrespo) While searching for other things on MariaDB's JIRA, I saw...
[09:34:40] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10SDAW-MediaSearch, 10Structured-Data-Backlog (Current Work): [M] No data from ptwikinews in event.mediawiki_mediasearch_interaction table - https://phabricator.wikimedia.org/T308815 (10mfossati) 05Open→03In progress
[11:47:35] <wikibugs>	 10Data-Engineering, 10Data³: Audit JSON schemas for Gerrit events - https://phabricator.wikimedia.org/T311615 (10hashar)
[11:49:10] <btullis>	 I am about to attempt another failback of the namenode service from an-master1002 to an-master1001.
[11:50:15] <btullis>	 No gobblin jobs are running and it is a relatively quiet time on the cluster.
[11:50:27] <joal>	 Happy to help
[11:50:36] <joal>	 btullis: I'll be monitoring with you if you wish
[11:50:58] <btullis>	 Great, thanks. Yes please.
[11:51:58] <btullis>	 !log btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
[11:52:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:52:28] <btullis>	 https://www.irccloud.com/pastebin/hIPbnvrP/
[11:52:34] <btullis>	 Looks OK so far.
[11:53:47] <joal>	 btullis: successful change (from zkfc log)
[11:54:13] <btullis>	 Yep, looks that way to me too. The command returned successfully too.
[11:54:16] <btullis>	 https://www.irccloud.com/pastebin/6tz7bVRQ/
[11:55:20] <joal>	 that's great :)
[11:55:49] <joal>	 now this doesn't explain why we failed previously, but at least that success makes me happy
[11:56:26] <btullis>	 Yes. Phew. It does gives weight to the hypothesis that it is related to load on the cluster at the time of the failover.
[11:56:37] <joal>	 yes indeed
[11:56:38] <btullis>	 Doesn't it?
[11:57:16] <btullis>	 Now, I will also want to do a restart of the namenode on an-master1002 at some point, so that it picks up its increased heap size, but I'm not in any hurry to do that.
[11:57:42] <joal>	 hm, I'm less afraid of restarts than failover nowadays :)
[11:58:14] <joal>	 I'd let's give master1 some time, just to be sure, and if you wish you can restart namenode2 then
[11:58:30] <joal>	 And actually, the cluster was really quiet - almost no job :)
[11:58:32] <btullis>	 Yes, I suggest that we also look at scheduling it for a period of low activity and try to measure the impact of a namenode restart.
[11:58:44] <joal>	 works for me
[11:59:01] <btullis>	 👍 Will do.
[11:59:12] <joal>	 I'm not sure I understand the notion of impact of namenode restart when it's passive though
[11:59:18] <joal>	 but that's ok :)
[12:04:58] <icinga-wm>	 RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:05:05] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10Generated Data Platform: [Shared Event Platform] Produce new mediawiki.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129 (10Ottomata)
[12:09:48] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10BTullis) We decided to attempt another failback from an-master1002 to an-master1001. This time we wanted to time it so that:  * No gobblin jobs were running because of {T311263} * It was...
[12:17:05] <wikibugs>	 10Data-Engineering, 10Data³: Audit JSON schemas for Gerrit events - https://phabricator.wikimedia.org/T311615 (10Ottomata) Very cool!  So, Event Platform is opinionated about some of the fields in the schema.  Some are required to automate validation and ingestion.  See [[ https://wikitech.wikimedia.org/wiki/E...
[12:18:07] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10Ottomata) NICE!  That is good news.  It will be interesting to test a failover during high cluster activity after we icebergify too!
[12:19:42] <wikibugs>	 10Data-Engineering, 10Data³: Audit JSON schemas for Gerrit events - https://phabricator.wikimedia.org/T311615 (10hashar)
[12:24:26] <mforns>	 heya team! yesterday I tried to deploy refinery and it failed, I will resume now and try again.
[12:25:22] <btullis>	 Ack. Thanks mforns. Happy to help if there's any way I can.
[12:25:34] <mforns>	 ok, thanks btullis! :]
[12:26:47] <wikibugs>	 10Data-Engineering, 10Data³: Audit JSON schemas for Gerrit events - https://phabricator.wikimedia.org/T311615 (10hashar) Thanks for the links, I haven't looked yet how to format the payload or to send it to event gate.  With this task, I am wondering whether the json schemas are valid for the event gate infras...
[12:31:46] <wikibugs>	 10Data-Engineering, 10Data³: Audit JSON schemas for Gerrit events - https://phabricator.wikimedia.org/T311615 (10Ottomata) Hm, I am not sure if we have draft/2019-09 support in EventGate, but it would not be difficult to add if we don't. Since you will be generated the JSONSchemas yourself, I betcha there is a...
[12:34:51] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service,refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:37:29] <mforns>	 btullis: the refinery scap deployment finished successfully :], however I noticed that the regular scap deploy (not the `-e thin` or `-e hadoop-test`) was super quick. Usually it's the slowest one. Could it be because most of the git-fat jar files were  synced yesterday at my first try? And so today it's only done what was remaining?
[12:37:37] <icinga-wm>	 PROBLEM - Check unit status of refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:38:32] <ottomata>	 mforns:  its possible i thiink yes, it would mean that most of the hosts succeeded and had a copy of the deploy and it was just a symlink change/ roll forward?
[12:38:34] <btullis>	 mforns: That sounds very plausible to me. 
[12:39:22] <mforns>	 ok btullis and ottomata thanks! will check that the code is present in an-launcher1002 and continue the deploy
[12:42:01] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:45:48] <mforns>	 ottomata, btullis: I got an error when executing the refinery-deploy-to-hdfs script:
[12:45:55] <mforns>	 https://www.irccloud.com/pastebin/0KJIm0jg/
[12:46:06] <ottomata>	 k looking
[12:46:15] <mforns>	 seems related to git-fat process...
[12:46:53] <ottomata>	 yah git fat did not run here for sure.
[12:46:55] <ottomata>	 hmm
[12:47:05] <ottomata>	 going to try some deploys  nthings
[12:52:15] <wikibugs>	 (03CR) 10Gmodena: WIP - Add new mediawiki entity fragments, and use them in new mediawiki page change schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata)
[12:54:13] <ottomata>	 mforns:  i did a -f redeploy on an-launcher1002 and it worked, am doing a full deploy -f now
[12:54:33] <mforns>	 ok, thank you a lot
[12:54:45] <ottomata>	 okay proceed now
[12:54:48] <ottomata>	 scap deploy looks better i think
[12:55:26] <mforns>	 ottomata: can I go ahead and execute refinery-deploy-to-hdfs in an-launcher?
[12:56:05] <icinga-wm>	 RECOVERY - Hadoop HDFS Namenode FSImage Age on an-master1002 is OK: FILE_AGE OK: /srv/hadoop/name/current/VERSION is 109 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[12:57:17] <icinga-wm>	 PROBLEM - Check unit status of refine_netflow on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:57:38] <ottomata>	 mforns yes proceed 
[12:57:42] <mforns>	 k
[13:00:23] <icinga-wm>	 RECOVERY - Check unit status of refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:00:29] <mforns>	 ottomata: it's working now
[13:00:33] <ottomata>	 gr8
[13:04:45] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:12:55] <mforns>	 !log re-deployed refinery with scap and refinery-deploy-to-hdfs
[13:13:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:14:28] <mforns>	 joal: I just deployed https://gerrit.wikimedia.org/r/c/analytics/refinery/+/807922/, I believe it will be picked up immediately by the Airflow job. Is that OK, or is there something else to do for that job?
[13:45:49] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:27] <icinga-wm>	 RECOVERY - Check unit status of refine_netflow on an-launcher1002 is OK: OK: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:59:31] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) The RAID controller firmware update did not make any difference, but thankfully @fgiunchedi has identified the cause of the reversed device n...
[14:05:05] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye executed with err...
[14:07:06] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) I'm updating the BIOS as well from version 2.13.3 to version 2.14.2 since it was marked as urgent by Dell. {F35286599,width=80%}
[14:12:54] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye
[14:13:48] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis)
[15:31:28] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Upgrade to latest PrestoDB and enable iceberg support - https://phabricator.wikimedia.org/T311525 (10Ottomata) > I'm worried that the Presto Iceberg connector might not have kerberos support?  Def the current problem.  Joal created a test iceberg table via spark i...
[15:46:29] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10RobH) >>! In T307399#8037215, @BTullis wrote: > The RAID controller firmware update did not make any difference, but thankfully @fgiunchedi has ident...
[15:53:52] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Upgrade to latest PrestoDB and enable iceberg support - https://phabricator.wikimedia.org/T311525 (10Ottomata) It does look like Presto is just re-using its Hive and HDFS auth components in its Iceberg connector, so it should work. Investigating...
[15:57:42] <nuria_>	 I sent this to mforns but i think joal and milimetric are also going to like it: https://www.usenix.org/system/files/pepr22_slides_arvesen.pdf
[15:58:42] <btullis>	 nuria_: Many thanks. That looks really interesting.
[15:58:59] <joal>	 Thanks nuria_ - will read :)
[16:06:57] <joal>	 mforns: Thanks for the deploy - I'll rerun the job and make sure everything is as expected :)
[16:14:19] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Proof of concept of Cassandra loading - https://phabricator.wikimedia.org/T307935 (10JAllemandou)
[16:14:51] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Proof of concept of Cassandra loading - https://phabricator.wikimedia.org/T307935 (10JAllemandou) This has been tested successfully - we'll need simple HQL jobs to load cassandra from now on :)
[16:19:26] <wikibugs>	 10Data-Engineering-Kanban, 10Data-Engineering-Radar, 10Generated Data Platform, 10Patch-For-Review: Flink output support for Event Platform events - https://phabricator.wikimedia.org/T310218 (10Ottomata) a:05Ottomata→03dcausse David, assigning this to you!  Am here to help but don't have a lot of time...
[16:51:34] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DC-Ops, 10SRE, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10BTullis) I tried the `partman/custom/kafka-jumbo.cfg` partman recipe on this how, but it didn't seem to be applied.  When I checked the log I saw thi...
[17:27:16] <mforns>	 !log killed mediawiki-history-load bundle in Hue, and started corresponding mediawiki_history_load DAG in Airflow
[17:27:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:57:35] <mforns>	 joal: cool! I see the editors_daily_monthly DAG failed, is this related to the deploy?
[19:13:41] <ebernhardson>	 does a table exist in hive that contains the latest (wiki name, page_id, title) for wikis? I know i could try and pull it from wmf.mediawiki_history but there might be a simpler way
[19:51:54] <ottomata>	 hm, latest like up to date?  i don't think so, you'd have to combine a mw snapshot, either wmf_raw.mediawiki_page or mediawiki history with event.mediawiki_page_create?
[19:51:59] <mforns>	 ebernhardson: we sqoop snapshots of a set of mediawiki tables monthly into Hive, maybe wmf_raw.mediawiki_page helps?
[19:52:18] <mforns>	 arg you beat me :]
[19:52:38] <ebernhardson>	 doesn't have to be up to date, i'm reading may 2022 data anyways, anything thats mostly complete is good enough. The mediawiki tables sound perfect, thanks!
[19:53:43] <ottomata>	 however!  we are working on a streaming dataset that will have what you want: https://phabricator.wikimedia.org/T311129
[19:54:28] <ebernhardson>	 i suppose by latest i mostly meant i don't have to run an aggregation over the table and figure out which row is the current value
[19:54:57] <ottomata>	 wmf_raw.mediawiki_page will work, but its monthly snapshots
[20:11:48] <wikibugs>	 10Data-Engineering: Public Druid cluster leftovers from old mw history snapshots - https://phabricator.wikimedia.org/T311547 (10Milimetric) I checked the [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid#Delete_segments | Coordinator UI ]] and didn't see `mediawiki_history_reduced_2022_02`, only `.....
[20:23:22] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Upgrade to latest PrestoDB and enable iceberg support - https://phabricator.wikimedia.org/T311525 (10Ottomata) GOT IT!  @JAllemandou try it out on test cluster.
[20:27:03] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster
[20:37:37] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS buster exec...
[20:41:41] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS buster
[20:48:36] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS buster exec...
[20:49:26] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye
[21:08:28] <wikibugs>	 (03PS1) 10Milimetric: Add world map echart [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/809683
[21:36:03] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-presto1006.eqiad.wmnet with OS bullseye ex...
[22:09:40] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) @BTullis Robh figured out the workaround to get the right raid volume to boot first.  I tried on an-presto1006 and everything seeme...
[23:07:59] <icinga-wm>	 PROBLEM - Check unit status of analytics-dumps-fetch-pageview on clouddumps1002 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:08:21] <icinga-wm>	 PROBLEM - Check unit status of analytics-dumps-fetch-unique_devices on clouddumps1002 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-unique_devices https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:13:14] <icinga-wm>	 PROBLEM - Check unit status of analytics-dumps-fetch-clickstream on clouddumps1002 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-clickstream https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:13:41] <icinga-wm>	 PROBLEM - Check unit status of analytics-dumps-fetch-mediacounts on clouddumps1001 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:16:21] <icinga-wm>	 PROBLEM - Check unit status of analytics-dumps-fetch-clickstream on clouddumps1001 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-clickstream https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:16:29] <icinga-wm>	 PROBLEM - Check unit status of analytics-dumps-fetch-mediacounts on clouddumps1002 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:18:47] <icinga-wm>	 PROBLEM - Check unit status of analytics-dumps-fetch-unique_devices on clouddumps1001 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-unique_devices https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:43:49] <wikibugs>	 10Analytics, 10API Platform: Establish testing procedure for Druid-based endpoints - https://phabricator.wikimedia.org/T311190 (10BPirkle) Worked a bit more on this today.  I was able to execute an canned example druid query via the instructions [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid |...