[00:45:23] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:59:01] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:02:15] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency
[01:04:05] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[01:12:15] <jinxer-wm>	 (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency
[01:27:19] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp5002 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[01:38:51] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp5002 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[02:22:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[03:22:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[03:22:57] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[03:32:57] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[03:40:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[04:50:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[05:04:05] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[05:28:51] <wikibugs>	 10Data-Engineering, 10Airflow: Migrate 1+ reportupdater jobs - https://phabricator.wikimedia.org/T307540 (10mforns)
[05:29:34] <wikibugs>	 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10mforns)
[05:29:38] <wikibugs>	 10Data-Engineering, 10Airflow: Migrate 1+ reportupdater jobs - https://phabricator.wikimedia.org/T307540 (10mforns)
[06:27:49] <wikibugs>	 (03CR) 10Snwachukwu: Create Hql script to generate API(rest and action) metrics. (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) (owner: 10Snwachukwu)
[06:30:29] <wikibugs>	 (03CR) 10Snwachukwu: [C: 03+1] Create Hql script to generate API(rest and action) metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) (owner: 10Snwachukwu)
[08:41:49] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Increase Java heap for HDFS namenodes - https://phabricator.wikimedia.org/T307549 (10BTullis)
[08:42:01] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Increase Java heap for HDFS namenodes - https://phabricator.wikimedia.org/T307549 (10BTullis) p:05Triage→03Medium
[08:47:41] <btullis>	 !log rebooting an-coord1002 to pick up new kernel
[08:47:43] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:04:05] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[09:14:53] <btullis>	 aqu: FYI I'm attending to this HDFS heap alert. --^ I have a patch ready to apply: https://gerrit.wikimedia.org/r/c/operations/puppet/+/789098
[09:21:50] <aqu>	 OK, thanks!
[09:30:19] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:43:35] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:16:19] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) Thanks @JMeybohm - Those docs are really useful. I will proceed to make the changes required.  There's one part that I'm not clear on from the docs. (T...
[10:21:12] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Milestone: Transformation Definitions Complete: - https://phabricator.wikimedia.org/T305474 (10EChetty)
[10:28:54] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) >>! In T303049#7902555, @BTullis wrote: > I understand that it's something to do with [[https://wikitech.wikimedia.org/wiki/DNS/Discovery#Read-only_an...
[11:05:31] <wikibugs>	 (03CR) 10Snwachukwu: Add new wikis to sqoop list (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787513 (https://phabricator.wikimedia.org/T304632) (owner: 10Snwachukwu)
[11:14:36] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) >>! In T303049#7902578, @JMeybohm wrote: > I was under the impression that datahub should only run/be used in the active datacenter because it relies o...
[11:30:48] <wikibugs>	 10Data-Engineering: LVS in Analytics VLANs - https://phabricator.wikimedia.org/T288750 (10cmooney) I
[11:37:44] <wikibugs>	 10Data-Engineering: LVS in Analytics VLANs - https://phabricator.wikimedia.org/T288750 (10cmooney) > One small downside is about traffic flows, if I understand correctly, most clients are in the analytics vlan, so traffic will do something like:  We need to bear this in mind.  The current uRPF filter on the CR r...
[12:03:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[12:05:07] <wikibugs>	 (03CR) 10Joal: Add new wikis to sqoop list (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787513 (https://phabricator.wikimedia.org/T304632) (owner: 10Snwachukwu)
[12:25:39] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] Create Hql script to generate API(rest and action) metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) (owner: 10Snwachukwu)
[12:25:46] <wikibugs>	 (03PS6) 10Mforns: Create Hql script to generate API(rest and action) metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) (owner: 10Snwachukwu)
[12:25:53] <wikibugs>	 (03CR) 10Mforns: [V: 03+2] Create Hql script to generate API(rest and action) metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) (owner: 10Snwachukwu)
[12:28:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[12:31:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[12:36:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[12:38:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[12:39:08] <wikibugs>	 10Quarry, 10cloud-services-team (Kanban): Quarry running very slowly - https://phabricator.wikimedia.org/T307482 (10rook) Both workers look like they have low load. No hanging lsof processes. Queries seem to be returning at a reasonable speed.
[12:39:16] <wikibugs>	 10Quarry, 10cloud-services-team (Kanban): Quarry running very slowly - https://phabricator.wikimedia.org/T307482 (10rook) 05In progress→03Resolved
[12:45:49] <wikibugs>	 10Data-Engineering, 10Airflow: Migrate unique devices jobs - https://phabricator.wikimedia.org/T305841 (10mforns)
[12:46:12] <wikibugs>	 10Data-Engineering, 10Airflow: Migrate unique devices jobs - https://phabricator.wikimedia.org/T305841 (10mforns)
[12:46:14] <wikibugs>	 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10mforns)
[12:48:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[12:56:16] <wikibugs>	 10Data-Engineering, 10Airflow: Migrate edit hourly job - https://phabricator.wikimedia.org/T307569 (10mforns)
[12:56:41] <wikibugs>	 10Data-Engineering, 10Airflow: Migrate edit hourly job - https://phabricator.wikimedia.org/T307569 (10mforns)
[12:56:43] <wikibugs>	 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10mforns)
[13:00:09] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:05:32] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10akosiaris) >>! In T303049#7902665, @BTullis wrote: >>>! In T303049#7902578, @JMeybohm wrote: >> I was under the impression that datahub should only run/be used...
[13:08:50] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[13:10:39] <wikibugs>	 10Data-Engineering, 10Airflow: Migrate the ProjectView jobs to Airflow - https://phabricator.wikimedia.org/T305844 (10mforns)
[13:10:50] <wikibugs>	 10Data-Engineering, 10Airflow: Migrate the ProjectView jobs to Airflow - https://phabricator.wikimedia.org/T305844 (10mforns)
[13:10:52] <wikibugs>	 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10mforns)
[13:12:12] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:23:27] <wikibugs>	 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10mforns)
[13:24:32] <wikibugs>	 10Data-Engineering, 10Airflow: Migrate the Referrer job - https://phabricator.wikimedia.org/T305842 (10mforns)
[13:28:55] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10Ottomata) FWIW, we hope that Datahub will one day be a service for more than just analytics data, but for now, it isn't, and can be considered part of the 'anal...
[13:40:13] <wikibugs>	 (03Abandoned) 10Jdrewniak: Fixing typo in desktopwebuiactionstracking schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777876 (https://phabricator.wikimedia.org/T301391) (owner: 10Jdrewniak)
[13:54:31] <mforns>	 ottomata, milimetric, joal, aqu: I put together a list of candidate tasks for the Airflow hackathon. Can 1 or 2 of you please have a quick look to see if it makes sense overall? :D https://docs.google.com/document/d/1DUoenMxv0HLWHrUazPjg46FJyvMUIhfWM3cCXVIZtBc/edit#
[13:56:37] <wikibugs>	 10Data-Engineering, 10Airflow: Migrate the projectview jobs - https://phabricator.wikimedia.org/T305844 (10mforns)
[13:56:49] <wikibugs>	 10Data-Engineering, 10Airflow: Migrate the referrer job - https://phabricator.wikimedia.org/T305842 (10mforns)
[14:12:24] <wikibugs>	 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10Aklapper)
[14:12:46] <wikibugs>	 10Data-Engineering, 10Airflow: Airflow Hackathon (May 2022) - https://phabricator.wikimedia.org/T307500 (10Aklapper)
[14:17:13] <milimetric>	 ottomata / joal: just a heads up that this document is floating around and I don't see anything about Shared Data Platform or Event Driven anything on there yet: https://docs.google.com/document/d/1XY291UC-PJbUWHUEG_lVKM3-lIaT_x2U1bwE4Mr8-s0/edit#
[14:18:58] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I had the beginnings of a theory, based on some reading around varnish, but now I don't think that it's va...
[14:24:43] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10CDanis) Hi, haven't deeply read or understood this issue (sorry!) but I wanted to point out T264021 as potentially...
[14:40:54] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10MediaWiki-extensions-EventLogging: Generate $wgEventLoggingSchemas from $wgEventStreams - https://phabricator.wikimedia.org/T303602 (10Ottomata) I think doing having EventLogging code do this with filtering via EventStreamConfig API using `producers` config is...
[14:55:06] <ottomata>	 milimetric:  yes been meaning to do thatt
[14:55:08] <ottomata>	 thanks for reminder
[14:56:09] <ottomata>	 i have a lot of todos for shared data platform
[14:56:19] <ottomata>	 including doc updates based on feedback and then more meetings / discussions
[14:56:26] <ottomata>	 been back burnering it... :/
[14:59:01] <ottomata>	 we should probably have cloud-like in prod  AKA infra as a service? in there too, CC btullis
[14:59:07] <ottomata>	 btullis: https://docs.google.com/document/d/1XY291UC-PJbUWHUEG_lVKM3-lIaT_x2U1bwE4Mr8-s0/edit#
[14:59:55] <btullis>	 ottomata: Yes, that is a good call. I will add something.
[15:05:14] <ottomata>	 milimetric:  done
[15:05:15] <ottomata>	 https://docs.google.com/document/d/1XY291UC-PJbUWHUEG_lVKM3-lIaT_x2U1bwE4Mr8-s0/edit#heading=h.ptaikyoza8ju
[15:05:22] <ottomata>	 please add your 'endorsements' :)
[15:20:32] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Thanks @CDanis - Yes that looks very likely. Also I think that the latency ticket {T294911} is also probab...
[15:36:34] <wikibugs>	 (03PS1) 10Vivian Rook: separate stop and submit buttons [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/789196 (https://phabricator.wikimedia.org/T290146)
[15:40:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] separate stop and submit buttons [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/789196 (https://phabricator.wikimedia.org/T290146) (owner: 10Vivian Rook)
[15:53:05] <wikibugs>	 (03PS2) 10Vivian Rook: separate stop and submit buttons [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/789196 (https://phabricator.wikimedia.org/T290146)
[16:03:34] <wikibugs>	 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, and 2 others: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10Antoine_Quhen) a:03razzi
[16:03:50] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[16:14:58] <wikibugs>	 10Quarry, 10Patch-For-Review: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10rook) I believe https://gerrit.wikimedia.org/r/789196 should allow for the double click workaround to be codified. The submit button is always present, and when a query is runn...
[16:30:14] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:31:09] <ottomata>	 razzi: btullis  i'm going to talk with joseph about flink stuff in bc, lemme know if you need me for SRE sync
[16:31:20] <razzi>	 ok cool ottomata 
[16:32:15] <razzi>	 I'm going to restart the kafka brokers like I mentioned, unless anybody has objections: `sudo cookbook sre.kafka.reboot-workers jumbo-eqiad`
[16:32:32] <razzi>	 *reboot
[16:32:33] <btullis>	 razzi: All fine with me.
[16:33:22] <taavi>	 hey! just a quick reminder that pontoon-1 and an-db-1 in the 'analytics' cloud vps are still failing puppet
[16:43:54] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:45:58] <btullis>	 taavi: thanks again. ottomata - do you think that we still need these two servers? Is it worth spending the time to fix up the puppet, or shall we delete them?
[17:11:07] <wikibugs>	 (03CR) 10Snwachukwu: Add new wikis to sqoop list (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787513 (https://phabricator.wikimedia.org/T304632) (owner: 10Snwachukwu)
[17:16:52] <wikibugs>	 (03PS2) 10Snwachukwu: Add new wikis to sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787513 (https://phabricator.wikimedia.org/T304632)
[17:21:42] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] Add new wikis to sqoop list (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787513 (https://phabricator.wikimedia.org/T304632) (owner: 10Snwachukwu)
[18:29:48] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:51:42] <wikibugs>	 10Quarry, 10Patch-For-Review: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10razzi) Here's the traceback of a 500 error I got: ` [2022-05-04 17:25:52,664] ERROR in app: Exception on /api/query/stop [POST] Traceback (most recent call last):   File "/srv/...
[19:54:08] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10ChangeProp, 10Discovery-Search, and 4 others: Better way to pause writes on elasticsearch - https://phabricator.wikimedia.org/T230730 (10Gehel) 05Open→03Declined We don't pause writes anymore during reindex, so this isn't needed.
[20:04:05] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[20:08:22] <ottomata>	 btullis:  i don't need them!  
[20:08:29] <ottomata>	 lets delee
[20:26:34] <wikibugs>	 10Quarry, 10Patch-For-Review: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10rook) That probably means celery isn't reporting on all the ways a query has terminated, or is somehow missing some. Rather than catch the error when the stop button is pressed...
[20:45:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[20:50:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[20:54:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[21:09:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[21:28:25] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation), 10Data-Services, 10cloud-services-team (Kanban): View 'centralauth_p.localuser' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T304733 (10MusikAnimal) This also breaks #event_metrics....
[21:39:46] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation), 10Data-Services, 10cloud-services-team (Kanban): View 'centralauth_p.localuser' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T304733 (10Ladsgroup) I'm actually doing it for another...
[22:10:09] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation), 10Data-Services, 10cloud-services-team (Kanban): View 'centralauth_p.localuser' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T304733 (10Ladsgroup) It seems it's fixed now? ` root@cl...
[22:26:55] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services: Make linktarget table visible on cloud wiki replicas - https://phabricator.wikimedia.org/T305064 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup
[22:30:24] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): View 'centralauth_p.localuser' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T304733 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup
[23:43:00] <wikibugs>	 10Data-Engineering, 10Cassandra: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans)
[23:43:14] <wikibugs>	 10Data-Engineering, 10Cassandra: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) p:05Triage→03Medium