[00:45:23] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:59:01] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:02:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [01:04:05] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [01:12:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [01:27:19] PROBLEM - Webrequests Varnishkafka log producer on cp5002 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [01:38:51] PROBLEM - Webrequests Varnishkafka log producer on cp5002 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [02:22:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [03:22:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [03:22:57] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [03:32:57] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [03:40:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:50:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [05:04:05] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [05:28:51] 10Data-Engineering, 10Airflow: Migrate 1+ reportupdater jobs - https://phabricator.wikimedia.org/T307540 (10mforns) [05:29:34] 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10mforns) [05:29:38] 10Data-Engineering, 10Airflow: Migrate 1+ reportupdater jobs - https://phabricator.wikimedia.org/T307540 (10mforns) [06:27:49] (03CR) 10Snwachukwu: Create Hql script to generate API(rest and action) metrics. (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) (owner: 10Snwachukwu) [06:30:29] (03CR) 10Snwachukwu: [C: 03+1] Create Hql script to generate API(rest and action) metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) (owner: 10Snwachukwu) [08:41:49] 10Data-Engineering, 10Data-Engineering-Kanban: Increase Java heap for HDFS namenodes - https://phabricator.wikimedia.org/T307549 (10BTullis) [08:42:01] 10Data-Engineering, 10Data-Engineering-Kanban: Increase Java heap for HDFS namenodes - https://phabricator.wikimedia.org/T307549 (10BTullis) p:05Triage→03Medium [08:47:41] !log rebooting an-coord1002 to pick up new kernel [08:47:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:04:05] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [09:14:53] aqu: FYI I'm attending to this HDFS heap alert. --^ I have a patch ready to apply: https://gerrit.wikimedia.org/r/c/operations/puppet/+/789098 [09:21:50] OK, thanks! [09:30:19] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:43:35] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:16:19] 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) Thanks @JMeybohm - Those docs are really useful. I will proceed to make the changes required. There's one part that I'm not clear on from the docs. (T... [10:21:12] 10Data-Engineering, 10Equity-Landscape: Milestone: Transformation Definitions Complete: - https://phabricator.wikimedia.org/T305474 (10EChetty) [10:28:54] 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) >>! In T303049#7902555, @BTullis wrote: > I understand that it's something to do with [[https://wikitech.wikimedia.org/wiki/DNS/Discovery#Read-only_an... [11:05:31] (03CR) 10Snwachukwu: Add new wikis to sqoop list (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787513 (https://phabricator.wikimedia.org/T304632) (owner: 10Snwachukwu) [11:14:36] 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) >>! In T303049#7902578, @JMeybohm wrote: > I was under the impression that datahub should only run/be used in the active datacenter because it relies o... [11:30:48] 10Data-Engineering: LVS in Analytics VLANs - https://phabricator.wikimedia.org/T288750 (10cmooney) I [11:37:44] 10Data-Engineering: LVS in Analytics VLANs - https://phabricator.wikimedia.org/T288750 (10cmooney) > One small downside is about traffic flows, if I understand correctly, most clients are in the analytics vlan, so traffic will do something like: We need to bear this in mind. The current uRPF filter on the CR r... [12:03:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:05:07] (03CR) 10Joal: Add new wikis to sqoop list (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787513 (https://phabricator.wikimedia.org/T304632) (owner: 10Snwachukwu) [12:25:39] (03CR) 10Mforns: [V: 03+2 C: 03+2] Create Hql script to generate API(rest and action) metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) (owner: 10Snwachukwu) [12:25:46] (03PS6) 10Mforns: Create Hql script to generate API(rest and action) metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) (owner: 10Snwachukwu) [12:25:53] (03CR) 10Mforns: [V: 03+2] Create Hql script to generate API(rest and action) metrics. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/785153 (https://phabricator.wikimedia.org/T300028) (owner: 10Snwachukwu) [12:28:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:31:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:36:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:38:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:39:08] 10Quarry, 10cloud-services-team (Kanban): Quarry running very slowly - https://phabricator.wikimedia.org/T307482 (10rook) Both workers look like they have low load. No hanging lsof processes. Queries seem to be returning at a reasonable speed. [12:39:16] 10Quarry, 10cloud-services-team (Kanban): Quarry running very slowly - https://phabricator.wikimedia.org/T307482 (10rook) 05In progress→03Resolved [12:45:49] 10Data-Engineering, 10Airflow: Migrate unique devices jobs - https://phabricator.wikimedia.org/T305841 (10mforns) [12:46:12] 10Data-Engineering, 10Airflow: Migrate unique devices jobs - https://phabricator.wikimedia.org/T305841 (10mforns) [12:46:14] 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10mforns) [12:48:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [12:56:16] 10Data-Engineering, 10Airflow: Migrate edit hourly job - https://phabricator.wikimedia.org/T307569 (10mforns) [12:56:41] 10Data-Engineering, 10Airflow: Migrate edit hourly job - https://phabricator.wikimedia.org/T307569 (10mforns) [12:56:43] 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10mforns) [13:00:09] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:05:32] 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10akosiaris) >>! In T303049#7902665, @BTullis wrote: >>>! In T303049#7902578, @JMeybohm wrote: >> I was under the impression that datahub should only run/be used... [13:08:50] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [13:10:39] 10Data-Engineering, 10Airflow: Migrate the ProjectView jobs to Airflow - https://phabricator.wikimedia.org/T305844 (10mforns) [13:10:50] 10Data-Engineering, 10Airflow: Migrate the ProjectView jobs to Airflow - https://phabricator.wikimedia.org/T305844 (10mforns) [13:10:52] 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10mforns) [13:12:12] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:23:27] 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10mforns) [13:24:32] 10Data-Engineering, 10Airflow: Migrate the Referrer job - https://phabricator.wikimedia.org/T305842 (10mforns) [13:28:55] 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10Ottomata) FWIW, we hope that Datahub will one day be a service for more than just analytics data, but for now, it isn't, and can be considered part of the 'anal... [13:40:13] (03Abandoned) 10Jdrewniak: Fixing typo in desktopwebuiactionstracking schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777876 (https://phabricator.wikimedia.org/T301391) (owner: 10Jdrewniak) [13:54:31] ottomata, milimetric, joal, aqu: I put together a list of candidate tasks for the Airflow hackathon. Can 1 or 2 of you please have a quick look to see if it makes sense overall? :D https://docs.google.com/document/d/1DUoenMxv0HLWHrUazPjg46FJyvMUIhfWM3cCXVIZtBc/edit# [13:56:37] 10Data-Engineering, 10Airflow: Migrate the projectview jobs - https://phabricator.wikimedia.org/T305844 (10mforns) [13:56:49] 10Data-Engineering, 10Airflow: Migrate the referrer job - https://phabricator.wikimedia.org/T305842 (10mforns) [14:12:24] 10Data-Engineering, 10Airflow: Airflow Hackathon - https://phabricator.wikimedia.org/T307500 (10Aklapper) [14:12:46] 10Data-Engineering, 10Airflow: Airflow Hackathon (May 2022) - https://phabricator.wikimedia.org/T307500 (10Aklapper) [14:17:13] ottomata / joal: just a heads up that this document is floating around and I don't see anything about Shared Data Platform or Event Driven anything on there yet: https://docs.google.com/document/d/1XY291UC-PJbUWHUEG_lVKM3-lIaT_x2U1bwE4Mr8-s0/edit# [14:18:58] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I had the beginnings of a theory, based on some reading around varnish, but now I don't think that it's va... [14:24:43] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10CDanis) Hi, haven't deeply read or understood this issue (sorry!) but I wanted to point out T264021 as potentially... [14:40:54] 10Data-Engineering, 10Data-Engineering-Kanban, 10MediaWiki-extensions-EventLogging: Generate $wgEventLoggingSchemas from $wgEventStreams - https://phabricator.wikimedia.org/T303602 (10Ottomata) I think doing having EventLogging code do this with filtering via EventStreamConfig API using `producers` config is... [14:55:06] milimetric: yes been meaning to do thatt [14:55:08] thanks for reminder [14:56:09] i have a lot of todos for shared data platform [14:56:19] including doc updates based on feedback and then more meetings / discussions [14:56:26] been back burnering it... :/ [14:59:01] we should probably have cloud-like in prod AKA infra as a service? in there too, CC btullis [14:59:07] btullis: https://docs.google.com/document/d/1XY291UC-PJbUWHUEG_lVKM3-lIaT_x2U1bwE4Mr8-s0/edit# [14:59:55] ottomata: Yes, that is a good call. I will add something. [15:05:14] milimetric: done [15:05:15] https://docs.google.com/document/d/1XY291UC-PJbUWHUEG_lVKM3-lIaT_x2U1bwE4Mr8-s0/edit#heading=h.ptaikyoza8ju [15:05:22] please add your 'endorsements' :) [15:20:32] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Thanks @CDanis - Yes that looks very likely. Also I think that the latency ticket {T294911} is also probab... [15:36:34] (03PS1) 10Vivian Rook: separate stop and submit buttons [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/789196 (https://phabricator.wikimedia.org/T290146) [15:40:13] (03CR) 10jerkins-bot: [V: 04-1] separate stop and submit buttons [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/789196 (https://phabricator.wikimedia.org/T290146) (owner: 10Vivian Rook) [15:53:05] (03PS2) 10Vivian Rook: separate stop and submit buttons [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/789196 (https://phabricator.wikimedia.org/T290146) [16:03:34] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, and 2 others: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10Antoine_Quhen) a:03razzi [16:03:50] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [16:14:58] 10Quarry, 10Patch-For-Review: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10rook) I believe https://gerrit.wikimedia.org/r/789196 should allow for the double click workaround to be codified. The submit button is always present, and when a query is runn... [16:30:14] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:31:09] razzi: btullis i'm going to talk with joseph about flink stuff in bc, lemme know if you need me for SRE sync [16:31:20] ok cool ottomata [16:32:15] I'm going to restart the kafka brokers like I mentioned, unless anybody has objections: `sudo cookbook sre.kafka.reboot-workers jumbo-eqiad` [16:32:32] *reboot [16:32:33] razzi: All fine with me. [16:33:22] hey! just a quick reminder that pontoon-1 and an-db-1 in the 'analytics' cloud vps are still failing puppet [16:43:54] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:45:58] taavi: thanks again. ottomata - do you think that we still need these two servers? Is it worth spending the time to fix up the puppet, or shall we delete them? [17:11:07] (03CR) 10Snwachukwu: Add new wikis to sqoop list (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787513 (https://phabricator.wikimedia.org/T304632) (owner: 10Snwachukwu) [17:16:52] (03PS2) 10Snwachukwu: Add new wikis to sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787513 (https://phabricator.wikimedia.org/T304632) [17:21:42] (03CR) 10Joal: [V: 03+2 C: 03+2] Add new wikis to sqoop list (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787513 (https://phabricator.wikimedia.org/T304632) (owner: 10Snwachukwu) [18:29:48] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:51:42] 10Quarry, 10Patch-For-Review: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10razzi) Here's the traceback of a 500 error I got: ` [2022-05-04 17:25:52,664] ERROR in app: Exception on /api/query/stop [POST] Traceback (most recent call last): File "/srv/... [19:54:08] 10Analytics-Radar, 10Data-Engineering, 10ChangeProp, 10Discovery-Search, and 4 others: Better way to pause writes on elasticsearch - https://phabricator.wikimedia.org/T230730 (10Gehel) 05Open→03Declined We don't pause writes anymore during reindex, so this isn't needed. [20:04:05] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [20:08:22] btullis: i don't need them! [20:08:29] lets delee [20:26:34] 10Quarry, 10Patch-For-Review: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10rook) That probably means celery isn't reporting on all the ways a query has terminated, or is somehow missing some. Rather than catch the error when the stop button is pressed... [20:45:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [20:50:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [20:54:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:09:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:28:25] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation), 10Data-Services, 10cloud-services-team (Kanban): View 'centralauth_p.localuser' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T304733 (10MusikAnimal) This also breaks #event_metrics.... [21:39:46] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation), 10Data-Services, 10cloud-services-team (Kanban): View 'centralauth_p.localuser' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T304733 (10Ladsgroup) I'm actually doing it for another... [22:10:09] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation), 10Data-Services, 10cloud-services-team (Kanban): View 'centralauth_p.localuser' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T304733 (10Ladsgroup) It seems it's fixed now? ` root@cl... [22:26:55] 10Data-Engineering, 10DBA, 10Data-Services: Make linktarget table visible on cloud wiki replicas - https://phabricator.wikimedia.org/T305064 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [22:30:24] 10Data-Engineering, 10Data-Engineering-Kanban, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): View 'centralauth_p.localuser' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T304733 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [23:43:00] 10Data-Engineering, 10Cassandra: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [23:43:14] 10Data-Engineering, 10Cassandra: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) p:05Triage→03Medium