[00:13:10] 10Quarry: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Xaosflux) [00:43:37] 10Quarry: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Liz) I'm also experience problems with two bots on en.wiki which haven't issued their regularly scheduled reports and receiving "overflow" messages when I try to look at page histories earlier today. Are these... [00:43:54] 10Quarry: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Pppery) Probably caused by {T60674} [00:59:44] 10Quarry: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Xaosflux) >>! In T309570#7968876, @Pppery wrote: > Probably caused by {T60674} Perhaps an unintended side affect? I get the error on the entire table, not just that column. [01:06:17] 10Quarry, 10Wikimedia-production-error: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Xaosflux) [01:16:15] RECOVERY - Check unit status of monitor_refine_event on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:42:01] 10Quarry, 10Wikimedia-production-error: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Xaosflux) seems extremely similar to T309567, is there a parent outage? [02:21:03] 10Quarry, 10Wikimedia-production-error: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Ran the maintain-views again on s1 clouddbs. It should be fixed now. [03:20:04] 10Quarry, 10Wikimedia-production-error: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Izno) [04:13:58] 10Quarry, 10Wikimedia-production-error: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Liz) Looks like everything is fixed now. Thank you for getting right on this bug report! [06:26:50] !log `elukey@an-master1001:~$ sudo systemctl reset-failed hadoop-clean-fairscheduler-event-logs.service` [06:26:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:05:37] 10Data-Engineering, 10Cassandra: Ensure AQS Cassandra client connections are multi-datacenter - https://phabricator.wikimedia.org/T307799 (10mfossati) >>! In T307799#7963969, @Eevans wrote: > It's been reported elsewhere that the bulk loader for Image Suggestions is using the same underlying code to interface... [11:55:47] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:06:15] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:13:34] (03CR) 10Snwachukwu: Update the browser_general hql to use spark hints (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416 (owner: 10Joal) [12:38:32] 10Data-Engineering, 10Airflow: Airflow Hackathon (May 2022) - https://phabricator.wikimedia.org/T307500 (10JArguello-WMF) 05Open→03Resolved [12:39:26] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:39:39] 10Data-Engineering: Kerberos passowrd reset request for sgimeno - https://phabricator.wikimedia.org/T309608 (10Sgs) [12:49:27] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:21:55] ottomata: would you have a minute for me please| [13:21:56] ? [13:22:20] ottomata: I'm fighting some python and could do with help [13:24:04] Ah nevermind - I found the thing :) [13:42:00] 10Data-Engineering: Kerberos password reset request for sgimeno - https://phabricator.wikimedia.org/T309608 (10Sgs) [13:45:24] Hi team [13:55:14] Hi milimetric [14:35:18] (03PS1) 10Milimetric: Move leftover check script to refinery [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801727 [14:35:41] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "Just moving from the wiki" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801727 (owner: 10Milimetric) [14:36:43] (03CR) 10Mforns: "LGTM! But I second Sandra's comments." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416 (owner: 10Joal) [14:37:55] 10Data-Engineering: Check home/HDFS leftovers of razzi - https://phabricator.wikimedia.org/T309000 (10Milimetric) ` ====== stat1004 ====== total 513244 drwxr-xr-x 2 26051 wikidev 4096 Jul 20 2021 hdfs-namenode-fsimage -rw-rw-r-- 1 26051 wikidev 1245367 Jan 10 16:42 part.txt -rw-r--r-- 1 26051 wikidev... [15:24:19] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [POC] Use airflow-installed Spark3 for an Airflow job - https://phabricator.wikimedia.org/T308168 (10JArguello-WMF) 05Open→03Resolved [15:24:21] 10Analytics, 10Data-Engineering, 10Epic: Upgrade analytics-hadoop to Spark 3 + scala 2.12 - https://phabricator.wikimedia.org/T291464 (10JArguello-WMF) [15:24:33] 10Data-Engineering-Kanban, 10Airflow: Medium Complexity Oozie Migration: mobile_apps-session_metrics - https://phabricator.wikimedia.org/T302874 (10JArguello-WMF) 05Open→03Resolved [15:24:47] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Adapt maxExecutors value by Dag - https://phabricator.wikimedia.org/T307447 (10JArguello-WMF) 05Open→03Resolved [15:25:06] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Organize hackathon - https://phabricator.wikimedia.org/T295204 (10JArguello-WMF) 05Open→03Resolved [15:25:19] 10Data-Engineering-Kanban, 10Airflow: Migrate the Clickstream jobs to Airflow - https://phabricator.wikimedia.org/T305843 (10JArguello-WMF) 05Open→03Resolved [15:25:51] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low RIsk Ozzie Migration: Wikidata CoEditor metric job - https://phabricator.wikimedia.org/T306177 (10JArguello-WMF) 05Open→03Resolved [15:25:54] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Low Risk Ozzie Migration: 4 wikidata metrics jobs - https://phabricator.wikimedia.org/T300021 (10JArguello-WMF) [15:26:13] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: APIs - https://phabricator.wikimedia.org/T300028 (10JArguello-WMF) 05Open→03Resolved [15:26:16] 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF) [15:26:34] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Low Risk Oozie Migration: interlanguage - https://phabricator.wikimedia.org/T300025 (10JArguello-WMF) 05Open→03Resolved [15:27:53] 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: browser/general - https://phabricator.wikimedia.org/T302875 (10JArguello-WMF) 05Open→03Resolved [15:28:05] 10Data-Engineering-Kanban, 10Airflow: Fix use of Java LinkedHashMap caching in Spark multi-threaded environment - https://phabricator.wikimedia.org/T305386 (10JArguello-WMF) 05Open→03Resolved [15:28:22] 10Data-Engineering-Kanban, 10Airflow: Medium Risk Oozie Migration: mediarequest - https://phabricator.wikimedia.org/T302876 (10JArguello-WMF) 05Open→03Resolved [15:28:34] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Create Generic Hive-To-Graphite job - https://phabricator.wikimedia.org/T304623 (10JArguello-WMF) 05Open→03Resolved [15:28:56] 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF) [15:28:58] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic, 10Patch-For-Review: Define and implement archiving for Airflow - https://phabricator.wikimedia.org/T300039 (10JArguello-WMF) 05Open→03Resolved [15:29:06] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Refactor jobs to not use DAG factories - https://phabricator.wikimedia.org/T302391 (10JArguello-WMF) 05Open→03Resolved [15:29:35] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Low Risk Oozie Migration: session length - https://phabricator.wikimedia.org/T300029 (10JArguello-WMF) 05Open→03Resolved [15:29:37] 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF) [15:29:55] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Unifying HDFS Sensor and FSSPEC Sensor - https://phabricator.wikimedia.org/T302392 (10JArguello-WMF) 05Open→03Resolved [15:30:23] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Troubleshoot MySQL connection issues - https://phabricator.wikimedia.org/T298893 (10JArguello-WMF) 05Open→03Resolved [15:30:44] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Refactor jobs to not use DAG factories - https://phabricator.wikimedia.org/T302391 (10JArguello-WMF) [15:31:00] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Investigate using a HiveToGraphite connector job instead of individual jobs - https://phabricator.wikimedia.org/T303308 (10JArguello-WMF) 05Open→03Resolved [15:31:16] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: wikidata_item_page_link - https://phabricator.wikimedia.org/T300023 (10JArguello-WMF) 05Open→03Resolved [15:31:19] 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF) [15:31:30] 10Data-Engineering, 10Airflow: [Airflow] Troubleshoot traffic anomaly detection job - https://phabricator.wikimedia.org/T303199 (10JArguello-WMF) 05Open→03Resolved [15:31:44] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10JArguello-WMF) 05Open→03Resolved [15:34:38] 10Data-Engineering: Check home/HDFS leftovers of razzi - https://phabricator.wikimedia.org/T309000 (10Milimetric) I've reviewed everything above and it can all be safely deleted. An admin needs to do this, with cumin, [[ https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week#Have_any_users_left_the_Found... [15:34:43] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, and 2 others: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10akosiaris) Didn't work btw, turns out that eventgate also needs a service-runner bump. PR at https://... [15:36:12] !log dropped razzi databases and deleted HDFS directories (in trash) [15:36:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:36:33] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: aqs - https://phabricator.wikimedia.org/T299398 (10JArguello-WMF) 05Open→03Resolved [15:36:38] 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF) [15:36:50] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Create Custom Hdfssensor - https://phabricator.wikimedia.org/T300276 (10JArguello-WMF) 05Open→03Resolved [15:37:15] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: wikidata_json_entity - https://phabricator.wikimedia.org/T300026 (10JArguello-WMF) 05Open→03Resolved [15:37:19] 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF) [15:37:35] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Add module to easily get dependency paths in HDFS - https://phabricator.wikimedia.org/T300795 (10JArguello-WMF) 05Open→03Resolved [15:37:52] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic: Airflow MVP - https://phabricator.wikimedia.org/T288263 (10JArguello-WMF) 05Open→03Resolved [15:42:23] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, and 2 others: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) Ah sorry about that, should have realized. Docs here: https://wikitech.wikimedia.org/wiki/... [15:43:31] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Add deletion job for old anomaly detection data - https://phabricator.wikimedia.org/T298972 (10JArguello-WMF) 05Open→03Resolved [15:43:50] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: [Airflow] Migrate anomaly detection DAG to airflow-dags repo. - https://phabricator.wikimedia.org/T295201 (10JArguello-WMF) 05Open→03Resolved [15:44:00] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Create a tool for easily spinning up a test Airflow instance - https://phabricator.wikimedia.org/T295202 (10JArguello-WMF) 05Open→03Resolved [15:44:21] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Create repository for Airflow DAGs - https://phabricator.wikimedia.org/T294026 (10JArguello-WMF) 05Open→03Resolved [15:44:24] 10Data-Engineering, 10Airflow, 10Epic, 10Platform Team Workboards (Image Suggestion API): Airflow collaborations - https://phabricator.wikimedia.org/T282033 (10JArguello-WMF) [15:44:38] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Write a job entirely in Airflow with spark and/or sparkSQL - https://phabricator.wikimedia.org/T285692 (10JArguello-WMF) 05Open→03Resolved [15:44:40] 10Analytics, 10Analytics-Kanban: [SPIKE] analytics-airflow jobs development - https://phabricator.wikimedia.org/T284172 (10JArguello-WMF) [15:45:01] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: [Airflow] Set up scap deployment - https://phabricator.wikimedia.org/T295380 (10JArguello-WMF) 05Open→03Resolved [15:45:14] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Automate sync'ing archiva packages to HDFS - https://phabricator.wikimedia.org/T294024 (10JArguello-WMF) 05Open→03Resolved [15:45:16] 10Data-Engineering, 10Airflow, 10Epic, 10Platform Team Workboards (Image Suggestion API): Airflow collaborations - https://phabricator.wikimedia.org/T282033 (10JArguello-WMF) [15:45:25] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, and 2 others: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) Hm, I think we stopped using the github commit sha to install, and instead rely on NPM lik... [16:01:47] (03PS2) 10Joal: Update / fix HQL jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416 [16:03:44] 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of razzi - https://phabricator.wikimedia.org/T309000 (10Milimetric) a:03Milimetric [16:05:44] 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Assess existing and in-development storage platforms for suitability - https://phabricator.wikimedia.org/T309509 (10BTullis) I have written a brief document assessing the WMCS and MOSS storage clusters as to their suitability for this project: [[https://... [16:20:19] 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Assess existing and in-development storage platforms for suitability - https://phabricator.wikimedia.org/T309509 (10BTullis) Moving to in-review whilst seeking input and feedback from stakeholders. [16:49:29] (03CR) 10Joal: Update / fix HQL jobs (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416 (owner: 10Joal) [16:49:57] (03PS3) 10Joal: Update / fix HQL jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416 [16:55:06] mforns: Heya - would you have aminute? [17:07:18] 10Data-Engineering, 10Equity-Landscape: Population input metrics - https://phabricator.wikimedia.org/T309279 (10ntsako) **Overall Engagement (Percentile)** and **Total Population Presence*Growth Percentile** depend on Affilate and Overall Engagement being completed The other columns have been calculated. [17:09:10] PROBLEM - At least one Hadoop HDFS NameNode is active on an-master1001 is CRITICAL: Hadoop Active NameNode CRITICAL: no namenodes are active https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running [17:09:16] (03PS4) 10Joal: Update / fix HQL jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416 [17:18:42] a-team is namenode down?? cc btullis, also looking [17:18:55] yeah we're seeing issues as well [17:19:26] ottomata: jobs seem to run [17:19:56] hdfs dfs -ls fails on stat1004 for me [17:20:45] hd full on an-master1001 [17:20:56] i think maybe 1002 is trying to come online [17:21:01] but the journal is mabye weird?! [17:22:17] hmm i don't see any full disks [17:22:20] just in namenode logs [17:22:43] 2022-05-31 17:05:24,447 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.64.5.27:8485, 10.64.5.29:8485, 10.64.36.113:8485, 10.64.53.29:8485, 10.64.21.116:8485], stream=QuorumOutputStream starting at txid 8233997690)) [17:22:43] org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 3/5. 2 successful responses: [17:22:44] Oh sorry, just spotted this now. [17:22:46] hm could be on one of the journal nodes [17:23:39] an-worker1080 [17:23:42] /dev/mapper/an--worker1080--vg-journalnode 9.8G 9.8G 0 100% /var/lib/hadoop/journal [17:23:43] ottomata: could be journal nodes, or could be too long of a GC times generating issues (we saw that) [17:23:47] https://usercontent.irccloud-cdn.com/file/ar0UWPJ4/image.png [17:23:52] samesame for 1078 [17:23:59] eesh [17:24:01] ok - full journalnodes it is [17:24:05] all full [17:24:06] um [17:24:18] maybe we should try to enter safe mode asap? [17:24:29] Agreed ottomata [17:24:37] Do you want to jump on a call together? [17:24:44] yes [17:24:46] yes bc [17:25:16] Oh, it was not just me --- good to know [17:28:02] I'm getting this error, but I think this is what you are already talking about https://pastebin.com/Bjy9fGM0 [17:28:04] PROBLEM - Check unit status of refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:28:40] PROBLEM - Hadoop Namenode - Stand By on an-master1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [17:30:45] !log stopped the hdfs-namenode service on an-master100[1-2] [17:30:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:33:12] !log stop journalnodes and datanodes on 5 hadoop journalnode hosts [17:33:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:38:22] !log increasing each of the hadoop journalnodes by 20 GB [17:38:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:38:30] PROBLEM - Hadoop JournalNode on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:38:34] PROBLEM - Hadoop DataNode on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:38:40] !log sudo lvresize -L+20G analytics1069-vg/journalnode [17:38:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:38:46] PROBLEM - Hadoop DataNode on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:38:46] PROBLEM - Hadoop DataNode on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:38:46] PROBLEM - Hadoop JournalNode on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:38:52] PROBLEM - Hadoop JournalNode on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:38:56] PROBLEM - Hadoop DataNode on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:39:12] PROBLEM - Hadoop JournalNode on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:39:32] PROBLEM - Hadoop DataNode on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:39:36] PROBLEM - Check unit status of analytics-dumps-fetch-unique_devices on labstore1006 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-unique_devices https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:39:58] journalnodes: [17:40:01] an-worker1080.eqiad.wmnet [17:40:01] an-worker1078.eqiad.wmnet [17:40:01] analytics1072.eqiad.wmnet [17:40:01] an-worker1090.eqiad.wmnet [17:40:01] analytics1069.eqiad.wmnet [17:40:04] btullis: ^ [17:40:10] PROBLEM - Hadoop JournalNode on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:40:34] PROBLEM - Check unit status of analytics-reportupdater-logs-rsync on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit analytics-reportupdater-logs-rsync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:41:39] !log resizing each journalnode with resize2fs [17:41:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:41:56] PROBLEM - Hadoop Namenode - Primary on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [17:42:16] PROBLEM - Check unit status of refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:43:49] !log restarting journalnode service on each of the five hadoop workers with journals. [17:43:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:44:36] !log restarting the datanodes on all five of the affected hadoop workers. [17:44:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:44:44] RECOVERY - Hadoop JournalNode on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:44:44] PROBLEM - Check unit status of refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:45:22] RECOVERY - Hadoop JournalNode on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:45:26] RECOVERY - Hadoop DataNode on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:45:38] RECOVERY - Hadoop DataNode on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:45:40] RECOVERY - Hadoop DataNode on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:45:40] RECOVERY - Hadoop JournalNode on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:45:44] RECOVERY - Hadoop JournalNode on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:45:48] RECOVERY - Hadoop DataNode on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:46:02] RECOVERY - Hadoop JournalNode on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [17:46:22] RECOVERY - Hadoop DataNode on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [17:46:43] !log starting namenode services on am-master1001 [17:46:58] PROBLEM - Check unit status of analytics-dumps-fetch-unique_devices on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-unique_devices https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:47:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:47:08] PROBLEM - Check unit status of analytics-dumps-fetch-mediacounts on labstore1006 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:48:48] RECOVERY - Hadoop Namenode - Primary on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [17:53:30] PROBLEM - Check unit status of refine_netflow on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:53:34] PROBLEM - Check unit status of analytics-dumps-fetch-mediacounts on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:58:00] PROBLEM - Check unit status of analytics-dumps-fetch-pageview on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:00:02] PROBLEM - Check unit status of refine_event_sanitized_analytics_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:00:42] ottomata: https://community.cloudera.com/t5/Support-Questions/Multiple-edits-inprogress-files-on-one-of-the-journal-nodes/td-p/240700 [18:04:38] PROBLEM - Check unit status of analytics-dumps-fetch-pageview on labstore1006 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:07:05] 2022-05-27 17:50:43,101 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Triggering checkpoint because there have been 31451627 txns since the last checkpoint, which exceeds the configured threshold 1000000 [18:07:30] 2022-05-27 17:50:43,104 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Save namespace ... [18:07:30] 2022-05-27 17:50:43,104 ERROR org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Exception in doCheckpoint [18:07:30] java.io.IOException: No image directories available! [18:07:52] RECOVERY - At least one Hadoop HDFS NameNode is active on an-master1001 is OK: Hadoop Active NameNode OKAY: an-master1001-eqiad-wmnet https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running [18:08:48] PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:09:04] PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:10:47] !log sudo -u hdfs hdfs dfsadmin -safemode enter [18:10:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:12:48] !log sudo service hadoop-hdfs-namenode start on an-master1002 [18:12:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:14:38] RECOVERY - Hadoop Namenode - Stand By on an-master1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:15:40] PROBLEM - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:15:56] 2022-05-23 00:07:11,192 ERROR org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Exception in doCheckpoint [18:15:56] java.io.IOException: Failed to save in any storage directories while saving namespace. [18:15:56] at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1215) [18:15:57] at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1135) [18:15:57] at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227) [18:15:57] at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64) [18:15:57] at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:477) [18:15:58] at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:380) [18:15:58] at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:400) [18:15:59] at java.security.AccessController.doPrivileged(Native Method) [18:15:59] at javax.security.auth.Subject.doAs(Subject.java:360) [18:16:00] at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1906) [18:16:00] at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) [18:16:01] at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:396) [18:17:32] 2022-05-23 00:07:06,918 INFO org.apache.hadoop.ipc.Server: IPC Server handler 48 on default port 8040, call Call#4759050 Retry#0 org.apache.hadoop.ha.HAServiceProtocol.monitorHealth from 10.64.21.110:37987 [18:17:32] org.apache.hadoop.ha.HealthCheckFailedException: The NameNode has no resources available [18:17:32] at org.apache.hadoop.hdfs.server.namenode.NameNode.monitorHealth(NameNode.java:1807) [18:17:32] at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.monitorHealth(NameNodeRpcServer.java:1697) [18:17:32] at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.monitorHealth(HAServiceProtocolServerSideTranslatorPB.java:80) [18:17:33] at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:5407) [18:17:33] at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507) [18:17:34] at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034) [18:17:34] at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003) [18:17:35] at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931) [18:17:35] at java.security.AccessController.doPrivileged(Native Method) [18:17:36] at javax.security.auth.Subject.doAs(Subject.java:422) [18:17:36] at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926) [18:17:37] at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854) [18:18:30] PROBLEM - Check unit status of refine_event_sanitized_main_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event_sanitized_main_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:18:30] 2022-05-23 00:07:10,335 ERROR org.apache.hadoop.hdfs.server.namenode.FSImage: Unable to save image for /srv/hadoop/name [18:18:30] java.io.IOException: No space left on device [18:18:30] at java.io.FileOutputStream.writeBytes(Native Method) [18:18:41] 2022-05-23 00:07:10,923 WARN org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space available on volume '/dev/mapper/an--master1002--vg-srv' is 0, which is below the configured reserved amount 104857600 [18:18:41] 2022-05-23 00:07:10,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 24 on default port 8040, call Call#4759056 Retry#0 org.apache.hadoop.ha.HAServiceProtocol.monitorHealth from 10.64.21.110:37987 [18:19:34] PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:20:48] (GobblinLastSuccessfulRunTooLongAgo) firing: Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [18:23:56] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:25:15] How goes it? [18:25:48] (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [18:26:14] second namenode is busy coming back. joseph and i are narrowing on a cause. [18:26:47] we thinkk secondary namenode disk filled up a wekek and a bit ago because [18:26:47] image snapshots are getting too large. and we save too many [18:28:31] Cool. We didn't get an alert? [18:28:58] i guess not! [18:29:51] root@an-master1002:/srv/backup/hadoop/namenode# du -sh . [18:29:52] 93G . [18:30:05] it freed up space after it failed on may 23. [18:30:16] it stopped saving images, but it didn't stop clearing them. [18:31:04] i'm pretty sure these are copied off via bacula too [18:31:07] so we don't need to save 20 days worth [18:34:09] 👍 [18:34:24] RECOVERY - Check unit status of analytics-dumps-fetch-unique_devices on labstore1006 is OK: OK: Status of the systemd unit analytics-dumps-fetch-unique_devices https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:36:03] 10Data-Engineering: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10Ottomata) [18:38:35] joal /srv/hadoop/name/current/fsimage.ckpt_0000000008234624682 [18:41:08] RECOVERY - Check unit status of analytics-dumps-fetch-unique_devices on labstore1007 is OK: OK: Status of the systemd unit analytics-dumps-fetch-unique_devices https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:41:16] RECOVERY - Check unit status of analytics-dumps-fetch-mediacounts on labstore1006 is OK: OK: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:48:44] !log sudo -u hdfs hdfs dfsadmin -safemode leave on an-master1001 [18:48:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:48:46] RECOVERY - Check unit status of analytics-dumps-fetch-mediacounts on labstore1007 is OK: OK: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:50:48] (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [18:53:26] RECOVERY - Check unit status of analytics-dumps-fetch-pageview on labstore1007 is OK: OK: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:59:00] 10Data-Engineering: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10Ottomata) [19:00:42] RECOVERY - Check unit status of analytics-dumps-fetch-pageview on labstore1006 is OK: OK: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:00:44] RECOVERY - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:02:16] 10Data-Engineering, 10Data-Engineering-Kanban: Drop UploadWizard* data - https://phabricator.wikimedia.org/T305556 (10Milimetric) a:03Milimetric [19:04:36] RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:04:47] 10Data-Engineering: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10Ottomata) [19:04:52] RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:05:06] RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:06:56] RECOVERY - Check unit status of refine_event_sanitized_analytics_immediate on an-launcher1002 is OK: OK: Status of the systemd unit refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:09:02] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:10:15] joal: mforns, btullis: https://wikitech.wikimedia.org/wiki/Incidents/2022-05-31_Analytics_Data_Lake_-_Hadoop_Namenode_failure [19:10:28] please read and let me know if i summarized correctly [19:10:38] Follow ups heer [19:10:40] Follow ups here [19:10:41] https://phabricator.wikimedia.org/T309649 [19:10:48] (GobblinLastSuccessfulRunTooLongAgo) firing: (4) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [19:11:25] BTW, ^ is the gobblin metrics based alerts! woohoo! [19:11:51] ottomata: thanks. Looks great at first glance. I will be home in 30 minutes in case that helps. [19:11:58] i think we should be in the clear for now [19:12:09] we have some action items to keep it from happening again [19:12:14] but nothing that is unbreak now [19:12:48] 10Data-Engineering: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10Ottomata) [19:14:48] RECOVERY - Check unit status of refine_event_sanitized_main_immediate on an-launcher1002 is OK: OK: Status of the systemd unit refine_event_sanitized_main_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:15:01] 10Data-Engineering: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10Ottomata) [19:15:48] (GobblinLastSuccessfulRunTooLongAgo) firing: (4) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [19:20:17] 10Data-Engineering, 10Data-Engineering-Kanban: Drop GettingStarted* data - https://phabricator.wikimedia.org/T307774 (10Milimetric) a:03Snwachukwu [19:20:51] 10Data-Engineering: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10Ottomata) [19:25:48] (GobblinLastSuccessfulRunTooLongAgo) resolved: (4) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [19:26:48] Can we add to that tag description - check alerts for /srv on an-master1002 and check that backups of /srv/backups/namenode exist ? [19:27:47] I should add them myself in [19:30:20] RECOVERY - Check unit status of refine_event on an-launcher1002 is OK: OK: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:32:28] RECOVERY - Check unit status of refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:32:56] btullis: ya add in please! according to puppet bacula is backing up /srv/backups/namenode [19:34:56] RECOVERY - Check unit status of refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:40:22] RECOVERY - Check unit status of analytics-reportupdater-logs-rsync on an-launcher1002 is OK: OK: Status of the systemd unit analytics-reportupdater-logs-rsync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:45:06] RECOVERY - Check unit status of refine_netflow on an-launcher1002 is OK: OK: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:24:20] 10Data-Engineering, 10Data-Engineering-Kanban: Drop UploadWizard* data - https://phabricator.wikimedia.org/T305556 (10Milimetric) ` drop table event_sanitized.gettingstartedredirectimpression; drop table event.gettingstartedredirectimpression; sudo -u analytics kerberos-run-command analytics hdfs dfs -rm -r /... [20:24:58] 10Data-Engineering, 10Data-Engineering-Kanban: Drop GettingStarted* data - https://phabricator.wikimedia.org/T307774 (10Milimetric) ` drop table event_sanitized.uploadwizarderrorflowevent; drop table event_sanitized.uploadwizardexceptionflowevent; drop table event_sanitized.uploadwizardflowevent; drop table ev...