[00:13:10] <wikibugs>	 10Quarry: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Xaosflux)
[00:43:37] <wikibugs>	 10Quarry: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Liz) I'm also experience problems with two bots on en.wiki which haven't issued their regularly scheduled reports and receiving "overflow" messages when I try to look at page histories earlier today. Are these...
[00:43:54] <wikibugs>	 10Quarry: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Pppery) Probably caused by {T60674}
[00:59:44] <wikibugs>	 10Quarry: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Xaosflux) >>! In T309570#7968876, @Pppery wrote: > Probably caused by {T60674}  Perhaps an unintended side affect? I get the error on the entire table, not just that column.
[01:06:17] <wikibugs>	 10Quarry, 10Wikimedia-production-error: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Xaosflux)
[01:16:15] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_event on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[01:42:01] <wikibugs>	 10Quarry, 10Wikimedia-production-error: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Xaosflux) seems extremely similar to T309567, is there a parent outage?
[02:21:03] <wikibugs>	 10Quarry, 10Wikimedia-production-error: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Ran the maintain-views again on s1 clouddbs. It should be fixed now.
[03:20:04] <wikibugs>	 10Quarry, 10Wikimedia-production-error: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Izno)
[04:13:58] <wikibugs>	 10Quarry, 10Wikimedia-production-error: quarry is unable to access enwiki_p.page table - https://phabricator.wikimedia.org/T309570 (10Liz) Looks like everything is fixed now. Thank you for getting right on this bug report!
[06:26:50] <elukey>	 !log `elukey@an-master1001:~$ sudo systemctl reset-failed hadoop-clean-fairscheduler-event-logs.service`
[06:26:54] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:05:37] <wikibugs>	 10Data-Engineering, 10Cassandra: Ensure AQS Cassandra client connections are multi-datacenter - https://phabricator.wikimedia.org/T307799 (10mfossati) >>! In T307799#7963969, @Eevans wrote: > It's been reported elsewhere that the bulk loader for Image Suggestions is using the same underlying code to interface...
[11:55:47] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:06:15] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:13:34] <wikibugs>	 (03CR) 10Snwachukwu: Update the browser_general hql to use spark hints (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416 (owner: 10Joal)
[12:38:32] <wikibugs>	 10Data-Engineering, 10Airflow: Airflow Hackathon (May 2022) - https://phabricator.wikimedia.org/T307500 (10JArguello-WMF) 05Open→03Resolved
[12:39:26] <icinga-wm>	 PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:39:39] <wikibugs>	 10Data-Engineering: Kerberos passowrd reset request for sgimeno - https://phabricator.wikimedia.org/T309608 (10Sgs)
[12:49:27] <icinga-wm>	 RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:21:55] <joal>	 ottomata: would you have a minute for me please|
[13:21:56] <joal>	 ?
[13:22:20] <joal>	 ottomata: I'm fighting some python and could do with help
[13:24:04] <joal>	 Ah nevermind - I found the thing :)
[13:42:00] <wikibugs>	 10Data-Engineering: Kerberos password reset request for sgimeno - https://phabricator.wikimedia.org/T309608 (10Sgs)
[13:45:24] <milimetric>	 Hi team
[13:55:14] <joal>	 Hi milimetric 
[14:35:18] <wikibugs>	 (03PS1) 10Milimetric: Move leftover check script to refinery [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801727
[14:35:41] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "Just moving from the wiki" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801727 (owner: 10Milimetric)
[14:36:43] <wikibugs>	 (03CR) 10Mforns: "LGTM! But I second Sandra's comments." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416 (owner: 10Joal)
[14:37:55] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of razzi - https://phabricator.wikimedia.org/T309000 (10Milimetric) ` ====== stat1004 ====== total 513244 drwxr-xr-x  2 26051 wikidev      4096 Jul 20  2021 hdfs-namenode-fsimage -rw-rw-r--  1 26051 wikidev   1245367 Jan 10 16:42 part.txt -rw-r--r--  1 26051 wikidev...
[15:24:19] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [POC] Use airflow-installed Spark3 for an Airflow job - https://phabricator.wikimedia.org/T308168 (10JArguello-WMF) 05Open→03Resolved
[15:24:21] <wikibugs>	 10Analytics, 10Data-Engineering, 10Epic: Upgrade analytics-hadoop to Spark 3 + scala 2.12 - https://phabricator.wikimedia.org/T291464 (10JArguello-WMF)
[15:24:33] <wikibugs>	 10Data-Engineering-Kanban, 10Airflow: Medium Complexity Oozie Migration: mobile_apps-session_metrics - https://phabricator.wikimedia.org/T302874 (10JArguello-WMF) 05Open→03Resolved
[15:24:47] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Adapt maxExecutors value by Dag - https://phabricator.wikimedia.org/T307447 (10JArguello-WMF) 05Open→03Resolved
[15:25:06] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Organize hackathon - https://phabricator.wikimedia.org/T295204 (10JArguello-WMF) 05Open→03Resolved
[15:25:19] <wikibugs>	 10Data-Engineering-Kanban, 10Airflow: Migrate the Clickstream jobs to Airflow - https://phabricator.wikimedia.org/T305843 (10JArguello-WMF) 05Open→03Resolved
[15:25:51] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low RIsk Ozzie Migration: Wikidata CoEditor metric job - https://phabricator.wikimedia.org/T306177 (10JArguello-WMF) 05Open→03Resolved
[15:25:54] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Low Risk Ozzie Migration: 4 wikidata metrics jobs - https://phabricator.wikimedia.org/T300021 (10JArguello-WMF)
[15:26:13] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: APIs - https://phabricator.wikimedia.org/T300028 (10JArguello-WMF) 05Open→03Resolved
[15:26:16] <wikibugs>	 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF)
[15:26:34] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Low Risk Oozie Migration: interlanguage - https://phabricator.wikimedia.org/T300025 (10JArguello-WMF) 05Open→03Resolved
[15:27:53] <wikibugs>	 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: browser/general - https://phabricator.wikimedia.org/T302875 (10JArguello-WMF) 05Open→03Resolved
[15:28:05] <wikibugs>	 10Data-Engineering-Kanban, 10Airflow: Fix use of Java LinkedHashMap caching in Spark multi-threaded environment - https://phabricator.wikimedia.org/T305386 (10JArguello-WMF) 05Open→03Resolved
[15:28:22] <wikibugs>	 10Data-Engineering-Kanban, 10Airflow: Medium Risk Oozie Migration: mediarequest - https://phabricator.wikimedia.org/T302876 (10JArguello-WMF) 05Open→03Resolved
[15:28:34] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Create Generic Hive-To-Graphite job - https://phabricator.wikimedia.org/T304623 (10JArguello-WMF) 05Open→03Resolved
[15:28:56] <wikibugs>	 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF)
[15:28:58] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic, 10Patch-For-Review: Define and implement archiving for Airflow - https://phabricator.wikimedia.org/T300039 (10JArguello-WMF) 05Open→03Resolved
[15:29:06] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Refactor jobs to not use DAG factories - https://phabricator.wikimedia.org/T302391 (10JArguello-WMF) 05Open→03Resolved
[15:29:35] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Low Risk Oozie Migration: session length - https://phabricator.wikimedia.org/T300029 (10JArguello-WMF) 05Open→03Resolved
[15:29:37] <wikibugs>	 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF)
[15:29:55] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Unifying HDFS Sensor and FSSPEC Sensor - https://phabricator.wikimedia.org/T302392 (10JArguello-WMF) 05Open→03Resolved
[15:30:23] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Troubleshoot MySQL connection issues - https://phabricator.wikimedia.org/T298893 (10JArguello-WMF) 05Open→03Resolved
[15:30:44] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Refactor jobs to not use DAG factories - https://phabricator.wikimedia.org/T302391 (10JArguello-WMF)
[15:31:00] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Investigate using a HiveToGraphite connector job instead of individual jobs - https://phabricator.wikimedia.org/T303308 (10JArguello-WMF) 05Open→03Resolved
[15:31:16] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: wikidata_item_page_link - https://phabricator.wikimedia.org/T300023 (10JArguello-WMF) 05Open→03Resolved
[15:31:19] <wikibugs>	 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF)
[15:31:30] <wikibugs>	 10Data-Engineering, 10Airflow: [Airflow] Troubleshoot traffic anomaly detection job - https://phabricator.wikimedia.org/T303199 (10JArguello-WMF) 05Open→03Resolved
[15:31:44] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow concurrency limits - https://phabricator.wikimedia.org/T300870 (10JArguello-WMF) 05Open→03Resolved
[15:34:38] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of razzi - https://phabricator.wikimedia.org/T309000 (10Milimetric) I've reviewed everything above and it can all be safely deleted.  An admin needs to do this, with cumin, [[ https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week#Have_any_users_left_the_Found...
[15:34:43] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, and 2 others: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10akosiaris) Didn't work btw, turns out that eventgate also needs a service-runner bump. PR at https://...
[15:36:12] <milimetric>	 !log dropped razzi databases and deleted HDFS directories (in trash)
[15:36:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:36:33] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: aqs - https://phabricator.wikimedia.org/T299398 (10JArguello-WMF) 05Open→03Resolved
[15:36:38] <wikibugs>	 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF)
[15:36:50] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Create Custom Hdfssensor - https://phabricator.wikimedia.org/T300276 (10JArguello-WMF) 05Open→03Resolved
[15:37:15] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: wikidata_json_entity - https://phabricator.wikimedia.org/T300026 (10JArguello-WMF) 05Open→03Resolved
[15:37:19] <wikibugs>	 10Data-Engineering, 10Airflow, 10Epic, 10Patch-For-Review: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10JArguello-WMF)
[15:37:35] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Add module to easily get dependency paths in HDFS - https://phabricator.wikimedia.org/T300795 (10JArguello-WMF) 05Open→03Resolved
[15:37:52] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic: Airflow MVP - https://phabricator.wikimedia.org/T288263 (10JArguello-WMF) 05Open→03Resolved
[15:42:23] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, and 2 others: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) Ah sorry about that, should have realized.  Docs here: https://wikitech.wikimedia.org/wiki/...
[15:43:31] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Add deletion job for old anomaly detection data - https://phabricator.wikimedia.org/T298972 (10JArguello-WMF) 05Open→03Resolved
[15:43:50] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: [Airflow] Migrate anomaly detection DAG to airflow-dags repo. - https://phabricator.wikimedia.org/T295201 (10JArguello-WMF) 05Open→03Resolved
[15:44:00] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Create a tool for easily spinning up a test Airflow instance - https://phabricator.wikimedia.org/T295202 (10JArguello-WMF) 05Open→03Resolved
[15:44:21] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Create repository for Airflow DAGs - https://phabricator.wikimedia.org/T294026 (10JArguello-WMF) 05Open→03Resolved
[15:44:24] <wikibugs>	 10Data-Engineering, 10Airflow, 10Epic, 10Platform Team Workboards (Image Suggestion API): Airflow collaborations - https://phabricator.wikimedia.org/T282033 (10JArguello-WMF)
[15:44:38] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Write a job entirely in Airflow with spark and/or sparkSQL - https://phabricator.wikimedia.org/T285692 (10JArguello-WMF) 05Open→03Resolved
[15:44:40] <wikibugs>	 10Analytics, 10Analytics-Kanban: [SPIKE] analytics-airflow jobs development - https://phabricator.wikimedia.org/T284172 (10JArguello-WMF)
[15:45:01] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: [Airflow] Set up scap deployment - https://phabricator.wikimedia.org/T295380 (10JArguello-WMF) 05Open→03Resolved
[15:45:14] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Automate sync'ing archiva packages to HDFS - https://phabricator.wikimedia.org/T294024 (10JArguello-WMF) 05Open→03Resolved
[15:45:16] <wikibugs>	 10Data-Engineering, 10Airflow, 10Epic, 10Platform Team Workboards (Image Suggestion API): Airflow collaborations - https://phabricator.wikimedia.org/T282033 (10JArguello-WMF)
[15:45:25] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Traffic, and 2 others: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) Hm,  I think we stopped using the github commit sha to install, and instead rely on NPM lik...
[16:01:47] <wikibugs>	 (03PS2) 10Joal: Update / fix HQL jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416
[16:03:44] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of razzi - https://phabricator.wikimedia.org/T309000 (10Milimetric) a:03Milimetric
[16:05:44] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Assess existing and in-development storage platforms for suitability - https://phabricator.wikimedia.org/T309509 (10BTullis) I have written a brief document assessing the WMCS and MOSS storage clusters as to their suitability for this project: [[https://...
[16:20:19] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Assess existing and in-development storage platforms for suitability - https://phabricator.wikimedia.org/T309509 (10BTullis) Moving to in-review whilst seeking input and feedback from stakeholders.
[16:49:29] <wikibugs>	 (03CR) 10Joal: Update / fix HQL jobs (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416 (owner: 10Joal)
[16:49:57] <wikibugs>	 (03PS3) 10Joal: Update / fix HQL jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416
[16:55:06] <joal>	 mforns: Heya - would you have aminute?
[17:07:18] <wikibugs>	 10Data-Engineering, 10Equity-Landscape: Population input metrics - https://phabricator.wikimedia.org/T309279 (10ntsako) **Overall Engagement (Percentile)** and **Total Population Presence*Growth Percentile** depend on Affilate and Overall Engagement being completed  The other columns have been calculated.
[17:09:10] <icinga-wm>	 PROBLEM - At least one Hadoop HDFS NameNode is active on an-master1001 is CRITICAL: Hadoop Active NameNode CRITICAL: no namenodes are active https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running
[17:09:16] <wikibugs>	 (03PS4) 10Joal: Update / fix HQL jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/801416
[17:18:42] <ottomata>	 a-team is namenode down?? cc btullis, also looking
[17:18:55] <joal>	 yeah we're seeing issues as well
[17:19:26] <joal>	 ottomata: jobs seem to run
[17:19:56] <ottomata>	 hdfs dfs -ls fails on stat1004 for me
[17:20:45] <ottomata>	 hd full on an-master1001
[17:20:56] <ottomata>	 i think maybe 1002 is trying to come online
[17:21:01] <ottomata>	 but the journal is mabye weird?!
[17:22:17] <ottomata>	 hmm i don't see any full disks
[17:22:20] <ottomata>	 just in namenode logs
[17:22:43] <ottomata>	 2022-05-31 17:05:24,447 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.64.5.27:8485, 10.64.5.29:8485, 10.64.36.113:8485, 10.64.53.29:8485, 10.64.21.116:8485], stream=QuorumOutputStream starting at txid 8233997690))
[17:22:43] <ottomata>	 org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 3/5. 2 successful responses:
[17:22:44] <btullis>	 Oh sorry, just spotted this now.
[17:22:46] <ottomata>	 hm could be on one of the journal nodes
[17:23:39] <ottomata>	 an-worker1080
[17:23:42] <ottomata>	  /dev/mapper/an--worker1080--vg-journalnode  9.8G  9.8G     0 100% /var/lib/hadoop/journal
[17:23:43] <joal>	 ottomata: could be journal nodes, or could be too long of a GC times generating issues (we saw that)
[17:23:47] <btullis>	 https://usercontent.irccloud-cdn.com/file/ar0UWPJ4/image.png
[17:23:52] <ottomata>	 samesame for 1078
[17:23:59] <ottomata>	 eesh
[17:24:01] <joal>	 ok - full journalnodes it is
[17:24:05] <ottomata>	 all full
[17:24:06] <ottomata>	 um
[17:24:18] <ottomata>	 maybe we should try to enter safe mode asap?
[17:24:29] <joal>	 Agreed ottomata 
[17:24:37] <btullis>	 Do you want to jump on a call together?
[17:24:44] <ottomata>	 yes
[17:24:46] <ottomata>	 yes bc
[17:25:16] <dsaez>	 Oh, it was not just me  --- good to know 
[17:28:02] <dsaez>	 I'm getting this error, but I think this is what you are already talking about https://pastebin.com/Bjy9fGM0
[17:28:04] <icinga-wm>	 PROBLEM - Check unit status of refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:28:40] <icinga-wm>	 PROBLEM - Hadoop Namenode - Stand By on an-master1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[17:30:45] <btullis>	 !log stopped the hdfs-namenode service on an-master100[1-2]
[17:30:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:33:12] <ottomata>	 !log stop journalnodes and datanodes on 5 hadoop journalnode hosts
[17:33:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:38:22] <btullis>	 !log increasing each of the hadoop journalnodes by 20 GB
[17:38:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:38:30] <icinga-wm>	 PROBLEM - Hadoop JournalNode on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process
[17:38:34] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[17:38:40] <btullis>	 !log sudo lvresize -L+20G analytics1069-vg/journalnode
[17:38:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:38:46] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[17:38:46] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[17:38:46] <icinga-wm>	 PROBLEM - Hadoop JournalNode on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process
[17:38:52] <icinga-wm>	 PROBLEM - Hadoop JournalNode on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process
[17:38:56] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[17:39:12] <icinga-wm>	 PROBLEM - Hadoop JournalNode on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process
[17:39:32] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[17:39:36] <icinga-wm>	 PROBLEM - Check unit status of analytics-dumps-fetch-unique_devices on labstore1006 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-unique_devices https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:39:58] <ottomata>	 journalnodes:
[17:40:01] <ottomata>	 an-worker1080.eqiad.wmnet
[17:40:01] <ottomata>	 an-worker1078.eqiad.wmnet
[17:40:01] <ottomata>	 analytics1072.eqiad.wmnet
[17:40:01] <ottomata>	 an-worker1090.eqiad.wmnet
[17:40:01] <ottomata>	 analytics1069.eqiad.wmnet
[17:40:04] <ottomata>	 btullis:  ^
[17:40:10] <icinga-wm>	 PROBLEM - Hadoop JournalNode on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process
[17:40:34] <icinga-wm>	 PROBLEM - Check unit status of analytics-reportupdater-logs-rsync on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit analytics-reportupdater-logs-rsync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:41:39] <btullis>	 !log resizing each journalnode with resize2fs
[17:41:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:41:56] <icinga-wm>	 PROBLEM - Hadoop Namenode - Primary on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[17:42:16] <icinga-wm>	 PROBLEM - Check unit status of refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:43:49] <btullis>	 !log restarting journalnode service on each of the five hadoop workers with journals.
[17:43:51] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:44:36] <btullis>	 !log restarting the datanodes on all five of the affected hadoop workers.
[17:44:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:44:44] <icinga-wm>	 RECOVERY - Hadoop JournalNode on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process
[17:44:44] <icinga-wm>	 PROBLEM - Check unit status of refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:45:22] <icinga-wm>	 RECOVERY - Hadoop JournalNode on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process
[17:45:26] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[17:45:38] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[17:45:40] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[17:45:40] <icinga-wm>	 RECOVERY - Hadoop JournalNode on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process
[17:45:44] <icinga-wm>	 RECOVERY - Hadoop JournalNode on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process
[17:45:48] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[17:46:02] <icinga-wm>	 RECOVERY - Hadoop JournalNode on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process
[17:46:22] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[17:46:43] <btullis>	 !log starting namenode services on am-master1001
[17:46:58] <icinga-wm>	 PROBLEM - Check unit status of analytics-dumps-fetch-unique_devices on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-unique_devices https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:47:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:47:08] <icinga-wm>	 PROBLEM - Check unit status of analytics-dumps-fetch-mediacounts on labstore1006 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:48:48] <icinga-wm>	 RECOVERY - Hadoop Namenode - Primary on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[17:53:30] <icinga-wm>	 PROBLEM - Check unit status of refine_netflow on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:53:34] <icinga-wm>	 PROBLEM - Check unit status of analytics-dumps-fetch-mediacounts on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:58:00] <icinga-wm>	 PROBLEM - Check unit status of analytics-dumps-fetch-pageview on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:00:02] <icinga-wm>	 PROBLEM - Check unit status of refine_event_sanitized_analytics_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:00:42] <mforns>	 ottomata: https://community.cloudera.com/t5/Support-Questions/Multiple-edits-inprogress-files-on-one-of-the-journal-nodes/td-p/240700
[18:04:38] <icinga-wm>	 PROBLEM - Check unit status of analytics-dumps-fetch-pageview on labstore1006 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:07:05] <ottomata>	 2022-05-27 17:50:43,101 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Triggering checkpoint because there have been 31451627 txns since the last checkpoint, which exceeds the configured threshold 1000000
[18:07:30] <ottomata>	 2022-05-27 17:50:43,104 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Save namespace ...
[18:07:30] <ottomata>	 2022-05-27 17:50:43,104 ERROR org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Exception in doCheckpoint
[18:07:30] <ottomata>	 java.io.IOException: No image directories available!
[18:07:52] <icinga-wm>	 RECOVERY - At least one Hadoop HDFS NameNode is active on an-master1001 is OK: Hadoop Active NameNode OKAY: an-master1001-eqiad-wmnet https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23No_active_HDFS_Namenode_running
[18:08:48] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:09:04] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:10:47] <ottomata>	 !log sudo -u hdfs hdfs dfsadmin -safemode enter
[18:10:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:12:48] <ottomata>	 !log sudo service hadoop-hdfs-namenode start on an-master1002
[18:12:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:14:38] <icinga-wm>	 RECOVERY - Hadoop Namenode - Stand By on an-master1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[18:15:40] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:15:56] <ottomata>	 2022-05-23 00:07:11,192 ERROR org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Exception in doCheckpoint
[18:15:56] <ottomata>	 java.io.IOException: Failed to save in any storage directories while saving namespace.
[18:15:56] <ottomata>	         at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1215)
[18:15:57] <ottomata>	         at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:1135)
[18:15:57] <ottomata>	         at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:227)
[18:15:57] <ottomata>	         at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1300(StandbyCheckpointer.java:64)
[18:15:57] <ottomata>	         at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:477)
[18:15:58] <ottomata>	         at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:380)
[18:15:58] <ottomata>	         at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:400)
[18:15:59] <ottomata>	         at java.security.AccessController.doPrivileged(Native Method)
[18:15:59] <ottomata>	         at javax.security.auth.Subject.doAs(Subject.java:360)
[18:16:00] <ottomata>	         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1906)
[18:16:00] <ottomata>	         at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
[18:16:01] <ottomata>	         at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:396)
[18:17:32] <ottomata>	 2022-05-23 00:07:06,918 INFO org.apache.hadoop.ipc.Server: IPC Server handler 48 on default port 8040, call Call#4759050 Retry#0 org.apache.hadoop.ha.HAServiceProtocol.monitorHealth from 10.64.21.110:37987
[18:17:32] <ottomata>	 org.apache.hadoop.ha.HealthCheckFailedException: The NameNode has no resources available
[18:17:32] <ottomata>	         at org.apache.hadoop.hdfs.server.namenode.NameNode.monitorHealth(NameNode.java:1807)
[18:17:32] <ottomata>	         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.monitorHealth(NameNodeRpcServer.java:1697)
[18:17:32] <ottomata>	         at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.monitorHealth(HAServiceProtocolServerSideTranslatorPB.java:80)
[18:17:33] <ottomata>	         at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:5407)
[18:17:33] <ottomata>	         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
[18:17:34] <ottomata>	         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
[18:17:34] <ottomata>	         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
[18:17:35] <ottomata>	         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
[18:17:35] <ottomata>	         at java.security.AccessController.doPrivileged(Native Method)
[18:17:36] <ottomata>	         at javax.security.auth.Subject.doAs(Subject.java:422)
[18:17:36] <ottomata>	         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
[18:17:37] <ottomata>	         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)
[18:18:30] <icinga-wm>	 PROBLEM - Check unit status of refine_event_sanitized_main_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event_sanitized_main_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:18:30] <ottomata>	 2022-05-23 00:07:10,335 ERROR org.apache.hadoop.hdfs.server.namenode.FSImage: Unable to save image for /srv/hadoop/name
[18:18:30] <ottomata>	 java.io.IOException: No space left on device
[18:18:30] <ottomata>	         at java.io.FileOutputStream.writeBytes(Native Method)
[18:18:41] <ottomata>	 2022-05-23 00:07:10,923 WARN org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space available on volume '/dev/mapper/an--master1002--vg-srv' is 0, which is below the configured reserved amount 104857600
[18:18:41] <ottomata>	 2022-05-23 00:07:10,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 24 on default port 8040, call Call#4759056 Retry#0 org.apache.hadoop.ha.HAServiceProtocol.monitorHealth from 10.64.21.110:37987
[18:19:34] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:20:48] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[18:23:56] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:25:15] <btullis>	 How goes it?
[18:25:48] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[18:26:14] <ottomata>	 second namenode is busy coming back. joseph and i are narrowing on a cause.
[18:26:47] <ottomata>	 we thinkk secondary namenode disk filled up a wekek and a bit ago because 
[18:26:47] <ottomata>	 image snapshots are getting too large.  and we save too many
[18:28:31] <btullis>	 Cool. We didn't get an alert?
[18:28:58] <ottomata>	 i guess not!
[18:29:51] <ottomata>	 root@an-master1002:/srv/backup/hadoop/namenode# du -sh .
[18:29:52] <ottomata>	 93G	.
[18:30:05] <ottomata>	 it freed up space after it failed on may 23.
[18:30:16] <ottomata>	 it stopped saving images, but it didn't stop clearing them.
[18:31:04] <ottomata>	 i'm pretty sure these are copied off via bacula too
[18:31:07] <ottomata>	 so we don't need to save 20 days worth
[18:34:09] <btullis>	 👍 
[18:34:24] <icinga-wm>	 RECOVERY - Check unit status of analytics-dumps-fetch-unique_devices on labstore1006 is OK: OK: Status of the systemd unit analytics-dumps-fetch-unique_devices https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:36:03] <wikibugs>	 10Data-Engineering: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10Ottomata)
[18:38:35] <ottomata>	 joal /srv/hadoop/name/current/fsimage.ckpt_0000000008234624682
[18:41:08] <icinga-wm>	 RECOVERY - Check unit status of analytics-dumps-fetch-unique_devices on labstore1007 is OK: OK: Status of the systemd unit analytics-dumps-fetch-unique_devices https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:41:16] <icinga-wm>	 RECOVERY - Check unit status of analytics-dumps-fetch-mediacounts on labstore1006 is OK: OK: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:48:44] <ottomata>	 !log sudo -u hdfs hdfs dfsadmin -safemode leave on an-master1001
[18:48:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:48:46] <icinga-wm>	 RECOVERY - Check unit status of analytics-dumps-fetch-mediacounts on labstore1007 is OK: OK: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:50:48] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[18:53:26] <icinga-wm>	 RECOVERY - Check unit status of analytics-dumps-fetch-pageview on labstore1007 is OK: OK: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:59:00] <wikibugs>	 10Data-Engineering: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10Ottomata)
[19:00:42] <icinga-wm>	 RECOVERY - Check unit status of analytics-dumps-fetch-pageview on labstore1006 is OK: OK: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:00:44] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:02:16] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Drop UploadWizard* data - https://phabricator.wikimedia.org/T305556 (10Milimetric) a:03Milimetric
[19:04:36] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:04:47] <wikibugs>	 10Data-Engineering: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10Ottomata)
[19:04:52] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:05:06] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:06:56] <icinga-wm>	 RECOVERY - Check unit status of refine_event_sanitized_analytics_immediate on an-launcher1002 is OK: OK: Status of the systemd unit refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:09:02] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:10:15] <ottomata>	 joal: mforns, btullis: https://wikitech.wikimedia.org/wiki/Incidents/2022-05-31_Analytics_Data_Lake_-_Hadoop_Namenode_failure
[19:10:28] <ottomata>	 please read and let me know if i  summarized correctly
[19:10:38] <ottomata>	 Follow ups heer
[19:10:40] <ottomata>	 Follow ups here
[19:10:41] <ottomata>	 https://phabricator.wikimedia.org/T309649
[19:10:48] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (4) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[19:11:25] <ottomata>	 BTW, ^ is the gobblin metrics based alerts! woohoo!
[19:11:51] <btullis>	 ottomata: thanks. Looks great at first glance. I will be home in 30 minutes in case that helps.
[19:11:58] <ottomata>	 i think we should be in the clear for now
[19:12:09] <ottomata>	 we have some action items to keep it from happening again
[19:12:14] <ottomata>	 but nothing that is unbreak now
[19:12:48] <wikibugs>	 10Data-Engineering: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10Ottomata)
[19:14:48] <icinga-wm>	 RECOVERY - Check unit status of refine_event_sanitized_main_immediate on an-launcher1002 is OK: OK: Status of the systemd unit refine_event_sanitized_main_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:15:01] <wikibugs>	 10Data-Engineering: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10Ottomata)
[19:15:48] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (4) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[19:20:17] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Drop GettingStarted* data - https://phabricator.wikimedia.org/T307774 (10Milimetric) a:03Snwachukwu
[19:20:51] <wikibugs>	 10Data-Engineering: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10Ottomata)
[19:25:48] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) resolved: (4) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[19:26:48] <btullis>	 Can we add to that tag description - check alerts for /srv on an-master1002 and check that backups of /srv/backups/namenode exist ?
[19:27:47] <btullis>	 I should add them myself in
[19:30:20] <icinga-wm>	 RECOVERY - Check unit status of refine_event on an-launcher1002 is OK: OK: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:32:28] <icinga-wm>	 RECOVERY - Check unit status of refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:32:56] <ottomata>	 btullis:  ya add in please!  according to puppet bacula is backing up /srv/backups/namenode
[19:34:56] <icinga-wm>	 RECOVERY - Check unit status of refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:40:22] <icinga-wm>	 RECOVERY - Check unit status of analytics-reportupdater-logs-rsync on an-launcher1002 is OK: OK: Status of the systemd unit analytics-reportupdater-logs-rsync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:45:06] <icinga-wm>	 RECOVERY - Check unit status of refine_netflow on an-launcher1002 is OK: OK: Status of the systemd unit refine_netflow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:24:20] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Drop UploadWizard* data - https://phabricator.wikimedia.org/T305556 (10Milimetric) ` drop table event_sanitized.gettingstartedredirectimpression; drop table event.gettingstartedredirectimpression;  sudo -u analytics kerberos-run-command analytics hdfs dfs -rm -r /...
[20:24:58] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Drop GettingStarted* data - https://phabricator.wikimedia.org/T307774 (10Milimetric) ` drop table event_sanitized.uploadwizarderrorflowevent; drop table event_sanitized.uploadwizardexceptionflowevent; drop table event_sanitized.uploadwizardflowevent; drop table ev...