[00:02:00] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:12:52] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:08:53] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] homepagevisit: Add new referer routes [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/894699 (https://phabricator.wikimedia.org/T322435) (owner: 10Kosta Harlan)
[01:09:28] <wikibugs>	 (03Merged) 10jenkins-bot: homepagevisit: Add new referer routes [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/894699 (https://phabricator.wikimedia.org/T322435) (owner: 10Kosta Harlan)
[02:22:44] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:33:32] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:05:58] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:38:31] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:10:56] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:09:35] <nfraison>	 ottomata: yes btullis also performed failover on test cluster (see https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log)
[07:45:01] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui)
[07:49:45] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10akosiaris) By the way, which eventgate (and thus which kafka cluster) will this produce to?
[07:50:17] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui)
[08:00:41] <nfraison>	 !log Reimage an-conf1003 to upgrade to bullseye T329362
[08:00:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:00:48] <stashbot>	 T329362: Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362
[08:07:16] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-conf10...
[08:16:31] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:23:26] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui)
[08:26:48] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui)
[08:28:56] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui)
[08:31:24] <nfraison>	 joal, btullis from discussion we had yesterday on failure during failover here is a config we had to put on the hadoop cluster @criteo due to too long quota initialization: https://gerrit.wikimedia.org/r/c/operations/puppet/+/895127
[08:31:25] <nfraison>	 I can't be 100% sure that it is the root cause of our failure as I don't have any tracing during this previous issue but it won't have bad side effect
[08:31:27] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui)
[08:42:29] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-conf1003.e...
[08:50:25] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10nfraison)
[08:52:35] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff)
[08:53:55] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10elukey)
[08:59:46] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data Pipelines, 10Infrastructure-Foundations, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui)
[09:08:50] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:16:51] <btullis>	 nfraison: That looks really useful. You might want to link it to this ticket and optionally reopen it: https://phabricator.wikimedia.org/T310293 This had all of our original observations on it.
[09:29:47] <wikibugs>	 (03PS2) 10Aqu: Migrate refine webrequest to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073)
[09:30:38] <wikibugs>	 10Data-Engineering, 10Data Pipelines: Install conda-analytics on Airflow servers - https://phabricator.wikimedia.org/T331345 (10BTullis) >>! In T331345#8669989, @Ottomata wrote: > I think it's fine and probably the easiest solution.  I agree, I think this solution is fine, so I'd be in favour of installing `co...
[09:58:16] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate DB connection issues faced from airflow on an-launcher1002 - https://phabricator.wikimedia.org/T331265 (10nfraison) Tonight issue: ` Mar 07 00:01:31 an-launcher1002 airflow-scheduler@analytics[5803]: Process Da...
[10:04:09] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10Ottomata) Re https://gerrit.wikimedia.org/r/894740, we should ask @mforns @Milimetric @JAl...
[10:04:12] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10Ottomata) Re https://gerrit.wikimedia.org/r/894740, we should ask @mforns @Milimetric @JAl...
[10:06:10] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10Ottomata) This doesn't produce to EventGate, it produces to Kafka directly.  It will produce to Kafka main clusters.
[10:30:45] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10achou) Thank you for sharing your thoughts on this. They all make sense! I agree that doing Option 1 for now and having a docum...
[10:49:11] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10achou)
[10:55:22] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Event Platform - Proof of Concept - Enriched Edit History Message Creation - https://phabricator.wikimedia.org/T302834 (10Ottomata) Can we close?
[11:00:29] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Create new mediawiki.page_links_change schema based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10Ottomata)
[11:01:04] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Create new mediawiki.page_links_change schema based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10Ottomata)
[11:01:13] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Platform Engineering, 10tech-decision-forum: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10Ottomata)
[11:03:49] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) > having a documented plan to switch to Option 3 in the future Just created {T331399}
[11:04:43] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Create new mediawiki.page_links_change schema based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10Ottomata) We may want to emit this change directly from EventBus, not as from a strea...
[11:13:57] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10akosiaris) >>! In T325303#8671772, @Ottomata wrote: > This doesn't produce to EventGate, it produces to Kafka directly.  It will p...
[11:15:20] <wikibugs>	 (03CR) 10DCausse: ProduceCanaryEvents: set a timeout on the http client (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/894642 (https://phabricator.wikimedia.org/T330236) (owner: 10DCausse)
[11:16:26] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:19:09] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata)
[11:21:40] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) > - predicted_topics  Let's bikeshed and implement the event schema for this in {T331401}
[11:21:48] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata)
[11:21:54] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata)
[11:23:53] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10Ottomata) Good question. I think so, but we should consult.    Perhaps, we should just continue producing to Kafka jumbo-eqiad fro...
[11:24:56] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10aborrero)
[11:25:36] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10aborrero)
[11:26:01] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10aborrero) Sent a ping to @Marostegui regarding clouddb[1013-1014,1021]  Also @Andrew regarding cloudservices host, but I think the...
[11:28:22] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: Store Flink HA metadata in Zookeeper for mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T331283 (10JMeybohm) IIRC there was an objection against using zookeeper because it is really only used by kafka and kafka does no longer requ...
[11:28:48] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops-radar: Store Flink HA metadata in Zookeeper for mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T331283 (10JMeybohm)
[11:30:14] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10serviceops-radar: Store Flink HA metadata in Zookeeper for mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T331283 (10Ottomata) That is indeed an objection! :) But we have two choices it seems:  - k8s config maps - zookeeper  U...
[11:30:40] <btullis>	 As mentioned on the analytics@ and analytics-announce@ mailing lists, we have to pause ingestion and make HDFS read only for a while this afternoon. Apologies for the short notice, for which the blame lies with me.
[11:31:49] <btullis>	 We're proposing to stop ingestion to the Data Lake at 12:50 UTC today.
[11:32:43] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) @aborrero regarding clouddb* hosts, it is up to your team but I think it would be nice if you could depool them. Bette...
[11:33:13] <btullis>	 We will then enable safe mode for HDFS at approximately 13:50 whilst T329073 is carried out.
[11:33:13] <stashbot>	 T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073
[11:48:05] <wikibugs>	 10Quarry, 10Wikidata: Quarry showing out-of-sync data in the Wikidata database - https://phabricator.wikimedia.org/T331394 (10Bugreporter)
[11:48:12] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:04:04] <wikibugs>	 10Analytics-Radar, 10Data-Engineering, 10Data-Engineering-Wikistats: WiViVi Broken in Firefox 50 (Linux only) - https://phabricator.wikimedia.org/T172304 (10Aklapper) 05Stalled→03Declined Unfortunately closing this Phabricator task as no further information has been provided and as I cannot reproduce now...
[12:39:02] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10Ottomata) We did some research on this stream's volume as part of {T307944}.  > I'll start with @Milimetric's handy numbers from a...
[12:44:10] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate DB connection issues faced from airflow on an-launcher1002 - https://phabricator.wikimedia.org/T331265 (10nfraison) The driver is in charge of servicing files, jars and app jar through http file server. With th...
[12:45:46] <btullis>	 !log depooling aqs1010 for T329073
[12:45:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:45:49] <stashbot>	 T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073
[12:46:23] <btullis>	 !log depooling aqs1016for T329073
[12:46:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:47:31] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis)
[12:48:13] <btullis>	 a-team: We are about to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/894537 which will disable gobblin timers, effectively pausing ingestion to the Data Lake
[12:48:48] <joal>	 ack btullis - I'll keep an eye
[12:49:03] <btullis>	 joal: Many thanks.
[12:49:21] * joal is flexing his job-babysitting muscles
[12:50:27] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform Value Stream: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP - https://phabricator.wikimedia.org/T307944 (10Ottomata)
[12:51:24] <btullis>	 !log disabled gobblin timers on an-launcher1002
[12:51:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:51:29] <nfraison>	 btullis as discussed please keep an-master1002 today for active NN
[12:51:29] <nfraison>	 I would like to push https://gerrit.wikimedia.org/r/c/operations/puppet/+/895127 before and it will require NN restart
[12:51:29] <nfraison>	 I will only push it tomorrow to avoid leading to issues while we already have row A switches upgrade
[12:55:11] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10fnegri) @Marostegui @aborrero the patch above should depool clouddb1013 and clouddb1014.  I don't think clouddb1021 can be depoole...
[12:56:14] <btullis>	 !log depooled datahubsearch1001 for T329073
[12:56:19] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:56:21] <stashbot>	 T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073
[12:56:31] <joal>	 mforns: Hi! For when you come online: there is a skein yarn application for druid-load of daily navigation-timing from 2023-02-28 still alive - is that expected?
[12:57:06] <btullis>	 !log depooled druid1004 for T329073
[12:57:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:57:37] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis)
[13:01:24] <wikibugs>	 (03PS2) 10Addshore: Sanitization: Keep version for mwcli_command_execute [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894104
[13:20:04] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream (Sprint 09), 10Patch-For-Review: Upgrade flink-kubernetes-operator to 1.4.0 - https://phabricator.wikimedia.org/T331282 (10Ottomata)
[13:20:23] <btullis>	 joal: We still have quite a bit of HDFS write happening, including some user jobs that aren't managed by us. This one looks pretty big: https://yarn.wikimedia.org/cluster/app/application_1676068183510_150644
[13:21:00] <joal>	 btullis: sorry I pinged you on a wrong chan
[13:21:08] <btullis>	 Should we reach out, or is there anything else we should do? 
[13:21:12] <joal>	 this job is the only big one doing stuff now
[13:21:30] <joal>	 btullis: I tried to reach out, but the user is not on IRC nor slack
[13:21:34] <joal>	 I'm gonna send an email
[13:22:54] <ottomata>	 btullis: it was started ~2hs ago, well within your annoucement, so I think if you can't reach them, its okay to kill
[13:22:55] <btullis>	 Cool, nor active on phasb since September 2022, it looks like https://phabricator.wikimedia.org/p/Aroraakhil/
[13:23:16] <btullis>	 ottomata: Thanks. Will do.
[13:27:46] <joal>	 email sent btullis - let's kill when we need
[13:28:03] <btullis>	 joal: Ack, thanks.
[13:28:21] <wikibugs>	 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 09), 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata)
[13:29:47] <wikibugs>	 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 09), 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata)
[13:30:43] <mforns>	 hi joal! thank you for the heads up. I thought it had failed yesterday, sorry my bad. It can be killed
[13:31:55] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream: [Shared Event Platform][NEEDS GROOMING] We should standardize Flink app config for yarn (development) deployments - https://phabricator.wikimedia.org/T311070 (10Ottomata) Can we close this?
[13:46:27] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10cmooney)
[13:47:43] <btullis>	 joal: ottomata: I think we're going to have to kill it.
[13:47:47] <joal>	 btullis: when do we need to kill the job on yarn?
[13:47:59] <joal>	 btullis: maintenancew happens in 10 minutes?
[13:48:11] <btullis>	 Yep.
[13:48:23] <joal>	 ok - killing jobs
[13:49:26] <btullis>	 joal: Thanks. Are you killing it from an-master1001 as the yarn user, or from a UI?
[13:49:42] <joal>	 btullis: I kill them from an-launcher1002, using analytics user
[13:49:55] <btullis>	 k, thanks.
[13:53:30] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: mediawiki-page-content-change-enrichment should configurably sample events - https://phabricator.wikimedia.org/T331417 (10Ottomata)
[13:54:45] <btullis>	 !log entering safe mode with `sudo -u hdfs kerberos-run-command hdfs hadoop dfsadmin -safemode enter` on an-master1002
[13:54:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:54:55] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: mediawiki-page-content-change-enrichment should configurably sample events - https://phabricator.wikimedia.org/T331417 (10Ottomata)
[13:55:04] <btullis>	 https://www.irccloud.com/pastebin/zTN1rUha/
[13:55:13] <joal>	 btullis: there are 2 apps I can't restart
[13:55:21] <joal>	 btullis: I can't sudo as analytics-search
[13:55:32] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream: mediawiki-page-content-change-enrichment should configurably sample events - https://phabricator.wikimedia.org/T331417 (10Ottomata) We could do this in some library-ized way in eventutilies-python, but I think the simpler thing to do would be to just add a con...
[13:55:38] <joal>	 btullis: shutting down Yarn should kill tem :)
[13:55:52] <btullis>	 Shutting down both resourcemanagers?
[13:55:59] <joal>	 yes :)
[13:56:04] <joal>	 But we can also not do that
[13:56:22] <joal>	 shutting them down will make sure no one restart jobs
[13:56:48] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis)
[13:56:58] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10achou) It seems we can remove the map field since we are now building a stream per model. Additional...
[13:57:41] <btullis>	 !log stopped `hadoop-yarn-resourcemanager.service` on both an-master100[1-2]
[13:57:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:57:47] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream: Flink EventStreamCatalog should not prevent creation of VIEWs - https://phabricator.wikimedia.org/T330703 (10tchin) a:03tchin
[13:58:32] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10jbond)
[13:59:07] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10jbond)
[13:59:09] <btullis>	 !log disabled puppet temporarily on an-master100[1-2] to avoid an automatic restart of yarn
[13:59:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:59:43] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MatthewVernon)
[14:00:56] <icinga-wm>	 PROBLEM - Hadoop ResourceManager on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Resourcemanager_process
[14:01:12] <icinga-wm>	 PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-resourcemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:02:02] <icinga-wm>	 PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-resourcemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:02:08] <icinga-wm>	 ACKNOWLEDGEMENT - Hadoop ResourceManager on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager Btullis Shut down to avoind new jobs being started during T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Resourcemanager_process
[14:07:28] <icinga-wm>	 RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:08:28] <icinga-wm>	 RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:09:30] <btullis>	 ^ I reset the failed state of the `hadoop-yarn-resourcemanager.service` on both an-master100[1-2] to reduce the alert noise. They're both still doen.
[14:09:34] <btullis>	 down
[14:09:40] <joal>	 thanks btullis 
[14:09:56] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f4ffc353-a529-4620-994f-ae7b737f3c7a) set by cmooney@cumin1001 fo...
[14:10:13] <joal>	 how about hdfs btullis?
[14:10:22] <joal>	 have we made it safemode?
[14:10:33] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream (Sprint 09): mediawiki-page-content-change-enrichment should configurably sample events - https://phabricator.wikimedia.org/T331417 (10Ottomata)
[14:10:45] <btullis>	 Yes.
[14:10:49] <btullis>	 https://www.irccloud.com/pastebin/oZY0ObIP/
[14:11:16] <joal>	 awesome btullis - doing ls on it doesn't tell me :)
[14:11:38] <icinga-wm>	 ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Nicolas Fraison T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:11:38] <icinga-wm>	 ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Nicolas Fraison T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:12:54] <btullis>	 Great, well let's keep an eye on those under-replicated blocks and missing blocks etc when 20% of the datanodes go away.
[14:14:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:26] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:34] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:38] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1143 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:39] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1101 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:40] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:41] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:42] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:46] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:46] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:50] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:50] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:51] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1097 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:52] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:56] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:56] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:14:58] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:00] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:00] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1134 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:06] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:11] <wikibugs>	 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 09), 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10JMeybohm) https://www.mediawiki.org/wiki/Platform_Engineering_Team/Event_Platform_...
[14:15:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1094 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:14] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:15] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:20] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1144 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:27] <btullis>	 I thought that these hosts were downtimed?
[14:15:30] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1067 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:15:34] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:16:09] <nfraison>	 all nodemanager are failing because RM is down
[14:16:25] <nfraison>	 we will have to restart them once the RM will be back
[14:16:43] <icinga-wm>	 ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Nicolas Fraison T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:16:43] <icinga-wm>	 ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Nicolas Fraison T329073 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:17:07] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0a07bba2-0f50-4eec-9718-0c768add34f3) set by cmooney@cumin1001 fo...
[14:18:09] <btullis>	 Ah right, yes I see. Thanks. I was mixing it up with the datanode service.
[14:18:23] <joal>	 btullis, nfraison: should qwe silent them as well?
[14:18:48] <nfraison>	 All of those NM alerts are ack
[14:19:11] <joal>	 Ah right - sorry about that nfraison 
[14:19:16] <btullis>	 Ack, thanks.
[14:19:34] <nfraison>	 joal: for next time we will indeed need to downtime before
[14:20:48] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[14:20:54] <btullis>	 Yeah, stopping the resourcemanagers on the masters to prevent new jobs launching was not exactly on the checklist.
[14:21:27] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate DB connection issues faced from airflow on an-launcher1002 - https://phabricator.wikimedia.org/T331265 (10nfraison) https://gerrit.wikimedia.org/r/c/operations/puppet/+/895228
[14:27:57] <wikibugs>	 (03CR) 10Joal: "One nit - otherwise looks good :) Ok for me once tested and resulting data validated :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/891621 (owner: 10Jennifer Ebe)
[14:38:45] <wikibugs>	 (03CR) 10Mforns: Create_mediacounts_archive_hql (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/891621 (owner: 10Jennifer Ebe)
[14:39:56] <wikibugs>	 (03CR) 10Joal: Create_mediacounts_archive_hql (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/891621 (owner: 10Jennifer Ebe)
[14:40:20] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_netflow_hourly.service,eventlogging_to_druid_network_flows_internal_hourly.service,refine_event_sanitized_analytics_immediate.service,refine_event_sanitized_main_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:42:00] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10jbond)
[14:42:52] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:42:56] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:43:02] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1141 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:43:02] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:43:14] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:43:18] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:43:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:43:22] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:43:30] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:43:30] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1096 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:43:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1070 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:43:47] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:43:47] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1088 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:43:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:43:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:00] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:06] <btullis>	 joal: nfraison: I think we can bring HDFS out of safemode now. Would you agree?
[14:44:08] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1129 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:14] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1079 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1122 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:20] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:30] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:34] <icinga-wm>	 RECOVERY - Hadoop ResourceManager on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Resourcemanager_process
[14:44:53] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: Last successful gobblin run of job eventlogging_legacy was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=eventlogging_legacy - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[14:45:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1066 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:45:54] <nfraison>	 btullis: are you sure all switches have been upgraded?
[14:46:20] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1106 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:46:47] <btullis>	 I will wait for the official all-clear in #wikimedia-sre but all looks good.
[14:46:56] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:47:00] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:47:06] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1063 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:47:06] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:47:16] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:47:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:47:38] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1134 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:47:48] <nfraison>	 Indeed all looks good from hdfs point of view
[14:47:48] <nfraison>	 so agreed to remove safemode once we have the go
[14:48:04] <btullis>	 :+1
[14:48:33] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:49:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1059 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:49:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:50:16] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10cmooney) Happy to say the upgrade went as expected, no issues encountered.  All devices now back online running 21.4R3-S1.5.
[14:50:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:50:34] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1139 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:50:44] <joal>	 nfraison, btullis: I can see no blip in under-replicated blocks on grafana - is that expected?
[14:50:59] <joal>	 There is one blip in missing blocks, super short
[14:51:08] <btullis>	 It seems that puppet didn't get properly disabled on an-master100[1-2] so the `hadoop-yarn-resourcemanager` got restart automatically. Not a big issue, but interesting.
[14:51:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:51:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:51:33] <nfraison>	 yes it only happen during the restart of the switches which were quite fast
[14:51:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1079 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:51:53] <joal>	 ack nfraison - that's what I had guessed but prefered to check
[14:52:28] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Andrew) the following hosts paged during this maintenance:   ` NodeDown wmcs cloudvirt1023:9100 (node eqiad) NodeDown wmcs cloudvi...
[14:52:54] <btullis>	 I agree. In fact those graphs look a bit better than I had dared hope, I suppose :-)
[14:53:00] <joal>	 :)
[14:53:00] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1136 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:53:04] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:53:30] <btullis>	 OK, we have the go-code from #wikimedia-sre
[14:53:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1067 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:53:47] <btullis>	 !log leaving safe mode on hdfs
[14:53:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:53:51] <nfraison>	 It is also due to datanode being considered dead only after 10 min (default value) by NN. Only seen as stale
[14:54:01] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream (Sprint 09): mediawiki-page-content-change-enrichment should configurably sample events - https://phabricator.wikimedia.org/T331417 (10Ottomata) Hm, on second thought. Perhaps we don't need this?  eventutilities-python only supports using the same source and de...
[14:54:07] <joal>	 yup I remember that nfraison - thanks for clarifying
[14:54:17] <btullis>	 https://www.irccloud.com/pastebin/K5ruWhTq/
[14:54:18] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:54:23] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream (Sprint 09): mediawiki-page-content-change-enrichment should configurably sample events - https://phabricator.wikimedia.org/T331417 (10Ottomata) 05Open→03Declined
[14:54:26] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10Ottomata)
[14:54:35] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff)
[14:54:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1143 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:55:18] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1115 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:55:40] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:56:28] <btullis>	 !log pooling datahubsearch1001
[14:56:29] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:56:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1099 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:57:16] <btullis>	 !log pooling aqs1010 and aqs1016
[14:57:17] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:57:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1058 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:58:21] <btullis>	 !log pooled druid1004
[14:58:22] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:58:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:58:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:58:32] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:58:40] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:58:45] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job eventlogging_legacy was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[14:58:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1111 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:58:50] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:58:54] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis)
[14:59:00] <icinga-wm>	 PROBLEM - Check systemd state on analytics1069 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:59:16] <nfraison>	 !log force startup of nodemanager on analytics_cluster
[14:59:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:59:31] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:37] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:39] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:41] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1108 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:49] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1128 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:59:49] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1145 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:51] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1096 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:54] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1065 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1084 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1088 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:17] <icinga-wm>	 RECOVERY - Check systemd state on analytics1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1107 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1094 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1087 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:01:02] <joal>	 nfraison, btullis: It seems that the RM sees no NodeManager (from the UI at least)
[15:01:18] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job eventlogging_legacy was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[15:01:20] <nfraison>	 yes I've tried to restart them but no change looking at logs
[15:01:26] <joal>	 Mwarf :(
[15:01:30] <btullis>	 It does that when it's on running with an-master1002 as the primary.
[15:01:50] <btullis>	 We can restart it on an-master1002 to force it back to an-master1001
[15:01:59] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1070 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:02:11] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1134 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:02:25] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:02:39] <btullis>	 OK, it is standby now. I didn't change anything.
[15:02:43] <btullis>	 https://www.irccloud.com/pastebin/FeZSo0oz/
[15:03:08] <btullis>	 YARN web ui back up.
[15:03:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:03:29] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1077 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:03:38] <joal>	 btullis: the yarn UI back since a while - but it says 0 active nodes
[15:03:47] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:03:47] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:03:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1139 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:54] <icinga-wm>	 PROBLEM - Check systemd state on analytics1077 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:05] <nfraison>	 still facing issue to connect
[15:04:05] <nfraison>	 2023-03-07 15:03:40,823 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to an-master1001-eqiad-wmnet
[15:04:05] <nfraison>	 2023-03-07 15:03:40,825 INFO org.apache.hadoop.io.retry.RetryInvocationHandler: java.net.ConnectException: Call From an-worker1123/10.64.5.12 to an-master1001.eqiad.wmnet:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ResourceTrackerPBClientImpl.registerNodeManager over an-master1001-eqiad-wmnet after 4 
[15:04:05] <nfraison>	 failover attempts. Trying to failover after sleeping for 1077ms. Current retry count: 4.
[15:04:05] <nfraison>	 2023-03-07 15:03:41,902 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over to an-master1002-eqiad-wmnet
[15:04:05] <nfraison>	 2023-03-07 15:03:41,904 INFO org.apache.hadoop.io.retry.RetryInvocationHandler: java.net.ConnectException: Call From an-worker1123/10.64.5.12 to an-master1002.eqiad.wmnet:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ResourceTrackerPBClientImpl.registerNodeManager over an-master1002-eqiad-wmnet after 5 
[15:04:05] <nfraison>	 failover attempts. Trying to failover after sleeping for 1767ms. Current retry count: 5.
[15:04:06] <btullis>	 Oh I see, sorry. Misunderstood the problem.
[15:04:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:04:37] <icinga-wm>	 PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:43] <joal>	 hm
[15:05:08] <joal>	 Could it be certifs again, since those apps had not been restarted for quite a while?
[15:05:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:05:40] <wikibugs>	 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 09), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) > Is that equivalent in k8s (having a CPU request...
[15:05:57] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1098 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:06:01] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1114 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:06:01] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1109 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:06:01] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:06:05] <joal>	 I need to get my kids from school - will be back in ~1/2h - sorry to leave you in the middle of the fire :S
[15:06:11] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:06:14] <joal>	 nfraison, btullis --^
[15:06:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:06:27] <btullis>	 No worries. Catch you later joal.
[15:06:27] <wikibugs>	 10Data-Engineering-Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 09), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) https://nightlies.apache.org/flink/flink-kubernete...
[15:06:43] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:06:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1082 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:06:47] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1064 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:07:01] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1147 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:05] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1079 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:07] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:07:09] <icinga-wm>	 PROBLEM - Check systemd state on analytics1073 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:19] <icinga-wm>	 PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1143 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:08:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1143 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:08:47] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:08:47] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1109 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:08:48] <btullis>	 nfraison: I'm also going to restart the two resourcemanagers, OK?
[15:09:11] <nfraison>	 agreed the 8031 socket is not open on both RM
[15:09:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1060 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:09:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:10:04] <wikibugs>	 (03CR) 10Nmaphophe: GDI Equity Landscape Tables/Scripts (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/889512 (owner: 10Nmaphophe)
[15:10:31] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:10:42] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10cmooney) >>! In T329073#8672931, @Andrew wrote: > the following hosts paged during this maintenance: >  >  > ` > NodeDown wmcs clo...
[15:10:49] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:10:49] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:11:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:11:45] <icinga-wm>	 PROBLEM - Check systemd state on analytics1058 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:11:49] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:12:03] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1111 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:12:47] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:13:09] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] error/2.0.0 - add dt field [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/893518 (https://phabricator.wikimedia.org/T330918) (owner: 10Ottomata)
[15:13:17] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1112 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:13:20] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff)
[15:13:39] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1083 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:13:39] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1074 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:13:39] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1119 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:13:45] <nfraison>	 btullis are you restarting them?
[15:13:56] <wikibugs>	 (03Merged) 10jenkins-bot: error/2.0.0 - add dt field [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/893518 (https://phabricator.wikimedia.org/T330918) (owner: 10Ottomata)
[15:14:03] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1081 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:07] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:07] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1100 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:07] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:08] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:09] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:11] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1146 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:11] <btullis>	 I've restarted the two resourcemanagers on an-master100[1-2]
[15:14:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:19] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1094 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:19] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:19] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1146 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:23] <icinga-wm>	 PROBLEM - Check systemd state on analytics1064 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:25] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:25] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1104 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:31] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:35] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:39] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1084 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:41] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:43] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:43] <icinga-wm>	 PROBLEM - Check systemd state on analytics1071 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:43] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1133 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:43] <btullis>	 Now I'm looking at an individual worker node. Are those ports still not open? Shall we jump on a call?
[15:14:44] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:45] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:47] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1101 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:47] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1127 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:48] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:49] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:50] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1141 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1122 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:52] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:53] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:55] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:14:56] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1096 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:57] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1127 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:58] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1141 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:59] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1142 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:15:00] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1117 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1145 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:15:02] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1101 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:03] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1129 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:04] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:05] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:06] <icinga-wm>	 PROBLEM - Check systemd state on analytics1060 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:07] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1096 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:15:08] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:15:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:15:10] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1121 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:11] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1138 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1097 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:15:13] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1086 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:15] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:16] <nfraison>	 seems that restart command on master1001 has not being taken in account
[15:15:16] <nfraison>	 RM process started at 14:43
[15:15:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:15:17] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1094 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:17] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1115 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:18] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1134 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:19] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1110 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:20] <icinga-wm>	 PROBLEM - Check systemd state on analytics1065 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:20] <nfraison>	 force killing it
[15:15:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1082 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:22] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1145 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:15:24] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1092 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:25] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1097 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:26] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:15:49] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:16:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:16:13] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1142 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:16:15] <nfraison>	 an-master1002 is the active one now and port is well open
[15:16:17] <wikibugs>	 (03CR) 10Nmaphophe: [V: 03+2] GDI Equity Landscape Tables/Scripts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/889512 (owner: 10Nmaphophe)
[15:16:19] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:39] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:39] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:55] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1062 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:17:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1079 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:17:01] <icinga-wm>	 PROBLEM - Check systemd state on analytics1061 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:17:18] <btullis>	 Great. Are you doing the same for an-master1001?
[15:17:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1134 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:17:35] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:17:35] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1136 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:17:46] <nfraison>	 I force kill an-master1001 whch is why master1002 is active
[15:17:53] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:18:01] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1134 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1063 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:18:14] <nfraison>	 unless specific requirement I would keep 1002 as active
[15:18:25] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:35] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:18:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:39] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:18:39] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1098 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:39] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:18:41] <icinga-wm>	 RECOVERY - Check systemd state on analytics1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:41] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:42] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1133 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:43] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1147 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:44] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:18:44] <nfraison>	 all NM are up
[15:18:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1101 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:18:46] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1143 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:18:47] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:19:04] <btullis>	 Yep, the only reason would be the web ui for yarn.wikimedia.org 
[15:19:22] <btullis>	 Maybe we should reboot an-master1001 while we have a chance. It's been up for 316 days.
[15:20:28] <nfraison>	 Don't think the issue is related
[15:20:28] <nfraison>	 So if the UI only point to 1001 we must fallback to it
[15:20:40] <nfraison>	 Doing it
[15:21:02] <btullis>	 Yep, the ui only points to 1001
[15:21:28] <btullis>	 Whenever you think we're ready, here is the patch to re-enable gobblin.  https://gerrit.wikimedia.org/r/c/operations/puppet/+/895239
[15:21:57] <nfraison>	 yarn RM back on 1001
[15:22:47] <btullis>	 Would be nice to make that depend on either host being the active RM. It's been that way since before my time here.
[15:23:11] <nfraison>	 fine to me for enabling back gobblin
[15:23:47] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:23:52] <btullis>	 !log re-enabling ingestion via gobblin.
[15:23:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:24:41] <btullis>	 I think those four timers will probably run at 16:00 UTC
[15:25:17] <nfraison>	 let's create a ticket for that
[15:25:17] <nfraison>	 not completely sure to know how to do this without some service discovery mechanism but we can investigate it
[15:25:39] <nfraison>	 ^ <- Would be nice to make that depend on either host being the active RM. It's been that way since before my time here.
[15:26:51] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1144 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:27:22] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10calbon) a:03achou
[15:28:16] <btullis>	 Yep, agreed. It would be nice if we could make it yarn.wm.o properly HA it without adding another two servers.
[15:28:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:29:10] <btullis>	 https://www.irccloud.com/pastebin/JzXpBVh3/
[15:29:26] <nfraison>	 https://phabricator.wikimedia.org/T331446
[15:29:26] <nfraison>	 Yep it is just that RM do some redirect to the host being active so would require to think about it
[15:31:13] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis)
[15:37:18] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata) Nice!  Quick naming bikeshed: is `score` the best name for this field?  Is that a generall...
[15:38:57] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10EBernhardson) >>! In T327970#8671765, @Ottomata wrote: > Re https://gerrit.wikimedia.org/r...
[15:39:17] <wikibugs>	 10Data-Engineering-Planning: Make YARN web interface work with both primary and standby resourcemanager - https://phabricator.wikimedia.org/T331448 (10BTullis)
[15:39:57] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata)
[15:40:01] <btullis>	 nfraison: I created T331448 quickly. Feel free to add more detail and we can discuss how/when to fit it in.
[15:40:02] <stashbot>	 T331448: Make YARN web interface work with both primary and standby resourcemanager - https://phabricator.wikimedia.org/T331448
[15:43:44] <wikibugs>	 10Data-Engineering-Planning: Make YARN web interface work with both primary and standby resourcemanager - https://phabricator.wikimedia.org/T331448 (10nfraison) FI dns repo:  ` yarn             1D IN CNAME dyna.wikimedia.org. `  With that dyna abstraction to be discussed with traffic team  ` ; "dyna.wikimedia.or...
[15:45:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:21] <nfraison>	 btullis: everything looks fine. We can probably communicate or there is still few things to start/check?
[15:46:18] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job eventlogging_legacy was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[15:47:04] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Isaac) > Do we like this scores field? We have the opportunity to do whatever we want here, so let's...
[15:48:10] <btullis>	 nfraison: I think it's good as well. 2 of the gobblin timers have fired already, including webrequest.
[15:48:15] <btullis>	 https://www.irccloud.com/pastebin/ZdscT8lD/
[15:50:28] <joal>	 thanks a lot folks for making the thing work
[15:50:35] <joal>	 What was the issue with nodemanagers?
[15:52:21] <btullis>	 The resourcemanagers got automatically restarted on an-master100[1-2] because my disabling of puppet didn't stick for some reason.
[15:52:42] <joal>	 ok  - and that broke NM?
[15:52:48] <nfraison>	 Not completely clear for now
[15:52:48] <nfraison>	 the an-master1001 was "active" but never open the 8031 socket which is normally open once all app are loaded from zk
[15:52:48] <nfraison>	 I suspect that having the NN in safemode while the RM restart leads to this bad state
[15:53:05] <btullis>	 While they were running, they didn't open port 8031 properly. When we tried to restart them, one of the processes was stuck.
[15:53:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1122 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:53:26] <btullis>	 We had to `kill -9` it (I think) and then it came back OK.
[15:53:37] <joal>	 There normally is no link between NodeManagers and NameNode - weird
[15:54:06] <btullis>	 Yeah, we haven't had a chance to dig into it in depth yet. 
[15:54:34] <btullis>	 Gobblin timers are redeployed, so hopefully ingestion should be catching up.
[15:54:35] <nfraison>	 There is one RM manage some renewal of hadoop delegation token (at least probably some other also)
[15:54:35] <nfraison>	 The NN must be available to make it work
[15:56:09] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10xcollazo) >>! In T327970#8673139, @EBernhardson wrote: >>>! In T327970#8671765, @Ottomata...
[15:59:10] <joal>	 Hm, so it'd be kerberos related nfraison - Meaning that kerberos creates a dependency from Yarn to HDFS (for delegation tokens) - is that right?
[16:01:09] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:01:19] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Isaac) > is score the best name for this field? Is that a generally used term for ML predictions?  I...
[16:03:20] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10Ottomata) > we need is the ability to ship this information in a job into the yarn cluster...
[16:06:34] <wikibugs>	 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10colewhite)
[16:12:59] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10Ottomata) > Seems like we don't have a robust mechanism to share secrets. Airflow does pro...
[16:16:11] <nfraison>	 I suspect more an action happening when loading jobs from the zk like the RM trying to renew the hdfs delegation token (or another action like that one link with hdfs) and finally being stuck
[16:16:11] <nfraison>	 but it should have resume once NN was not anymore in active mode
[16:16:11] <nfraison>	 I didn't take time to get threaddump to see why it was stuck. looking at log files to see if I can identify the pattern
[16:16:18] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job eventlogging_legacy was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[16:21:18] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) resolved: (3) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[16:22:55] <joal>	 Gobblin has finished catching up, webrequest-load jobs have started - we're back on track, with some (expected) delay
[16:23:24] <wikibugs>	 (03PS3) 10Jennifer Ebe: Create_mediacounts_archive_hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/891621
[16:24:31] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:29:17] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata) > maybe it just makes sense to have three separate schema (one for classification models,...
[16:31:15] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10xcollazo) >>! In T327970#8673370, @Ottomata wrote: >> Seems like we don't have a robust me...
[16:31:51] <wikibugs>	 (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/891621 (owner: 10Jennifer Ebe)
[16:32:23] <joal>	 btullis: could it be true that the test-cluster is in standby mode?
[16:32:36] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:32:36] <icinga-wm>	 RECOVERY - Checks that the airflow database for airflow search is working properly on an-airflow1005 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[16:35:08] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[16:40:03] <nfraison>	 joal: no the an-test-master1002 is well inactive mode and not in safemode
[16:43:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:52:46] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:55:57] <xcollazo>	 !log deployed image-suggestions hotfix to platform_eng Airflow instance. See https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/262.
[16:55:58] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:59:15] <wikibugs>	 10Data-Engineering, 10Product-Analytics: Allow connections to presto UI port - https://phabricator.wikimedia.org/T331455 (10Ottomata)
[17:23:16] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:28:33] <btullis>	 joal: Safe mode is off on the test cluster.
[17:28:39] <btullis>	 https://www.irccloud.com/pastebin/Ndlpl50O/
[17:29:06] <btullis>	 ...but we are still running with an-test-master1002 as the active namenode.
[17:34:52] <joal>	 hm - maybe there's something wrong in config, cause my oozie jobs fail on the test-cluster
[17:34:56] <joal>	 :(
[17:38:49] <btullis>	 I'll restart the resourcemanager services. Maybe they got stuck in the same way as the prod cluster.
[17:41:10] <btullis>	 So currently, an-test-master1002 is the active resourcemanager - but on this occasion port 8031 is open on this host. That doesn't match what we saw earlier on the prod cluster.
[17:41:14] <btullis>	 https://www.irccloud.com/pastebin/5FFOqLj9/
[17:41:33] <btullis>	 joal: How are your oozie jobs failing?
[17:42:13] <joal>	 btullis: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby.
[17:42:23] <joal>	  org.apache.hadoop.security.SaslRpcClient.saslConnect
[17:43:41] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:44:03] <btullis>	 OK, so it's definitely related to the namenode, not the resourcemanager. I won't restart them as I said above.
[17:44:41] <btullis>	 Is it possible that we have an-test-master1001 hard-coded somewhere, because that is the standby namenode at the moment - on purpose.
[17:44:50] <btullis>	 Is this on an-test-client1001 ?
[17:45:15] <joal>	 btullis: that's very possible btullis 
[17:46:34] <btullis>	 joal: /etc/hadoop/conf/hdfs-site.xml looks ok. It contains these properties.
[17:46:48] <btullis>	 https://www.irccloud.com/pastebin/tvIsi7GY/
[17:47:54] <joal>	 btullis: it feels kerberos related in same ways the prod cluster was - I wonder how we should to tackle
[17:50:44] <btullis>	 joal: Agreed. I'm sort of hesistant to  rush to fix it this afternoon though.  We were hoping to test this change before a failback to an-[test-]master1001 tomorrow: https://gerrit.wikimedia.org/r/c/operations/puppet/+/895127/3/modules/bigtop/templates/hadoop/hdfs-site.xml.erb
[17:51:42] <joal>	 Right - no problem in keeping as it is now, it's just non-function
[17:52:41] <btullis>	 And is this just when you're submitting a job, or do you get the same when querying oozie from the CLI?
[17:52:55] <joal>	 it's jobs oozie run that fail
[17:53:10] <joal>	 the webrequest one we fixed yesterday with Andrew
[17:53:18] <joal>	 no big deal, it's test cluster
[17:54:22] <wikibugs>	 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate DB connection issues faced from airflow on an-launcher1002 - https://phabricator.wikimedia.org/T331265 (10JAllemandou) Let's wait for Airflow 2.5 - we'll migrate those jobs to Airflow and they'll run through Sk...
[17:54:27] <wikibugs>	 (03CR) 10Joal: [C: 04-1] "-1 because of one missing field - otherwise 2 nits." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/894661 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu)
[17:54:46] <joal>	 aqu: Just finalized my review - Happy to discuss when you wish
[17:58:16] <btullis>	 joal: Would it help if I restarted to oozie service on an-test-coord1001 ? Would you be able to try a rerun then?
[17:58:36] <joal>	 sure btullis `
[17:59:52] <joal>	 btullis: this can wait tomorrow - please take some time off, we'll fix that tomorrow
[18:04:55] <nfraison>	 how do you access the oozie workflow/ui?
[18:05:10] <joal>	 no UI for this one - CLI only nfraison :S
[18:06:25] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): mediawiki-event-enrichment should support the latest eventutilities-python changes - https://phabricator.wikimedia.org/T330994 (10Ottomata) I was able to use `trigger_release` CI job [[ https://gitlab.wikimedia.org/repos/data-engineering/e...
[18:09:11] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): mediawiki-event-enrichment should support the latest eventutilities-python changes - https://phabricator.wikimedia.org/T330994 (10Ottomata) Can we use that in https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-...
[18:09:58] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10Machine-Learning-Team, 10Research: Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10diego)  > And for the revert-risk model: > ` > score: >     model_name: revertrisk >     model_versi...
[18:13:50] <joal>	 ok - I don't know what happened, but the latest job has succeeded - tryin to rerun one past failed instnce
[18:14:22] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:15:29] <nfraison>	 For me the standby issue is just a WARN it is expected:
[18:15:29] <nfraison>	 2023-03-07 17:13:22,810 WARN [main] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server 
[18:15:29] <nfraison>	 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
[18:15:30] <nfraison>	 Hdfs first try on master1001 which indicate it is standby and then the job contact the other NN
[18:15:30] <nfraison>	 At least it is for that job application_1678199022461_0269
[18:16:04] <joal>	 makes sense nfraison
[18:16:45] <nfraison>	 but I'm no seeing the error which leads oozie to put status ERROR so far
[18:16:45] <nfraison>	 0035319-220707095645518-oozie-oozi-W@generate_sequence_statistics             ERROR     job_1678199022461_0269 FAILED/KILLED2         
[18:19:29] <joal>	 nfraison: my rerun of the previously failed instance has passed the point where it was failing
[18:19:48] <joal>	 I'm not sure if hosts have been restarted, but something has fixed it
[18:24:19] <nfraison>	 2023-03-07 17:13:31,479  WARN Hive2ActionExecutor:523 - SERVER[an-test-coord1001.eqiad.wmnet] USER[analytics] GROUP[-] TOKEN[] APP[webrequest-load-wf-test_text-2023-3-6-7] JOB[0035319-220707095645518-oozie-oozi-W] ACTION[0035319-220707095645518-oozie-oozi-W@generate_sequence_statistics] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.Hive2Main], exit code [2]
[18:24:19] <nfraison>	 Ok I found it we had face a Metaspace OOM linked to the change I'm testing on hive test cluste
[18:24:19] <nfraison>	 And I restarted the hiveserver2 in test in between to validate some other JVM parameter as it was also leading to some big GC
[18:24:19] <nfraison>	 I will need to retune this :(
[18:24:19] <nfraison>	 sry for the impact
[18:24:47] <nfraison>	 looki,g to it tomorrow at least to increase the max size and see how it goes
[18:25:16] <joal>	 ack nfraison - thanks for finding it
[18:25:52] <wikibugs>	 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 09): Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10nfraison) Need to recheck the setting as it leads to OOM in Metaspace :(
[18:35:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:59:18] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Product-Analytics: Massive spike in pageviews for a few enwiki pages beginning with "Index" - https://phabricator.wikimedia.org/T327027 (10kzimmerman)
[18:59:21] <wikibugs>	 10Data-Engineering: Improve pageview automated traffic detection heuristics - https://phabricator.wikimedia.org/T280565 (10kzimmerman)
[19:05:42] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:08:35] <wikibugs>	 10Data-Engineering: Improve pageview automated traffic detection heuristics - https://phabricator.wikimedia.org/T280565 (10kzimmerman)
[19:08:37] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Product-Analytics: Massive spike in pageviews for a few enwiki pages beginning with "Index" - https://phabricator.wikimedia.org/T327027 (10kzimmerman) 05Open→03Declined I've associated this with the ticket about improving automated d...
[20:09:42] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work), 10Patch-For-Review: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10RKemper)
[20:11:10] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10RKemper)
[20:11:41] <wikibugs>	 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 (10RKemper)
[20:50:27] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:21:32] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:39:22] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): mediawiki-event-enrichment should support the latest eventutilities-python changes - https://phabricator.wikimedia.org/T330994 (10Ottomata) Yes we can!   https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merg...
[21:41:37] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:03:33] <mforns>	 !log deployed airflow analytics again to try and fix druid_load_edit_hourly
[22:03:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[23:04:26] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:15:12] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:47:38] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring