[03:03:28] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[03:03:28] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[04:24:44] <jinxer-wm>	 (SystemdUnitFailed) firing: kube-controller-manager.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:29:44] <jinxer-wm>	 (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:34:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) kube-controller-manager.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:39:44] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) kube-controller-manager.service Failed on dse-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:53:48] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 1), 10Event-Platform, 10Patch-For-Review: mediawiki page_content_change should generate new meta.id field - https://phabricator.wikimedia.org/T341277 (10CodeReviewBot) tchin opened https://gitlab.wikimedia.org/repos/data-engineering/med...
[08:43:38] <wikibugs>	 10Data-Platform-SRE, 10sre-alert-triage: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342762 (10dcausse) There was a stale `/srv/query_service/aliases.map` file with some content in it (that I copied to `/root/aliases.map.T342762`) which I believe was confusing nginx causing it to r...
[08:54:56] <wikibugs>	 (03CR) 10DCausse: Provide internal schema for CirrusSearch update-pipeline updates. (034 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202) (owner: 10Peter Fischer)
[10:00:18] <wikibugs>	 (03PS1) 10Nmaphophe: GDI Equity Landscape Tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/941911
[10:29:58] <wikibugs>	 10Data-Engineering, 10Growth-Team, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Icebox, and 4 others: [EPIC] Deprecate EventLogging::logEvent() - https://phabricator.wikimedia.org/T318263 (10phuedx)
[11:03:28] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[11:03:28] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[11:08:03] <wikibugs>	 10Data-Engineering, 10Anti-Harassment, 10Growth-Team, 10MediaWiki-extensions-EventLogging, and 5 others: [EPIC] Deprecate EventLogging::logEvent() - https://phabricator.wikimedia.org/T318263 (10Dreamy_Jazz)
[11:34:58] <wikibugs>	 (03PS9) 10Peter Fischer: Provide internal schema for CirrusSearch update-pipeline updates. [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202)
[11:40:35] <wikibugs>	 (03CR) 10Peter Fischer: "Thanks for the review! I followed your suggestions. I'm not completely happy with the names though. Do we aim for consistency with other s" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/856507 (https://phabricator.wikimedia.org/T317202) (owner: 10Peter Fischer)
[12:37:38] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10RobH)
[12:38:10] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10RobH)
[12:49:44] <jinxer-wm>	 (SystemdUnitFailed) firing: hdfs_rsync_analytics_hadoop_published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:49:50] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hdfs_rsync_analytics_hadoop_published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:57:41] <wikibugs>	 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Requesting permission to use kafka-main cluster to transport CirrusSearch updates - https://phabricator.wikimedia.org/T341625 (10elukey) The stability of the kafka main cluster is now way better, they are not totally rebalanced but this ca...
[13:01:01] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:04:44] <jinxer-wm>	 (SystemdUnitFailed) resolved: hdfs_rsync_analytics_hadoop_published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:18:45] <wikibugs>	 10Data-Platform-SRE, 10sre-alert-triage: 404 from nginx on wcqs2001 - https://phabricator.wikimedia.org/T342762 (10bking)
[13:43:14] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10Puppet-Infrastructure: GeoIP2-Anonymous-IP Subscription expired - https://phabricator.wikimedia.org/T342878 (10jbond) p:05Triage→03Medium
[13:46:56] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10Puppet-Infrastructure: GeoIP2-Anonymous-IP Subscription expired - https://phabricator.wikimedia.org/T342878 (10jbond) p:05Medium→03High
[13:47:30] <jbond>	 hi all could someone take a look at T342878 and mnake sure the correct people are looped in.  tl;dr subscription expired for GeoIP2-Anonymous-IP
[13:47:31] <stashbot>	 T342878: GeoIP2-Anonymous-IP Subscription expired - https://phabricator.wikimedia.org/T342878
[13:51:52] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Marostegui) I have assigned the recipe already with the above patch.
[13:51:54] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, and 2 others: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Marostegui) @BTullis any reason why this needs AAAA records records? The other hosts do not have them and it will likely give some headaches with the m...
[13:58:28] <icinga-wm>	 RECOVERY - Zookeeper Server on flink-zk1003 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[15:03:28] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[15:03:28] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[15:05:08] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10BTullis) >>! In T342862#9048212, @Marostegui wrote: > @BTullis any reason why this needs AAAA records records? The other hosts do not have them and it...
[15:05:22] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10BTullis)
[15:10:45] <btullis>	 jbond: I'll make sure that Olja knows about it and find out if there's any way that we can expedite this and stop it happening again in future.
[15:11:31] <jbond>	 btullis: thanks 
[15:19:35] <wikibugs>	 10Data-Engineering, 10Infrastructure-Foundations, 10Puppet-Infrastructure: GeoIP2-Anonymous-IP Subscription expired - https://phabricator.wikimedia.org/T342878 (10odimitrijevic) This dataset is no longer subscribed to.  We should remove the download of the database.
[15:21:04] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure: GeoIP2-Anonymous-IP Subscription expired - https://phabricator.wikimedia.org/T342878 (10odimitrijevic)
[15:41:35] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis) I did some standard benchmarks with `rados bench` as per the guidance [[https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/1.3/html/administrati...
[15:43:18] <wikibugs>	 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10MatthewVernon) >>! In T326945#9045370, @BTullis wrote: >>>! In T326945#9045226, @MatthewVernon wrote: >> Apropos your CRUSH rules, it might be worth adding rack/row as well? We have the equival...
[15:45:54] <wikibugs>	 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) >>! In T326945#9048873, @MatthewVernon wrote: >  > In which case, the time to add those to the CRUSH rules is now - adjusting the CRUSH rule later often ends up involving a log of data...
[15:53:48] <wikibugs>	 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) I did some standard benchmarks with `rados bench` as per the guidance [[https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/1.3/html/administration_guide/benchmarking_pe...
[15:56:22] <wikibugs>	 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis) I then did a little bit of testing with `rbd bench` First writing 10 GB in 4 MB chunks and using 16 threads to the SSDs. `  btullis@cephosd1001:~$ sudo rbd bench --io-type write --io-t...
[15:59:55] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review: GeoIP2-Anonymous-IP Subscription expired - https://phabricator.wikimedia.org/T342878 (10jbond) 05Open→03Resolved a:03jbond >>! In T342878#9048742, @odimitrijevic wrote: > This dataset...
[16:23:58] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10bking) The cluster is up and all nodes appear to have joined correctly; my compliments to whoever wrote the puppet code.  The next step is to get metrics...
[16:26:52] <wikibugs>	 (03CR) 10Sharvaniharan: "Hi Dmitry... Please review when you get a chance :-)" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/940266 (owner: 10Sharvaniharan)
[17:07:39] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10CodeReviewBot) btullis merged https://gitlab.wiki...
[17:17:54] <wikibugs>	 10Data-Platform-SRE: Decide on installation details for new ceph cluster - https://phabricator.wikimedia.org/T326945 (10BTullis)
[17:20:57] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10CodeReviewBot) dancy opened https://gitlab.wikime...
[17:21:30] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10CodeReviewBot) dancy merged https://gitlab.wikime...
[17:32:47] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10BTullis) Thank you for adding trusted runners to this repo. On first ru...
[17:36:00] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10GitLab (Project Migration), 10Release-Engineering-Team (Priority Backlog 📥): Migrate analytics/datahub pipeline to GitLab - https://phabricator.wikimedia.org/T341194 (10dancy) >>! In T341194#9049353, @BTullis wrote: > # It looks like there...
[18:28:07] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Create a wiki list for Wikifunctions' call to sqoop-mediawiki-tables (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/941985 (https://phabricator.wikimedia.org/T342199) (owner: 10David Martin)
[18:48:48] <milimetric>	 !log done deploying some simple stuff to refinery (static files and script comment updates)
[18:48:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:03:28] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[19:03:28] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability
[19:14:44] <jinxer-wm>	 (SystemdUnitFailed) firing: hadoop-yarn-nodemanager.service Failed on an-worker1085:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:16:40] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:17:12] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:18:37] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:19:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on an-worker1083:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:24:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on an-worker1083:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:27:59] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:29:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) hadoop-yarn-nodemanager.service Failed on an-worker1083:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:30:41] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1091 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:32:29] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:34:31] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1129 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:34:41] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1129 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:34:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) hadoop-yarn-nodemanager.service Failed on an-worker1083:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:35:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:36:01] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:36:41] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:37:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1091 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:39:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1115 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:39:35] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1129 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:39:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) hadoop-yarn-nodemanager.service Failed on an-worker1085:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:39:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1129 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:44:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) hadoop-yarn-nodemanager.service Failed on an-worker1085:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:51:03] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1115 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:56:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1115 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:56:55] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:59:44] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) hadoop-yarn-nodemanager.service Failed on an-worker1115:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:36:44] <wikibugs>	 10Data-Engineering: Don't pollute skein logs. Part II. - https://phabricator.wikimedia.org/T342926 (10xcollazo)
[23:03:28] <jinxer-wm>	 (MediawikiPageContentChangeEnrichAvailability) firing: ...
[23:03:28] <jinxer-wm>	 Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability