[00:00:44] <jinxer-wm>	 (SystemdUnitFailed) resolved: hadoop-yarn-nodemanager.service Failed on an-worker1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:13:22] <jinxer-wm>	 (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.293% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[00:41:31] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye
[01:28:09] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans)
[01:34:42] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:35:48] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:01:46] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye executed with errors: - an-worke...
[02:07:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:08:46] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:09:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:21:06] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:21:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:24:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:44:10] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) a:03Eevans
[04:13:23] <jinxer-wm>	 (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.291% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[06:03:04] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10CodeReviewBot) stevemunene merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/544  switch druid host to index to the druid-public cluster and datahub injestion.
[06:24:57] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:13:23] <jinxer-wm>	 (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.291% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[08:27:27] <wikibugs>	 (03Abandoned) 10Brouberol: Replace an-druid1001 by an-druid1002 in druid connection strings [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975206 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol)
[08:27:56] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:28:55] <brouberol>	 btullis: FYI we can see the effect of renewing the skein certificate every week https://grafana-rw.wikimedia.org/d/980N6H7Iz/skein-certificate-expiry?orgId=1&from=1699622203125&to=1700641712251
[08:29:06] <brouberol>	 that's one less thing to worry about
[09:16:41] <wikibugs>	 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10Antoine_Quhen)
[09:23:16] <jinxer-wm>	 (EventgateValidationErrors) firing: ...
[09:23:16] <jinxer-wm>	 eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors
[09:23:43] <wikibugs>	 10Data-Platform-SRE: Service implementation for wdqs1017-1020 - https://phabricator.wikimedia.org/T351671 (10Gehel) p:05Triage→03High
[09:24:28] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) p:05Triage→03High
[09:24:37] <wikibugs>	 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10Gehel) p:05Triage→03High
[09:24:39] <wikibugs>	 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10Gehel) p:05Triage→03High
[09:26:02] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488 (10Gehel) p:05High→03Medium
[09:32:15] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader into Kubernetes - https://phabricator.wikimedia.org/T346189 (10Gehel) p:05Triage→03Low
[09:32:53] <wikibugs>	 10Data-Platform-SRE, 10Scap, 10Wikidata, 10Wikidata-Query-Service, 10Release-Engineering-Team (Priority Backlog 📥): wdqs: replace git-fat with git-lfs - https://phabricator.wikimedia.org/T316876 (10Gehel) p:05Triage→03Low
[09:35:55] <wikibugs>	 10Data-Engineering (Sprint 5): [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10Antoine_Quhen) The puppet configuration is now merged, and statsd_exporter is running on an-test-client1002. Analytics Prometheus is scrapping from it, as it should....
[09:37:17] <wikibugs>	 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10Antoine_Quhen)
[09:37:21] <wikibugs>	 10Data-Engineering (Sprint 5): [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10Antoine_Quhen)
[09:38:16] <jinxer-wm>	 (EventgateValidationErrors) resolved: ...
[09:38:16] <jinxer-wm>	 eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors
[09:41:33] <wikibugs>	 10Analytics-Radar, 10Data-Engineering-Radar, 10VisualEditor, 10VisualEditor-MediaWiki-Templates, and 7 others: eventlogging_VisualEditorTemplateDialogUse: '.event.template_names[0]' should be string - https://phabricator.wikimedia.org/T299779 (10thiemowmde)
[09:41:35] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Migrate MjoLniR deploy repo to Gitlab - https://phabricator.wikimedia.org/T350043 (10Gehel) p:05Medium→03Low
[09:43:25] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop standby master to bullseye - https://phabricator.wikimedia.org/T332578 (10Gehel) p:05Triage→03High
[09:43:54] <wikibugs>	 10Data-Platform-SRE: [DataHub] Users are redirected to the wrong screen on logout and from certain urls. - https://phabricator.wikimedia.org/T347149 (10Gehel) p:05Triage→03Low
[09:44:49] <wikibugs>	 10Data-Platform-SRE: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10Gehel) p:05Triage→03High
[09:46:18] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488 (10HinMar) I would like to share the information that our project will expire at the end of 2023. We would like to include federated queries in...
[09:48:30] <wikibugs>	 10Data-Platform-SRE: Confirm TLS certificate monitoring is in place for Search Platform-owned domains - https://phabricator.wikimedia.org/T343761 (10Gehel) p:05Triage→03Medium
[09:49:45] <jinxer-wm>	 (EventgateValidationErrors) firing: ...
[09:49:46] <jinxer-wm>	 eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors
[09:49:48] <wikibugs>	 10Data-Platform-SRE: Add optional TLS encryption to the druid-public-broker - https://phabricator.wikimedia.org/T331631 (10Gehel) p:05Triage→03Medium
[09:50:04] <wikibugs>	 10Data-Platform-SRE, 10Observability-Alerting: Migrate zookeeper prometheus checks from Icinga to Alertmanager - https://phabricator.wikimedia.org/T309012 (10Gehel) p:05Triage→03Medium
[09:50:30] <wikibugs>	 10Data-Platform-SRE, 10Observability-Alerting, 10observability: Create VictorOps config for new Data Platform SRE team - https://phabricator.wikimedia.org/T344202 (10Gehel) p:05Triage→03Low
[09:50:51] <wikibugs>	 10Data-Platform-SRE: Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10Gehel) p:05Triage→03High
[09:50:56] <wikibugs>	 10Data-Platform-SRE: Upgrade an-launcher1002 to bullseye - https://phabricator.wikimedia.org/T332580 (10Gehel) p:05Triage→03High
[09:51:04] <wikibugs>	 10Data-Platform-SRE: Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10Gehel) p:05Triage→03High
[09:54:16] <brouberol>	 good news! We *already* have replication factor exported as a metric
[09:54:16] <brouberol>	 (https://thanos.wikimedia.org/graph?g0.expr=max+by+%28topic%29+%28kafka_cluster_Partition_ReplicasCount%7Bcluster%3D%22kafka_jumbo%22%2C+topic%3D%22webrequest_text%22%7D%29&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D), making
[09:54:16] <brouberol>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/975291 obsolete
[09:55:43] <brouberol>	 so the topics with RF=1 are visible via https://thanos.wikimedia.org/graph?g0.expr=%28max+by+%28topic%29+%28kafka_cluster_Partition_ReplicasCount%7Bcluster%3D%22kafka_jumbo%22%7D%29%29+%3C+2&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
[09:55:49] <elukey>	 brouberol: yep please always check the JMX config before filing python changes :D
[09:56:10] <elukey>	 JMX/MBeans/Prometheus-exporter I meant
[09:56:38] <brouberol>	 haha yep. I got bitten by automatism here. older versions of kafka didn't export this, meaning I just assumed it wasn't there
[09:59:41] <elukey>	 brouberol: yes yes np :) Another bit of info to be aware - we don't export, IIRC, everything that the Kafka MBeans expose, we explicitly only allow what's necessary
[09:59:56] <elukey>	 it is all stored in the prometheus jmx exporter's config in puppet
[10:00:04] <elukey>	 so in case a metric is not there, what I usually do is
[10:00:10] <elukey>	 1) check if any mbean expose it
[10:00:22] <elukey>	 2) check the prometheus exporter config allowed list
[10:00:23] <elukey>	 etc..
[10:00:38] <elukey>	 it is probably a little cumbersome, but it is the current workflow
[10:02:06] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) My work on adding a metric exporter script was useless, as we already have [[ https://thanos.wikimedia.org/graph?g0.expr=%28max+by+%28cluster%2C+topic%29+%28kaf...
[10:02:40] <brouberol>	 ack, thanks
[10:03:05] <stevemunene>	 o/ brouberol we're at the sync
[10:04:16] <brouberol>	 oops omw
[10:16:26] <volans>	 hello! Would it be possible to get some support in testing https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/976163 once merged? Maybe with DRY-RUN runs?
[10:24:57] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:25:08] <wikibugs>	 10Data-Platform-SRE: Set up kubeconfig files for spark-history - https://phabricator.wikimedia.org/T351711 (10BTullis) a:05BTullis→03None
[10:25:32] <wikibugs>	 10Data-Platform-SRE: Add a namespace (or namespaces) for the spark-history service - https://phabricator.wikimedia.org/T351713 (10BTullis) a:05BTullis→03None
[10:25:52] <wikibugs>	 10Data-Platform-SRE: Create a helm chart for the spark-history service - https://phabricator.wikimedia.org/T351722 (10BTullis) a:05BTullis→03None
[10:29:46] <jinxer-wm>	 (EventgateValidationErrors) resolved: ...
[10:29:46] <jinxer-wm>	 eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors
[10:31:41] <brouberol>	 speaking of topics with RF=1 elukey, do you know whether we rely on the ksql-* topics for anything, or if they were created at some point and are now un-used?
[10:32:32] <elukey>	 brouberol: IIRC it was a test made by Andrew a long time ago
[10:32:39] <elukey>	 I am not aware of any usage of ksql
[10:33:23] <brouberol>	 thanks! Before deleting anything, I'll make sure he sends a +1
[10:36:40] <btullis>	 volans: Sure, we can help with that testing. The lowest risk cookbook of those to run is `cookbooks/sre/presto/reboot-workers.py`which we can do without a DRY-RUN. 
[10:38:42] <volans>	 btullis: great, lmk what's the best way to coordinate and feel free to take the time to review the changes
[10:42:17] <btullis>	 Cool, I'm reviewing now. I think that b.king and r.kemper would probably be best placed to check the wdqs data transfer cookbook.
[10:45:47] <volans>	 yep I added ryan to the CR too for that oen
[10:48:41] <btullis>	 I'll happily run some of them when it's merged and let you know. I can't see us being in a position to run the `hadoop.upgrade-bigtop-distro`nor `hadoop.change-distro-from-cdh` live for a while, but the others should be easy enough to fit in soon.
[10:49:38] <volans>	 a dry-run should in most cases be enough (assuming it doesn't bail out earlier for some reason)
[10:50:04] <btullis>	 Ack.
[10:51:52] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) p:05Triage→03Medium
[10:54:07] <volans>	 btullis: thanks! So the plan is wait for a feedback on the wdqs ones, merge and test together?
[10:56:15] <jinxer-wm>	 (EventgateValidationErrors) firing: ...
[10:56:16] <jinxer-wm>	 eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors
[10:56:17] <btullis>	 volans: Yes, I think so, unless you're in a hurry. r.kemper is US west coast timezone, so should be able to approve later today. 
[10:56:31] <volans>	 nah no hurry at all
[10:57:50] <btullis>	 Cool. I'll make sure to mention it to them later, if I can.
[11:01:28] <volans>	 thanks
[11:06:16] <jinxer-wm>	 (EventgateValidationErrors) resolved: ...
[11:06:16] <jinxer-wm>	 eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors
[11:14:24] <wikibugs>	 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10Clement_Goubert) >>! In T351716#9349220, @Ottomata wrote: > I think the two namespaces two helmfile option is better in this case.  I thought there was a example of different he...
[11:44:46] <wikibugs>	 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10BTullis) Great, thanks @Ottomata and @Clement_Goubert for this advice, that makes perfect sense.  So we'll start out with **two namespaces**, **two helmfile service directories*...
[11:45:04] <wikibugs>	 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10BTullis) 05Open→03Resolved
[11:45:08] <wikibugs>	 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis)
[11:51:19] <wikibugs>	 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis)
[12:09:41] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1129 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:09:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:13:23] <jinxer-wm>	 (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.291% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[12:17:13] <brouberol>	 headsup, I'm going to reimage an-druid1001 (the last one to go)
[12:17:32] <btullis>	 brouberol: Ack, thanks.
[12:19:50] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host an-druid1001.eqiad.wmnet with OS bullseye
[12:21:43] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10Patch-For-Review, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) I have added the required configuration to our apache2 site template for matomo in 976686: Add another p...
[12:22:44] <btullis>	 !log applying security patches to postgres13 on an-db1001
[12:22:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:26:30] <wikibugs>	 10Data-Platform-SRE: Add optional TLS encryption to the druid-public-broker - https://phabricator.wikimedia.org/T331631 (10BTullis) a:05BTullis→03None
[12:35:06] <wikibugs>	 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) I have merged the [[https://gerrit.wikimedia.org/r/896363|patch to the spark images]] and I've triggered a manual rebuild of the production images repo, as pe...
[12:53:41] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host an-druid1001.eqiad.wmnet with OS bullseye completed: - an-druid1001 (**PASS**)...
[13:06:27] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10brouberol) 05In progress→03Resolved
[13:06:30] <wikibugs>	 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10brouberol)
[13:06:51] <brouberol>	 an-druid is now fully running on Bullseye
[13:09:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:12:50] <wikibugs>	 10Data-Platform-SRE: Add a namespace (or namespaces) for the spark-history service - https://phabricator.wikimedia.org/T351713 (10brouberol) a:03brouberol
[13:25:08] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1129 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:29:10] <wikibugs>	 10Data-Platform-SRE: Set up kubeconfig files for spark-history - https://phabricator.wikimedia.org/T351711 (10brouberol) a:03brouberol
[13:31:52] <wikibugs>	 10Data-Engineering (Sprint 5): [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10gmodena) > As a low hanging fruit, we could take the union of both tables (in their current shape) and generate a single data q...
[13:34:08] <wikibugs>	 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10brouberol)
[13:34:58] <wikibugs>	 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol)
[13:38:43] <wikibugs>	 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10brouberol) @BTullis @Stevemunene I was reading https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Kerberos/Adminis...
[13:57:23] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10Patch-For-Review, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) Oh, it looks like that's not working. I deployed that change, but it seems that I am still receiving 302...
[14:19:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:19:55] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2001.codfw.wmnet with OS bullseye
[14:20:10] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1087 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:20:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:21:14] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:21:38] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:21:38] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:21:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1141 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:21:56] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1141 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:28] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1121 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:54] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:24:16] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:24:34] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:24:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:29:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:29:55] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1087 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:30:07] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:32:01] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1119 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1112 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:29] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:32:37] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:32:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:34:07] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:34:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:38:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:38:21] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1141 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:39:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:41:07] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:44:15] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (11) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:44:45] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:49:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (9) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:50:32] <wikibugs>	 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10CodeReviewBot) aqu updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537  Ad...
[14:51:48] <wikibugs>	 10Data-Engineering (Sprint 5): [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10CodeReviewBot) aqu updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537  Add statsd as a dependency to our setup
[14:51:56] <wikibugs>	 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10CodeReviewBot) aqu updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537  Add statsd as a depende...
[14:52:49] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:11] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:43] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:56:45] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2001.codfw.wmnet with OS bullseye completed: - aqs2001 (**PASS**)   - Downtimed on Icinga/Alertmanage...
[14:56:57] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans)
[14:57:49] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2002.codfw.wmnet with OS bullseye
[14:58:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:58:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:58:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:59:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1106 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:07] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1147 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1122 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1122 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:01:05] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1107 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:01:45] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:02:55] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1143 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:03] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1143 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:03:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:03:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:03:57] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1150 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:04] <wikibugs>	 10Data-Engineering (Sprint 5): [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10Antoine_Quhen) The workaround to our Gilab-CI pb is here: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537
[15:04:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:05:29] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1122 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:05:33] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:35] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:05:45] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:07] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Some wikibase tables not available in commonswiki_p - https://phabricator.wikimedia.org/T298452 (10Ladsgroup) The new term tables in commons (wbt_*) should be empty to my knowledge. Is there a reason to make them visible? Or d...
[15:09:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1107 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:09:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:14:10] <wikibugs>	 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10brouberol) Since we started receiving alerts for every puppet failure in `#wikimedia-alanlytics`, the channel is really starting to be almost bot/alerts-only, whi...
[15:14:44] <jinxer-wm>	 (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:16:13] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:29] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:17:31] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:19:35] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:19:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:22:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:22:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1120 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:22:53] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1147 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:23:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1143 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:24:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (12) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:26:01] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Airflow scheduler and webserver logs should be readable by airflow instance admins - https://phabricator.wikimedia.org/T304615 (10brouberol) a:03brouberol
[15:26:27] <wikibugs>	 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10BTullis) Sounds good. Here are my own knee-jerk responses :-) Personally, I would vote for `#wikimedia-data-platform` - it needn't be so focused on SREs that we h...
[15:27:02] <wikibugs>	 10Data-Engineering (Sprint 5), 10Event-Platform: [Event Platform] mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10brouberol) @gmodena let's sync when you when to increase partition count, and I'll happily oblige!
[15:27:13] <wikibugs>	 10Data-Engineering (Sprint 5), 10Event-Platform: [Event Platform] mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10brouberol)
[15:29:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:29:43] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:31:00] <wikibugs>	 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10brouberol) > This is where I think we need to be careful to split up the data-engineering-alerts and data-platform-alerts - so not migrating everything wholesale,...
[15:33:03] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:33:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:34:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (9) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:36:04] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2002.codfw.wmnet with OS bullseye completed: - aqs2002 (**WARN**)   - Downtimed on Icinga/Alertmanage...
[15:39:22] <btullis>	 !log updating default airflow configuration with https://gerrit.wikimedia.org/r/c/operations/puppet/+/976700
[15:39:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:40:15] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:42:49] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1120 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:43:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1120 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:44:03] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:32] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2003.codfw.wmnet with OS bullseye
[15:49:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:49:57] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:52:35] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1143 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:53:58] <wikibugs>	 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) I have installed the new test airflow 2.7.3 deb to an-test-client1002 with the following commands. ` btullis@an-test-client1002:...
[15:56:25] <wikibugs>	 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) Executed the database migrations with: ` btullis@an-test-client1002:~$ sudo -u analytics airflow-analytics_test db upgrade /usr/...
[15:56:39] <wikibugs>	 10Data-Engineering, 10Data-Services, 10cloud-services-team: Surface Temporary user information to Cloud Wiki Replicas - https://phabricator.wikimedia.org/T346679 (10taavi) 05Open→03Resolved
[15:57:47] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (3) airflow-kerberos@analytics_test.service crashloop on an-test-client1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:57:55] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1104 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:58:30] <wikibugs>	 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) Now I see metrics being sent. ` btullis@an-test-client1002:~$ sudo tcpdump -i lo port 9125 tcpdump: verbose output suppressed, u...
[15:58:31] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1010 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:59:13] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:59:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:01:54] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2003.codfw.wmnet with OS bullseye executed with errors: - aqs2003 (**FAIL**)   - Downtimed on Icinga/...
[16:02:15] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2003.codfw.wmnet with OS bullseye
[16:07:47] <jinxer-wm>	 (SystemdUnitCrashLoop) resolved: (3) airflow-kerberos@analytics_test.service crashloop on an-test-client1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[16:09:40] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1104 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:09:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:10:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:10:14] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:10:42] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:13:23] <jinxer-wm>	 (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.291% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[16:14:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:21:18] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:21:36] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:24:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:26:30] <joal>	 We have currently 2 running spark jobs doing the same thing (XMLDumpsConverter) - This is duie to skein launcher having died, but not the child job, and airflow is retrying 
[16:26:37] <joal>	 I'm killing the job with no launcher
[16:27:39] <joal>	 !log Kill duplicated XMLDumpsConverter 
[16:27:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:36:48] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans)
[16:40:24] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2003.codfw.wmnet with OS bullseye completed: - aqs2003 (**WARN**)   - Removed from Puppet and PuppetD...
[16:43:17] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2004.codfw.wmnet with OS bullseye
[16:56:22] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:56:48] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1128 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:59:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:19:25] <wikibugs>	 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis)
[17:20:48] <wikibugs>	 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis)
[17:23:40] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2004.codfw.wmnet with OS bullseye completed: - aqs2004 (**WARN**)   - Downtimed on Icinga/Alertmanage...
[17:26:46] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans)
[17:27:43] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2005.codfw.wmnet with OS bullseye
[17:28:57] * brouberol waves good evening 
[17:57:33] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): CirrusSearch: make p95 alerts more granular - https://phabricator.wikimedia.org/T349340 (10RKemper) Here's a [[ https://logstash.wikimedia.org/app/dashboards#/view/8b1907c0-2062-11ec-85b7-9d1831ce7631?_g=(filters:!(),refreshInterval:(pause:!t,value:0),tim...
[18:03:49] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2005.codfw.wmnet with OS bullseye completed: - aqs2005 (**WARN**)   - Downtimed on Icinga/Alertmanage...
[18:16:41] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye
[18:26:17] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10RobH) attempting to reimage an-worker1160 it sticks at requesting a lease for boot, host shows the MAC of the eth0 attempting to request a dhcp lease for boot.  on insta...
[18:29:37] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:29:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:29:45] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1152 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:32:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:32:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1152 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:34:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:06:40] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2006.codfw.wmnet with OS bullseye
[19:26:17] <wikibugs>	 10Analytics, 10Data-Engineering, 10SRE, 10Event-Platform: Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10brouberol) I wonder if something as simple as round robin DNS implemented with multiple A records with the same subdomain would suffice  to substantially improve the...
[19:36:56] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye executed with errors: - an-worker1...
[19:48:12] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2006.codfw.wmnet with OS bullseye completed: - aqs2006 (**WARN**)   - Downtimed on Icinga/Alertmanage...
[19:56:37] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2007.codfw.wmnet with OS bullseye
[19:57:14] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans)
[20:13:23] <jinxer-wm>	 (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.291% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[20:33:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1156 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:33:58] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2007.codfw.wmnet with OS bullseye completed: - aqs2007 (**WARN**)   - Downtimed on Icinga/Alertmanage...
[20:34:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:34:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:35:53] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2008.codfw.wmnet with OS bullseye
[20:37:07] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:38:01] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:39:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:45:35] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1175.eqiad.wmnet with OS bullseye
[21:28:58] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2008.codfw.wmnet with OS bullseye completed: - aqs2008 (**WARN**)   - Downtimed on Icinga/Alertmanage...
[21:30:10] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2009.codfw.wmnet with OS bullseye
[21:31:13] <wikibugs>	 (03CR) 10Jgreen: Fix mismatched allocation error from fdopen/pclose to fdopen/fclose. This is to resolve a "mismatched-dealloc" error that blocked packaging  (031 comment) [analytics/kafkatee] - 10https://gerrit.wikimedia.org/r/961174 (owner: 10Jgreen)
[22:05:36] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2009.codfw.wmnet with OS bullseye completed: - aqs2009 (**PASS**)   - Downtimed on Icinga/Alertmanage...
[22:06:33] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2010.codfw.wmnet with OS bullseye
[22:07:19] <wikibugs>	 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1175.eqiad.wmnet with OS bullseye executed with errors: - an-worke...
[22:20:22] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2010.codfw.wmnet with OS bullseye executed with errors: - aqs2010 (**FAIL**)   - Downtimed on Icinga/...
[22:20:40] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2010.codfw.wmnet with OS bullseye
[22:43:05] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2010.codfw.wmnet with OS bullseye executed with errors: - aqs2010 (**FAIL**)   - Removed from Puppet...
[22:43:23] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2010.codfw.wmnet with OS bullseye
[22:43:37] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Event-Platform: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Sharvaniharan) @Ottomata  Thank you for looping me in.  @SNowick_WMF , @mpopov Thank you for summarizing our update issues with Migration. Our complete m...
[23:20:56] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans)
[23:24:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:26:12] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2010.codfw.wmnet with OS bullseye completed: - aqs2010 (**WARN**)   - Removed from Puppet and PuppetD...
[23:26:28] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:33:18] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1081 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:33:20] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:34:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:36:48] <icinga-wm>	 PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:37:08] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:39:36] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:39:42] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1012 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:39:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (11) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:40:56] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:40:59] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Event-Platform: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10mpopov) @Sharvaniharan: once the legacy EL system is turned off, the tables associated with non-migrated legacy EL schemas will stop getting data. The da...
[23:41:10] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1109 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:42:14] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1134 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:42:26] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1012 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:43:52] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:43:56] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1011 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:44:02] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:44:22] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1015 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:44:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (17) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:46:56] <icinga-wm>	 PROBLEM - Check systemd state on analytics1075 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:49:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (19) export_smart_data_dump.service Failed on an-conf1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:52:04] <icinga-wm>	 PROBLEM - Check systemd state on an-conf1003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:52:50] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1011 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:54:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (20) export_smart_data_dump.service Failed on an-conf1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed