[00:00:44] (SystemdUnitFailed) resolved: hadoop-yarn-nodemanager.service Failed on an-worker1078:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:13:22] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.293% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:41:31] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye [01:28:09] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [01:34:42] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:35:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:46] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [02:07:32] PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:08:46] PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:09:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:21:06] RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:21:14] RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:24:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:44:10] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) a:03Eevans [04:13:23] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.291% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [06:03:04] 10Data-Platform-SRE, 10Patch-For-Review: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10CodeReviewBot) stevemunene merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/544 switch druid host to index to the druid-public cluster and datahub injestion. [06:24:57] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:23] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.291% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:27:27] (03Abandoned) 10Brouberol: Replace an-druid1001 by an-druid1002 in druid connection strings [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975206 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [08:27:56] RECOVERY - Check systemd state on an-worker1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:28:55] btullis: FYI we can see the effect of renewing the skein certificate every week https://grafana-rw.wikimedia.org/d/980N6H7Iz/skein-certificate-expiry?orgId=1&from=1699622203125&to=1700641712251 [08:29:06] that's one less thing to worry about [09:16:41] 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10Antoine_Quhen) [09:23:16] (EventgateValidationErrors) firing: ... [09:23:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [09:23:43] 10Data-Platform-SRE: Service implementation for wdqs1017-1020 - https://phabricator.wikimedia.org/T351671 (10Gehel) p:05Triage→03High [09:24:28] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) p:05Triage→03High [09:24:37] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10Gehel) p:05Triage→03High [09:24:39] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10Gehel) p:05Triage→03High [09:26:02] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488 (10Gehel) p:05High→03Medium [09:32:15] 10Data-Platform-SRE, 10Discovery-Search (Current work): Migrate search-loader into Kubernetes - https://phabricator.wikimedia.org/T346189 (10Gehel) p:05Triage→03Low [09:32:53] 10Data-Platform-SRE, 10Scap, 10Wikidata, 10Wikidata-Query-Service, 10Release-Engineering-Team (Priority Backlog 📥): wdqs: replace git-fat with git-lfs - https://phabricator.wikimedia.org/T316876 (10Gehel) p:05Triage→03Low [09:35:55] 10Data-Engineering (Sprint 5): [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10Antoine_Quhen) The puppet configuration is now merged, and statsd_exporter is running on an-test-client1002. Analytics Prometheus is scrapping from it, as it should.... [09:37:17] 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10Antoine_Quhen) [09:37:21] 10Data-Engineering (Sprint 5): [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10Antoine_Quhen) [09:38:16] (EventgateValidationErrors) resolved: ... [09:38:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [09:41:33] 10Analytics-Radar, 10Data-Engineering-Radar, 10VisualEditor, 10VisualEditor-MediaWiki-Templates, and 7 others: eventlogging_VisualEditorTemplateDialogUse: '.event.template_names[0]' should be string - https://phabricator.wikimedia.org/T299779 (10thiemowmde) [09:41:35] 10Data-Platform-SRE, 10Discovery-Search: Migrate MjoLniR deploy repo to Gitlab - https://phabricator.wikimedia.org/T350043 (10Gehel) p:05Medium→03Low [09:43:25] 10Data-Platform-SRE: Upgrade hadoop standby master to bullseye - https://phabricator.wikimedia.org/T332578 (10Gehel) p:05Triage→03High [09:43:54] 10Data-Platform-SRE: [DataHub] Users are redirected to the wrong screen on logout and from certain urls. - https://phabricator.wikimedia.org/T347149 (10Gehel) p:05Triage→03Low [09:44:49] 10Data-Platform-SRE: Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10Gehel) p:05Triage→03High [09:46:18] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488 (10HinMar) I would like to share the information that our project will expire at the end of 2023. We would like to include federated queries in... [09:48:30] 10Data-Platform-SRE: Confirm TLS certificate monitoring is in place for Search Platform-owned domains - https://phabricator.wikimedia.org/T343761 (10Gehel) p:05Triage→03Medium [09:49:45] (EventgateValidationErrors) firing: ... [09:49:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [09:49:48] 10Data-Platform-SRE: Add optional TLS encryption to the druid-public-broker - https://phabricator.wikimedia.org/T331631 (10Gehel) p:05Triage→03Medium [09:50:04] 10Data-Platform-SRE, 10Observability-Alerting: Migrate zookeeper prometheus checks from Icinga to Alertmanager - https://phabricator.wikimedia.org/T309012 (10Gehel) p:05Triage→03Medium [09:50:30] 10Data-Platform-SRE, 10Observability-Alerting, 10observability: Create VictorOps config for new Data Platform SRE team - https://phabricator.wikimedia.org/T344202 (10Gehel) p:05Triage→03Low [09:50:51] 10Data-Platform-SRE: Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 (10Gehel) p:05Triage→03High [09:50:56] 10Data-Platform-SRE: Upgrade an-launcher1002 to bullseye - https://phabricator.wikimedia.org/T332580 (10Gehel) p:05Triage→03High [09:51:04] 10Data-Platform-SRE: Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10Gehel) p:05Triage→03High [09:54:16] good news! We *already* have replication factor exported as a metric [09:54:16] (https://thanos.wikimedia.org/graph?g0.expr=max+by+%28topic%29+%28kafka_cluster_Partition_ReplicasCount%7Bcluster%3D%22kafka_jumbo%22%2C+topic%3D%22webrequest_text%22%7D%29&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D), making [09:54:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/975291 obsolete [09:55:43] so the topics with RF=1 are visible via https://thanos.wikimedia.org/graph?g0.expr=%28max+by+%28topic%29+%28kafka_cluster_Partition_ReplicasCount%7Bcluster%3D%22kafka_jumbo%22%7D%29%29+%3C+2&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [09:55:49] brouberol: yep please always check the JMX config before filing python changes :D [09:56:10] JMX/MBeans/Prometheus-exporter I meant [09:56:38] haha yep. I got bitten by automatism here. older versions of kafka didn't export this, meaning I just assumed it wasn't there [09:59:41] brouberol: yes yes np :) Another bit of info to be aware - we don't export, IIRC, everything that the Kafka MBeans expose, we explicitly only allow what's necessary [09:59:56] it is all stored in the prometheus jmx exporter's config in puppet [10:00:04] so in case a metric is not there, what I usually do is [10:00:10] 1) check if any mbean expose it [10:00:22] 2) check the prometheus exporter config allowed list [10:00:23] etc.. [10:00:38] it is probably a little cumbersome, but it is the current workflow [10:02:06] 10Data-Platform-SRE, 10Patch-For-Review: Monitor kafka topics with a replication factor of 1 - https://phabricator.wikimedia.org/T346887 (10brouberol) My work on adding a metric exporter script was useless, as we already have [[ https://thanos.wikimedia.org/graph?g0.expr=%28max+by+%28cluster%2C+topic%29+%28kaf... [10:02:40] ack, thanks [10:03:05] o/ brouberol we're at the sync [10:04:16] oops omw [10:16:26] hello! Would it be possible to get some support in testing https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/976163 once merged? Maybe with DRY-RUN runs? [10:24:57] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:25:08] 10Data-Platform-SRE: Set up kubeconfig files for spark-history - https://phabricator.wikimedia.org/T351711 (10BTullis) a:05BTullis→03None [10:25:32] 10Data-Platform-SRE: Add a namespace (or namespaces) for the spark-history service - https://phabricator.wikimedia.org/T351713 (10BTullis) a:05BTullis→03None [10:25:52] 10Data-Platform-SRE: Create a helm chart for the spark-history service - https://phabricator.wikimedia.org/T351722 (10BTullis) a:05BTullis→03None [10:29:46] (EventgateValidationErrors) resolved: ... [10:29:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [10:31:41] speaking of topics with RF=1 elukey, do you know whether we rely on the ksql-* topics for anything, or if they were created at some point and are now un-used? [10:32:32] brouberol: IIRC it was a test made by Andrew a long time ago [10:32:39] I am not aware of any usage of ksql [10:33:23] thanks! Before deleting anything, I'll make sure he sends a +1 [10:36:40] volans: Sure, we can help with that testing. The lowest risk cookbook of those to run is `cookbooks/sre/presto/reboot-workers.py`which we can do without a DRY-RUN. [10:38:42] btullis: great, lmk what's the best way to coordinate and feel free to take the time to review the changes [10:42:17] Cool, I'm reviewing now. I think that b.king and r.kemper would probably be best placed to check the wdqs data transfer cookbook. [10:45:47] yep I added ryan to the CR too for that oen [10:48:41] I'll happily run some of them when it's merged and let you know. I can't see us being in a position to run the `hadoop.upgrade-bigtop-distro`nor `hadoop.change-distro-from-cdh` live for a while, but the others should be easy enough to fit in soon. [10:49:38] a dry-run should in most cases be enough (assuming it doesn't bail out earlier for some reason) [10:50:04] Ack. [10:51:52] 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) p:05Triage→03Medium [10:54:07] btullis: thanks! So the plan is wait for a feedback on the wdqs ones, merge and test together? [10:56:15] (EventgateValidationErrors) firing: ... [10:56:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [10:56:17] volans: Yes, I think so, unless you're in a hurry. r.kemper is US west coast timezone, so should be able to approve later today. [10:56:31] nah no hurry at all [10:57:50] Cool. I'll make sure to mention it to them later, if I can. [11:01:28] thanks [11:06:16] (EventgateValidationErrors) resolved: ... [11:06:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [11:14:24] 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10Clement_Goubert) >>! In T351716#9349220, @Ottomata wrote: > I think the two namespaces two helmfile option is better in this case. I thought there was a example of different he... [11:44:46] 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10BTullis) Great, thanks @Ottomata and @Clement_Goubert for this advice, that makes perfect sense. So we'll start out with **two namespaces**, **two helmfile service directories*... [11:45:04] 10Data-Platform-SRE: Decide how to handle the spark-history service for the test cluster - https://phabricator.wikimedia.org/T351716 (10BTullis) 05Open→03Resolved [11:45:08] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) [11:51:19] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) [12:09:41] PROBLEM - Check systemd state on an-worker1129 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:13:23] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.291% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:17:13] headsup, I'm going to reimage an-druid1001 (the last one to go) [12:17:32] brouberol: Ack, thanks. [12:19:50] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host an-druid1001.eqiad.wmnet with OS bullseye [12:21:43] 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10Patch-For-Review, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) I have added the required configuration to our apache2 site template for matomo in 976686: Add another p... [12:22:44] !log applying security patches to postgres13 on an-db1001 [12:22:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:26:30] 10Data-Platform-SRE: Add optional TLS encryption to the druid-public-broker - https://phabricator.wikimedia.org/T331631 (10BTullis) a:05BTullis→03None [12:35:06] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) I have merged the [[https://gerrit.wikimedia.org/r/896363|patch to the spark images]] and I've triggered a manual rebuild of the production images repo, as pe... [12:53:41] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host an-druid1001.eqiad.wmnet with OS bullseye completed: - an-druid1001 (**PASS**)... [13:06:27] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10brouberol) 05In progress→03Resolved [13:06:30] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10brouberol) [13:06:51] an-druid is now fully running on Bullseye [13:09:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:50] 10Data-Platform-SRE: Add a namespace (or namespaces) for the spark-history service - https://phabricator.wikimedia.org/T351713 (10brouberol) a:03brouberol [13:25:08] RECOVERY - Check systemd state on an-worker1129 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:10] 10Data-Platform-SRE: Set up kubeconfig files for spark-history - https://phabricator.wikimedia.org/T351711 (10brouberol) a:03brouberol [13:31:52] 10Data-Engineering (Sprint 5): [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10gmodena) > As a low hanging fruit, we could take the union of both tables (in their current shape) and generate a single data q... [13:34:08] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10brouberol) [13:34:58] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [13:38:43] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Create a keytab for each spark-history-server and add it to the puppet secret hieradata - https://phabricator.wikimedia.org/T351816 (10brouberol) @BTullis @Stevemunene I was reading https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Kerberos/Adminis... [13:57:23] 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10Patch-For-Review, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) Oh, it looks like that's not working. I deployed that change, but it seems that I am still receiving 302... [14:19:43] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:55] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2001.codfw.wmnet with OS bullseye [14:20:10] PROBLEM - Check systemd state on an-worker1087 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:28] PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:21:14] PROBLEM - Hadoop NodeManager on an-worker1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:21:38] PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:38] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:21:54] PROBLEM - Hadoop NodeManager on an-worker1141 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:21:56] PROBLEM - Check systemd state on an-worker1141 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:28] PROBLEM - Check systemd state on an-worker1121 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:54] PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:10] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:24:16] RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:34] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:24:43] (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:43] (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:55] RECOVERY - Hadoop NodeManager on an-worker1087 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:30:07] PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:32:01] PROBLEM - Check systemd state on an-worker1119 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:21] PROBLEM - Check systemd state on an-worker1112 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:29] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:32:37] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:32:37] RECOVERY - Check systemd state on an-worker1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:07] RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:34:43] (SystemdUnitFailed) firing: (10) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:19] RECOVERY - Hadoop NodeManager on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:38:21] RECOVERY - Check systemd state on an-worker1141 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:43] (SystemdUnitFailed) firing: (10) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:41:07] RECOVERY - Check systemd state on an-worker1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:15] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:43] (SystemdUnitFailed) firing: (11) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:45] PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:43] (SystemdUnitFailed) firing: (9) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:32] 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10CodeReviewBot) aqu updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537 Ad... [14:51:48] 10Data-Engineering (Sprint 5): [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10CodeReviewBot) aqu updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537 Add statsd as a dependency to our setup [14:51:56] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10CodeReviewBot) aqu updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537 Add statsd as a depende... [14:52:49] RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:29] RECOVERY - Check systemd state on an-worker1119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:11] PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:43] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:56:45] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2001.codfw.wmnet with OS bullseye completed: - aqs2001 (**PASS**) - Downtimed on Icinga/Alertmanage... [14:56:57] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [14:57:49] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2002.codfw.wmnet with OS bullseye [14:58:09] PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:58:29] RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:45] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:43] (SystemdUnitFailed) firing: (10) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:43] RECOVERY - Hadoop NodeManager on an-worker1106 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:43] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:07] PROBLEM - Check systemd state on an-worker1147 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:17] PROBLEM - Hadoop NodeManager on an-worker1122 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:21] PROBLEM - Check systemd state on an-worker1122 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:37] PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:05] PROBLEM - Check systemd state on an-worker1107 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:45] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:02:55] PROBLEM - Check systemd state on an-worker1143 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:03] PROBLEM - Hadoop NodeManager on an-worker1143 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:03:05] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:03:13] PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:03:57] PROBLEM - Check systemd state on an-worker1150 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:04] 10Data-Engineering (Sprint 5): [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10Antoine_Quhen) The workaround to our Gilab-CI pb is here: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537 [15:04:43] (SystemdUnitFailed) firing: (14) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:29] RECOVERY - Hadoop NodeManager on an-worker1122 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:05:33] RECOVERY - Check systemd state on an-worker1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:35] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:05:45] RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:07] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Some wikibase tables not available in commonswiki_p - https://phabricator.wikimedia.org/T298452 (10Ladsgroup) The new term tables in commons (wbt_*) should be empty to my knowledge. Is there a reason to make them visible? Or d... [15:09:37] RECOVERY - Check systemd state on an-worker1107 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:43] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:10] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10brouberol) Since we started receiving alerts for every puppet failure in `#wikimedia-alanlytics`, the channel is really starting to be almost bot/alerts-only, whi... [15:14:44] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:13] PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:29] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:17:31] PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:19:35] RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:19:43] (SystemdUnitFailed) firing: (13) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:22:11] RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:22:51] PROBLEM - Check systemd state on an-worker1120 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:53] RECOVERY - Check systemd state on an-worker1147 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:11] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:23:29] RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:29] RECOVERY - Check systemd state on an-worker1143 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:43] (SystemdUnitFailed) firing: (12) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:01] 10Data-Engineering, 10Data-Platform-SRE: Airflow scheduler and webserver logs should be readable by airflow instance admins - https://phabricator.wikimedia.org/T304615 (10brouberol) a:03brouberol [15:26:27] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10BTullis) Sounds good. Here are my own knee-jerk responses :-) Personally, I would vote for `#wikimedia-data-platform` - it needn't be so focused on SREs that we h... [15:27:02] 10Data-Engineering (Sprint 5), 10Event-Platform: [Event Platform] mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10brouberol) @gmodena let's sync when you when to increase partition count, and I'll happily oblige! [15:27:13] 10Data-Engineering (Sprint 5), 10Event-Platform: [Event Platform] mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10brouberol) [15:29:43] (SystemdUnitFailed) firing: (10) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:43] RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:03] RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:31:00] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10brouberol) > This is where I think we need to be careful to split up the data-engineering-alerts and data-platform-alerts - so not migrating everything wholesale,... [15:33:03] RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:21] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:34:43] (SystemdUnitFailed) firing: (9) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:36:04] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2002.codfw.wmnet with OS bullseye completed: - aqs2002 (**WARN**) - Downtimed on Icinga/Alertmanage... [15:39:22] !log updating default airflow configuration with https://gerrit.wikimedia.org/r/c/operations/puppet/+/976700 [15:39:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:40:15] RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:42:49] PROBLEM - Hadoop NodeManager on an-worker1120 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:43:57] RECOVERY - Hadoop NodeManager on an-worker1120 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:44:03] RECOVERY - Check systemd state on an-worker1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:32] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2003.codfw.wmnet with OS bullseye [15:49:43] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:57] RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:35] RECOVERY - Hadoop NodeManager on an-worker1143 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:53:58] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) I have installed the new test airflow 2.7.3 deb to an-test-client1002 with the following commands. ` btullis@an-test-client1002:... [15:56:25] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) Executed the database migrations with: ` btullis@an-test-client1002:~$ sudo -u analytics airflow-analytics_test db upgrade /usr/... [15:56:39] 10Data-Engineering, 10Data-Services, 10cloud-services-team: Surface Temporary user information to Cloud Wiki Replicas - https://phabricator.wikimedia.org/T346679 (10taavi) 05Open→03Resolved [15:57:47] (SystemdUnitCrashLoop) firing: (3) airflow-kerberos@analytics_test.service crashloop on an-test-client1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:57:55] PROBLEM - Check systemd state on an-worker1104 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:27] PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:58:30] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) Now I see metrics being sent. ` btullis@an-test-client1002:~$ sudo tcpdump -i lo port 9125 tcpdump: verbose output suppressed, u... [15:58:31] PROBLEM - Check systemd state on an-presto1010 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:13] PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:43] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:54] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2003.codfw.wmnet with OS bullseye executed with errors: - aqs2003 (**FAIL**) - Downtimed on Icinga/... [16:02:15] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2003.codfw.wmnet with OS bullseye [16:07:47] (SystemdUnitCrashLoop) resolved: (3) airflow-kerberos@analytics_test.service crashloop on an-test-client1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:09:40] RECOVERY - Check systemd state on an-worker1104 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:43] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:10:02] RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:10:14] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:42] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:13:23] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.291% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:14:43] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:21:18] RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:21:36] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:43] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:26:30] We have currently 2 running spark jobs doing the same thing (XMLDumpsConverter) - This is duie to skein launcher having died, but not the child job, and airflow is retrying [16:26:37] I'm killing the job with no launcher [16:27:39] !log Kill duplicated XMLDumpsConverter [16:27:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:36:48] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [16:40:24] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2003.codfw.wmnet with OS bullseye completed: - aqs2003 (**WARN**) - Removed from Puppet and PuppetD... [16:43:17] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2004.codfw.wmnet with OS bullseye [16:56:22] RECOVERY - Check systemd state on an-presto1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:48] RECOVERY - Check systemd state on an-worker1128 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:43] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:25] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [17:20:48] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [17:23:40] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2004.codfw.wmnet with OS bullseye completed: - aqs2004 (**WARN**) - Downtimed on Icinga/Alertmanage... [17:26:46] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [17:27:43] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2005.codfw.wmnet with OS bullseye [17:28:57] * brouberol waves good evening [17:57:33] 10Data-Platform-SRE, 10Discovery-Search (Current work): CirrusSearch: make p95 alerts more granular - https://phabricator.wikimedia.org/T349340 (10RKemper) Here's a [[ https://logstash.wikimedia.org/app/dashboards#/view/8b1907c0-2062-11ec-85b7-9d1831ce7631?_g=(filters:!(),refreshInterval:(pause:!t,value:0),tim... [18:03:49] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2005.codfw.wmnet with OS bullseye completed: - aqs2005 (**WARN**) - Downtimed on Icinga/Alertmanage... [18:16:41] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye [18:26:17] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10RobH) attempting to reimage an-worker1160 it sticks at requesting a lease for boot, host shows the MAC of the eth0 attempting to request a dhcp lease for boot. on insta... [18:29:37] PROBLEM - Hadoop NodeManager on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:29:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:29:45] PROBLEM - Check systemd state on an-worker1152 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:32:21] RECOVERY - Hadoop NodeManager on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:32:29] RECOVERY - Check systemd state on an-worker1152 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:06:40] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2006.codfw.wmnet with OS bullseye [19:26:17] 10Analytics, 10Data-Engineering, 10SRE, 10Event-Platform: Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10brouberol) I wonder if something as simple as round robin DNS implemented with multiple A records with the same subdomain would suffice to substantially improve the... [19:36:56] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye executed with errors: - an-worker1... [19:48:12] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2006.codfw.wmnet with OS bullseye completed: - aqs2006 (**WARN**) - Downtimed on Icinga/Alertmanage... [19:56:37] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2007.codfw.wmnet with OS bullseye [19:57:14] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [20:13:23] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.291% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:33:51] PROBLEM - Check systemd state on an-worker1156 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:58] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2007.codfw.wmnet with OS bullseye completed: - aqs2007 (**WARN**) - Downtimed on Icinga/Alertmanage... [20:34:21] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:34:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:35:53] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2008.codfw.wmnet with OS bullseye [20:37:07] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:38:01] RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:45:35] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1175.eqiad.wmnet with OS bullseye [21:28:58] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2008.codfw.wmnet with OS bullseye completed: - aqs2008 (**WARN**) - Downtimed on Icinga/Alertmanage... [21:30:10] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2009.codfw.wmnet with OS bullseye [21:31:13] (03CR) 10Jgreen: Fix mismatched allocation error from fdopen/pclose to fdopen/fclose. This is to resolve a "mismatched-dealloc" error that blocked packaging (031 comment) [analytics/kafkatee] - 10https://gerrit.wikimedia.org/r/961174 (owner: 10Jgreen) [22:05:36] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2009.codfw.wmnet with OS bullseye completed: - aqs2009 (**PASS**) - Downtimed on Icinga/Alertmanage... [22:06:33] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2010.codfw.wmnet with OS bullseye [22:07:19] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1175.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [22:20:22] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2010.codfw.wmnet with OS bullseye executed with errors: - aqs2010 (**FAIL**) - Downtimed on Icinga/... [22:20:40] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2010.codfw.wmnet with OS bullseye [22:43:05] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2010.codfw.wmnet with OS bullseye executed with errors: - aqs2010 (**FAIL**) - Removed from Puppet... [22:43:23] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs2010.codfw.wmnet with OS bullseye [22:43:37] 10Data-Engineering, 10Product-Analytics, 10Event-Platform: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Sharvaniharan) @Ottomata Thank you for looping me in. @SNowick_WMF , @mpopov Thank you for summarizing our update issues with Migration. Our complete m... [23:20:56] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [23:24:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:26:12] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs2010.codfw.wmnet with OS bullseye completed: - aqs2010 (**WARN**) - Removed from Puppet and PuppetD... [23:26:28] PROBLEM - Check systemd state on an-presto1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:18] PROBLEM - Check systemd state on an-worker1081 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:20] PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:43] (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:36:48] PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:08] PROBLEM - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:36] PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:42] PROBLEM - Check systemd state on kafka-jumbo1012 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:43] (SystemdUnitFailed) firing: (11) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:40:56] PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:40:59] 10Data-Engineering, 10Product-Analytics, 10Event-Platform: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10mpopov) @Sharvaniharan: once the legacy EL system is turned off, the tables associated with non-migrated legacy EL schemas will stop getting data. The da... [23:41:10] PROBLEM - Check systemd state on an-worker1109 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:14] PROBLEM - Check systemd state on an-worker1134 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:26] PROBLEM - Check systemd state on an-presto1012 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:43:52] PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:43:56] PROBLEM - Check systemd state on kafka-jumbo1011 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:44:02] PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:44:22] PROBLEM - Check systemd state on an-presto1015 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:44:43] (SystemdUnitFailed) firing: (17) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:46:56] PROBLEM - Check systemd state on analytics1075 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:43] (SystemdUnitFailed) firing: (19) export_smart_data_dump.service Failed on an-conf1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:52:04] PROBLEM - Check systemd state on an-conf1003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:52:50] PROBLEM - Check systemd state on an-presto1011 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:54:43] (SystemdUnitFailed) firing: (20) export_smart_data_dump.service Failed on an-conf1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed