[00:18:56] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:42] (SystemdUnitFailed) firing: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:54] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:42] (SystemdUnitFailed) resolved: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:46:00] (03PS1) 10David Martin: Update hql LOCATIONs to accord with corresponding sqoop job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/939785 (https://phabricator.wikimedia.org/T341729) [01:50:06] 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Milimetric) I believe the new query I left in T342267#9029101 goes most of the way towards addressing the problems... [01:52:57] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update hql LOCATIONs to accord with corresponding sqoop job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/939785 (https://phabricator.wikimedia.org/T341729) (owner: 10David Martin) [02:24:43] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:07] 10Data-Engineering, 10Data-Engineering-Wikistats: Increase topojson resolution: Singapore does not appear on wikistats map - https://phabricator.wikimedia.org/T199571 (10Robertsky) 05Open→03Resolved a:03Robertsky This seems to be resolved with T338033 [06:24:43] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:14:02] (03PS2) 10TChin: Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/939367 (https://phabricator.wikimedia.org/T340765) [07:14:19] (03PS2) 10TChin: Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/939366 (https://phabricator.wikimedia.org/T340765) [07:17:21] (03CR) 10TChin: [C: 03+2] Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/939366 (https://phabricator.wikimedia.org/T340765) (owner: 10TChin) [07:17:28] (03CR) 10TChin: [C: 03+2] Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/939367 (https://phabricator.wikimedia.org/T340765) (owner: 10TChin) [07:17:50] (03Merged) 10jenkins-bot: Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/939366 (https://phabricator.wikimedia.org/T340765) (owner: 10TChin) [07:17:59] (03Merged) 10jenkins-bot: Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/939367 (https://phabricator.wikimedia.org/T340765) (owner: 10TChin) [07:49:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:55:55] btullis: Good morning :) [07:56:31] btullis: I got contacted by Alluxio folks, and look what the showed me: https://prestodb.io/docs/current/cache/local.html [07:57:13] btullis: I vaguely remember you needed a review on a CR, but can't find which one that was. Would you have a link? [08:00:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:41:17] gehel: Many thanks. It's this one: https://gerrit.wikimedia.org/r/c/896116 [08:42:34] for java engineers around: I've started a slack thread in #wmf-java about serialVersionUID (based on https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/5/diffs). your inputs would be welcomed. (cc: joal, xcollazo) [08:42:43] joal: Oh! Hang on. Wow! [08:50:49] joal: Should we have a meeting after the unmeeting? [08:51:52] btullis: I'll have to run for errand after the unmmeting - after lunch? [08:54:40] joal: Yes, sure thing. [08:54:49] btullis: I'llsend an invite [09:00:10] gehel: Will read about SUUIDs :) [09:00:51] joal: it seems like a simple subject, but it is way more complex than initially expected :) [09:01:14] gehel: so as many topics :) [09:25:26] joal: This built-in alluxio caching was enabled in version 0.236 of presto: https://prestodb.io/docs/current/release/release-0.236.html#hive-changes [09:26:06] When I started (a little over two years ago) we were already on 0.246 : https://gerrit.wikimedia.org/g/operations/debs/presto [09:28:15] I struggle to believe when I was doing all of this T266641 that we (I) missed this option. https://prestodb.io/docs/current/cache/local.html [09:28:15] T266641: Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 [10:02:05] 10Data-Platform-SRE: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Manuel) [10:06:25] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Alluxio for Improved Superset Query Performance - https://phabricator.wikimedia.org/T288252 (10BTullis) [10:07:14] 10Data-Engineering, 10Data-Platform-SRE: Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10BTullis) [10:07:15] btullis: We (I) missed the fact that kerberos+impersonation for global Alluxio cache was not open source... This presto story is a long miss-and-hit one :) [10:07:24] 10Data-Engineering, 10Data-Platform-SRE: Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10BTullis) 05Resolved→03Open Reopening this ticket, as we may have a way forward with using Alluxio to optimize Presto and improve performance for Superset and stat machine users. P... [10:49:42] 10Data-Platform-SRE, 10Patch-For-Review: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis) [10:57:00] 10Data-Platform-SRE, 10Patch-For-Review: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis) Moving to deploy this change now. I expect that we might receive some alerts from Icinga re: disk-space etc so I have added a week's downtime on the servers in... [11:11:34] 10Data-Platform-SRE: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis) There are a few errors. It's trying to disable the write cache on the NVMe drive, which I'm not sure was intentional. ` Notice: /Stage[main]/Ceph::Osds/Exec[Disable write cache on de... [11:18:50] (03Abandoned) 10Jennifer Ebe: T340880 Merge visibility changes into hourly target table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937047 (owner: 10Jennifer Ebe) [11:19:30] (03Abandoned) 10Jennifer Ebe: T335860 Implement job to transform mediawiki revision_visibility_change Hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/935442 (owner: 10Jennifer Ebe) [12:04:43] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:13] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.282 - https://phabricator.wikimedia.org/T342343 (10BTullis) [12:59:53] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Jclark-ctr) @btullis replaced cable on analytics1073 & analytics1075 [13:44:37] 10Data-Platform-SRE: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10BTullis) p:05Triage→03High a:03Stevemunene [13:45:19] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye [13:48:04] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10gmodena) [13:57:28] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm [14:01:53] (03PS1) 10Joal: Remove unused cassandra module [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/940154 [14:10:08] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate WDQS categories update failures on Bullseye hosts - https://phabricator.wikimedia.org/T342060 (10bking) Per Tuesdas's pairing session, if we use the `--force` flag when we `scap deploy` , we can fix the issues with incomplete deployment and fi... [14:15:43] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1075.eqiad.wmnet with OS bullseye [14:32:49] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w... [14:32:51] (HdfsFSImageAge) firing: The HDFS FSImage on analytics-test-hadoop:an-test-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:37:51] (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:38:03] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm [14:38:11] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w... [14:39:04] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm [14:50:26] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye executed with errors: - analytics1073 (**FAIL**) - Removed from Puppet... [14:51:10] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye [14:52:51] (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge [14:58:49] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1075.eqiad.wmnet with OS bullseye completed: - analytics1075 (**PASS**) - Removed from Puppet and Puppet... [15:02:23] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) 05Open→03Resolved Many thanks to all concerned. These hosts now have regained connectivity and have been upgraded to 10 G... [15:16:16] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm completed:... [15:30:14] Hey a-team! What's going on with stat1006? My jupyterhub server was stopped overnight even though it wasn't doing much, and now I'm unable to spawn it again, both with an existing environment or a new one, they all hit the 2 minute timeout [15:31:21] Hi Nettrom - I'm not aware of anything special - I think it will be decommissioned infavor of stat1009,but that was not imminent as far as I know [15:31:46] btullis, stevemunene - any info on stat1006? [15:31:59] joal: yeah, I'm planning to move to stat1009 once my current analysis project is completed, but didn't think it needed to happen this quickly :) [15:32:37] Nettrom: The host seems quit, no loador big job - weird [15:32:50] quit +e / quiet [15:33:23] Hey Nettrom , joal looking into it. [15:33:32] Thanks stevemunene [15:33:35] stevemunene: thank you! :) [15:34:57] 10Data-Platform-SRE, 10Discovery-Search: Examine/refactor WDQS categories update scripts - https://phabricator.wikimedia.org/T342361 (10bking) [15:44:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:31] Nettrom: It's not busy now, but there was a lot of pressure on the memory overnight. [16:01:40] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=stat1006&var-datasource=thanos&var-cluster=analytics&from=now-24h&to=now https://usercontent.irccloud-cdn.com/file/lLpFhgy9/image.png [16:04:15] https://www.irccloud.com/pastebin/odssCfQH/ [16:09:17] Both processes that were mentioned though (26464 and 26202) seemed to belong to gmodena and not to you, Nettrrom [16:10:11] So I haven't got an explanation for what's happened to your jupyterhub. Are you still unable to log into it? [16:10:26] o/ btullis I can spawn a server with my user but I am getting some errors when I try to use Nettrom's server env , I get an error https://www.irccloud.com/pastebin/42o0uKY7/ [16:11:13] oh oops missed your last message before posting that btullis [16:13:03] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye executed with errors: - analytics1073 (**FAIL**) - Downtimed on Icinga... [16:13:14] based on https://discourse.jupyter.org/t/keyerror-missing-required-environment-jupyterhub-service-url/19096/9 I thought This might be related to the jupyterhub-singleuser server version that the failing env is running. With the failing env showing `Starting jupyterhub single-user server extension version 4.0.1` instead of the expected `Starting jupyterhub-singleuser server version 1.5.0` [16:15:27] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye [16:22:23] I've stevemunene Nettrom - Yes, I think it might be something like this. I've checked in the base conda environment which version of jupyter we have: [16:23:35] https://www.irccloud.com/pastebin/lI9Hhx4p/ [16:24:20] Then I have compared this with the conda environment that is currently running in nettrom's active jupyter-singleuser environment: [16:24:49] https://www.irccloud.com/pastebin/xOytc4gi/ [16:27:24] So it looks like your jupyterhub has been upgraded in this conda environment from version 1.5 to version 4.0. But something about the jupyterhub-singleuser plugin isn't happy. [16:29:07] Do you need the upgraded version in particular nettrom? Could you downgrade it yourself within this conda environment? You said that you also had trouble spawning a new environment, that seems odd. [16:30:19] btullis: Oh, good catch! I've noticed that installing/updating packages through conda inside Jupyter results in updates to lots of installed packages [16:30:36] I'll try downgrading to the base install then and it'll probably fix the problem [16:31:53] I did notice being able to spawn a different environment that I think has less updates, so downgrading the one we've been looking at hopefully solves things [16:32:06] if this continues, I'll be back, or file a phab task :) [16:32:15] thanks for looking into it, btullis & stevemunene ! [16:32:19] Nettrom: Great! Let us know how it goes. [16:33:00] btullis and stevemunene for the win ! <3 [16:42:57] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm [16:48:06] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye completed: - analytics1073 (**PASS**) - Downtimed on Icinga/Alertmanag... [17:18:04] PROBLEM - Zookeeper Server on flink-zk1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [17:22:21] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm completed:... [17:25:30] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm [18:00:25] 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm completed:... [18:11:28] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) @Papaul finished double checking that I got everything like we discussed. All the firmware is up to date and the NIC issues have been solved. Can you pleas... [18:47:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [18:47:27] The mw-page-content-change-enrich Flink cluster in eqiad has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [18:52:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [18:52:27] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [18:54:03] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10RKemper) All of these hosts except `wdqs202[1-2]` are in service. Those last two hosts will... [19:00:50] PROBLEM - Zookeeper Server on flink-zk1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [19:38:09] we have alerts! I just downtimed ;) [19:44:57] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:27:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ... [20:27:27] (2) The mw-page-content-change-enrich Flink cluster in eqiad has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [20:27:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ... [20:27:33] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [20:32:27] (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ... [20:32:27] (2) The mw-page-content-change-enrich Flink cluster in eqiad has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning [20:34:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [20:34:27] High Kafka consumer lag for mw_page_content_change_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [20:39:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [20:39:27] High Kafka consumer lag for mw_page_content_change_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [20:42:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [20:42:27] High Kafka consumer lag for mw_page_content_change_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [20:47:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ... [20:47:27] High Kafka consumer lag for mw_page_content_change_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [20:49:27] (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ... [20:49:27] High Kafka consumer lag for mw_page_content_change_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag [20:50:19] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Papaul) @Jhancock.wm that step was already done on june15 see link below. so you should be good to proceed with the OS install. Thanks https://gerrit.wikimedia.org/r/c... [21:07:06] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye [21:27:36] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich: events larger than max.request.size should be produced - https://phabricator.wikimedia.org/T342399 (10gmodena) [21:29:47] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich: filter out events larger than max.request.size - https://phabricator.wikimedia.org/T342399 (10gmodena) [21:33:06] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich: filter out events larger than max.request.size - https://phabricator.wikimedia.org/T342399 (10gmodena) We did implement this filter in the old Scala PoC: https://gitlab.wikimedia.org/repos/data-en... [21:41:34] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [21:55:42] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye completed: - an-worker1156 (*... [21:56:14] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [22:00:29] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1155.eqiad.wmnet with OS bullseye [22:48:10] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1155.eqiad.wmnet with OS bullseye completed: - an-worker1155 (*... [22:51:04] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [22:54:59] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1154.eqiad.wmnet with OS bullseye [23:33:34] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1154.eqiad.wmnet with OS bullseye completed: - an-worker1154 (*... [23:38:15] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) [23:42:01] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1153.eqiad.wmnet with OS bullseye [23:49:43] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed