[00:18:56] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:19:42] <jinxer-wm>	 (SystemdUnitFailed) firing: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:30:54] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:34:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:46:00] <wikibugs>	 (03PS1) 10David Martin: Update hql LOCATIONs to accord with corresponding sqoop job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/939785 (https://phabricator.wikimedia.org/T341729)
[01:50:06] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Dashiki, 10Performance-Team (Radar): Investigate surprising "10% Other" portion of Analytics Browsers report - https://phabricator.wikimedia.org/T342267 (10Milimetric) I believe the new query I left in T342267#9029101 goes most of the way towards addressing the problems...
[01:52:57] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update hql LOCATIONs to accord with corresponding sqoop job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/939785 (https://phabricator.wikimedia.org/T341729) (owner: 10David Martin)
[02:24:43] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:00:07] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Wikistats: Increase topojson resolution: Singapore does not appear on wikistats map - https://phabricator.wikimedia.org/T199571 (10Robertsky) 05Open→03Resolved a:03Robertsky This seems to be resolved with T338033
[06:24:43] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:14:02] <wikibugs>	 (03PS2) 10TChin: Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/939367 (https://phabricator.wikimedia.org/T340765)
[07:14:19] <wikibugs>	 (03PS2) 10TChin: Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/939366 (https://phabricator.wikimedia.org/T340765)
[07:17:21] <wikibugs>	 (03CR) 10TChin: [C: 03+2] Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/939366 (https://phabricator.wikimedia.org/T340765) (owner: 10TChin)
[07:17:28] <wikibugs>	 (03CR) 10TChin: [C: 03+2] Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/939367 (https://phabricator.wikimedia.org/T340765) (owner: 10TChin)
[07:17:50] <wikibugs>	 (03Merged) 10jenkins-bot: Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/939366 (https://phabricator.wikimedia.org/T340765) (owner: 10TChin)
[07:17:59] <wikibugs>	 (03Merged) 10jenkins-bot: Skip schema test cases that will fail validation in new jsonschema-tools version [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/939367 (https://phabricator.wikimedia.org/T340765) (owner: 10TChin)
[07:49:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:50:00] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:55:55] <joal>	 btullis: Good morning :)
[07:56:31] <joal>	 btullis: I got contacted by Alluxio folks, and look what the showed me: https://prestodb.io/docs/current/cache/local.html
[07:57:13] <gehel>	 btullis: I vaguely remember you needed a review on a CR, but can't find which one that was. Would you have a link?
[08:00:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:04:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:41:17] <btullis>	 gehel: Many thanks. It's this one: https://gerrit.wikimedia.org/r/c/896116
[08:42:34] <gehel>	 for java engineers around:  I've started a slack thread in #wmf-java about serialVersionUID (based on https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/5/diffs). your inputs would be welcomed. (cc: joal, xcollazo)
[08:42:43] <btullis>	 joal: Oh! Hang on. Wow!
[08:50:49] <btullis>	 joal: Should we have a meeting after the unmeeting?
[08:51:52] <joal>	 btullis: I'll have to run for errand after the unmmeting - after lunch?
[08:54:40] <btullis>	 joal: Yes, sure thing. 
[08:54:49] <joal>	 btullis: I'llsend an invite
[09:00:10] <joal>	 gehel: Will read about SUUIDs :)
[09:00:51] <gehel>	 joal: it seems like a simple subject, but it is way more complex than initially expected :)
[09:01:14] <joal>	 gehel: so as many topics :)
[09:25:26] <btullis>	 joal: This built-in alluxio caching was enabled in version 0.236 of presto: https://prestodb.io/docs/current/release/release-0.236.html#hive-changes 
[09:26:06] <btullis>	 When I started (a little over two years ago) we were already on 0.246 : https://gerrit.wikimedia.org/g/operations/debs/presto
[09:28:15] <btullis>	 I struggle to believe when I was doing all of this T266641 that we (I) missed this option. https://prestodb.io/docs/current/cache/local.html
[09:28:15] <stashbot>	 T266641: Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641
[10:02:05] <wikibugs>	 10Data-Platform-SRE: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Manuel)
[10:06:25] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Alluxio for Improved Superset Query Performance - https://phabricator.wikimedia.org/T288252 (10BTullis)
[10:07:14] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10BTullis)
[10:07:15] <joal>	 btullis: We (I) missed the fact that kerberos+impersonation for global Alluxio cache was not open source... This presto story is a long miss-and-hit one :)
[10:07:24] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10BTullis) 05Resolved→03Open Reopening this ticket, as we may have a way forward with using Alluxio to optimize Presto and improve performance for Superset and stat machine users.  P...
[10:49:42] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis)
[10:57:00] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis) Moving to deploy this change now. I expect that we might receive some alerts from Icinga re: disk-space etc so I have added a week's downtime on the servers in...
[11:11:34] <wikibugs>	 10Data-Platform-SRE: Deploy ceph osd processes to data-engineering cluster - https://phabricator.wikimedia.org/T330151 (10BTullis) There are a few errors. It's trying to disable the write cache on the NVMe drive, which I'm not sure was intentional. ` Notice: /Stage[main]/Ceph::Osds/Exec[Disable write cache on de...
[11:18:50] <wikibugs>	 (03Abandoned) 10Jennifer Ebe: T340880 Merge visibility changes into hourly target table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/937047 (owner: 10Jennifer Ebe)
[11:19:30] <wikibugs>	 (03Abandoned) 10Jennifer Ebe: T335860 Implement job to transform mediawiki revision_visibility_change Hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/935442 (owner: 10Jennifer Ebe)
[12:04:43] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:47:13] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.282 - https://phabricator.wikimedia.org/T342343 (10BTullis)
[12:59:53] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Jclark-ctr) @btullis replaced cable on analytics1073 & analytics1075
[13:44:37] <wikibugs>	 10Data-Platform-SRE: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10BTullis) p:05Triage→03High a:03Stevemunene
[13:45:19] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye
[13:48:04] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10gmodena)
[13:57:28] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm
[14:01:53] <wikibugs>	 (03PS1) 10Joal: Remove unused cassandra module [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/940154
[14:10:08] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate WDQS categories update failures on Bullseye hosts - https://phabricator.wikimedia.org/T342060 (10bking) Per Tuesdas's pairing session, if we use the `--force` flag when we `scap deploy` , we can fix the issues with incomplete deployment and fi...
[14:15:43] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1075.eqiad.wmnet with OS bullseye
[14:32:49] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w...
[14:32:51] <jinxer-wm>	 (HdfsFSImageAge) firing: The HDFS FSImage on analytics-test-hadoop:an-test-master1002:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[14:37:51] <jinxer-wm>	 (HdfsFSImageAge) firing: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[14:38:03] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm
[14:38:11] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w...
[14:39:04] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm
[14:50:26] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye executed with errors: - analytics1073 (**FAIL**)   - Removed from Puppet...
[14:51:10] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye
[14:52:51] <jinxer-wm>	 (HdfsFSImageAge) resolved: (2) The HDFS FSImage on analytics-test-hadoop:an-test-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-test-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[14:58:49] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1075.eqiad.wmnet with OS bullseye completed: - analytics1075 (**PASS**)   - Removed from Puppet and Puppet...
[15:02:23] <wikibugs>	 10Data-Platform-SRE, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) 05Open→03Resolved Many thanks to all concerned. These hosts now have regained connectivity and have been upgraded to 10 G...
[15:16:16] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm completed:...
[15:30:14] <Nettrom>	 Hey a-team! What's going on with stat1006? My jupyterhub server was stopped overnight even though it wasn't doing much, and now I'm unable to spawn it again, both with an existing environment or a new one, they all hit the 2 minute timeout
[15:31:21] <joal>	 Hi Nettrom - I'm not aware of anything special - I think it will be decommissioned infavor of stat1009,but that was not imminent as far as I know
[15:31:46] <joal>	 btullis, stevemunene - any info on stat1006?
[15:31:59] <Nettrom>	 joal: yeah, I'm planning to move to stat1009 once my current analysis project is completed, but didn't think it needed to happen this quickly :)
[15:32:37] <joal>	 Nettrom: The host seems quit, no loador big job - weird
[15:32:50] <joal>	 quit +e / quiet
[15:33:23] <stevemunene>	 Hey Nettrom , joal looking into it. 
[15:33:32] <joal>	 Thanks stevemunene 
[15:33:35] <Nettrom>	 stevemunene: thank you! :)
[15:34:57] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Examine/refactor WDQS categories update scripts - https://phabricator.wikimedia.org/T342361 (10bking)
[15:44:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:01:31] <btullis>	 Nettrom: It's not busy now, but there was a lot of pressure on the memory overnight. 
[16:01:40] <btullis>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=stat1006&var-datasource=thanos&var-cluster=analytics&from=now-24h&to=now https://usercontent.irccloud-cdn.com/file/lLpFhgy9/image.png
[16:04:15] <btullis>	 https://www.irccloud.com/pastebin/odssCfQH/
[16:09:17] <btullis>	 Both processes that were mentioned though (26464 and 26202) seemed to belong to gmodena and not to you, Nettrrom
[16:10:11] <btullis>	 So I haven't got an explanation for what's happened to your jupyterhub. Are you still unable to log into it?
[16:10:26] <stevemunene>	 o/ btullis I can spawn a server with my user but I am getting some errors when I try to use Nettrom's server env , I get an error  https://www.irccloud.com/pastebin/42o0uKY7/
[16:11:13] <stevemunene>	 oh oops missed your last message before posting that btullis 
[16:13:03] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye executed with errors: - analytics1073 (**FAIL**)   - Downtimed on Icinga...
[16:13:14] <stevemunene>	 based on https://discourse.jupyter.org/t/keyerror-missing-required-environment-jupyterhub-service-url/19096/9 I thought This might be related to the jupyterhub-singleuser server version that the failing env is running. With the failing env showing `Starting jupyterhub single-user server extension version 4.0.1` instead of the expected `Starting jupyterhub-singleuser server version 1.5.0`
[16:15:27] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye
[16:22:23] <btullis>	 I've stevemunene Nettrom - Yes, I think it might be something like this. I've checked in the base conda environment which version of jupyter we have:
[16:23:35] <btullis>	 https://www.irccloud.com/pastebin/lI9Hhx4p/
[16:24:20] <btullis>	 Then I have compared this with the conda environment that is currently running in nettrom's active jupyter-singleuser environment:
[16:24:49] <btullis>	 https://www.irccloud.com/pastebin/xOytc4gi/
[16:27:24] <btullis>	 So it looks like your jupyterhub has been upgraded in this conda environment from version 1.5 to version 4.0. But something about the jupyterhub-singleuser plugin isn't happy.
[16:29:07] <btullis>	 Do you need the upgraded version in particular nettrom? Could you downgrade it yourself within this conda environment? You said that you also had trouble spawning a new environment, that seems odd.
[16:30:19] <Nettrom>	 btullis: Oh, good catch! I've noticed that installing/updating packages through conda inside Jupyter results in updates to lots of installed packages
[16:30:36] <Nettrom>	 I'll try downgrading to the base install then and it'll probably fix the problem
[16:31:53] <Nettrom>	 I did notice being able to spawn a different environment that I think has less updates, so downgrading the one we've been looking at hopefully solves things
[16:32:06] <Nettrom>	 if this continues, I'll be back, or file a phab task :)
[16:32:15] <Nettrom>	 thanks for looking into it, btullis & stevemunene !
[16:32:19] <btullis>	 Nettrom: Great! Let us know how it goes.
[16:33:00] <joal>	 btullis and stevemunene for the win ! <3
[16:42:57] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm
[16:48:06] <wikibugs>	 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host analytics1073.eqiad.wmnet with OS bullseye completed: - analytics1073 (**PASS**)   - Downtimed on Icinga/Alertmanag...
[17:18:04] <icinga-wm>	 PROBLEM - Zookeeper Server on flink-zk1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[17:22:21] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1001.eqiad.wmnet with OS bookworm completed:...
[17:25:30] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm
[18:00:25] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm completed:...
[18:11:28] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm) @Papaul finished double checking that I got everything like we discussed. All the firmware is up to date and the NIC issues have been solved. Can you pleas...
[18:47:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ...
[18:47:27] <jinxer-wm>	 The mw-page-content-change-enrich Flink cluster in eqiad has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning
[18:52:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ...
[18:52:27] <jinxer-wm>	 mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[18:54:03] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10RKemper) All of these hosts except `wdqs202[1-2]` are in service. Those last two hosts will...
[19:00:50] <icinga-wm>	 PROBLEM - Zookeeper Server on flink-zk1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[19:38:09] <inflatador>	 we have alerts! I just downtimed ;)
[19:44:57] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:27:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichTaskManagerNotRunning) firing: ...
[20:27:27] <jinxer-wm>	 (2) The mw-page-content-change-enrich Flink cluster in eqiad has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning
[20:27:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichJobManagerNotRunning) resolved: ...
[20:27:33] <jinxer-wm>	 mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning
[20:32:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichTaskManagerNotRunning) resolved: ...
[20:32:27] <jinxer-wm>	 (2) The mw-page-content-change-enrich Flink cluster in eqiad has no registered TaskManagers - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=All - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichTaskManagerNotRunning
[20:34:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ...
[20:34:27] <jinxer-wm>	 High Kafka consumer lag for mw_page_content_change_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag
[20:39:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ...
[20:39:27] <jinxer-wm>	 High Kafka consumer lag for mw_page_content_change_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag
[20:42:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ...
[20:42:27] <jinxer-wm>	 High Kafka consumer lag for mw_page_content_change_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag
[20:47:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) resolved: ...
[20:47:27] <jinxer-wm>	 High Kafka consumer lag for mw_page_content_change_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag
[20:49:27] <jinxer-wm>	 (MediawikiPageContentChangeEnrichHighKafkaConsumerLag) firing: ...
[20:49:27] <jinxer-wm>	 High Kafka consumer lag for mw_page_content_change_enrich in eqiad - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichHighKafkaConsumerLag
[20:50:19] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Papaul) @Jhancock.wm that step was already done on june15 see link below. so you should be good to proceed with the OS install. Thanks  https://gerrit.wikimedia.org/r/c...
[21:07:06] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye
[21:27:36] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich: events larger than max.request.size should be produced - https://phabricator.wikimedia.org/T342399 (10gmodena)
[21:29:47] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich: filter out events larger than max.request.size - https://phabricator.wikimedia.org/T342399 (10gmodena)
[21:33:06] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich: filter out events larger than max.request.size - https://phabricator.wikimedia.org/T342399 (10gmodena) We did implement this filter in the old Scala PoC: https://gitlab.wikimedia.org/repos/data-en...
[21:41:34] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm)
[21:55:42] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye completed: - an-worker1156 (*...
[21:56:14] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm)
[22:00:29] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1155.eqiad.wmnet with OS bullseye
[22:48:10] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1155.eqiad.wmnet with OS bullseye completed: - an-worker1155 (*...
[22:51:04] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm)
[22:54:59] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1154.eqiad.wmnet with OS bullseye
[23:33:34] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1154.eqiad.wmnet with OS bullseye completed: - an-worker1154 (*...
[23:38:15] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jhancock.wm)
[23:42:01] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1153.eqiad.wmnet with OS bullseye
[23:49:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@analytics_meta.service Failed on db1108:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed