[09:08:07] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10nfraison) Some zookeeper from other teams are already relying on bullseye with zookeepe... [09:46:01] (03PS20) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [10:07:41] (03PS21) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [10:15:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:20:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:29:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:34:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:34:57] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:35:32] (03PS22) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [10:36:56] (03CR) 10CI reject: [V: 04-1] Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [10:39:42] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [10:44:50] PROBLEM - Kerberos KDC daemon on krb2001 is CRITICAL: PROCS CRITICAL: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [10:46:28] PROBLEM - Kerberos KDC daemon on krb1001 is CRITICAL: PROCS CRITICAL: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [11:08:57] (03PS23) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [11:09:26] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): [Flink Operations] How to handle restarting a Flink application - https://phabricator.wikimedia.org/T328563 (10gmodena) a:03gmodena [12:23:15] (03Abandoned) 10Thiemo Kreuz (WMDE): Limit HTTP status code to 100…599 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/836736 (owner: 10Thiemo Kreuz (WMDE)) [12:29:08] !log restart presto coordinator on an-coord1001 to take in account new configs T329525 [12:29:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:29:10] T329525: Create Presto test clusters with 10 new nodes and try reproduce issue - https://phabricator.wikimedia.org/T329525 [12:32:33] !log roll-restart presto workers on an-presto100[1-5] to take in account new configs T329525 [12:32:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:45:11] !log adding 5 nodes to the presto prod cluster [12:45:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:11:01] !log Reimage an-presto1001 to upgrade to bullseye T329361 [13:11:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:11:03] T329361: Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 [13:12:49] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye [13:38:30] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Presto is unstable with more than 5 worker nodes - https://phabricator.wikimedia.org/T325809 (10nfraison) a:03nfraison [13:40:12] 10Data-Engineering, 10Patch-For-Review, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10nfraison) a:05nfraison→03None [13:51:09] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye e... [13:55:45] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye [14:01:55] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) Reboot of an-presto1001.eqiad.wmnet seems stuck nothing displayed on the IPMI console Enforcing manual reset of the node ` ipmitoo... [14:36:07] 10Data-Engineering, 10Data Pipelines: Upload new airflow package to wikimedia APT repository - https://phabricator.wikimedia.org/T330087 (10Stevemunene) [14:51:58] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye e... [14:52:09] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) Again stuck on [138/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Unable to get... [14:58:39] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye [15:02:23] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) DHCP configuration looks fine root@install1004:/etc/dhcp/automation/proxies# cat ttyS0-115200.conf # Automatically generated by dh... [15:04:23] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) retrying to run the reimage this time the console move forward but the debian installer is stuck on some missing firmware ` ┌──... [15:07:14] (03PS24) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [15:37:28] (03Abandoned) 10Aqu: Java Hive UDF thread safety [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/886800 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [15:41:28] (03PS25) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [15:54:52] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye e... [15:56:31] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) Looking at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006500 it seems that those firmware are missing on the bullseye setu... [16:05:58] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) From https://phabricator.wikimedia.org/T308106 this is a known issue that require manual ack on the host [16:09:47] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye [16:22:37] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10nfraison) reimage relaunched and prompt manually ack. But again blocked due to ` ┌───────────────────────┤ [!] Partition disks ├──────────... [16:26:53] 10Data-Engineering-Planning, 10Data Pipelines, 10Discovery-Search (Current work): Migrate import_ttl.py from airflow 1 to airflow 2 - https://phabricator.wikimedia.org/T329874 (10MPhamWMF) [16:28:49] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by nfraison@cumin1001 for host an-presto1001.eqiad.wmnet with OS bullseye e... [16:49:04] (03PS26) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [17:04:23] (03PS27) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [17:24:36] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 09), 10Patch-For-Review, 10SecTeam-Processed, 10Vuln-VulnComponent: Upgrade Puppet code to make Airflow configuration files compatible with version 2.5.0 - https://phabricator.wikimedia.org/T315580 (10JArguello-WMF) [17:35:12] (03PS28) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [17:49:00] (03CR) 10Aqu: "I've switched to a thread safe version of the singletons. I've removed serialization. And I've performed some fixes requires by Sonar. One" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [20:09:05] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Replace refinery-source Guava caches by Caffeine - https://phabricator.wikimedia.org/T325266 (10Antoine_Quhen) The last round of review is needed. * I've made sure that the use of our singletons is thread-safe + lazy-loaded. * I've ensured that Kryo d...