[01:08:27] 10Data-Platform-SRE, 10Discovery-Search (Current work): Retune enwiki_content shard settings - https://phabricator.wikimedia.org/T343820 (10RKemper) [01:08:57] 10Data-Platform-SRE, 10Discovery-Search (Current work): Retune enwiki_content shard settings - https://phabricator.wikimedia.org/T343820 (10RKemper) Will be in blocked/waiting for a few days while a reindex of all wikis completes to apply the newest settings. [02:41:06] 10Data-Engineering, 10Movement-Insights: Canonical-data ownership, definition and update - https://phabricator.wikimedia.org/T339928 (10nshahquinn-wmf) [03:48:39] (MediawikiPageContentChangeEnrichAvailability) firing: ... [03:48:39] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [05:10:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [05:15:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [05:28:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [05:29:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [05:33:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [05:34:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [06:03:32] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for suwikisource - https://phabricator.wikimedia.org/T343547 (10Marostegui) Views and grants are added. A data check looks good, no PII. This is ready for #data-engineering to create the views. [06:04:05] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for blkwiktionary - https://phabricator.wikimedia.org/T343541 (10Marostegui) Views and grants are added. A data check looks good, no PII. This is ready for #data-engineering to create the views. [06:44:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [06:47:38] 10Data-Engineering, 10All-and-every-Wikisource, 10ArticlePlaceholder, 10BetaFeatures, and 59 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10WMDE-Fisch) [06:54:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [07:38:51] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [07:48:39] (MediawikiPageContentChangeEnrichAvailability) firing: ... [07:48:39] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [07:58:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [08:03:51] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [08:03:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [08:08:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [08:13:51] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [08:13:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [08:23:51] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [08:48:56] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1099.eqiad.wmnet with OS bullseye [09:08:47] (03CR) 10Hnowlan: [C: 03+1] Update aqs scap targets with new hosts and tidy up [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/947385 (https://phabricator.wikimedia.org/T342213) (owner: 10Btullis) [09:14:30] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Patch-For-Review: Puppet: consider skipping SPDX enforcement on text files - https://phabricator.wikimedia.org/T344291 (10jbond) p:05Triage→03Low [09:15:42] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Patch-For-Review: Puppet: consider skipping SPDX enforcement on text files - https://phabricator.wikimedia.org/T344291 (10jbond) ill leave this open for a bit before sending a CR but i dont see an issue (cc @MoritzMuehlenhoff) [09:29:12] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1099.eqiad.wmnet with OS bullseye completed: - an-worker1099 (**PASS**) - Downtimed on Icinga/Alertmanag... [09:32:04] PROBLEM - Check systemd state on an-worker1099 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:26] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10BTullis) Just for reference, I checked to make sure that both pyspark 3.3.2 and 3.4.1 are available via conda-forge, since this is where we currently... [10:00:10] RECOVERY - Check systemd state on an-worker1099 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:07] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Patch-For-Review: Puppet: consider skipping SPDX enforcement on text files - https://phabricator.wikimedia.org/T344291 (10MoritzMuehlenhoff) >>! In T344291#9095654, @jbond wrote: > ill leave this open for a bit before sending a CR but i dont... [10:19:27] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:27] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:31] awa [11:48:39] (MediawikiPageContentChangeEnrichAvailability) firing: ... [11:48:39] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [12:16:08] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1100.eqiad.wmnet with OS bullseye [12:25:09] 10Data-Engineering: Check home/HDFS leftovers of bmansurov - https://phabricator.wikimedia.org/T320367 (10BTullis) 05Open→03Resolved a:03BTullis Dropped the hive table with: ` hive (default)> DROP DATABASE bmansurov CASCADE; OK Time taken: 4.036 seconds hive (default)> ` Should you need it, the user's arch... [12:25:11] 10Data-Engineering: Cleanup User Hive Databases - https://phabricator.wikimedia.org/T323884 (10BTullis) [12:27:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [12:32:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [12:38:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [12:41:13] Hello team :) [12:41:37] joal: Welcome back! [12:41:43] \o/ [12:41:51] Hi btullis - thank you :) [12:50:17] 10Data-Engineering: Check home/HDFS leftovers of echetty - https://phabricator.wikimedia.org/T330834 (10BTullis) 05Open→03Resolved a:03BTullis Thanks @odimitrijevic - I have now removed the home directories with: ` sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile:... [12:53:32] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for blkwiktionary - https://phabricator.wikimedia.org/T343541 (10BTullis) [12:53:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [12:53:51] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for suwikisource - https://phabricator.wikimedia.org/T343547 (10BTullis) Thanks @Marostegui - I believe that since the recent reorg, it's best to add the #data-platform-sre tag to these tickets, unless there's... [12:54:13] 10Data-Engineering, 10Data-Platform-SRE: Request DataHub edit access for David Martin (dmartin) - https://phabricator.wikimedia.org/T344217 (10BTullis) [12:56:31] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) Happened upon a problem with an-worker1100, relating to the GPU support. This will probably require a patch. ` root@an-worker1100:~# puppet agent -tv Info: Using configured environment 'production' Inf... [13:07:12] 10Data-Engineering, 10Data-Platform-SRE: Request DataHub edit access for David Martin (dmartin) - https://phabricator.wikimedia.org/T344217 (10lbowmaker) 05Open→03Resolved a:03lbowmaker Granted general editing rights. Resolving [13:15:25] 10Data-Platform-SRE: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10Gehel) a:03Gehel [13:18:50] 10Data-Platform-SRE: Grant all authenticated users access to SQL Lab in Superset - https://phabricator.wikimedia.org/T328457 (10BTullis) a:03BTullis [13:21:03] 10Data-Platform-SRE, 10sre-alert-triage: Alert triage - https://phabricator.wikimedia.org/T342247 (10Gehel) 05Open→03Resolved a:03Gehel [13:28:40] 10Data-Platform-SRE, 10Data-Catalog: DataHub rights assignment is case-sensitive - https://phabricator.wikimedia.org/T309382 (10BTullis) Moving this to blocked, whlist we implement {T305874}, which we believe will fix this issue. [13:44:11] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1101.eqiad.wmnet with OS bullseye [13:44:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [13:45:51] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [13:49:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [13:54:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [13:56:19] I think this error comes from a job from xcollazo [13:56:41] xcollazo: your job uses 7Tb ram - we usually ask users to not to go above 2.5Tb [13:57:26] joal: Yes, I believe so too. Also probably the high number of files is related to the iceberg/spark issues identified in T338057 [13:57:27] T338057: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 [13:57:42] reading [13:59:11] ack btullis - thanks for the pointer [13:59:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [14:00:21] joal: I've just discussed with gehel and we've agreed to prioritize the spark upgrade as requested. If you have any input on spark 3.3.3 vs 3.4.1 (or both), feel free to weigh in. [14:00:27] joal: my bad, was getting greedy with my tests, will tone it down [14:00:42] thanks xcollazo :) [14:00:56] btullis: I don't have any specifics on spark 3.X versions - I shall read [14:04:51] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [14:07:04] ^^^ this one is inevitable though, I will clean up, but until we bump Spark, my MERGE INTOs will generate lots of small files. More info at https://phabricator.wikimedia.org/T344266 [14:07:11] So we will have bursts of files [14:22:58] ack xcollazo - thank you for letting me know [14:23:46] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1100.eqiad.wmnet with OS bullseye completed: - an-worker1100 (**WARN**) - Downtimed on Icinga/Alertmanag... [14:30:08] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1101.eqiad.wmnet with OS bullseye completed: - an-worker1101 (**PASS**) - Downtimed on Icinga/Alertmanag... [15:00:28] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1102.eqiad.wmnet with OS bullseye [15:00:44] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1103.eqiad.wmnet with OS bullseye [15:08:19] (03PS1) 10MNeisler: Add wikifunctionswiki to the sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/949539 (https://phabricator.wikimedia.org/T344356) [15:41:20] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1103.eqiad.wmnet with OS bullseye completed: - an-worker1103 (**PASS**) - Downtimed on Icinga/Alertmanag... [15:43:31] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1102.eqiad.wmnet with OS bullseye completed: - an-worker1102 (**PASS**) - Downtimed on Icinga/Alertmanag... [15:48:39] (MediawikiPageContentChangeEnrichAvailability) firing: ... [15:48:39] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [15:59:28] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10xcollazo) > We would need to update both the spark docker image version Funny, I thought the docker image was running 3.3.x? > It might be pretty tri... [16:06:30] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work), 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.16; 2023-07-04): Rollout Elasticsearch extra plugins package and restart cluster to apply - https://phabricator.wikimedia.org/T344366 (10bking) [16:24:02] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1104.eqiad.wmnet with OS bullseye [16:24:07] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1105.eqiad.wmnet with OS bullseye [16:24:23] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Spike: Explore Containerization Solutions for DE Applications - https://phabricator.wikimedia.org/T288254 (10odimitrijevic) The work related to this has been done as part of standing up the [[ https:... [16:24:47] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Spike: Explore Containerization Solutions for DE Applications - https://phabricator.wikimedia.org/T288254 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic [16:28:27] 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10BTullis) >>! In T338057#9096980, @xcollazo wrote: >> We would need to update both the spark docker image version > Funny, I thought the docker image w... [16:29:51] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [16:32:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:33:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:57] milimetric: I'm planning on doing an aqs deploy with this fix: https://gerrit.wikimedia.org/r/c/analytics/aqs/deploy/+/947385 for this: T342213 [16:34:59] T342213: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 [16:36:05] 10Data-Platform-SRE, 10Patch-For-Review: Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10xcollazo) Wanted to pass by and +1 this effort. In the context of T338057, T338057 and T342587, having the ability to look back would have significantly improved the development experience. [16:36:07] I've never done an aqs deploy before and you said that you had some issues that you fixed recently. Any guidelines for me beyond?https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/AQS#Step_2:_Deploy_using_scap [16:36:21] 10Data-Platform-SRE, 10Patch-For-Review: Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10xcollazo) [16:38:04] (03CR) 10Btullis: [V: 03+2 C: 03+2] Update aqs scap targets with new hosts and tidy up [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/947385 (https://phabricator.wikimedia.org/T342213) (owner: 10Btullis) [16:38:04] 10Data-Platform-SRE, 10Patch-For-Review: Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10xcollazo) Also related: {T342923}. [16:43:42] !log deploying aqs [16:43:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:50:30] (03PS1) 10Btullis: Fix scap targets [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/949557 (https://phabricator.wikimedia.org/T342213) [16:51:28] (03CR) 10Btullis: [V: 03+2 C: 03+2] Fix scap targets [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/949557 (https://phabricator.wikimedia.org/T342213) (owner: 10Btullis) [16:52:15] !log deploying aqs again [16:52:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:04:43] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1104.eqiad.wmnet with OS bullseye completed: - an-worker1104 (**PASS**) - Downtimed on Icinga/Alertmanag... [17:05:24] * btullis !log re-ran efine_eventlogging_analytics failed job and sent follow-up email. [17:05:30] !log re-ran efine_eventlogging_analytics failed job and sent follow-up email. [17:05:35] !log re-ran efine_eventlogging_analytics failed job and sent follow-up email. [17:05:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:05:42] Fat fingers. [17:06:21] !log aqs deploy completed successfully. [17:06:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:06:58] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1105.eqiad.wmnet with OS bullseye completed: - an-worker1105 (**PASS**) - Downtimed on Icinga/Alertmanag... [17:09:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:51] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [18:27:08] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data-Catalog, 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10tchin) Since Datahub has the concept of platforms, I think the best way forward is to have a separate platform called `Even... [19:30:51] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [19:31:51] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [19:48:39] (MediawikiPageContentChangeEnrichAvailability) firing: ... [19:48:39] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability [20:30:46] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work), 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.16; 2023-07-04): Rollout Elasticsearch extra plugins package and restart cluster to apply - https://phabricator.wikimedia.org/T344366 (10bking) Package builds a... [20:37:42] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2002.codfw.wmnet with OS bo... [21:43:20] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data-Catalog, 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10odimitrijevic) @Tchin as discussed today, that sounds like a good approach. Before deploying to production, let's wipe out... [21:45:55] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data-Catalog, 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10odimitrijevic) [21:46:58] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Discovery-Search (Current work), 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.16; 2023-07-04): Add support for redirects in CirrusSearch - https://phabricator.wikimedia.org/T325315 (10RKemper) Built `wmf-elasticsearch-search-plugins_7.10.... [21:52:36] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2002.codfw.wmnet with OS bookwo... [22:08:27] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data-Catalog, 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10odimitrijevic) [22:11:49] 10Data-Platform-SRE: Automate elastic plugin pkg build process - https://phabricator.wikimedia.org/T303011 (10bking) [22:13:04] 10Data-Platform-SRE: Automate elastic plugin pkg build process - https://phabricator.wikimedia.org/T303011 (10bking) Moving to "in progress," as I did have an opportunity to work on this again today, in support of T325315 . [22:25:17] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Data-Catalog, 10Event-Platform: Event Platform and DataHub Integration - https://phabricator.wikimedia.org/T318863 (10odimitrijevic) Here are some considerations that we discussed, that we need to further explore and decide on: - Creating a... [23:48:40] (MediawikiPageContentChangeEnrichAvailability) firing: ... [23:48:40] Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=codfw%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichAvailability