[01:38:42] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:40:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:35:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [05:38:57] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:25:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:45:52] * brouberol waves good morning! [07:46:01] * brouberol and a happy new year [08:41:40] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [08:51:40] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [09:05:02] Good morning all and a happy new year too. [09:14:55] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) a:03BTullis [09:24:23] !log adding three days' downtime to dbstore1008, prior to switching its role to `mariadb::analytics_replica` for T351921 [09:24:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:24:26] T351921: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 [09:35:37] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10SGupta-WMF) [09:38:58] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:57:18] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Collect metrics from the spark-history server - https://phabricator.wikimedia.org/T353694 (10brouberol) 05Open→03Resolved After some unsuccessful testing, I found https://github.com/apache/spark/pull/34326 which indicates that the... [09:57:20] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [09:57:43] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Monitor the availability of the spark history server deployments - https://phabricator.wikimedia.org/T353717 (10brouberol) 05Open→03Resolved [09:57:45] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [10:39:51] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) a:03brouberol [10:46:21] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) [10:50:24] 10Data-Platform-SRE, 10SRE, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) a:05pfischer→03None [10:53:48] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) I am running the following in a screen session on cumin1001 to recover the latest snapshot of s1 to dbstore1008. ` sudo transfer.py --type=decompress dbprov1... [10:56:33] !log configuring [eqiad,codfw].mediawiki.cirrussearch.page_rerender.v1 as compacted topics on jumbo-eqiad - T353715 [10:56:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:56:36] T353715: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 [10:56:49] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) ` brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'eqiad.me... [10:58:22] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) [10:58:28] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) We can see the impact on the overall topic size {F41648651} [10:59:40] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2023_12 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [11:13:35] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) ` brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'codfw.me... [11:17:53] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) 22% of the topic segments were compacted and deleted: {F41648664} [11:18:30] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) [11:19:41] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2023_12 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [11:41:19] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) I am creating a backup of the grants from dbstore1003 with the following command: ` root@dbstore1003:~# sudo pt-show-grants -S /run/mysqld/mysqld.s1.sock > /... [12:17:54] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) [12:18:01] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) 05Open→03Resolved The change has been applied an hour ago (at the line). We don't obs... [12:41:59] (PuppetZeroResources) firing: Puppet has failed generate resources on dbstore1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:47:02] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) The data transfer completed successfully. ` 2024-01-02 11:52:04 WARNING: Original size is 475259205252 but transferred size is 1334541723635 for copy to dbs... [13:10:38] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Marostegui) Zarcillo database updated. [13:38:58] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:54:42] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) Interesting! Curious, so the reason for using compaction here is just to save space, not... [14:20:58] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host dbstore1008.eqiad.wmnet with OS bookworm [14:26:20] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10BTullis) a:03BTullis [14:27:00] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host dbstore1009.eqiad.wmnet with OS bookworm [14:43:55] 10Data-Engineering (Sprint 6): [Event Platform] Review analytics switch approach VarnishKafka -> HAProxy - https://phabricator.wikimedia.org/T353454 (10Ahoelzl) Will close this as part of the Sprint review on 01/08. [14:58:12] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host dbstore1009.eqiad.wmnet with OS bookworm completed: - dbstore1009 (**PA... [14:59:08] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host dbstore1008.eqiad.wmnet with OS bookworm completed: - dbstore1008 (**WA... [15:06:26] 10Data-Platform-SRE, 10Patch-For-Review: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10bking) I merged the above patch, but I noticed the unit `wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.timer` was not re... [15:18:13] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) I think that we are ready to move the `analytics-hive.eqiad.wmnet` DNS CNAME from an-coord1001 to an-coord1003. I have tested by running `sudo -u analyt... [15:25:16] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [15:25:37] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [15:27:00] joal: brouberol: have added you as reviewers to this change. https://gerrit.wikimedia.org/r/c/operations/dns/+/987152 - I think we're ready to move the hive services to a new coordinator, but I'd appreciate some other people being around to double-check things. [15:28:37] 10Data-Platform-SRE, 10Patch-For-Review: Refresh hadoop coordinators an-coord100[1-2] with an-coord[3-4] - https://phabricator.wikimedia.org/T332572 (10BTullis) [15:28:43] I'll have a look real quick [15:29:15] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) [15:29:20] 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [15:29:22] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) [15:31:46] Ah, I think that jo.al is out this week. [15:35:48] brouberol: Thanks. Here goes then. [15:36:23] !log migrating analytics-hive.eqiad.wmnet to an-coord1003 for T336045 [15:36:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:36:26] T336045: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 [15:38:01] The TTL on the DNS record is five minutes, so we should know pretty soon whether it's working as expected. [15:39:10] 10Data-Platform-SRE: Decom an-coord100[1-2] - https://phabricator.wikimedia.org/T353774 (10BTullis) [15:39:16] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) [15:39:22] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) @Ottomata, yes, this was intended to a) save disk space and b) reduce the number of record... [15:40:56] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) Are you sure you want `delete` in the policy then? Perhaps you want to keep all the lates... [15:47:59] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) This looks to be OK so far. I have run the same test from a stat client and I can see that the metastore connection is going to an-coord1003. ` btullis@s... [16:12:45] 10Data-Engineering, 10API Platform, 10GraphQL, 10Pageviews-API: Responses on pageview API should be lighter - https://phabricator.wikimedia.org/T145935 (10Atieno) a:05Atieno→03None [16:18:18] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) I have reimaged dbstore1008 as bookworm, which caused it to pull in version 10.6 of mariadb as well. I have repeated the transfer and started the replicatio... [16:31:03] 10Data-Engineering (Sprint 8), 10Data Products, 10serviceops-radar, 10Patch-For-Review: Rewrite all Airflow sensors that use datacenter prepartitions to depend on both datacenters - https://phabricator.wikimedia.org/T338796 (10CodeReviewBot) xcollazo opened https://gitlab.wikimedia.org/repos/data-engineeri... [16:32:59] 10Data-Engineering (Sprint 8), 10serviceops-radar, 10Data Products (Data Products Sprint 05), 10Patch-For-Review: Rewrite all Airflow sensors that use datacenter prepartitions to depend on both datacenters - https://phabricator.wikimedia.org/T338796 (10xcollazo) a:03xcollazo [16:47:14] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) The recovery of s5 has completed. I set the replication parameters with: ` sudo cat /srv/sqldata.s5/xtrabackup_slave_info | grep GLOBAL | sudo mysql.s5 ` Fol... [16:52:36] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) @Ottomata, we considered this but but decided against it since a) page_rerender is only o... [17:09:52] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) [17:13:49] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) Starting the recovery of s7 with: ` btullis@cumin1002:~$ sudo transfer.py --type=decompress dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s7.... [17:14:46] 10Data-Engineering, 10Event-Platform: Make meta.dt required on all schemas that declare it - https://phabricator.wikimedia.org/T340044 (10xcollazo) 05Open→03Resolved a:03xcollazo `meta.dt`, in practice, is always set. I'll take this as good enough since there is no practical consequences of having the s... [17:20:43] (03PS16) 10Btullis: Update to Superset version 3.0.2 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [17:38:58] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:54] 10Data-Platform-SRE, 10GitLab (Project Migration): Migrate Elasticsearch plugins repo to gitlab - https://phabricator.wikimedia.org/T353275 (10bking) [17:45:47] 10Data-Platform-SRE, 10GitLab (Project Migration): Migrate Elasticsearch plugins repo to gitlab - https://phabricator.wikimedia.org/T353275 (10bking) [17:54:12] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10GitLab (Project Migration): Migrate Elasticsearch plugins repo to gitlab - https://phabricator.wikimedia.org/T353275 (10bking) [17:57:51] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10GitLab (Project Migration): Migrate Elasticsearch plugins repo to gitlab - https://phabricator.wikimedia.org/T353275 (10bking) [[ https://www.mediawiki.org/wiki/GitLab/Hosting_a_project_on_GitLab#Migrating_a_project | This page ]] documents how to migrate to Gitl... [18:08:33] (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) firing: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ... [18:08:33] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=codfw.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected [18:54:50] 10Data-Engineering (Sprint 6): [Iceberg Migration] Migrate browser_general tables to Iceberg - https://phabricator.wikimedia.org/T352670 (10Snwachukwu) a:03Snwachukwu [19:01:49] 10Data-Platform-SRE, 10Movement-Insights: Create a DataHub group for the Movement Insights team - https://phabricator.wikimedia.org/T354211 (10nshahquinn-wmf) [19:06:34] (03PS5) 10TChin: Add iceberg version of aqs_hourly table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982869 (https://phabricator.wikimedia.org/T352669) [19:07:09] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) Setting recovery coordinates with: ` btullis@dbstore1008:/srv/sqldata.s7$ sudo cat /srv/sqldata.s7/xtrabackup_slave_info | grep GLOBAL | sudo mysql.s7 ` Foll... [19:07:53] (03CR) 10TChin: Add iceberg version of aqs_hourly table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982869 (https://phabricator.wikimedia.org/T352669) (owner: 10TChin) [19:08:32] (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) firing: (2) Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ... [19:08:32] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=codfw.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected [19:16:40] 10Data-Engineering, 10Data Pipelines: Refine: Use Spark SQL instead of Hive JDBC - https://phabricator.wikimedia.org/T209453 (10Ottomata) I made some progress modifying Spark to [[ https://github.com/apache/spark/pull/21012#issuecomment-1874422376 | make it support adding nested column ]]. I'll stop here and... [19:18:32] (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) resolved: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ... [19:18:32] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=codfw.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected [19:50:04] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) NFS appears to result in a permissions issue as nfs is creating files and directories as nfsmanager/498 where quarry is trying to create files as quarry/999 [20:03:13] 10Data-Platform-SRE: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10bking) [20:04:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [20:05:02] 10Data-Platform-SRE: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10RKemper) [20:39:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [20:41:16] 10Data-Platform-SRE: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10bking) [20:41:18] 10Data-Platform-SRE, 10Patch-For-Review: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10bking) [21:08:55] 10Data-Platform-SRE, 10Patch-For-Review: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2087.codfw.wmnet with OS bullseye [21:17:18] (03PS1) 10Ottomata: spark HiveExtensions now support column COMMENTs in DDL and merge helpers [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/987195 (https://phabricator.wikimedia.org/T307040) [21:20:29] 10Data-Engineering, 10Product-Analytics, 10Patch-For-Review: Propagate field descriptions from event schemas to Hive event tables - https://phabricator.wikimedia.org/T307040 (10Ottomata) > I think this would automatically just work if we could create/alter the tables through Spark directly, rather than throu... [21:22:34] (03CR) 10CI reject: [V: 04-1] spark HiveExtensions now support column COMMENTs in DDL and merge helpers [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/987195 (https://phabricator.wikimedia.org/T307040) (owner: 10Ottomata) [21:22:40] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) +1 k! [21:28:17] (03PS2) 10Ottomata: spark HiveExtensions now support column COMMENTs in DDL and merge helpers [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/987195 (https://phabricator.wikimedia.org/T307040) [21:38:58] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:50:09] (03PS3) 10Ottomata: spark HiveExtensions now support column COMMENTs in DDL and merge helpers [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/987195 (https://phabricator.wikimedia.org/T307040) [22:01:27] 10Data-Engineering, 10Product-Analytics, 10Patch-For-Review: Propagate field descriptions from event schemas to Hive event tables - https://phabricator.wikimedia.org/T307040 (10Ottomata) Wow it...kinda...works~ `lang=sql CREATE TABLE otto.mw_page_change0 LIKE event.mediawiki_page_change_v1; ` Then I ran o... [22:29:34] 10Data-Platform-SRE, 10Patch-For-Review: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2087.codfw.wmnet with OS bullseye executed with errors: - elastic2087 (**FAIL**... [23:08:32] (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) firing: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ... [23:08:32] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=codfw.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected