[01:38:42] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:40:31] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:35:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[05:38:57] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:25:28] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[07:45:52] * brouberol waves good morning!
[07:46:01] * brouberol and a happy new year
[08:41:40] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[08:51:40] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[09:05:02] <btullis>	 Good morning all and a happy new year too.
[09:14:55] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) a:03BTullis
[09:24:23] <btullis>	 !log adding three days' downtime to dbstore1008, prior to switching its role to `mariadb::analytics_replica` for T351921
[09:24:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:24:26] <stashbot>	 T351921: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921
[09:35:37] <wikibugs>	 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10SGupta-WMF)
[09:38:58] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:57:18] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Collect metrics from the spark-history server - https://phabricator.wikimedia.org/T353694 (10brouberol) 05Open→03Resolved After some unsuccessful testing, I found https://github.com/apache/spark/pull/34326 which indicates that the...
[09:57:20] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol)
[09:57:43] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Monitor the availability of the spark history server deployments - https://phabricator.wikimedia.org/T353717 (10brouberol) 05Open→03Resolved
[09:57:45] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol)
[10:39:51] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) a:03brouberol
[10:46:21] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer)
[10:50:24] <wikibugs>	 10Data-Platform-SRE, 10SRE, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) a:05pfischer→03None
[10:53:48] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) I am running the following in a screen session on cumin1001 to recover the latest snapshot of s1 to dbstore1008. ` sudo transfer.py --type=decompress dbprov1...
[10:56:33] <brouberol>	 !log configuring [eqiad,codfw].mediawiki.cirrussearch.page_rerender.v1 as compacted topics on jumbo-eqiad - T353715
[10:56:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:56:36] <stashbot>	 T353715: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715
[10:56:49] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) ` brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'eqiad.me...
[10:58:22] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer)
[10:58:28] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) We can see the impact on the overall topic size {F41648651}
[10:59:40] <jinxer-wm>	 (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for mediawiki_history_reduced_2023_12 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[11:13:35] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) ` brouberol@kafka-jumbo1010:~$ kafka configs --entity-type topics --entity-name 'codfw.me...
[11:17:53] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) 22% of the topic segments were compacted and deleted: {F41648664}
[11:18:30] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol)
[11:19:41] <jinxer-wm>	 (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for mediawiki_history_reduced_2023_12 on the druid_public Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_public&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable
[11:41:19] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) I am creating a backup of the grants from dbstore1003 with the following command: ` root@dbstore1003:~# sudo pt-show-grants -S /run/mysqld/mysqld.s1.sock > /...
[12:17:54] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol)
[12:18:01] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10brouberol) 05Open→03Resolved The change has been applied an hour ago (at the line). We don't obs...
[12:41:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on dbstore1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[12:47:02] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) The data transfer completed successfully. ` 2024-01-02 11:52:04  WARNING: Original size is 475259205252 but transferred size is 1334541723635 for copy to dbs...
[13:10:38] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Marostegui) Zarcillo database updated.
[13:38:58] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:54:42] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) Interesting!  Curious, so the reason for using compaction here is just to save space, not...
[14:20:58] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host dbstore1008.eqiad.wmnet with OS bookworm
[14:26:20] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10BTullis) a:03BTullis
[14:27:00] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host dbstore1009.eqiad.wmnet with OS bookworm
[14:43:55] <wikibugs>	 10Data-Engineering (Sprint 6): [Event Platform] Review analytics switch approach VarnishKafka -> HAProxy - https://phabricator.wikimedia.org/T353454 (10Ahoelzl) Will close this as part of the Sprint review on 01/08.
[14:58:12] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host dbstore1009.eqiad.wmnet with OS bookworm completed: - dbstore1009 (**PA...
[14:59:08] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host dbstore1008.eqiad.wmnet with OS bookworm completed: - dbstore1008 (**WA...
[15:06:26] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10bking) I merged the above patch, but I noticed the unit `wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.timer` was not re...
[15:18:13] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) I think that we are ready to move the `analytics-hive.eqiad.wmnet` DNS CNAME from an-coord1001 to an-coord1003.  I have tested by running `sudo -u analyt...
[15:25:16] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol)
[15:25:37] <wikibugs>	 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol)
[15:27:00] <btullis>	 joal: brouberol: have added you as reviewers to this change. https://gerrit.wikimedia.org/r/c/operations/dns/+/987152 - I think we're ready to move the hive services to a new coordinator, but I'd appreciate some other people being around to double-check things.
[15:28:37] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Refresh hadoop coordinators an-coord100[1-2] with an-coord[3-4] - https://phabricator.wikimedia.org/T332572 (10BTullis)
[15:28:43] <brouberol>	 I'll have a look real quick
[15:29:15] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis)
[15:29:20] <wikibugs>	 10Data-Platform-SRE, 10Epic: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis)
[15:29:22] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis)
[15:31:46] <btullis>	 Ah, I think that jo.al is out this week.
[15:35:48] <btullis>	 brouberol: Thanks. Here goes then.
[15:36:23] <btullis>	 !log migrating analytics-hive.eqiad.wmnet to an-coord1003 for T336045
[15:36:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:36:26] <stashbot>	 T336045: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045
[15:38:01] <btullis>	 The TTL on the DNS record is five minutes, so we should know pretty soon whether it's working as expected.
[15:39:10] <wikibugs>	 10Data-Platform-SRE: Decom an-coord100[1-2] - https://phabricator.wikimedia.org/T353774 (10BTullis)
[15:39:16] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis)
[15:39:22] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) @Ottomata, yes, this was intended to a) save disk space and b) reduce the number of record...
[15:40:56] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) Are you sure you want `delete` in the policy then?  Perhaps you want to keep all the lates...
[15:47:59] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis) This looks to be OK so far. I have run the same test from a stat client and I can see that the metastore connection is going to an-coord1003. ` btullis@s...
[16:12:45] <wikibugs>	 10Data-Engineering, 10API Platform, 10GraphQL, 10Pageviews-API: Responses on pageview API should be lighter - https://phabricator.wikimedia.org/T145935 (10Atieno) a:05Atieno→03None
[16:18:18] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) I have reimaged dbstore1008 as bookworm, which caused it to pull in version 10.6 of mariadb as well.  I have repeated the transfer and started the replicatio...
[16:31:03] <wikibugs>	 10Data-Engineering (Sprint 8), 10Data Products, 10serviceops-radar, 10Patch-For-Review: Rewrite all Airflow sensors that use datacenter prepartitions to depend on both datacenters - https://phabricator.wikimedia.org/T338796 (10CodeReviewBot) xcollazo opened https://gitlab.wikimedia.org/repos/data-engineeri...
[16:32:59] <wikibugs>	 10Data-Engineering (Sprint 8), 10serviceops-radar, 10Data Products (Data Products Sprint 05), 10Patch-For-Review: Rewrite all Airflow sensors that use datacenter prepartitions to depend on both datacenters - https://phabricator.wikimedia.org/T338796 (10xcollazo) a:03xcollazo
[16:47:14] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) The recovery of s5 has completed. I set the replication parameters with: ` sudo cat /srv/sqldata.s5/xtrabackup_slave_info | grep GLOBAL | sudo mysql.s5 ` Fol...
[16:52:36] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10pfischer) @Ottomata, we considered this but but decided against it since  a) page_rerender is only o...
[17:09:52] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10BTullis)
[17:13:49] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) Starting the recovery of s7 with: ` btullis@cumin1002:~$ sudo transfer.py --type=decompress dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s7....
[17:14:46] <wikibugs>	 10Data-Engineering, 10Event-Platform: Make meta.dt required on all schemas that declare it - https://phabricator.wikimedia.org/T340044 (10xcollazo) 05Open→03Resolved a:03xcollazo `meta.dt`, in practice, is always set.  I'll take this as good enough since there is no practical consequences of having the s...
[17:20:43] <wikibugs>	 (03PS16) 10Btullis: Update to Superset version 3.0.2 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356)
[17:38:58] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:44:54] <wikibugs>	 10Data-Platform-SRE, 10GitLab (Project Migration): Migrate Elasticsearch plugins repo to gitlab - https://phabricator.wikimedia.org/T353275 (10bking)
[17:45:47] <wikibugs>	 10Data-Platform-SRE, 10GitLab (Project Migration): Migrate Elasticsearch plugins repo to gitlab - https://phabricator.wikimedia.org/T353275 (10bking)
[17:54:12] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10GitLab (Project Migration): Migrate Elasticsearch plugins repo to gitlab - https://phabricator.wikimedia.org/T353275 (10bking)
[17:57:51] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10GitLab (Project Migration): Migrate Elasticsearch plugins repo to gitlab - https://phabricator.wikimedia.org/T353275 (10bking) [[ https://www.mediawiki.org/wiki/GitLab/Hosting_a_project_on_GitLab#Migrating_a_project | This page ]] documents how to migrate to Gitl...
[18:08:33] <jinxer-wm>	 (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) firing: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ...
[18:08:33] <jinxer-wm>	 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=codfw.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected
[18:54:50] <wikibugs>	 10Data-Engineering (Sprint 6): [Iceberg Migration] Migrate browser_general tables to Iceberg - https://phabricator.wikimedia.org/T352670 (10Snwachukwu) a:03Snwachukwu
[19:01:49] <wikibugs>	 10Data-Platform-SRE, 10Movement-Insights: Create a DataHub group for the Movement Insights team - https://phabricator.wikimedia.org/T354211 (10nshahquinn-wmf)
[19:06:34] <wikibugs>	 (03PS5) 10TChin: Add iceberg version of aqs_hourly table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982869 (https://phabricator.wikimedia.org/T352669)
[19:07:09] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) Setting recovery coordinates with: ` btullis@dbstore1008:/srv/sqldata.s7$ sudo cat /srv/sqldata.s7/xtrabackup_slave_info | grep GLOBAL | sudo mysql.s7 ` Foll...
[19:07:53] <wikibugs>	 (03CR) 10TChin: Add iceberg version of aqs_hourly table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982869 (https://phabricator.wikimedia.org/T352669) (owner: 10TChin)
[19:08:32] <jinxer-wm>	 (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) firing: (2) Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ...
[19:08:32] <jinxer-wm>	 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=codfw.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected
[19:16:40] <wikibugs>	 10Data-Engineering, 10Data Pipelines: Refine: Use Spark SQL instead of Hive JDBC - https://phabricator.wikimedia.org/T209453 (10Ottomata) I made some progress modifying Spark to [[ https://github.com/apache/spark/pull/21012#issuecomment-1874422376 | make it support adding nested column ]].  I'll stop here and...
[19:18:32] <jinxer-wm>	 (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) resolved: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ...
[19:18:32] <jinxer-wm>	 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=codfw.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected
[19:50:04] <wikibugs>	 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) NFS appears to result in a permissions issue as nfs is creating files and directories as nfsmanager/498 where quarry is trying to create files as quarry/999
[20:03:13] <wikibugs>	 10Data-Platform-SRE: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10bking)
[20:04:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[20:05:02] <wikibugs>	 10Data-Platform-SRE: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10RKemper)
[20:39:28] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[20:41:16] <wikibugs>	 10Data-Platform-SRE: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10bking)
[20:41:18] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10bking)
[21:08:55] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2087.codfw.wmnet with OS bullseye
[21:17:18] <wikibugs>	 (03PS1) 10Ottomata: spark HiveExtensions now support column COMMENTs in DDL and merge helpers [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/987195 (https://phabricator.wikimedia.org/T307040)
[21:20:29] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Patch-For-Review: Propagate field descriptions from event schemas to Hive event tables - https://phabricator.wikimedia.org/T307040 (10Ottomata) > I think this would automatically just work if we could create/alter the tables through Spark directly, rather than throu...
[21:22:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] spark HiveExtensions now support column COMMENTs in DDL and merge helpers [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/987195 (https://phabricator.wikimedia.org/T307040) (owner: 10Ottomata)
[21:22:40] <wikibugs>	 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10CirrusSearch, 10Discovery-Search, 10SRE, 10serviceops: Enable kafka log compaction for page_rerender on jumbo - https://phabricator.wikimedia.org/T353715 (10Ottomata) +1 k!
[21:28:17] <wikibugs>	 (03PS2) 10Ottomata: spark HiveExtensions now support column COMMENTs in DDL and merge helpers [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/987195 (https://phabricator.wikimedia.org/T307040)
[21:38:58] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:50:09] <wikibugs>	 (03PS3) 10Ottomata: spark HiveExtensions now support column COMMENTs in DDL and merge helpers [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/987195 (https://phabricator.wikimedia.org/T307040)
[22:01:27] <wikibugs>	 10Data-Engineering, 10Product-Analytics, 10Patch-For-Review: Propagate field descriptions from event schemas to Hive event tables - https://phabricator.wikimedia.org/T307040 (10Ottomata) Wow it...kinda...works~   `lang=sql CREATE TABLE otto.mw_page_change0 LIKE event.mediawiki_page_change_v1; `  Then I ran o...
[22:29:34] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2087.codfw.wmnet with OS bullseye executed with errors: - elastic2087 (**FAIL**...
[23:08:32] <jinxer-wm>	 (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) firing: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ...
[23:08:32] <jinxer-wm>	 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=codfw.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected