[00:04:15] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@staging.service Failed on dbstore1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:32] PROBLEM - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:30:33] ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T354499 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:04:33] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@staging.service Failed on dbstore1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:39:05] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Marostegui) prometheus exporters are failing with access denied, so I guess some users were not correctly migrated: ` root@dbstore1008:~# system... [08:04:33] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@staging.service Failed on dbstore1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:25:31] 10Data-Platform-SRE, 10SRE, 10serviceops, 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10Gehel) [10:29:21] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) >>! In T351921#9440216, @Marostegui wrote: > prometheus exporters are failing with access denied, so I guess some users were not correc... [10:44:02] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Pipelines, 10Data Products, and 4 others: Merge ks-Arab and ks-Deva to ks - https://phabricator.wikimedia.org/T314476 (10Pginer-WMF) [10:47:57] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search: Investigate connection timeouts between Search Update Pipeline and MediaWiki APIs - https://phabricator.wikimedia.org/T354289 (10Gehel) [10:48:21] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10MW-1.42-notes (1.42.0-wmf.13; 2024-01-09 ), and 2 others: Remove EventLogging::submitMetricsEvent() - https://phabricator.wikimedia.org/T354419 (10phuedx) [10:48:23] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10MW-1.42-notes (1.42.0-wmf.13; 2024-01-09 ), and 2 others: Remove EventLogging::submitMetricsEvent() - https://phabricator.wikimedia.org/T354419 (10phuedx) Nice. Thanks! [11:09:12] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) I've added the `REPLICA MONITOR` permission to the nagios and prometheus users for all sections on dbstore1008. I've susequently re-ena... [11:22:51] \o/ I'm back from my temporary (4 months) house ! [11:22:56] Hi folks :) [11:23:14] Hi joal, welcome back! Happy New Year. [11:23:54] Well thank you btullis! Happy new year to you and the rest of the team as well, all my best wishes <3 [11:30:46] 10Data-Engineering, 10Data-Engineering-Wikistats: Incorrect number of content pages on stats.wikimedia.org - https://phabricator.wikimedia.org/T354489 (10Sfaci) Hi, You are right. The bug we have fixed related to the Latvian Wiki is not specific for that Wiki. It's a bug related to how we were filtering the r... [12:04:33] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@staging.service Failed on dbstore1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:31] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10SRE, 10serviceops, 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10Gehel) [12:21:52] 10Data-Platform-SRE: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 (10Gehel) p:05Triage→03High [12:22:20] 10Data-Platform-SRE, 10Movement-Insights: Create a DataHub group for the Movement Insights team - https://phabricator.wikimedia.org/T354211 (10Gehel) p:05Triage→03High [12:26:58] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring stat1010 into service with GPU from stat1005 - https://phabricator.wikimedia.org/T336040 (10BTullis) [12:28:04] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring stat1011 into service - https://phabricator.wikimedia.org/T354526 (10BTullis) [12:34:07] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade eventlogging VM to bullseye (or bookworm) - https://phabricator.wikimedia.org/T349289 (10BTullis) I believe that we're on the verge of finishing the migration of all legacy eventlogging componenets. See {T259163} and {T238230} for further det... [12:53:14] (03PS26) 10Gmodena: refinery-job: add WebrequestMetrics. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) [13:13:07] 10Data-Engineering, 10Data Products (Data Products Sprint 04): Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest - https://phabricator.wikimedia.org/T351909 (10gmodena) Looks like there were duplicate x_analytics keys in December data. The followin... [13:39:55] 10Data-Engineering, 10Data Products (Data Products Sprint 05): Make defaults immutable for Airflow confs - https://phabricator.wikimedia.org/T325014 (10CodeReviewBot) xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/572 Make global confs immutable. [13:47:15] (EventgateValidationErrors) firing: ... [13:47:16] eventgate-analytics-external stream eventlogging_UniversalLanguageSelector validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [13:50:07] 10Data-Platform-SRE: Configure OIDC Authentication for Superset on K8S - https://phabricator.wikimedia.org/T353794 (10Gehel) p:05Triage→03High [13:51:36] 10Data-Engineering, 10Data Products (Data Products Sprint 05): Make defaults immutable for Airflow confs - https://phabricator.wikimedia.org/T325014 (10xcollazo) [14:02:15] (EventgateValidationErrors) resolved: ... [14:02:16] eventgate-analytics-external stream eventlogging_UniversalLanguageSelector validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [15:10:06] 10Data-Engineering, 10Data Products (Data Products Sprint 04): Duplicate keys in x_analytics header corrupt some wmf_raw.webrequest rows and break refinement of wmf.webrequest - https://phabricator.wikimedia.org/T351909 (10gmodena) > The following wmf.webrequest partitions have been flagged as buggy (detected... [15:17:46] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade eventlogging VM to bullseye (or bookworm) - https://phabricator.wikimedia.org/T349289 (10Ottomata) Decommissioning probably won't get done until after I'm back from leave in late April. Can we wait that long? [15:19:24] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10BTullis) I had initially thought to take a backup of the staging database on dbstore1005 and restore it to dbstore1008, then set up dbstore1008 as a replication clien... [15:44:16] (SystemdUnitFailed) resolved: (2) wmf_auto_restart_prometheus-mysqld-exporter@staging.service Failed on dbstore1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:47:22] (KafkaReplicationFactorTooLow) firing: (833) Kafka topic ContentTranslationAbuseFilter replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [15:48:42] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@staging.service Failed on dbstore1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:52:49] (KafkaReplicationFactorTooLow) resolved: (833) Kafka topic ContentTranslationAbuseFilter replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [15:56:01] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) >>! In T351921#9441050, @BTullis wrote: > Therefore, I think that dbstore1008 is ready to receive traffic. The only reason that I'm hes... [15:56:53] !log migrating s7-analytics-replica to dbstore1008 for T351921 [15:56:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:56:57] T351921: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 [15:58:42] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@staging.service Failed on dbstore1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:49] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) That didn't work as hoped. ` btullis@stat1004:~$ analytics-mysql ukwiki ERROR 1698 (28000): Access denied for user 'research'@'2620:0:8... [16:06:15] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Marostegui) >>! In T351921#9442090, @BTullis wrote: > That didn't work as hoped. > ` > btullis@stat1004:~$ analytics-mysql ukwiki > ERROR 1698 (... [16:14:50] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10Gehel) Still needs to be done: * HTTP timeout * Documentation * Validate elasticsearch ingestion throughput once we're on Cloudelastic [16:17:46] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) >>! In T351921#9442097, @Marostegui wrote: > dbstore hosts should not have ipv6 AAAA records otherwise you'd be running into https://ph... [16:18:56] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search (Current work): Cirrus-streaming-updater test: validate relforge indices are correctly updated - https://phabricator.wikimedia.org/T350186 (10Gehel) We can also use Saneitizer on Cloudelastic to see how much it diverges from other clusters (will... [16:21:35] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 05), 10Patch-For-Review: Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10Sfaci) At this moment the fix is already deployed to producti... [16:26:46] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Marostegui) That is indeed the right procedure, but double check with them to be sure. There's absolutely no need to reimage and/or edit the int... [16:30:14] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search (Current work): Investigate connection timeouts between Search Update Pipeline and MediaWiki APIs - https://phabricator.wikimedia.org/T354289 (10Gehel) [16:31:36] 10Data-Engineering, 10Data-Engineering-Wikistats: Incorrect number of content pages on stats.wikimedia.org - https://phabricator.wikimedia.org/T354489 (10Sfaci) By the way, the fix for the Latvian wiki bug has just been deployed in production environment. From now on you should see right results for any Wiki.... [16:35:11] 10Data-Platform-SRE: ProbeDown - https://phabricator.wikimedia.org/T353712 (10Gehel) [16:38:54] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search (Current work): Investigate connection timeouts between Search Update Pipeline and MediaWiki APIs - https://phabricator.wikimedia.org/T354289 (10Gehel) [16:40:38] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search (Current work): Investigate connection timeouts between Search Update Pipeline and MediaWiki APIs - https://phabricator.wikimedia.org/T354289 (10Gehel) [16:52:30] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10pfischer) One of the key findings of the backfill tests was a lower-than-expected throughput, see T353460. That is mainly caused by a bug inside the flink consume... [16:56:49] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) Ugh, still not working. ` btullis@stat1004:~$ analytics-mysql ukwiki ERROR 1044 (42000): Access denied for user 'research'@'10.%' to da... [17:00:29] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) Hmm. `research_role` has a `mysql_native_password` on dbstore1003, whereas it has a `unix_socket` on dbstore1008. I wonder if thatr is... [17:00:41] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Marostegui) >>! In T351921#9442458, @BTullis wrote: > Ugh, still not working. > ` > btullis@stat1004:~$ analytics-mysql ukwiki > ERROR 1044 (420... [17:02:01] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Marostegui) Fixed: ` btullis@stat1004:/root$ analytics-mysql ukwiki Reading table information for completion of table and column names You can t... [17:03:18] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence (work done), 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Marostegui) [17:08:54] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence, 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Marostegui) [17:11:25] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence, 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) >>! In T351921#9442480, @Marostegui wrote: > Fixed: > ` > btullis@stat1004:/root$ analytics-mysql ukwiki > Readin... [17:11:55] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence, 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) [17:12:16] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence, 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Marostegui) pt-show-grants doesn't show roles grants as far as I remember. This is essentially what you need: ` root@dbsto... [17:19:34] !log migrated s5-analytics-replica to dbstore1008 for T351921 [17:19:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:19:38] T351921: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 [17:20:59] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence, 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) >>! In T351921#9442529, @Marostegui wrote: > pt-show-grants doesn't show roles grants as far as I remember. This... [17:21:13] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence, 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) [17:22:52] !log migrated s1-analytics-replica to dbstore1008 for T351921 [17:22:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:23:50] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence, 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Marostegui) >>! In T351921#9442548, @BTullis wrote: >>>! In T351921#9442529, @Marostegui wrote: >> pt-show-grants doesn't... [17:23:57] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence, 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) s5 and s1 sections are now also being served from dbstore1008. ` btullis@stat1004:~$ analytics-mysql avkwiki Rea... [17:24:59] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence, 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) [17:36:15] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence, 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) [17:36:38] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence, 10Patch-For-Review: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) I've checked that they `wmfdata-python` library is working for wikis on s1, s5, and s7 as well. e.g. {F41659147,w... [17:38:51] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Identify/complete post-migration tasks after rdf-streaming-updater migrates to flink operator - https://phabricator.wikimedia.org/T350784 (10bking) [17:47:01] 10Data-Platform-SRE: PuppetZeroResources alert on elastic2083.codfw.wmnet - https://phabricator.wikimedia.org/T354543 (10bking) [18:08:32] (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) firing: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ... [18:08:32] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=eqiad.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected [18:41:09] 10Data-Platform-SRE: PuppetZeroResources alert on elastic2083.codfw.wmnet - https://phabricator.wikimedia.org/T354543 (10bking) The above CR appears to have fixed the issue. Closing... [18:41:19] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): PuppetZeroResources alert on elastic2083.codfw.wmnet - https://phabricator.wikimedia.org/T354543 (10bking) [18:48:59] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 05): Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10Spnq) The "Content page" count now looks to be correct 👍 But now there's another... [19:06:00] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10ssingh) [EDITED, wrong task ID for authdns-update]. [19:09:10] (GobblinKafkaRecordsExtractedNotEqualRecordsExpected) resolved: Gobblin job event_default ingested an unexpected number of records for a Kafka topic partition. ... [19:09:10] - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default&var-kafka_topic=eqiad.mediawiki.cirrussearch.page_rerender.v1&viewPanel=4 - https://alerts.wikimedia.org/?q=alertname%3DGobblinKafkaRecordsExtractedNotEqualRecordsExpected [19:34:49] (03PS1) 10Snwachukwu: Migration of browser General table to iceberg format. 1. Add a iceberg create table statement hql file fof browser_general table 2. Add hql file to update browser_general iceberg table with values. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/988711 (https://phabricator.wikimedia.org/T352670) [19:35:26] 10Data-Engineering, 10Movement-Insights, 10Traffic, 10Patch-For-Review: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10dr0ptp4kt) I'll amend the patch. [19:36:54] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Iceberg Migration] Migrate browser_general tables to Iceberg - https://phabricator.wikimedia.org/T352670 (10Snwachukwu) Update a patch containing 2 hql files required to create and update iceberg version of browser_general tables respectively. [19:44:10] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488 (10Gehel) [19:44:26] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Gehel) [19:44:32] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the NFDI4Culture Knowledge Graph - https://phabricator.wikimedia.org/T346455 (10Gehel) [19:45:53] 10Data-Engineering (Sprint 7): Migrate browser queries to Airflow - https://phabricator.wikimedia.org/T354552 (10lbowmaker) [19:49:56] 10Data-Engineering (Sprint 6): [Event Platform] Review analytics switch approach VarnishKafka -> HAProxy - https://phabricator.wikimedia.org/T353454 (10lbowmaker) 05Open→03Resolved [19:50:03] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Data Quality] Adopt iceberg as the data quality metrics table backend - https://phabricator.wikimedia.org/T352687 (10lbowmaker) 05Open→03Resolved [19:50:06] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10lbowmaker) [19:50:10] 10Data-Engineering (Sprint 6), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10lbowmaker) 05Open→03Resolved [19:50:12] 10Data-Engineering (Sprint 6), 10Event-Platform: [Event Platform] mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10lbowmaker) 05Open→03Resolved [19:50:14] 10Data-Engineering, 10Event-Platform: [SPIKE] Should we enable compression on kafka jumbo? - https://phabricator.wikimedia.org/T345657 (10lbowmaker) [19:50:17] 10Data-Engineering, 10Epic: [Data Quality] SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents - https://phabricator.wikimedia.org/T345912 (10lbowmaker) [19:50:19] 10Analytics, 10Data-Engineering (Sprint 6), 10Event-Platform, 10Patch-For-Review, 10User-notice: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10lbowmaker) [19:50:21] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10lbowmaker) 05Open→03Resolved [19:50:23] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10lbowmaker) [19:50:26] 10Data-Engineering (Sprint 6), 10Event-Platform: [Event Platform] mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10lbowmaker) 05Open→03Resolved [19:50:28] 10Data-Engineering (Sprint 6), 10Data Pipelines, 10Event-Platform, 10MW-1.41-notes (1.41.0-wmf.28; 2023-09-26): [Event Platform] eventgate-wikimedia occasionally fails to produce events due to stream config fetch errors - https://phabricator.wikimedia.org/T326002 (10lbowmaker) 05Open→03Resolved [19:50:30] 10Data-Engineering (Sprint 6), 10Event-Platform: [Event Platform] mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10lbowmaker) [19:50:35] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10MediaWiki-extensions-EventLogging, and 3 others: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events - https://phabricator.wikimedia.org/T251609 (10lbowmaker) [19:50:38] 10Analytics, 10Data-Engineering (Sprint 6), 10Event-Platform, 10Patch-For-Review, 10User-notice: [Event Platform] Enable canary events for all MediaWiki streams - https://phabricator.wikimedia.org/T266798 (10lbowmaker) 05Open→03Resolved [19:51:05] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Observability-Metrics, 10Patch-For-Review: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10lbowmaker) [19:51:21] 10Data-Engineering (Sprint 7): [Data Quality] Finalize Data Quality Metrics Schema - https://phabricator.wikimedia.org/T352683 (10lbowmaker) [19:51:40] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10lbowmaker) [19:51:43] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables - https://phabricator.wikimedia.org/T347879 (10lbowmaker) [19:51:45] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality] Metrics Alerting - https://phabricator.wikimedia.org/T352685 (10lbowmaker) [19:51:59] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Data Quality] Move MetricsExporter to refinery-spark - https://phabricator.wikimedia.org/T352688 (10lbowmaker) [19:52:02] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Iceberg Migration] Migrate aqs hourly tables to Iceberg - https://phabricator.wikimedia.org/T352669 (10lbowmaker) [19:52:04] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Iceberg Migration] Migrate interlanguage tables to Iceberg - https://phabricator.wikimedia.org/T352671 (10lbowmaker) [19:52:10] 10Data-Engineering (Sprint 7): [Data Quality] [Needs Grooming] Collect requirements to define prioritized data pipeline and data metrics - https://phabricator.wikimedia.org/T350409 (10lbowmaker) [19:52:13] 10Data-Engineering (Sprint 7): [Data Quality] Define concept for Alerting in coordination with SRE - https://phabricator.wikimedia.org/T351093 (10lbowmaker) [19:52:16] 10Data-Engineering (Sprint 7), 10Data Pipelines, 10Discovery-Search, 10Java-Scala-Standardization, 10Patch-For-Review: [Maintenance] We should have a top level maven parent pom based on wikimedia-discovery-discovery-parent-pom, - https://phabricator.wikimedia.org/T309097 (10lbowmaker) [19:52:54] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Iceberg Migration] Migrate browser_general tables to Iceberg - https://phabricator.wikimedia.org/T352670 (10lbowmaker) [19:52:56] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Configure the spark event dir in the spark3 defaults - https://phabricator.wikimedia.org/T352849 (10lbowmaker) [19:53:02] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10SRE Observability: [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query - https://phabricator.wikimedia.org/T347430 (10lbowmaker) [19:53:08] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10lbowmaker) [19:53:20] 10Data-Engineering (Sprint 7), 10Data Products, 10Structured-Data-Backlog: [Maintenance] Set up deletion jobs for Structured Data's data pipelines - https://phabricator.wikimedia.org/T347561 (10lbowmaker) [19:53:25] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Event-Platform: [Event Platform] Define Flink k8s operator SLO - https://phabricator.wikimedia.org/T345914 (10lbowmaker) [19:53:27] 10Data-Engineering (Sprint 7): [Iceberg Migration] Migrate pageview tables to Iceberg - https://phabricator.wikimedia.org/T347690 (10lbowmaker) [19:53:29] 10Data-Engineering (Sprint 7): [Iceberg Migration] Migrate session length tables to Iceberg - https://phabricator.wikimedia.org/T352672 (10lbowmaker) [19:53:31] 10Data-Engineering (Sprint 7), 10Event-Platform: [Event Platform] eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10lbowmaker) [19:53:34] 10Data-Engineering (Sprint 7), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10lbowmaker) [19:54:05] 10Data-Engineering (Sprint 7), 10Machine-Learning-Team, 10Wikimedia Enterprise, 10Epic, 10Event-Platform: [Event Platform] Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10lbowmaker) [19:56:33] 10Data-Platform-SRE, 10Wikidata: WDQS graph split hosts: Remove throttling/banning mechanisms and improve hadoop access - https://phabricator.wikimedia.org/T354555 (10bking) [19:59:10] (SystemdUnitFailed) firing: user-runtime-dir@43623.service Failed on stat1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:47] 10Data-Engineering, 10Epic: Dataset Config Store - https://phabricator.wikimedia.org/T354557 (10lbowmaker) [20:10:06] 10Data-Engineering: [Dataset Config Store] [SPIKE] Investigate existing solutions - https://phabricator.wikimedia.org/T354558 (10lbowmaker) [20:10:32] 10Data-Engineering (Sprint 7): [Dataset Config Store] [SPIKE] Investigate existing solutions - https://phabricator.wikimedia.org/T354558 (10lbowmaker) [20:28:28] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [20:32:41] 10Data-Engineering, 10Spike: [SPIKE] Can we migrate the anomaly detection job to DeeQu checks - https://phabricator.wikimedia.org/T354566 (10lbowmaker) [20:33:04] 10Data-Engineering (Sprint 7), 10Spike: [SPIKE] Can we migrate the anomaly detection job to DeeQu checks - https://phabricator.wikimedia.org/T354566 (10lbowmaker) [20:36:55] 10Data-Engineering (Sprint 7): [Data Quality][Webrequest] Log severity level of alerts generated by refinery - https://phabricator.wikimedia.org/T354568 (10gmodena) [21:02:46] 10Data-Engineering: Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10Milimetric) @VirginiaPoundstone this issue came up again (thanks very much to @xcollazo who remembered this task). I support option b) in Xabriel's plan above, and I think this sho... [22:51:12] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Ensure Elastic stack works on bookworm - https://phabricator.wikimedia.org/T353392 (10bking) Per today's pairing session with @RKemper , it looks like our current version of Elastic (7.10.2) does not support the version of Java provided by Debia... [23:28:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:32:04] 10Analytics, 10Data-Engineering, 10EventStreams, 10Privacy Engineering, and 2 others: Create Mediawiki "oversightprotect" action - https://phabricator.wikimedia.org/T354577 (10Htriedman)