[00:00:16] (SystemdUnitFailed) firing: user-runtime-dir@43623.service Failed on stat1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:39:10] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:55:32] 10Analytics, 10Data-Engineering, 10EventStreams, 10Privacy Engineering, and 3 others: Create Mediawiki "oversightprotect" action - https://phabricator.wikimedia.org/T354577 (10DannyS712) a:03DannyS712 Going to try and implement this. I'm assuming that this is all meant to be oversight-level suppression,... [05:40:16] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:01:52] 10Data-Engineering (Sprint 7), 10Patch-For-Review: [Iceberg Migration] Migrate browser_general tables to Iceberg - https://phabricator.wikimedia.org/T352670 (10CodeReviewBot) ebysans opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/576 Update browser_general dag to gene... [07:15:29] (03PS33) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [07:16:03] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [07:21:43] (03PS34) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [07:22:14] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [07:23:14] (03CR) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [07:42:36] (03PS35) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [07:43:09] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [07:47:46] (03PS36) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [07:48:18] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [08:21:40] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10SRE, 10serviceops, 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) [08:22:20] (03PS37) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [08:42:20] (03PS1) 10Gmodena: refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) [08:46:33] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 05): Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10Sfaci) Let us take a look. I think you're right and we are discarding always "redir... [08:47:48] (03CR) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [09:33:47] 10Data-Engineering, 10EventStreams, 10MediaWiki-General, 10Privacy Engineering, and 3 others: Create Mediawiki "oversightprotect" action - https://phabricator.wikimedia.org/T354577 (10Aklapper) > to implement this functionality into Mediawiki Please add the MediaWiki project tag (and review tags/subscribe... [09:40:17] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:48:17] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 9 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10DAlangi_WMF) [09:48:46] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 9 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10DAlangi_WMF) @Krinkle, is this good to be resolved now or should we wait till end of week? [09:50:09] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 9 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10Krinkle) With the last patch merged, feel free to close it. [09:53:09] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 9 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10DAlangi_WMF) 05Open→03Resolved [10:07:00] 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10gmodena) > Without the possibility to update our images on the docker registry. We will regenerate the whole environ... [10:39:00] !log roll-restarting kafka-jumbo to pick up new JRE [10:39:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:40:17] (KafkaReplicationFactorTooLow) firing: (20) Kafka topic codfw.android.customize_toolbar_interaction replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [10:43:52] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 05): Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10Sfaci) After pushing a new fix and testing it in the staging environment (where we... [10:45:17] (03PS3) 10Gmodena: refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) [10:45:18] (KafkaReplicationFactorTooLow) resolved: (20) Kafka topic codfw.android.customize_toolbar_interaction replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [10:48:17] (KafkaReplicationFactorTooLow) firing: (32) Kafka topic codfw.android.app_appearance_settings_interaction replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [10:53:19] (KafkaReplicationFactorTooLow) resolved: (52) Kafka topic codfw.android.app_appearance_settings_interaction replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [11:02:49] (KafkaReplicationFactorTooLow) firing: (55) Kafka topic ContentTranslationAbuseFilter replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [11:03:06] (KafkaReplicationFactorTooLow) firing: (76) Kafka topic ContentTranslationAbuseFilter replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [11:07:50] (KafkaReplicationFactorTooLow) resolved: (89) Kafka topic ContentTranslationAbuseFilter replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [11:19:05] (KafkaReplicationFactorTooLow) firing: (209) Kafka topic ContentTranslationAbuseFilter replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [11:19:59] (KafkaReplicationFactorTooLow) resolved: (209) Kafka topic ContentTranslationAbuseFilter replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [11:36:06] !log disable puppet on hadoop masters both test and production to test/implement new net_topology script [11:36:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:37:51] (KafkaReplicationFactorTooLow) firing: (243) Kafka topic codfw.eventgate-analytics.error.validation replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [11:39:08] (KafkaReplicationFactorTooLow) resolved: (191) Kafka topic codfw.eventgate-analytics.error.validation replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTooLow [11:48:52] !log roll restarting hadoop test masters to pick up new net_topology script and new JRE [11:48:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:50:22] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10BTullis) The staging database has now been migrated. ` btullis@stat1004:~$ analytics-mysql staging Reading table information for completion of t... [11:52:39] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Bring dbstore1009 into service to replace dbstore1005 - https://phabricator.wikimedia.org/T351924 (10BTullis) [14:42:14] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data-Persistence: Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10BTullis) >>! In T351921#9445600, @Marostegui wrote: > This host is not yet visible on orchestrator due to the migration of puppet7 (T352974), ca... [14:59:53] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata: WDQS graph split hosts: Remove throttling/banning mechanisms and improve hadoop access - https://phabricator.wikimedia.org/T354555 (10Gehel) [15:01:18] I'm wondering which platform to target for new UI instrumentation. It looks like "metrics platform" / mw.eventLog.dispatch is the current thing? [15:03:42] awight: Yes, I believe that's the thing to go for. https://phabricator.wikimedia.org/project/profile/5324 [15:06:57] Thanks! I'm trying to figure out which mailing list I should be watching for this sort of change--maybe there hasn't been an announcement yet? [15:20:05] awight: Good question. The best answer I have is probably the #metrics-platform channel in Slack, which may have more useful information for you. [15:24:56] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Refresh an-master1001 with an-master1003 - https://phabricator.wikimedia.org/T332573 (10BTullis) a:03BTullis [15:26:39] 10Data-Platform-SRE: ProbeDown - https://phabricator.wikimedia.org/T353712 (10bking) a:03bking This alert has cleared. Per IRC conversation in #wikimedia-observability , the task notifier does not post follow-up messages or close the ticket when the alert clears. See T351389 and T352079 for discussions about... [15:26:55] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Refresh an-master1002 with an-master1004 - https://phabricator.wikimedia.org/T332578 (10BTullis) a:03BTullis [15:27:15] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): ProbeDown - https://phabricator.wikimedia.org/T353712 (10bking) 05Open→03Resolved [15:31:08] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Migrate yarn.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349399 (10BTullis) [15:31:40] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Migrate hue.wikimedia.org to bullseye - https://phabricator.wikimedia.org/T349400 (10BTullis) [15:31:44] 10Data-Engineering, 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Data Products: Migrate an-web1001 to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349398 (10BTullis) [15:35:11] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Revert dbstore migration from puppet7 to puppet5 - https://phabricator.wikimedia.org/T354411 (10BTullis) Just for reference, I think that we are still undecided on whether this roll-back is neces... [15:36:00] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Revert dbstore migration from puppet7 to puppet5 - https://phabricator.wikimedia.org/T354411 (10BTullis) 05Open→03Stalled p:05Triage→03High [15:37:39] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Upgrade eventlogging VM to bullseye (or bookworm) - https://phabricator.wikimedia.org/T349289 (10BTullis) >>! In T349289#9441870, @Ottomata wrote: > Decommissioning probably won't get done until after I'm back from leave in late April. Can we wait t... [15:38:58] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): ProbeDown - https://phabricator.wikimedia.org/T352807 (10bking) [15:39:10] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:40] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): ProbeDown - https://phabricator.wikimedia.org/T352807 (10bking) This alert has cleared. Per IRC conversation in #wikimedia-observability , the task notifier does not post follow-up messages or close the ticket when the alert clears. See T351389 and T352079 for discu... [15:43:00] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) [15:43:18] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): ProbeDown - https://phabricator.wikimedia.org/T352807 (10bking) 05Open→03Resolved [15:47:13] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) 05Open→03Resolved [15:48:50] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Create helmfile deployment files for superset and superset-next - https://phabricator.wikimedia.org/T353790 (10BTullis) p:05Triage→03High [15:51:11] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10RKemper) [15:51:53] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10RKemper) [15:59:06] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Generate TLS certs for new WDQS endpoints - https://phabricator.wikimedia.org/T354661 (10RKemper) [16:01:05] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create DNS records for 3 new WDQS endpoints - https://phabricator.wikimedia.org/T354662 (10RKemper) [16:03:18] (03PS1) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) [16:03:47] (03CR) 10CI reject: [V: 04-1] Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [16:10:38] 10Analytics, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Data Products (Data Products Sprint 05): Wikistats - incorrect number of content articles for Latvian Wikipedia - https://phabricator.wikimedia.org/T354074 (10Sfaci) a:03Sfaci [16:16:25] (03PS2) 10Cyndywikime: Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) [16:17:07] (03CR) 10CI reject: [V: 04-1] Add UserIsTemp property [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/989205 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [16:21:17] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Ensure Data Platform SREs have a contact group in puppet/alerting - https://phabricator.wikimedia.org/T342578 (10Gehel) [16:21:22] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Generate TLS certs for new WDQS endpoints - https://phabricator.wikimedia.org/T354661 (10BTullis) I wonder whether should look at using the PKI for these certificates, rather than `cergen`... [16:23:26] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create 3 microsites for wdqs full graph, main graph, & scholarly articles - https://phabricator.wikimedia.org/T354658 (10Gehel) [16:27:03] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 07), and 2 others: EventLoggingTest::testDispatch fails when time ticks within the test run - https://phabricator.wikimedia.org/T353243 (10phuedx) [16:27:54] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 07), and 3 others: EventLoggingTest::testDispatch fails when time ticks within the test run - https://phabricator.wikimedia.org/T353243 (10phuedx) p:05Triage→03Medium [16:38:38] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 07), and 3 others: EventLoggingTest::testDispatch fails when time ticks within the test run - https://phabricator.wikimedia.org/T353243 (10phuedx) [16:46:06] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata: WDQS graph split hosts: Remove throttling/banning mechanisms and improve hadoop access - https://phabricator.wikimedia.org/T354555 (10bking) [16:51:44] 10Data-Engineering, 10Release-Engineering-Team, 10GitLab (CI & Job Runners): Unblock Dockerfile syntax to build images with Gitlab trusted runner - https://phabricator.wikimedia.org/T351792 (10Antoine_Quhen) 1/ Splitting the CI Yes, it's possible to split the image building job from the rest of the CI pipel... [16:55:18] 10Data-Platform-SRE: cleanup apifeatureusage indices on the CIrrus elasticsearch cluster - https://phabricator.wikimedia.org/T354670 (10Gehel) [16:55:22] 10Data-Platform-SRE: cleanup apifeatureusage indices on the CIrrus elasticsearch cluster - https://phabricator.wikimedia.org/T354670 (10Gehel) p:05Triage→03Medium [16:56:52] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata: WDQS graph split hosts: Remove throttling/banning mechanisms and improve hadoop access - https://phabricator.wikimedia.org/T354555 (10bking) [17:04:42] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [17:04:50] 10Data-Engineering, 10Data-Platform, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Epic: Deprecate and remove MetricsClient#dispatch() - https://phabricator.wikimedia.org/T352969 (10phuedx) [17:05:46] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata: WDQS graph split hosts: Remove throttling/banning mechanisms and improve hadoop access - https://phabricator.wikimedia.org/T354555 (10bking) [17:09:55] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata: WDQS graph split hosts: Remove throttling/banning mechanisms and investigate external connectivity - https://phabricator.wikimedia.org/T354555 (10bking) [17:10:28] 10Data-Engineering, 10EventStreams, 10MediaWiki-General, 10Privacy Engineering, and 3 others: Create Mediawiki "oversightprotect" action - https://phabricator.wikimedia.org/T354577 (10Htriedman) > I'm assuming that this is all meant to be oversight-level suppression, rather than admin-level (unless we wan... [17:14:19] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Wikidata: WDQS graph split hosts: Remove throttling/banning mechanisms and investigate external connectivity - https://phabricator.wikimedia.org/T354555 (10bking) 05Open→03In progress Per pairing session with @BTullis , the wdqs test hosts are not running env... [17:20:39] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Refresh an-master1001 and an-master1002 with an-master1003 and an-master1004 - https://phabricator.wikimedia.org/T332573 (10BTullis) [17:21:36] 10Data-Platform-SRE (2023/24 Q3 Milestone 1): Refresh an-master1002 with an-master1004 - https://phabricator.wikimedia.org/T332578 (10BTullis) [17:30:24] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [17:41:59] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:43:44] 10Data-Engineering (Sprint 7): [Maintenance] Migrate ReportUpdater browser queries to Airflow - https://phabricator.wikimedia.org/T354552 (10Ahoelzl) [17:44:19] 10Data-Engineering (Sprint 7), 10Spike: [Data Quality] [SPIKE] Can we migrate the anomaly detection job to DeeQu checks - https://phabricator.wikimedia.org/T354566 (10Ahoelzl) [17:49:53] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) [17:58:12] 10Data-Engineering (Sprint 7): [Data Quality] [Needs Grooming] Collect requirements to define prioritized data pipeline and data metrics - https://phabricator.wikimedia.org/T350409 (10Ahoelzl) 05Open→03Resolved Q3 priorities are defined. [17:59:36] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Patch-For-Review: Refresh an-master100[1-2] with an-master100[3-4] - https://phabricator.wikimedia.org/T332573 (10BTullis) I have created the kerberos principals and keytabs for the new hosts with the following file: ` btullis@krb1001:~$ cat T332573_new_hadoop_ma... [18:06:54] 10Data-Engineering, 10Data Products (Data Products Sprint 07): Data Quality Issue: Wikitext History Job fail / rerun in Airflow - https://phabricator.wikimedia.org/T342911 (10WDoranWMF) [18:07:11] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Data Products (Data Products Sprint 07): Add and export MetricsClient#isStreamInSample() - https://phabricator.wikimedia.org/T352966 (10WDoranWMF) [18:15:36] 10Data-Engineering (Sprint 6), 10Privacy Engineering, 10Event-Platform, 10MW-1.42-notes (1.42.0-wmf.7; 2023-11-28), and 3 others: [Event Platform] Actor performing suppression revealed publicly - https://phabricator.wikimedia.org/T342487 (10sbassett) [18:23:29] 10Data-Engineering-Radar, 10Data Products, 10MediaWiki-extensions-EventLogging, 10Metrics Platform Backlog, 10Technical-Debt: Non-deterministic unit test "streamInSample() - session sampling resets" - https://phabricator.wikimedia.org/T304379 (10WDoranWMF) [18:27:37] (03CR) 10Joal: "One nit about partitioning, otherwise looks good!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/988711 (https://phabricator.wikimedia.org/T352670) (owner: 10Snwachukwu) [18:37:43] 10Data-Engineering (Sprint 7): [Data Quality] Implement basic data quality metrics for MW history - https://phabricator.wikimedia.org/T354692 (10Ahoelzl) [18:38:26] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Update WDQS Runbook following update lag incident - https://phabricator.wikimedia.org/T336577 (10bking) 05Open→03Resolved a:03bking [18:38:48] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Retrieve host & port info when connecting to MariaDB replicas on the cluster - https://phabricator.wikimedia.org/T340472 (10mpopov) Xabriel published https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/6743db0e987a... [18:42:09] 10Data-Engineering (Sprint 7): [Maintenance] Safeguard VarnishKafka to HAProxy analytics transition - https://phabricator.wikimedia.org/T354694 (10Ahoelzl) [18:43:41] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Let user specify cnf to use when connecting to MariaDB - https://phabricator.wikimedia.org/T340469 (10mpopov) In https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/6743db0e987a4567352eec4277e5a7f4092de423/notebook... [18:43:57] 10Data-Engineering (Sprint 7): [Iceberg Migration] Define sensor concept - https://phabricator.wikimedia.org/T354695 (10Ahoelzl) [18:46:37] 10Data-Engineering (Sprint 7): [Dataset Config Store] [SPIKE] Investigate existing backend solutions - https://phabricator.wikimedia.org/T354558 (10Ahoelzl) [18:49:24] 10Data-Engineering (Sprint 7): [Maintenance] Define a Refine system refactoring concept - https://phabricator.wikimedia.org/T354696 (10Ahoelzl) [19:25:56] 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search (Current work): Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10RKemper) Finished adding the SLO dashboards to https://grafana-rw.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?orgId=1&from=now-90d&to=now. Rema... [19:40:17] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:45:17] (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:47:18] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Patch-For-Review: Incorrect number of content pages on stats.wikimedia.org - https://phabricator.wikimedia.org/T354489 (10Marki354) Yes, it appears to be fixed now. Thanks. [20:08:04] (03CR) 10Gmodena: refinery-job: add WebrequestMetrics. (039 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) (owner: 10Gmodena) [20:25:04] (03PS27) 10Gmodena: refinery-job: add WebrequestMetrics. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) [20:28:31] (03CR) 10CI reject: [V: 04-1] refinery-job: add WebrequestMetrics. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) (owner: 10Gmodena) [20:35:03] (03CR) 10Gmodena: refinery-job: add WebrequestMetrics. (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) (owner: 10Gmodena) [20:46:53] (03PS28) 10Gmodena: refinery-job: add WebrequestMetrics. [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/979359 (https://phabricator.wikimedia.org/T349763) [20:48:32] !log about to deploy analytics/refinery - weekly train [20:48:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:07:51] !log Deployed refinery using scap, then deployed onto hdfs [21:07:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:08:53] (03PS5) 10Gmodena: refinery: log data quality alert severity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/988985 (https://phabricator.wikimedia.org/T354568) [21:17:39] 10Data-Engineering, 10Data-Platform-SRE: analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 (10Antoine_Quhen) [21:18:30] !log analytics/refinery not deployed fully on test cluster. Ticket for the bug here: https://phabricator.wikimedia.org/T354703 [21:18:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:28:14] !log airflow-dags/analytics(_test) are both deployed [21:28:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:42:14] (PuppetFailure) firing: Puppet has failed on snapshot1014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:31:40] 10Data-Engineering (Sprint 7), 10Spike: [Data Quality] [SPIKE] Can we migrate the anomaly detection job to DeeQu checks - https://phabricator.wikimedia.org/T354566 (10Ahoelzl) a:03gmodena [23:38:36] 10Data-Engineering (Sprint 7): [Refine System] Define a concept and an approach for refactoring the Refine system - https://phabricator.wikimedia.org/T354696 (10Ahoelzl) [23:45:18] (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed