[04:27:42] (SystemdUnitFailed) firing: monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:29] 10Data-Engineering (Sprint 6), 10Patch-For-Review: [Iceberg Migration] Migrate aqs hourly tables to Iceberg - https://phabricator.wikimedia.org/T352669 (10tchin) Tested to see if the `COALESCE` hints still work in Iceberg by creating 2 tables and filling then with/without the hint. It still seems to work. `la... [07:33:05] (03CR) 10TChin: Add iceberg version of aqs_hourly table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982869 (https://phabricator.wikimedia.org/T352669) (owner: 10TChin) [08:12:42] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:23] * brouberol waves good morning! [08:40:28] 10Data-Platform-SRE: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10dcausse) [08:44:18] 10Data-Platform-SRE: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24 - https://phabricator.wikimedia.org/T352878 (10dcausse) [09:26:11] Morning all. [09:39:52] 10Data-Engineering, 10CommonsMetadata, 10DiscussionTools, 10Growth-Team, and 10 others: Phase out Title::getPageViewLanguage in favour of ParserOutput metadata - https://phabricator.wikimedia.org/T350806 (10Tacsipacsi) [09:41:23] (03CR) 10Gmodena: Add iceberg version of aqs_hourly table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982869 (https://phabricator.wikimedia.org/T352669) (owner: 10TChin) [09:41:25] 10Quarry: Timer that counts up as the query is running - https://phabricator.wikimedia.org/T353690 (10Novem_Linguae) [09:54:27] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [09:54:30] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Configure the YARN resource manager with the spark history service URL - https://phabricator.wikimedia.org/T352863 (10brouberol) 05Open→03Resolved We had to slightly tweak the spark UI config as well as the a... [09:55:04] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [10:24:14] !log deploying version 0.0.27 of conda-analytics [10:24:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:29:48] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) I am pushing out version 0.0.27 of conda-analytics to production now, with: ` btullis@cumin1001:~$ sudo debdeploy deploy -u... [10:56:09] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Collect metrics from the spark-history server - https://phabricator.wikimedia.org/T353694 (10brouberol) [10:57:11] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [11:06:15] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Create a helm chart for Superset - https://phabricator.wikimedia.org/T352166 (10BTullis) a:03BTullis [11:10:47] !log restarted the jupyterhub-conda service on stat servers. [11:10:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:16:30] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) All hosts where conda-analytics is deployed now have version 0.0.27 installed. {F41613111,width=50%} [12:02:58] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10BTullis) I have sent out a mail requesting that users upgrade wmfdata to version 2.2.0. The proposed timescale for us to switch the... [12:12:42] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:17] (KafkaReplicationFactorTooLow) firing: ... [12:14:17] Kafka topic codfw.mediawiki.page_prediction_change.rc0 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=codfw.mediawiki.page_prediction_change.rc0&viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTo [12:19:17] (KafkaReplicationFactorTooLow) resolved: ... [12:19:17] Kafka topic codfw.mediawiki.page_prediction_change.rc0 replication factor is too low on jumbo-eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Increase_a_topic's_replication_factor - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=codfw.mediawiki.page_prediction_change.rc0&viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DKafkaReplicationFactorTo [13:14:25] (03CR) 10TChin: Add iceberg version of aqs_hourly table (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982869 (https://phabricator.wikimedia.org/T352669) (owner: 10TChin) [13:31:09] btullis: I'm running experiments on adding metrics collection to the spark history pod, and it seems that the metrics described at https://spark.apache.org/docs/latest/monitoring.html#metrics are available on the master, the executor, etc, but not on the history server level [13:41:10] scratch that, I was being obtuse [14:03:03] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Machine-Learning-Team: Fix IPv6 service IP ranges for all Kubernetes clusters - https://phabricator.wikimedia.org/T353705 (10elukey) [14:23:59] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Discovery-Search (Current work): Load Wikidata split graphs into test servers - https://phabricator.wikimedia.org/T350465 (10dcausse) Numbers look correct: | host | graph | # entities | # triples | | wdqs1022|full| 111,514,880| 15,320,277,615| | wdqs1023|scholar... [14:28:56] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Machine-Learning-Team: Fix IPv6 service IP ranges for all Kubernetes clusters - https://phabricator.wikimedia.org/T353705 (10elukey) [14:33:44] (03CR) 10Xcollazo: Add iceberg version of aqs_hourly table (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982869 (https://phabricator.wikimedia.org/T352669) (owner: 10TChin) [14:35:27] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10pfischer) @elukey, we have an updated estimate of the expected topic size increment per wiki we publ... [14:37:38] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Collect metrics from the spark-history server - https://phabricator.wikimedia.org/T353694 (10brouberol) Experimentation has shown that the history server does _not_ expose metrics at all. The metrics referred to [[ https://spark.apache.org/docs/latest/monito... [14:40:29] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10pfischer) [14:42:21] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10ci-test-error (WMF-deployed Build Failure): EventLoggingTest::testDispatch fails when time ticks within the test run - https://phabricator.wikimedia.org/T353243 (10Lucas_Werkmeister_WMDE) Happened again in [this Wikibase build](https://integration.wi... [14:46:26] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10ci-test-error (WMF-deployed Build Failure): EventLoggingTest::testDispatch fails when time ticks within the test run - https://phabricator.wikimedia.org/T353243 (10Lucas_Werkmeister_WMDE) > The test should be wary of this and potentially fix the curr... [14:49:25] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Machine-Learning-Team: Fix IPv6 service IP ranges for all Kubernetes clusters - https://phabricator.wikimedia.org/T353705 (10elukey) New ML staging range: https://netbox.wikimedia.org/ipam/prefixes/887/ New ML Serve codfw range: https://netbox.wikimedia.o... [14:51:23] Hello SREs ! Could someone help me deploy a slight conf change for the test cluster, about Airflow metrics: https://gerrit.wikimedia.org/r/c/operations/puppet/+/984200 [14:51:23] (We may need a statsd-exporter restart + an Airflow restart). Thank you! [14:53:37] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE: Collect metrics from the spark-history server - https://phabricator.wikimedia.org/T353694 (10brouberol) Metrics collection seems to really be happening at the spark master UI level: ` brouberol@an-worker1080:~$ sudo lsof -i tcp:4040 COMMAND PID USER... [15:10:28] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Collect metrics from the spark-history server - https://phabricator.wikimedia.org/T353694 (10brouberol) [15:11:51] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Monitor the availability of the spark history server deployments - https://phabricator.wikimedia.org/T353717 (10brouberol) [15:14:54] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10brouberol) [15:19:20] 10Data-Platform-SRE (23/24 Q3 Milestone 1): Bring dbstore1008 into service to replace dbstore1003 - https://phabricator.wikimedia.org/T351921 (10Marostegui) ` root@dbstore1003:~# df -hT /srv Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/tank-data xfs 4.4T 4.0T 385G 92% /srv ` I h... [15:50:37] 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Fix IPv6 service IP ranges for all Kubernetes clusters - https://phabricator.wikimedia.org/T353705 (10elukey) [16:02:33] btullis , stevemunene , elukey do you have some times for this quick puppet fix on the test cluster ? [16:05:47] o/ aqu on a call is it ok to deploy in 45 mins or so? [16:08:07] 10Data-Platform-SRE: Root cause Archiva outage from 2023-09-24 - https://phabricator.wikimedia.org/T347343 (10Stevemunene) a:03Stevemunene [16:12:05] (03PS1) 10Xcollazo: For wikifunctions_ui sanitization, keep performer.name instead of performer.id [analytics/refinery] - 10https://gerrit.wikimedia.org/r/984247 (https://phabricator.wikimedia.org/T349121) [16:12:43] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:22:27] stevemunene Thanks you ! I can wait. [16:27:05] 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 (10Ottomata) [16:33:29] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): ProbeDown - https://phabricator.wikimedia.org/T353065 (10bking) 05Open→03In progress p:05Triage→03Low a:03bking [16:34:23] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10Gehel) [16:35:23] 10Data-Engineering (Sprint 6), 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Patch-For-Review: Configure the spark event dir in the spark3 defaults - https://phabricator.wikimedia.org/T352849 (10Gehel) [16:36:23] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Patch-For-Review: Wmfdata should connect to Presto using the analytics-presto CNAME - https://phabricator.wikimedia.org/T345482 (10Gehel) [16:38:25] 10Data-Platform-SRE: Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234 (10Gehel) [16:39:51] 10Data-Platform-SRE (23/24 Q3 Milestone 1): [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Gehel) [16:41:04] 10Data-Platform-SRE: Create dashboards/alerts for new Cirrus Streaming Updater - https://phabricator.wikimedia.org/T349772 (10Gehel) [16:42:54] 10Data-Platform-SRE: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 (10Gehel) [16:43:44] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10RKemper) Current status: - Met again with Traffic team last week and got approval for our proposal to... [16:44:13] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Patch-For-Review: Bring an-coord100[3-4] into service - https://phabricator.wikimedia.org/T336045 (10Gehel) [16:44:51] 10Data-Platform-SRE (23/24 Q3 Milestone 1): Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Gehel) [16:44:53] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Update elasticsearch instance extra plugin - https://phabricator.wikimedia.org/T353270 (10bking) 05Open→03Resolved AC is complete...closing. [16:45:36] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10Gehel) [16:46:30] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Discovery-Search (Current work), 10Patch-For-Review: Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 (10Gehel) [16:47:09] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10observability: Change data platform-related IRC channels to improve communication - https://phabricator.wikimedia.org/T352783 (10Gehel) [16:51:49] 10Data-Platform-SRE, 10Patch-For-Review: Expose Prometheus Blackbox Exporter's ability to add http headers in puppet module - https://phabricator.wikimedia.org/T353672 (10bking) 05Open→03In progress p:05Triage→03Medium a:03bking [16:52:43] 10Data-Platform-SRE: ProbeDown - https://phabricator.wikimedia.org/T353652 (10bking) 05Open→03Resolved The changes above have been merged/applied and I don't see any more errors in logstash. As such, I'm closing this ticket out. [16:53:37] 10Data-Platform-SRE (23/24 Q3 Milestone 1): Restart Search Platform-owned services for Java 8 / Java 11 security updates - https://phabricator.wikimedia.org/T350703 (10Gehel) [16:54:19] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): ProbeDown - https://phabricator.wikimedia.org/T353065 (10bking) 05In progress→03Resolved Fixed by changes mentioned in T353652 , so I'm closing out. [16:54:57] 10Data-Platform-SRE (23/24 Q3 Milestone 1), 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [16:55:56] 10Data-Platform-SRE: Publish Elastic-related packages for Bookworm - https://phabricator.wikimedia.org/T353481 (10Gehel) p:05Triage→03Medium [16:57:57] btullis: I'm doing some additional triage. It seems to me that T353705 goes to "watching" at the moment. Do you have another opinion? [16:57:58] T353705: Fix IPv6 service IP ranges for all Kubernetes clusters - https://phabricator.wikimedia.org/T353705 [16:58:24] 10Data-Platform-SRE: Create dashboards/alerts for new Cirrus Streaming Updater - https://phabricator.wikimedia.org/T349772 (10Gehel) p:05Medium→03High [17:00:04] 10Data-Engineering, 10Data-Platform-SRE: Airflow scheduler and webserver logs should be readable by airflow instance admins - https://phabricator.wikimedia.org/T304615 (10Gehel) a:05brouberol→03None [17:00:19] 10Data-Platform-SRE, 10Discovery-Search (Current work): Create SLI / SLO on Search update lag - https://phabricator.wikimedia.org/T328330 (10Gehel) a:05RKemper→03None [17:00:21] 10Data-Platform-SRE: Create dashboards/alerts for new Cirrus Streaming Updater - https://phabricator.wikimedia.org/T349772 (10Gehel) a:05bking→03None [17:01:36] aqu: ready to go [17:01:43] stevemunene: I think you're working on T347343. Could you move it to "in progress" if that's the case? [17:01:43] T347343: Root cause Archiva outage from 2023-09-24 - https://phabricator.wikimedia.org/T347343 [17:01:58] 10Data-Platform-SRE: Archive /home/ezachte data on stat1007 - https://phabricator.wikimedia.org/T238243 (10Gehel) a:05BTullis→03None [17:02:11] 10Data-Platform-SRE: Check home/HDFS leftovers of andyrussg - https://phabricator.wikimedia.org/T338234 (10Gehel) a:05BTullis→03None [17:02:25] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10Gehel) a:05BTullis→03None [17:03:11] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Root cause Archiva outage from 2023-09-24 - https://phabricator.wikimedia.org/T347343 (10Stevemunene) [17:03:19] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Root cause Archiva outage from 2023-09-24 - https://phabricator.wikimedia.org/T347343 (10Stevemunene) p:05High→03Medium [17:04:16] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Expose Prometheus Blackbox Exporter's ability to add http headers in puppet module - https://phabricator.wikimedia.org/T353672 (10Gehel) [17:09:18] aqu: merged and the service was reloaded with the new config on `an-test-client1002` [17:20:56] (03CR) 10Mforns: Remove grouping by unpredictable country name (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982899 (https://phabricator.wikimedia.org/T353296) (owner: 10Milimetric) [17:24:12] (03PS3) 10Mforns: Remove grouping by unpredictable country name [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982899 (https://phabricator.wikimedia.org/T353296) (owner: 10Milimetric) [17:25:24] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/982899 (https://phabricator.wikimedia.org/T353296) (owner: 10Milimetric) [17:36:25] Thanks stevemunene ! (was in meeting :) ) [17:37:22] 10Data-Engineering, 10Event-Platform (Sprint 12): Upgrade Flink Image to 1.17 - https://phabricator.wikimedia.org/T335408 (10Ottomata) 05Open→03Resolved [17:38:04] 10Data-Engineering, 10Event-Platform (Sprint 12): Upgrade Flink Image to 1.17 - https://phabricator.wikimedia.org/T335408 (10Ottomata) [17:38:14] 10Data-Engineering: Upgrade eventutiltilies-flink Java lib to Flink 1.17 - https://phabricator.wikimedia.org/T335982 (10Ottomata) 05Open→03Resolved a:03Ottomata [17:45:19] Stevemunene did you restart prometheus-statsd-exporter.service on this machine ? [17:46:46] config was reloaded automatically, but lemme do a manual restart [17:47:02] done [17:52:32] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10ci-test-error (WMF-deployed Build Failure): EventLoggingTest::testDispatch fails when time ticks within the test run - https://phabricator.wikimedia.org/T353243 (10Dreamy_Jazz) >>! In T353243#9415686, @Lucas_Werkmeister_WMDE wrote: >> The test should... [17:52:39] 10Data-Platform-SRE: Expose Prometheus Blackbox Exporter's ability to add http headers in puppet module - https://phabricator.wikimedia.org/T353672 (10bking) 05In progress→03Resolved Based on the access log on `wdqs1015` and the rendered Prometheus config in `/etc/prometheus/blackbox.yml.d/query_wikidata_org... [17:53:12] 10Data-Platform-SRE (2023/24 Q2 Milestone 1): Expose Prometheus Blackbox Exporter's ability to add http headers in puppet module - https://phabricator.wikimedia.org/T353672 (10bking) [17:59:12] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10ci-test-error (WMF-deployed Build Failure): EventLoggingTest::testDispatch fails when time ticks within the test run - https://phabricator.wikimedia.org/T353243 (10Dreamy_Jazz) It seems that a call to `gmdate( 'Y-m-d\TH:i:s.v\Z' )` is made in the `Me... [18:01:28] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [18:02:30] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10ci-test-error (WMF-deployed Build Failure): EventLoggingTest::testDispatch fails when time ticks within the test run - https://phabricator.wikimedia.org/T353243 (10Dreamy_Jazz) That repository has the `wikimedia/timestamp` package as a `require-dev`.... [18:22:51] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/984247 (https://phabricator.wikimedia.org/T349121) (owner: 10Xcollazo) [18:28:26] 10Data-Engineering, 10Machine-Learning-Team, 10Event-Platform: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10Ottomata) [18:29:33] !log starting refinery deploy (weekly train) [18:29:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:30:02] !log Deploy latest DAG changes to Analytics Airflow instance [18:30:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:46:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [19:06:08] !log finished refinery deploy (weekly train) [19:06:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:12:20] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10Patch-For-Review: Simplify query.wikidata.org LDF endpoint config - https://phabricator.wikimedia.org/T352111 (10bking) 05In progress→03Resolved Although DNS seems simpler on its face, it actually increases complexity, as the rest of the LDF config lives in P... [19:38:00] 10Data-Platform-SRE: conda-analytics package install ConnectTimeoutError on stat1009 - https://phabricator.wikimedia.org/T353745 (10mpopov) [19:42:17] 10Data-Platform-SRE: conda-analytics package install ConnectTimeoutError on stat1009 - https://phabricator.wikimedia.org/T353745 (10mpopov) [19:44:54] 10Data-Platform-SRE: conda-analytics package install ConnectTimeoutError on stat1009 - https://phabricator.wikimedia.org/T353745 (10mpopov) [19:49:37] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python, 10Data Products (Data Products Sprint 05): Release wmfdata with ca_bundle fix - https://phabricator.wikimedia.org/T352808 (10nshahquinn-wmf) 05In progress→03Resolved [20:12:43] (SystemdUnitFailed) firing: (2) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:44:27] !log deployed airflow analytics to modify unique devices dags [21:44:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:44:43] !log deployed airflow wmde to unbreak their instance's config [21:44:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:50:37] !log cleared clickstream monthly sensors in Airflow since they failed waiting for data [21:50:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:53:30] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Epic: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s - https://phabricator.wikimedia.org/T340548 (10bking) [21:53:43] 10Data-Platform-SRE (2023/24 Q2 Milestone 1), 10serviceops-radar, 10Discovery-Search (Current work), 10Epic: Determine and control cirrus streaming updater's usage of MWAPI resources - https://phabricator.wikimedia.org/T349848 (10bking) 05Open→03Invalid It's hard to resource usage estimate until we've... [22:23:06] !log reran Airflow dag unique_devices_per_project_family_monthly to fix MaxMind duplicate country name issue [22:23:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:52:20] 10Data-Platform-SRE: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 (10RKemper) ### Current status New hosts added in puppet. Their weights have been set in pybal (more specifically, etcd via conftool), and they're currently marked inactive while we do data xfers. About to k...