[02:08:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [02:13:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [04:30:17] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:23:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [07:28:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [08:23:51] <_joe_> si there some job running right now? [08:24:03] <_joe_> joal: ^^ some job is saturating the network [08:24:18] hum [08:24:20] will look [08:25:14] <_joe_> started 24 minutes ago [08:25:15] there is a heavy monthly job currently running indeed [08:25:29] shall I kill it? [08:25:41] <_joe_> can you do that? [08:25:45] sure [08:25:50] <_joe_> if it's ok to re-start it eventually [08:25:59] let's first kill it [08:26:15] <_joe_> yeah [08:26:27] <_joe_> can you follow the discussion in #-operations ? [08:27:13] Joining [08:29:36] !kill cassandra-monthly-wf-local_group_default_T_mediarequest_top_files-2022-4 as it was probably saturating network [08:29:45] !log kill cassandra-monthly-wf-local_group_default_T_mediarequest_top_files-2022-4 as it was probably saturating network [08:29:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:30:22] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:44:49] !log Rerun cassandra-monthly-wf-local_group_default_T_mediarequest_top_files-2022-4 with SRE watching network [08:44:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:06:17] PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:09:45] RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:11:35] !log kill cassandra-monthly-wf-local_group_default_T_mediarequest_top_files-2022-4 again [09:11:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:22:23] (03PS1) 10GoranSMilovanovic: minor [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/789802 [13:22:42] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] minor [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/789802 (owner: 10GoranSMilovanovic) [13:34:49] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:46:03] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:31:11] (03PS1) 10GoranSMilovanovic: licence_gpl_removal [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/789816 [14:31:24] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] licence_gpl_removal [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/789816 (owner: 10GoranSMilovanovic) [15:04:01] 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10akosiaris) >>! In T303049#7903184, @Ottomata wrote: > FWIW, we hope that Datahub will one day be a service for more than just analytics data, but for now, it is... [15:12:00] (03PS1) 10GoranSMilovanovic: cognate_link [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/789819 [15:12:12] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] cognate_link [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/789819 (owner: 10GoranSMilovanovic) [15:34:51] 10Data-Engineering, 10Cassandra: Enable Cassandra encryption (inter-node & client) - https://phabricator.wikimedia.org/T307798 (10Eevans) [15:35:13] 10Data-Engineering, 10Cassandra: Enable Cassandra encryption (inter-node & client) - https://phabricator.wikimedia.org/T307798 (10Eevans) p:05Triage→03Medium [15:36:49] 10Data-Engineering, 10Cassandra: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [15:46:29] 10Data-Engineering, 10Cassandra: Ensure AQS Cassandra client connections are multi-datacenter - https://phabricator.wikimedia.org/T307799 (10Eevans) [15:48:18] 10Data-Engineering, 10Cassandra: Ensure AQS Cassandra client connections are multi-datacenter - https://phabricator.wikimedia.org/T307799 (10Eevans) [15:48:32] 10Data-Engineering, 10Cassandra: Ensure AQS Cassandra client connections are multi-datacenter - https://phabricator.wikimedia.org/T307799 (10Eevans) p:05Triage→03Medium [15:50:22] 10Data-Engineering, 10Cassandra: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [16:00:39] 10Data-Engineering, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10Eevans) [16:00:57] 10Data-Engineering, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (codfw) - https://phabricator.wikimedia.org/T307801 (10Eevans) p:05Triage→03Medium [16:03:12] 10Data-Engineering, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 (10Eevans) p:05Triage→03Medium [16:04:55] 10Data-Engineering, 10Cassandra: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [16:05:13] 10Data-Engineering, 10Cassandra: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [16:05:19] 10Data-Engineering, 10Cassandra, 10Generated Data Platform: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 (10Eevans) [16:07:06] 10Data-Engineering, 10Cassandra, 10Generated Data Platform: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [17:31:06] Starting the reboot of stat clients now [19:55:05] 10Quarry, 10Patch-For-Review: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10rook) I suspect (most of?) the underlying problem of when the stop button fails is due to the celery worker having already died, and thus the stop command has nothing to tell t... [23:13:33] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Low RIsk Ozzie Migration: Wikidata CoEditor metric job - https://phabricator.wikimedia.org/T306177 (10Snwachukwu) [23:15:11] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: Low RIsk Ozzie Migration: Wikidata CoEditor metric job - https://phabricator.wikimedia.org/T306177 (10Snwachukwu)