[01:16:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:23:49] RECOVERY - Check unit status of monitor_refine_event on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:06:33] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:19:57] PROBLEM - SSH on an-worker1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:24:07] PROBLEM - Host an-worker1109 is DOWN: PING CRITICAL - Packet loss = 100% [06:08:55] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:11:16] (03CR) 10Nmaphophe: [V: 03+1 C: 03+1] Update geoeditor HQL scripts for spark3 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/806200 (owner: 10Joal) [08:11:46] (03CR) 10Nmaphophe: [V: 03+1 C: 03+1] "Looks good to me" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/806200 (owner: 10Joal) [08:36:53] !log power cycled an-worker1109 as it was stuck with CPU soft lockups [08:36:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:38:43] RECOVERY - Host an-worker1109 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [08:39:07] RECOVERY - SSH on an-worker1109 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:12:55] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:14:20] 10Data-Engineering: Drop ArticleCreationWorkflow data - https://phabricator.wikimedia.org/T310863 (10phuedx) [09:14:32] 10Data-Engineering: Drop ArticleCreationWorkflow data - https://phabricator.wikimedia.org/T310863 (10phuedx) [09:15:44] 10Data-Engineering: Drop ArticleCreationWorkflow data - https://phabricator.wikimedia.org/T310863 (10phuedx) [09:15:49] 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform, 10Fundraising-Backlog, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10phuedx) [09:45:09] hi btullis - Are you working on datahub? [09:45:47] Yes, very much so. [09:47:31] The latest advice I have from DataHub themselves is to recreate the opensearch index for `datahub_usage_event` [09:48:17] ack - this explains why it shows error when I try to use it :) [09:48:20] However I have discovered that our opensearch 1.2.4 packages don't include a plugin that we may need. [09:48:23] thanks btullis [09:49:23] Yes, sorry. I put a note into the #data-catalog channel on Slack about the downtime. I'll keep trying to fix it for a bit on v 0.8.38 - If it doesn't work I can try rolling it back to 0.8.34 [09:49:51] no problem btullis - I was just unsure :) It's not blocking me in any ways [09:51:56] OK, cool. Thanks joal. I'll keep updating progress here. [11:43:04] (03PS1) 10Gerrit maintenance bot: Add blk.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/806401 (https://phabricator.wikimedia.org/T310873) [11:46:26] (03PS1) 10Gerrit maintenance bot: Add pcm.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/806404 (https://phabricator.wikimedia.org/T310880) [12:19:38] 10Data-Engineering-Icebox, 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare and check storage layer for kcgwiki - https://phabricator.wikimedia.org/T305280 (10BTullis) I have executed `sudo maintain-views --databases kcgwiki` on clouddb1016 and clouddb1020 which are the... [12:33:25] 10Quarry: Prettify Quarry's "User not found" page - https://phabricator.wikimedia.org/T134661 (10rook) 05Open→03Resolved [12:35:07] !log deployed daily airflow dag for 3 Wikidata metrics. [12:35:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:36:11] 10Quarry: 404 page for no user shows login - https://phabricator.wikimedia.org/T310888 (10rook) [12:36:25] Hi SandraEbele - just a quick reminder that it's best practice to try not to deploy on Fridays :) It's done, no big deal - I still wanted to point it out [12:36:38] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/806401 (https://phabricator.wikimedia.org/T310873) (owner: 10Gerrit maintenance bot) [12:37:18] Okay. Thanks joal [12:37:41] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/806404 (https://phabricator.wikimedia.org/T310880) (owner: 10Gerrit maintenance bot) [12:38:20] I will also pause killing the Oozie jobs and starting the airflow job till Monday. [13:12:50] 10Data-Engineering, 10Equity-Landscape: Editorship Metrics Transformation - https://phabricator.wikimedia.org/T306618 (10KCVelaga_WMF) ` SELECT * FROM kcv.geoeditorship_output_rank_metrics ` [13:13:06] 10Data-Engineering, 10Equity-Landscape: Editorship Metrics Transformation - https://phabricator.wikimedia.org/T306618 (10KCVelaga_WMF) 05Open→03Resolved [13:28:41] 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10cscott) This has been requested by the kiwix team multiple times over the years. Hopefully this would be parsoid-format HTML dumps. [13:35:06] 10Analytics-Radar, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10cscott) HTML dumps are already available in https://dumps.wikimedia.org/other/enterprise_html/ ; see also {T302237}. [13:35:26] 10Data-Engineering, 10MediaViewer, 10MediaWiki-extensions-EventLogging: Decommission the MediaViewer and MultimediaViewer* instruments - https://phabricator.wikimedia.org/T310890 (10phuedx) [13:39:10] 10Data-Engineering, 10MediaViewer, 10MediaWiki-extensions-EventLogging: Decommission the MediaViewer and MultimediaViewer* instruments - https://phabricator.wikimedia.org/T310890 (10phuedx) [13:42:38] 10Data-Engineering, 10MediaViewer, 10MediaWiki-extensions-EventLogging: Decommission the MediaViewer and MultimediaViewer* instruments - https://phabricator.wikimedia.org/T310890 (10phuedx) > Mark the schema as inactive | Schema | Diff | | --- | --- | | MediaViewer | https://meta.wikimedia.org/w/index.php?t... [13:54:37] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [14:26:23] (03PS1) 10Btullis: Update the blubber configuration for the datahub-frontend [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/806427 (https://phabricator.wikimedia.org/T310079) [14:33:42] (03CR) 10CI reject: [V: 04-1] Update the blubber configuration for the datahub-frontend [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/806427 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [14:44:25] (03PS2) 10Btullis: Update the blubber configuration for the datahub-frontend [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/806427 (https://phabricator.wikimedia.org/T310079) [14:44:29] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [14:56:10] (03CR) 10CI reject: [V: 04-1] Update the blubber configuration for the datahub-frontend [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/806427 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [15:06:00] (03PS3) 10Btullis: Update the blubber configuration for the datahub-frontend [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/806427 (https://phabricator.wikimedia.org/T310079) [15:23:07] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [15:31:43] (03CR) 10Btullis: [C: 03+2] Update the blubber configuration for the datahub-frontend [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/806427 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [15:51:02] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform, 10Patch-For-Review: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [15:52:18] (03Merged) 10jenkins-bot: Update the blubber configuration for the datahub-frontend [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/806427 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis) [15:59:17] 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations: Re-enable CAS-SSO for hue.wikimedia.org - https://phabricator.wikimedia.org/T310686 (10BTullis) 05Open→03Resolved [16:03:07] 10Data-Engineering, 10Data-Engineering-Kanban: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10BTullis) 05Open→03Resolved It looks like we just wait for the ONFIRE team to review it and either ask for... [16:03:26] 10Data-Engineering, 10Data-Engineering-Kanban: Out of disk space on multiple kafka-test brokers - https://phabricator.wikimedia.org/T310342 (10BTullis) 05Open→03Resolved [16:06:00] (03CR) 10Btullis: Add kcgwiki to the sqoop list (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/806244 (https://phabricator.wikimedia.org/T305280) (owner: 10Milimetric) [16:07:14] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add kcgwiki to the sqoop list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/806244 (https://phabricator.wikimedia.org/T305280) (owner: 10Milimetric) [16:38:52] milimetric: DataHub 0.8.38 is live. Should we make that change to the ingestion jobs now? [16:55:16] (03PS1) 10Btullis: Update the main logo that is used in DataHub [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/806445 (https://phabricator.wikimedia.org/T310629) [16:58:09] Heya! I created a CR (https://gerrit.wikimedia.org/r/c/operations/alerts/+/805237) to port over varnishkafka delivery alarms before realizing that it's the traffic team that needs to be alerted. Is there any concern with me moving that alert over to team-data-engineering and assigning someone for review? [17:02:24] Hi brett: that's fine. Please feel free to assign to me for review. We already have one varnishkafka alert (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-data-engineering/varnishkafka.yaml) so you can add them to the same file. [17:04:49] 10Analytics, 10Observability-Logging: AQS Cassandra driver logs an incredible amount of small logs at a high rate - https://phabricator.wikimedia.org/T310760 (10colewhite) [17:06:58] btullis: Thanks! I went ahead and did that [17:07:46] Hi all, thank you for the response to the aqs logging rate the other day. At this time, we're seeing ~100/msgs/sec from `levelPath: warn/table/cassandra/driver` and wonder if we're hiding real issues. The task is: https://phabricator.wikimedia.org/T310760 [17:11:09] 10Analytics, 10Observability-Logging: AQS Cassandra driver logs an incredible amount of small logs at a high rate - https://phabricator.wikimedia.org/T310760 (10BTullis) Tagging @Eevans since he is actively working on the {T307641} and may know more about the source of these messages. [17:21:00] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Upgrade DataHub V0.8.38 - https://phabricator.wikimedia.org/T310079 (10BTullis) [17:21:56] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:23:33] cwhite: Thanks again. I've tagged urandom again who probably knows more about it than me. Am I right in thinking that because you're dropping the messages there isn't going to be any issue leaving it like this for three days? [18:09:26] 10Analytics, 10Observability-Logging: AQS Cassandra driver logs an incredible amount of small logs at a high rate - https://phabricator.wikimedia.org/T310760 (10Eevans) I assume that this is limited to earlier this week, as opposed to something more chronic, so let me know if that is incorrect. What happened... [18:15:34] 10Analytics, 10Observability-Logging: AQS Cassandra driver logs an incredible amount of small logs at a high rate - https://phabricator.wikimedia.org/T310760 (10Eevans) >>! In T310760#8012251, @colewhite wrote: > The issue is mitigated from the logstash side, but there is still about 100 logs/sec being dropped... [20:54:45] (03PS1) 10Vivian Rook: Get non-coincidental history entries. [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/806474 (https://phabricator.wikimedia.org/T306658) [20:55:34] (03CR) 10Vivian Rook: [C: 04-1] "Do not merge this. It is a draft. In its current form it will remove the sql entry from a history entry." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/806474 (https://phabricator.wikimedia.org/T306658) (owner: 10Vivian Rook) [20:58:08] 10Quarry, 10Patch-For-Review: Quarry history feature not showing history - https://phabricator.wikimedia.org/T306658 (10rook) It would appear that we want to reference latest_run_id (which may be oddly named, as it seems to refer to a single query, which will always be the latest run of itself). This is noted... [23:13:54] 10Analytics, 10Observability-Logging: AQS Cassandra driver logs an incredible amount of small logs at a high rate - https://phabricator.wikimedia.org/T310760 (10colewhite) >>! In T310760#8012470, @Eevans wrote: > I don't know what these are; The bigger flood from earlier this week I can explain (see my previou...