[00:31:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:27] PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-namenode-backup-hdfs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:52] 10Data-Engineering, 10Equity-Landscape: Grants input metric - https://phabricator.wikimedia.org/T309276 (10ntsako) The grants changes have been re-run @KCVelaga_WMF [08:45:15] (03CR) 10Joal: [C: 03+1] "LGTM" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/868753 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [08:46:40] (03CR) 10Joal: [C: 03+1] "LGTM" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/868754 (https://phabricator.wikimedia.org/T302500) (owner: 10Aqu) [09:13:56] (03CR) 10Aqu: [V: 03+2 C: 03+2] "Thanks @joal" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/868753 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [09:17:33] !log About to deploy analytics/refinery (bug fix in HDFS usage pipeline) [09:17:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:26:26] 10Data-Engineering, 10Equity-Landscape: Grants input metric - https://phabricator.wikimedia.org/T309276 (10KCVelaga_WMF) a:05ntsako→03KCVelaga_WMF [09:36:19] !log Deployed analytics/refinery using scap, then deployed onto HDFS. [09:36:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:54:59] 10Data-Engineering: Check home/HDFS leftovers of toddleroux / ryanmax / afandian2 - https://phabricator.wikimedia.org/T325527 (10MoritzMuehlenhoff) [10:25:47] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) p:05Triage→03Unbreak! a:03BTullis [10:28:03] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) Picking up this ticket and working in it with... [10:35:45] Just to let you know, steve_munene and I will be working on T325331 today, due to user facing errors in Superset. [10:35:46] T325331: Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 [10:39:32] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) I can replicate the errors, so I think that th... [10:48:59] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) We can see errors in the logs for all workers,... [10:50:52] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Upgrade the WDQS streaming updater to latest flink (1.15) - https://phabricator.wikimedia.org/T289836 (10dcausse) [10:51:03] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) I think that the first thing for us to try wou... [10:52:07] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) Running this cookbook now. ` btullis@cumin1001... [11:13:53] RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:28] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) Cookbook finished at: 11:28 UTC ` !log btullis... [11:40:04] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) Initial results show that we are still getting... [11:41:13] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) I suggest that we experiment with shutting dow... [11:41:58] 10Data-Engineering, 10Equity-Landscape: Grants input metric - https://phabricator.wikimedia.org/T309276 (10ntsako) a:05KCVelaga_WMF→03ntsako [11:43:28] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10ops-monitoring-bot) Icinga downtime and Alertmanager si... [11:45:22] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) I have downtimed the 10 new hosts with the coo... [11:47:20] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) The number of workers dropped as expected. I h... [11:52:43] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) It does! All dashboards mentioned in the descr... [11:56:36] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) Booted an-presto10[06-10] using `ipmitool` ` b... [12:01:58] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) Cluster expanded to 10 nodes as expected. {F35... [12:08:58] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) I've tried a couple of queries with 10 workers... [12:10:28] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10ops-monitoring-bot) Icinga downtime and Alertmanager si... [12:11:33] !log systemctl start hadoop-namenode-backup-hdfs.service on an-master1002 at 11am UTC [12:11:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:16:26] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) I have shut down an-presto100[1-5] with the fo... [12:34:28] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) p:05Unbreak!→03High These dashboards seem... [13:02:48] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10JAllemandou) Thank you for the quick investigation @BTu... [13:08:00] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10JAllemandou) A quick look at the [[ https://grafana.wik... [13:20:30] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) >>! In T325331#8478159, @JAllemandou wrote: >... [13:45:08] !log restart presto-server on an-coord1001 to increase heap from 4GB to 16 GB T325331 [13:45:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:45:11] T325331: Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 [13:52:56] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), and 2 others: Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) I have now deployed that change and restarted the presto-server on an-coord1001 with the incr... [13:58:17] (03PS7) 10Snwachukwu: [WIP] Refactor and Expand External referer classification [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/864772 (https://phabricator.wikimedia.org/T309769) [14:13:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [14:20:32] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), and 2 others: Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) I can see the increase in the heap size and there is plenty of headroom now. {F35876239,width... [14:21:49] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), and 2 others: Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) ` btullis@cumin1001:~$ sudo cookbook sre.hosts.remove-downtime --force 'an-presto101[1-5].eqi... [14:31:05] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) @Volans - many thanks for your feedback. Could you try the SQLLab features again in superset-next... [14:35:08] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10BTullis) > I believe that the access to this feature is completely enabled/disabled by RBAC now, so I've a... [14:43:07] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) Tests with 10 new workers only and increased h... [14:48:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [15:00:11] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10BTullis) So we're now running the 5 original workers. T... [15:03:28] elukey, btullis: k8s meeting? https://meet.google.com/wjt-srdx-cgq [15:03:46] Only Brian and myself in there at the moment. If no one joins, we'll just cancel for this week [15:04:24] gehel: Sorry, I can't make this slot most weeks because of childcare. [15:04:47] we can try to change time at some point... we'll see next year! [15:05:01] same thing sorry! I can join in a bit (sorry I thought I was optional) [15:06:24] ah I see everybody left already, sorry gehel ! [15:06:56] elukey: you are optional! Since it was only Brian and myself, I pinged a few people before canceling [15:40:56] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10JAllemandou) Investigation results: The overcount affecting `unique_devices_per_project_family` when com... [15:51:36] k8s meeting? gehel btullis should I go? [15:53:23] 10Data-Engineering-Planning, 10Data Pipelines: Update refinery-source PageviewDefinition to better handle `Special:` pages - https://phabricator.wikimedia.org/T325544 (10odimitrijevic) [15:53:49] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10odimitrijevic) Thank you @JAllemandou! That explains things clearly. I have added the follow up work to t... [16:01:24] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): Investigate wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10JAllemandou) [16:22:59] 10Data-Engineering-Planning, 10Product-Analytics, 10Superset, 10Editing-team (Tracking), 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Superset: Presto backend: Unable to access some charts - https://phabricator.wikimedia.org/T325331 (10JAllemandou) Indeed - same GC problems despite a way hi... [16:38:24] 10Data-Engineering, 10CirrusSearch, 10Event-Platform Value Stream, 10Discovery-Search (Current work): EventRowTypeInfo should support schema evolution of rows seriliazed in flink application state - https://phabricator.wikimedia.org/T325273 (10MPhamWMF) [16:38:54] 10Data-Engineering, 10CirrusSearch, 10Event-Platform Value Stream, 10Discovery-Search (Current work): EventRowTypeInfo should support schema evolution of rows seriliazed in flink application state - https://phabricator.wikimedia.org/T325273 (10Gehel) [16:44:11] 10Data-Engineering, 10CirrusSearch, 10Event-Platform Value Stream, 10Discovery-Search (Current work): EventRowTypeInfo should support schema evolution of rows seriliazed in flink application state - https://phabricator.wikimedia.org/T325273 (10dcausse) a:03dcausse [18:33:21] 10Data-Engineering-Planning, 10Patch-For-Review, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): NEW FEATURE REQUEST: Upgrade superset to 1.5.2 - https://phabricator.wikimedia.org/T323458 (10Volans) @BTullis yes I can see the SQLLab and run queris just fine now. All good then for the first two poi... [20:08:26] 10Data-Engineering-Planning, 10Product-Analytics, 10Data Pipelines (Sprint 05-06): Investigate wikimedia and wikidata unique devices per-project-family overcount offset - https://phabricator.wikimedia.org/T301403 (10odimitrijevic) I believe this is the same problem discussed in https://phabricator.wikimedia.... [20:11:51] 10Data-Engineering-Planning, 10Data Pipelines: Update refinery-source PageviewDefinition to better handle `Special:` pages - https://phabricator.wikimedia.org/T325544 (10odimitrijevic) @Joal can you please provide the delta between the two totals (per family and per project) and an estimate, if not the actual... [21:04:30] 10Data-Engineering-Planning, 10Data Pipelines, 10Infrastructure-Foundations, 10Shared-Data-Infrastructure: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10odimitrijevic) @CDanis thanks for bubbling this up. We'll discuss when we get back...