[00:21:27] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1 looks like abstract art to me [00:22:10] But, notice a signal through the chaos in "HDFS Datanode Heap": rolling restarts of hadoop java processes cleaned up heap usage [00:24:27] Doing the druid test cluster java restarts manually, since there is no option in the cookbook for the 1-node druid test "cluster" yet (though I added it here: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/704443) [00:24:37] !log razzi@an-test-druid1001:~$ sudo systemctl restart druid-historical [00:24:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [00:24:44] razzi@an-test-druid1001:~$ sudo systemctl restart druid-overlord [00:24:47] whoops [00:24:49] !log razzi@an-test-druid1001:~$ sudo systemctl restart druid-overlord [00:24:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [00:28:33] !log razzi@an-test-druid1001:~$ sudo systemctl restart druid-middlemanager [00:28:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [00:33:26] !log razzi@an-test-druid1001:~$ sudo systemctl restart druid-broker [00:33:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [00:33:54] !log razzi@an-test-druid1001:~$ sudo systemctl restart druid-coordinator [00:33:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [00:34:26] !log razzi@an-test-druid1001:~$ sudo systemctl restart zookeeper [00:34:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [00:35:15] Ok it worked, `razzi@an-test-druid1001:~$ sudo lsof -Xd DEL` came back clean [00:35:50] Only 1 node type left for https://phabricator.wikimedia.org/T283067, production druid cluster!!! [00:37:20] Signing off for the day, see y'all tomorrow! [01:25:09] PROBLEM - Check unit status of monitor_refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:02:16] 10Analytics, 10Product-Analytics: Investigate running Stan models on GPU - https://phabricator.wikimedia.org/T286493 (10mpopov) [06:48:40] razzi: yep exactly the hdfs datanodes heap consumption is lower right after the restart, then the oldgen fills up slowly of objects as the datanodes gets traffic (not sure exactly what ends up on the heap) [09:03:09] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10BTullis) Dry run of `sre.aqs.roll-restart` was successful this time. ` DRY-RUN: END (PASS) - Cookbook sre.aqs.roll-restart (exit_cod... [09:04:45] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10BTullis) p:05Triage→03Medium [09:30:46] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:59:26] 10Analytics, 10Analytics-EventLogging, 10Wikimedia-production-error: TypeError: Return value of JsonSchemaHooks::onEditFilterMergedContent() must be of the type boolean, none returned - https://phabricator.wikimedia.org/T286611 (10Ammarpad) Caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Even... [10:20:19] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:43:57] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [11:58:10] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:59:20] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:04:17] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:09:44] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:53:54] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:54:42] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [13:01:21] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [13:05:57] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [13:13:40] yeah joal we need some refine tuning eh? [13:18:03] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [13:44:26] 10Analytics-Clusters, 10Analytics-Kanban, 10User-MoritzMuehlenhoff: Reduce manual kinit frequency on stat100x hosts - https://phabricator.wikimedia.org/T268985 (10BTullis) [13:45:37] 10Analytics-Clusters, 10Analytics-Kanban, 10User-razzi: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10BTullis) [13:49:26] 10Analytics-Clusters, 10Analytics-Kanban: Add analytics-presto.eqiad.wmnet CNAME for Presto coordinator failover - https://phabricator.wikimedia.org/T273642 (10BTullis) [13:55:23] joal FYI I bumped Refine spark executor memory from 4G to 8G [13:55:26] tasks were OOming [13:55:35] i guess because of the gzip decompression? [14:52:49] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 (10Ottomata) @JAllemandou These are the OOMs in Refine we are getting, def looks gzip related: ` Container: container_e18_1623774792907_150589_01_000010 on an-worker1132.eqiad.wmnet_804... [15:06:21] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [15:07:06] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [15:08:00] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [15:10:43] ottomata: o/ I see in the cal that the ops sync conflicts with the montly tech update.. should we reschedule? [15:11:04] tech update too! [15:11:11] it also conflicted wiith event platofmr sync :/ [15:11:13] yes we should reschedule [15:11:15] i'll find a place [15:11:16] ottomata: Out of interest, where were the OOMs being logged. [15:12:00] btullis: in the refine job output. the job runs all over hadoop workers. after the job finishes, yarn aggregates all the worker log files into a big log file in hdfs [15:12:02] you can get it by running [15:12:09] sudo -u analytics yarn logs -applicationId [15:12:28] so like [15:12:36] sudo -u analytics yarn logs -applicationId application_1623774792907_150589 [15:12:42] from an-launcher1002.eqiad.wmnet [15:12:44] should show you [15:14:08] ottomata. Cool, no hurry. I was just starting to browse in `an-launcher1002:/var/log/refinery` but realized that these were probably just client-side logs. [15:14:32] yeah when the spark job runs in yarn cluster mode [15:14:42] the spark master process logs are also on a remote hadoop worker [15:15:41] if it was run in yarn client mode, you'd see the spark master logs locally, buuuut still the remote spark executors run on hadoop workers, and to see their logs you'd need to either access the worker directly while the job is running, use the Yarn web UI (which doesnt' work well without ssh tunnels), or just wait til the job is done [15:15:54] we have a ticket to do log aggregation more frequently for long running jobs [15:16:07] https://phabricator.wikimedia.org/T269616 [15:16:13] we tried...but it isn't working [15:18:08] Awesome. Thanks for the info. [15:18:41] Hi ottomata - today is bank holiday in France, working from now on :) [15:19:01] ottomata: thanks for having bumped the memory - All failed jobs were from eventlogging_legacy? [15:19:05] no [15:19:10] just one from el legacy [15:19:13] most from refine_event [15:19:15] all large datasets [15:20:40] and, i think one job did fail since I bumped memory [15:36:35] joal, max executors are 64 [15:36:49] maybe more would be helpful? fewer tasks assigned per executor? [15:37:10] ottomata: let's batcave on this for a minute? [15:37:12] k [15:38:29] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10JMeybohm) [15:40:17] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10JMeybohm) [15:48:37] 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10JMeybohm) [16:07:25] btullis: standup? [16:08:18] oh nm! [16:09:48] 10Analytics-Radar, 10Product-Analytics, 10Pageviews-Anomaly: Analyse possible bot traffic for ptwiki article Ambev - https://phabricator.wikimedia.org/T282502 (10Milimetric) [16:27:31] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:38:25] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:41:01] 10Analytics-Radar, 10Product-Analytics, 10Pageviews-Anomaly: Analyse possible bot traffic for ptwiki article Ambev - https://phabricator.wikimedia.org/T282502 (10Milimetric) (TODO) We could use a standard way of handling these tasks at WMF. I feel like it would be ignored if it was on a wiki, but maybe ever... [16:51:55] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Services, and 2 others: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10Ottomata) [17:15:22] (03PS1) 10Ottomata: Refine - explicitly uncache DataFrame when done [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/704576 (https://phabricator.wikimedia.org/T271232) [17:15:51] joal https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/704576 [17:16:00] reading! [17:23:05] (03CR) 10Joal: [C: 03+1] "LGTM! Let's try" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/704576 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [17:24:00] joal i wonder if therer is a way we can compare [17:24:09] some spark memory usage stats before and afte? [17:24:15] hm - not easy [17:24:31] ottomata: we would need to use babar I cthink (from Criteo [17:24:49] ottomata: https://github.com/criteo/babar [17:26:10] Going to go ahead with the druid java service restarts [17:26:18] joal cool [17:26:19] https://github.com/criteo/babar#profiling-a-spark-application [17:26:31] Metrics are here, looks healthy https://grafana.wikimedia.org/d/000000538/druid?orgId=1&refresh=1m [17:28:37] razzi: druid-public right? [17:29:05] if so I'll watch metrics, curious how it goes [17:29:27] (it is the backend of AQS so hopefully no alarm will fire) [17:29:52] I was just going to ask, what the deal is with druid public versus analytics [17:30:03] Do we have to restart both? [17:30:48] *searches druid on wikitech to find about the druid clusters, should have done that first* [17:31:16] hm ok we actually don't have clear docs on the different druid clusters! [17:31:57] razzi: analytics is the one used by turnilo superset etc.. [17:32:31] public is the one that gets only one dataset loaded, the mw history snapshot, and it is called by the AQS api [17:32:44] (and it is also not in the analytics VLAN, and has a load balancer in front of it) [17:32:57] the cookbook distinguish between the two especially for the pool/depool actions [17:33:23] and yes we need to restart both :) [17:33:43] also zookeeper on both clusters, that can be done running the zookeeper cookbook (there are options for both druid clusters) [17:33:59] cool, that all makes sense, I'll make a personal todo to write that up on wikitec [17:35:51] re: zookeeper, I'll use cookbooks/sre/zookeeper/roll-restart-zookeeper.py after I restart druid processes [17:36:01] exactly yes [17:36:32] IMO we could get away with fewer zookeeper clusters, I doubt they have too high of traffic [17:36:57] the fact that one zookeeper cluster can go down without affecting other services is kinda cool, but it's a fault-tolerant system already [17:37:00] yep definitely, but copying znodes to an-conf* (that was created later) is not that easy :D [17:37:36] hadoop used the zookeeper main cluster before an-conf was created, because we wanted to separate concerns with SRE [17:37:44] but both druid clusters were already there [17:37:56] here goes with the druid public java restarts [17:37:58] (with zookeeper co-located) [17:38:09] will run: sudo cookbook sre.druid.roll-restart-workers public [17:38:12] in tmux [17:38:15] and watch metrics :) [17:39:23] +1 [17:39:43] !log sudo cookbook sre.druid.roll-restart-workers public for https://phabricator.wikimedia.org/T283067 [17:39:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:42:35] razzi: there is an extra caveat that I have never had the time to fix, namely the fact that after restarting the clusters a roll restart of the prometheus-druid-exporter services (one for each node) is needed [17:44:10] https://github.com/wikimedia/operations-software-druid_exporter#known-limitations [17:44:42] so the way that we collect metrics is that we force druid to push (via HTTP POST) metrics to a localhost daemon that exposes them as prometheus exporter [17:45:13] when a roll restart happens, the overlord and coordinator leaders will likely change [17:45:44] so the past leaders stop emitting metrics, and due to how prometheus works they keep pushing the last value of their metrics (and not zero or null etc..) [17:45:53] I believe that some fix to the exporter may resolve this [17:46:03] but it has been in my backlog for a long time :) [17:56:05] ack elukey, would the fix be to add it to the cookbook? [17:57:18] razzi: yes it could be a good addition [18:06:14] going out for a run, things look stable, ttl! [18:10:28] cool ttyl elukey ! [18:12:05] Druid public cookbook's still running, for anybody who's following along [18:12:08] very slow [18:12:17] 3/5 hosts done [18:28:08] awesome razzi :) [18:28:21] razzi: dumb question on java restart - have you done cassandra? [18:28:35] joal: Have not! [18:28:42] Maybe Moritz missed that one [18:28:48] https://phabricator.wikimedia.org/T283067 [18:28:49] I think it should be added to the list :) [18:30:19] razzi: I added a comment to the tqask [18:43:46] 10Analytics, 10Analytics-EventLogging, 10Wikimedia-production-error: TypeError: Return value of JsonSchemaHooks::onEditFilterMergedContent() must be of the type boolean, none returned - https://phabricator.wikimedia.org/T286611 (10Krinkle) Fall out from T280312. \cc @Func, @matmarex [18:48:05] 10Analytics, 10Analytics-EventLogging, 10Platform Team Workboards (MW Expedition), 10Wikimedia-production-error: Exception: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T286610 (10Krinkle) It only happens currently on metawiki during edits for the Schema namespace with a `J... [18:56:05] Ok druid public restarts completed without an issue, going to move on to druid analytics cluster [19:00:24] actually I'm going to eat lunch first [19:06:02] ottomata: looks like our spark tweaks did the job \o/ [19:06:28] I'm gonna sign off for tonight - I'll be triaging emails tomorrow (I'm back onto gsuite ! yay!) [19:18:54] 10Analytics, 10Analytics-EventLogging, 10Platform Team Workboards (MW Expedition), 10Wikimedia-production-error: Exception: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T286610 (10daniel) >>! In T286610#7213244, @Krinkle wrote: > Retagging EventLogging as the issue is not i... [19:20:32] 10Analytics, 10Analytics-EventLogging, 10Platform Team Workboards (MW Expedition), 10Wikimedia-production-error: Exception: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T286610 (10daniel) #pet-mw-expedition got tagged since PET has touched PageEditStash recently. But I don'... [19:20:42] joal: yeehaw! [19:20:43] laters! [20:53:01] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Services, and 2 others: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10Ottomata) FYI a deploy earlier today caused an MW API outage. Incident report here: https://wikitech.wikim... [20:55:38] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Services, and 2 others: EventGate should use recent service-runner (^2.8.1) with Prometheus support - https://phabricator.wikimedia.org/T272714 (10Ottomata) @colewhite Ok, I need to fork node-rdkafka-prometheus to fix this. Tomorrow I'll make a `@wikime...