[00:30:17] PROBLEM - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:37:15] PROBLEM - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:45:21] RECOVERY - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:57:05] Morning all. I'm going to kick off a rolling restart of the hadoop workers today. Hopefully it won't affect any running jobs, but I'll be on the lookout for anything that might be affected. [08:58:03] !log cookbook sre.hadoop.roll-restart-workers analytics [08:58:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:24:29] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10BTullis) I have created a CR to increase the JVM heap on the namenodes. https://gerrit.wikimedia.org/r/c/operations/puppet/+/804551 Checking the memory utilization... [09:36:53] o/ trying to use kafka-test from a stat machine and I seem to only have access to kafka-test1006 & kafka-test1010 nodes in between (1007 -> 1009) do not seem to respond, I was wondering if this is expected? [09:37:48] I can ping those but getting connection refused on port 9092 [09:38:12] doesn't sound right, dcausse, maybe the roll-restart that Ben's doing had those down while you were trying to access? [09:39:36] milimetric: ah perhaps? I'll a wait a bit then, something might be in progress [09:39:38] Woah, I just checked Icinga for kafka-test and David is 100% right. Bright lights all over the place. Looking at it now. [09:40:02] btullis: thanks for a look! [09:40:07] *taking [09:40:17] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=kafka [09:40:37] Disk space issues [09:41:47] We've got one very big topic. [09:41:51] https://usercontent.irccloud-cdn.com/file/6MEwuos0/image.png [09:43:19] Since yesterday at 19:00 ish BST. [09:43:22] https://usercontent.irccloud-cdn.com/file/XaQXkOS5/image.png [09:43:27] btullis: oh ok, it's related to the work on the event platform, pinging them [09:43:44] Thanks ever so much. [09:45:55] I need to make sure that analytics is pinged for a bunch of these servers from Icinga/Alertmanager because at the moment this just goes to #wikimedia-operations and gets missed. [09:51:07] 10Data-Engineering, 10Data-Engineering-Kanban: Out of disk space on multiple kafka-test brokers - https://phabricator.wikimedia.org/T310342 (10BTullis) [09:51:30] 10Data-Engineering, 10Data-Engineering-Kanban: Out of disk space on multiple kafka-test brokers - https://phabricator.wikimedia.org/T310342 (10BTullis) p:05Triage→03Unbreak! [09:52:37] I've created an unbreak-now ticket for recovery and follow-ups.: https://phabricator.wikimedia.org/T310342 [10:25:59] RECOVERY - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:52:37] 10Data-Engineering, 10Data-Engineering-Kanban: Out of disk space on multiple kafka-test brokers - https://phabricator.wikimedia.org/T310342 (10BTullis) We've had permission from @gmodena to prune or kill the topic. https://wikimedia.slack.com/archives/C02BB8L2S5R/p1654854733215369?thread_ts=1654852943.654799&... [11:56:21] 10Data-Engineering, 10Data-Engineering-Kanban: Out of disk space on multiple kafka-test brokers - https://phabricator.wikimedia.org/T310342 (10BTullis) I have modified the topic to set the retention time to 1 second. ` btullis@kafka-test1006:~$ kafka configs --entity-type topics --entity-name mediawiki.page_c... [12:25:20] 10Data-Engineering, 10Data-Engineering-Kanban: Out of disk space on multiple kafka-test brokers - https://phabricator.wikimedia.org/T310342 (10BTullis) I have now deleted the custom retention time. ` btullis@kafka-test1006:~$ kafka configs --alter --entity-type topics --entity-name mediawiki.page_content_chang... [12:25:48] 10Data-Engineering, 10Data-Engineering-Kanban: Out of disk space on multiple kafka-test brokers - https://phabricator.wikimedia.org/T310342 (10BTullis) p:05Unbreak!→03Medium [12:27:59] The issue with kafka-test is fixed now. I've purged the `mediawiki.page_content_change` as advised, by temporarily setting retention to to 1 second, then reverting. [12:28:51] Thanks again dcausse for bringing this to our attention. I'm going to make sure that we are alerted to this kind first in future, so that we don't have to wait for you to tell us about it :-) [12:29:12] btullis: thanks for taking care of the issue! :) [12:32:08] 10Data-Engineering, 10Data-Engineering-Kanban: Out of disk space on multiple kafka-test brokers - https://phabricator.wikimedia.org/T310342 (10BTullis) In order to free enough space for kafka to apply the new settings and purge the topic, I had to remove three old kernels from each broker with: ` sudo apt purg... [12:36:50] joal: there are 3 more refine job failures that I think were older than the ones you responded to [12:37:03] 2 legacy and one analytics [12:37:30] ottomata: I think they were the same - newer ones were me rerunning jobs but still have failures [12:37:52] Failed refinement of hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_WikipediaPortal/year=2022/month=06/day=09/hour=13 -> `event`.`wikipediaportal` /wmf/data/event/wikipediaportal/year=2022/month=6/day=9/hour=13 [12:38:31] that's legacy ^. I see you did a specialmutesubmit and a referencepreviewscite [12:39:08] I did all the 5 that were raised at the same time, including WikiPortal I think [12:39:12] hm butu i do see _SUCCESS there [12:39:13] OHHHH [12:39:16] okay [12:39:31] oh i see that now in your reponse [12:39:35] There were 3 emails with errors [12:39:47] It's confusing :S [12:39:48] OHHHHH i need to scroll down. [12:39:49] wait. [12:39:56] i see 5 emails [12:40:16] yeah, 3 original emails, and me rerunning with failures [12:40:56] huh, i guess gmail didnt' thread your responses together for me with 3 of them. [12:41:19] OHH i see [12:41:19] right. [12:41:20] yeah that's exactly that [12:41:25] you reran wiht failures the first time [12:41:30] correct [12:41:32] and that empty .gz caused more emails to be sent [12:41:33] got it [12:41:46] i'm going to respond to the non responded to ones just for posterity [12:41:59] Sounds good - thanks for that [12:42:23] okay perfect, thank you! [12:42:29] thanks for checking :) [12:56:29] (03PS1) 10Joal: Update geoeditors HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/804574 [12:58:04] (03PS2) 10Joal: Update geoeditors HQL [analytics/refinery] - 10https://gerrit.wikimedia.org/r/804574 [13:02:34] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Out of disk space on multiple kafka-test brokers - https://phabricator.wikimedia.org/T310342 (10BTullis) We have decided to reduce the retention time on the kafka-test cluster from its default value of 7 days to 1 day. That is what this patch... [13:20:10] (03CR) 10DCausse: Add Schema for Enriched MW Streams (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/799351 (https://phabricator.wikimedia.org/T308017) (owner: 10Luke Bowmaker) [13:22:32] 10Data-Engineering, 10Data-Engineering-Kanban, 10Observability-Alerting: Ensure that the data-engineering team is alerted to all relevant host and service checks from Icinga - https://phabricator.wikimedia.org/T310359 (10BTullis) [13:24:00] 10Data-Engineering, 10Data-Engineering-Kanban, 10Observability-Alerting: Ensure that the data-engineering team is alerted to all relevant host and service checks from Icinga - https://phabricator.wikimedia.org/T310359 (10BTullis) This is a follow-up ticket from this incident: {T310342} [14:11:32] (03CR) 10Aqu: [C: 03+1] "Looks good." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/804574 (owner: 10Joal) [14:17:06] 10Data-Engineering-Kanban: Merge the update to analytics refine job version in analytics test cluster - https://phabricator.wikimedia.org/T310362 (10Antoine_Quhen) [14:32:51] 10Data-Engineering, 10Data-Engineering-Kanban, 10Observability-Alerting, 10Patch-For-Review: Ensure that the data-engineering team is alerted to all relevant host and service checks from Icinga - https://phabricator.wikimedia.org/T310359 (10BTullis) p:05Triage→03Medium [14:40:24] joal: reviewed all your MRs! Thanks for doing all the work! Left some comments, but mostly LGTM all. [14:51:32] mforns / joal: have we done any profiling on the airflow dag parsing, like https://towardsdatascience.com/profiling-and-analyzing-performance-of-python-programs-3bf3b41acd16? [14:52:15] milimetric: no... [14:52:31] k, I'll mess with that next week [14:52:38] there's a reportupdater problem I don't understand [14:52:46] the wmcs jobs have tons of errors, but the queries are fine [14:53:11] I was debugging and was worried I might be taking too much CPU away from Airflow [14:54:22] OK [14:54:58] ottomata: I saw you comment on the Airflow DAG processor bump up change. I don't understand where to define the processors['count'] though.. [14:55:33] milimetric: can I help with RU? [14:55:56] mforns: it is defined by facter [14:56:05] puppet should have it globally [14:56:17] its been a while and i remember the name to be different, but i'm pretty sure its that [14:56:23] change it and lets run PCC to see what happens [14:56:32] ok [14:57:41] 10Data-Engineering, 10Data-Engineering-Kanban: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10BTullis) I've merged that patch, but I will wait until mext week before running the cookbook again to test it. [15:00:49] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Analytics Data Lake - Hadoop Namenode failure - standby namenode backups filled up namenode data partition - https://phabricator.wikimedia.org/T309649 (10BTullis) I have added the following patch to add the analytics contact group to all of o... [15:07:38] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow DagProcessor not refreshing all dags - https://phabricator.wikimedia.org/T310297 (10Antoine_Quhen) The pressure is now lower for the CPU for an-launcher. The whole list of dags is parsed. Maybe we should: * optimize dag parsing (profiling) * a... [15:07:42] mforns: https://phabricator.wikimedia.org/T310317#7994875 [15:07:42] 10Data-Engineering, 10Data-Engineering-Kanban, 10Cloud-Services, 10Developer-Advocacy: Data missing on the hierarchical view on the wmcs-edits tool - https://phabricator.wikimedia.org/T310317 (10Milimetric) The logs showed consistent errors since 2021-03, but I think it was just because this file had a tra... [15:08:02] mforns: (tl;dr; I think it was just a trailing half-empty line, I'm rerunning now and everything seems ok) [15:08:24] (currently about to finish `sudo -u analytics /usr/local/bin/kerberos-run-command analytics /usr/bin/python3 /srv/reportupdater/reportupdater/update_reports.py -l info /srv/reportupdater/jobs/reportupdater-queries/wmcs /srv/reportupdater/output/metrics/wmcs` on an-launcher1002 as a manual run) [15:08:29] awesome! thanks a lot milimetric :] [15:09:06] oh, one weird thing, the percent job errors even though it seems to also output... so weird: [15:09:08] https://www.irccloud.com/pastebin/Z6Ai9HNO/ [15:24:11] milimetric: I think this kind of error is returned when the query returns no values [15:24:20] *raised [15:37:42] ottomata: will processors['count'] return an int or a string? [15:38:32] mforns: not sure [15:39:00] https://puppet.com/docs/puppet/7/core_facts.html#processors [15:39:06] i gesus integer [15:39:07] ! [15:40:35] thanks! [16:04:48] (03PS1) 10Btullis: Release v0.8.38 of DataHub using WMF customization [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/804611 (https://phabricator.wikimedia.org/T310079) [16:39:41] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:46:02] btullis: analytics1068 downtime expired [16:50:55] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:54:56] ah mforns to_s not necessary [16:55:06] it will be rendered out as a string in the template no matter what [16:55:16] also that is a ruby-ism, not puppet :) [16:55:17] https://puppet-compiler.wmflabs.org/pcc-worker1002/35820/ [16:55:18] looks good [16:55:19] merging [17:00:00] !log applied change to airflow instances to bump scheduler parsing_processes = # of cpu processors [17:00:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:25:53] RhinosF1: ok, sorry. I will renew it. [17:36:59] btullis: np, just caught me eye [18:40:22] Dan, Thanks for the link about profiling. I want to share what I found myself: [18:40:22] Could not find much information with snakeviz, cProfile, or line_profiler. [18:40:22] Then I compared the loading times with an empty dag file. They are almost the same (~0.6s localy. 0.9s on an-launcher). So maybe there is nothing wrong with our code. [18:40:22] `find ./analytics -type f -name '*_dag.py' -exec time python "{}" \;` [18:40:22] (Also, I've tried with Airflow 2.3 Python 3.9, and loading times are almost the same.) [18:40:23] I wish you better luck and a good weekend :) [20:06:28] 10Data-Engineering, 10Data-Engineering-Kanban, 10Cloud-Services, 10Developer-Advocacy: Data missing on the hierarchical view on the wmcs-edits tool - https://phabricator.wikimedia.org/T310317 (10Milimetric) Ok, jobs ran, dashboard looks ok again, I think it's solved, ping me again if anything seems weird. [20:08:35] awesome work Antoine, I mean my guess is that their code is all messed up, I'd want to dig into it with a profiler. My intuition says they're doing something for every file that can be shared across all files. Like opening/closing connections or spinning up an interpreter or something. [21:18:29] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform, 10Patch-For-Review: Add better support for using Event Platform streams with the Flink DataStream API - https://phabricator.wikimedia.org/T310302 (10Ottomata) I think this is coming along nicely! Many thanks to @G... [21:19:54] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Generated Data Platform, 10Patch-For-Review: Add better support for using Event Platform streams with the Flink DataStream API - https://phabricator.wikimedia.org/T310302 (10Ottomata) A cool thing about [[ https://gerrit.wikimedia.org/r/c... [23:23:43] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Data Catalog Technical Evaluation - https://phabricator.wikimedia.org/T293643 (10odimitrijevic) Completed: https://wikitech.wikimedia.org/wiki/Data_Catalog_Application_Evaluation/Rubric/data-catalog-evaluation_server_notes [23:23:51] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Data Catalog Technical Evaluation - https://phabricator.wikimedia.org/T293643 (10odimitrijevic) 05Open→03Resolved