[00:20:01] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:07] PROBLEM - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:31:47] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:35] PROBLEM - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:32:43] 10Data-Engineering, 10Data-Engineering-Kanban, 10MediaWiki-extensions-EventLogging, 10Patch-For-Review: Generate $wgEventLoggingSchemas from $wgEventStreams - https://phabricator.wikimedia.org/T303602 (10phuedx) Thanks for the ping! I added a few more test cases to each patch having discovered a minor bug... [11:52:59] I'm back btullis :) [11:54:20] OK, just getting coffee. 3 mins then see you in the batcave? [12:01:41] I'm there. [12:04:20] oops soory missed the ping [12:04:23] joining [12:37:57] 10Data-Engineering, 10SRE, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10zeljkofilipin) [13:21:20] Hey ottomata - let me know when ou have a minute [13:27:36] joal just finished emails was gonna check with you! [13:27:53] ottomata: I have about 1/2hour before getting kids - batcave? [13:27:56] ya [14:22:51] 10Data-Engineering, 10Data-Engineering-Kanban: Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263 (10Ottomata) Let me try to summarize the findings Jospeh just explained to me. Somewhere around 13:08 the standby namenode (an-master1001 at the time) was restarted. On restart,... [14:33:05] joal btw, shoudl we reenable locks? https://github.com/wikimedia/analytics-refinery/blob/master/gobblin/common/analytics-common.properties#L56-L60 [15:14:16] 10Data-Engineering, 10Data-Engineering-Kanban: Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263 (10Ottomata) ##### Backfill Backfilling the data. Jospeh had a good idea: instead of discovering the right offsets, we can just run a custom gobblin job to import the offending... [15:14:44] !log backfilled eventlogging data lost during failed gobblin job - T311263 [15:14:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:14:46] T311263: Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263 [15:21:12] ok btullis joal i guess there are 2 big follow ups still. [15:21:17] 1. what is wrong with gobblin [15:21:24] 2. what to do about namenode issues? [15:21:35] are we sure 2 is caused by just too many files? [15:21:52] it does seem that the number of files has increased a lot in the past 6 months? [15:22:05] https://grafana-rw.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=28&from=1640580634172&to=1656084018180 [15:22:26] that's about a 25% increase? [15:22:33] we don't have data in prometheus going back farther than 2022-01? [15:25:29] i dunno if usefull, but i wouldn't be surprised if onboarding more teams to use spark resulted in many more tiny files. I guess only once, but while reviewing other teams work i found one thing emitting thousands of <1MB files because spark decided 90k partitions was the right number [15:27:07] aye :) [15:31:16] ... how are there fewer blocks than files? [15:31:18] https://grafana-rw.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=28&from=1640580634172&to=1656084018180 [15:33:12] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM! Thanks :]" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/807963 (https://phabricator.wikimedia.org/T310890) (owner: 10Phuedx) [15:46:54] ottomata: heya - do ou need me now? [16:01:58] ottomata: I understand the problem with the dt not extracted log - dt is actually correctly extracted (otherwise your backfill wouldn't have worked) - it's the logging line that is incorrectly set! [16:05:46] (03PS1) 10Joal: Fix logging in JsonStringTimestampExtractor [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/808270 [16:11:54] 10Data-Engineering: [Wikistats] Add newly translated languages - https://phabricator.wikimedia.org/T311315 (10mforns) [16:33:48] joal: hello [16:33:54] heya [16:33:59] joal: oh right! [16:34:12] joal: only if you have time and want to work on solving either of those 2 problems, i'm avail! [16:34:31] ottomata: I have a few minutes - let's chat if you wish [16:34:35] mkay [16:50:43] 10Data-Engineering, 10Data-Engineering-Kanban: Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263 (10JAllemandou) Thanks a lot @Ottomata for the backfill. > So, data saved this time. We likely lost some data from when we had the namenode outage a few weeks ago and didn't noti... [16:57:01] Thanks joal: and ottomata: for the backfill and analysis. Sorry I couldn't cross over much with you two this afternoon. [17:10:54] This looks like a potentially useful way to find the largest number of small files in a cluster. https://community.cloudera.com/t5/Community-Articles/Identify-where-most-of-the-small-file-are-located-in-a-large/ta-p/247253 [17:17:57] I could look at trying this out on Monday if we think it might help with troubleshooting the namenode issue. [17:48:20] btullis: cool sounds good, i think joal is going to look into that too on monday, so yall can sync up then. I'm tracing some gobblin code rn to see if can find out what happened [18:41:43] 10Data-Engineering, 10Data-Engineering-Kanban: Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263 (10Ottomata) Doing some more logs, gobblin and HDFS code reading, and I have a little more findings: In [[ https://yarn.wikimedia.org/jobhistory/logs/an-worker1135.eqiad.wmnet:80... [18:47:17] (03Abandoned) 10Urbanecm: Add avkwiki to analytics whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/617111 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [19:04:31] (03CR) 10Ottomata: [C: 03+2] Fix logging in JsonStringTimestampExtractor [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/808270 (owner: 10Joal) [22:58:58] 10Data-Engineering, 10Product-Analytics: Improvements to mediawiki_geoeditors_monthly dimensions - https://phabricator.wikimedia.org/T302079 (10mpopov) > I wonder whether the requested change should be done for the data in Druid only, or if it would be valuable to change the geoeditors tables on the cluster....