[01:22:52] 10Analytics, 10API Platform: Establish testing procedure for Druid-based endpoints - https://phabricator.wikimedia.org/T311190 (10BPirkle) [01:23:49] 10Analytics, 10API Platform: Establish testing procedure for Druid-based endpoints - https://phabricator.wikimedia.org/T311190 (10BPirkle) p:05Triage→03Medium [02:21:49] mforns: hm, i think there is [02:29:05] hm, mforns the answer isn't so simple [02:29:06] but [02:29:07] https://github.com/wikimedia/puppet/blob/production/modules/admin/manifests/hashuser.pp#L17-L49 [02:29:39] there are system users that are outuside of that range, but not any that we will ever give keytabs or use for this [02:31:06] you could just assume if > 1000 it is a real user, but there are some old real users that have uids < 900 [02:31:14] (users and uids are defined starting here: https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml#L1096) [02:31:41] so, for your purposes, if uid is between 900 and 950, it is system user [05:19:12] (VarnishkafkaNoMessages) firing: ... [05:19:12] varnishkafka for instance cp5016:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=eqsin%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp5016:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [05:24:12] (VarnishkafkaNoMessages) firing: (2) varnishkafka for instance cp5010:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [05:29:12] (VarnishkafkaNoMessages) resolved: (2) varnishkafka for instance cp5010:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [06:51:05] RECOVERY - Check unit status of analytics-dumps-fetch-pageview on clouddumps1002 is OK: OK: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:37:20] 10Analytics, 10Data-Engineering, 10Event-Platform: mediawiki/page/properties-change schema should use map type for added and removed page properties - https://phabricator.wikimedia.org/T281483 (10JAllemandou) Adding perspective on this: Using map instead of structured and defined schema allows for more flexi... [07:41:03] RECOVERY - Check unit status of analytics-dumps-fetch-mediacounts on clouddumps1002 is OK: OK: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:50:20] Good morning team :) [07:51:05] aqu: Would you mind please reviewing https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/94 and https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/90 [07:51:20] aqu: They are relatively small, and I wish to merge/deploy them today if possible [07:51:58] Morning! [07:57:01] joal: done [07:57:08] Many thanks aqu :) [07:58:06] Ok merging and dpeloying my 2 airflow patches [08:24:23] aqu: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/90 [08:24:26] please :) [08:26:50] joal: perfect :) [08:27:06] \o/ [08:27:08] thanks :) [08:35:50] Morning all :-) [08:37:02] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Refactor HDFSArchiveOperator to run in Skein - https://phabricator.wikimedia.org/T310542 (10Antoine_Quhen) @Snwachukwu, following our talk about optimization, you may try: * to use the un-shaded job jar, which is lighter than the shaded one,... [08:45:17] Hi btullis :) [08:52:30] !log Deploy airflow [08:52:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:59:06] (03CR) 10DCausse: WIP - Add new mediawiki entity fragments, and use them in new mediawiki page change schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [09:18:51] (03PS1) 10Joal: Fix geoeditors HQL bug [analytics/refinery] - 10https://gerrit.wikimedia.org/r/807922 [09:19:00] aqu: if you have aminute --^ [09:21:10] 10Data-Engineering: Add an-worker11[42-48] to the Hadoop cluster - https://phabricator.wikimedia.org/T311210 (10BTullis) [09:27:19] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Refactor HDFSArchiveOperator to run in Skein - https://phabricator.wikimedia.org/T310542 (10JAllemandou) >>! In T310542#8022120, @Antoine_Quhen wrote: > @Snwachukwu, following our talk about optimization, you may try: > * to use the un-shade... [09:28:12] joal: Hi! are you re-enabling the geoeditors jobs? [09:28:32] yes mforns - I did that, but found yet another bug (see my CR just above) [09:28:46] I saw! Thanks a lot for fixing the jobs! [09:29:03] will update the spreadsheet [09:29:16] mforns: no problem at all :) Would you mind reviewing that fix as well mforns ? [09:32:35] (03CR) 10Mforns: [C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/807922 (owner: 10Joal) [09:34:17] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for next deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/807922 (owner: 10Joal) [09:37:57] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10BTullis) I have created a patch to increase the heap once again, from 72 GB to 84 GB. [10:19:03] (03CR) 10Vivian Rook: [C: 03+2] Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976) (owner: 10Vivian Rook) [10:21:58] (03Merged) 10jenkins-bot: Introduce black formatting to quarry [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/807534 (https://phabricator.wikimedia.org/T288976) (owner: 10Vivian Rook) [10:23:52] 10Quarry, 10Patch-For-Review: Add black formatting to quarry linter - https://phabricator.wikimedia.org/T288976 (10rook) 05Open→03Resolved [10:23:58] 10Quarry, 10Epic, 10cloud-services-team (Kanban): Productionize quarry a bit - https://phabricator.wikimedia.org/T288982 (10rook) [11:25:20] !log kill oozie mediawiki-geoeditors-monthly-coord in favor of airflow job [11:25:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:26:14] 10Data-Engineering-Radar, 10Growth-Team, 10MediaWiki-extensions-GuidedTour, 10MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), 10Patch-For-Review: Finish decommissioning the legacy GuidedTour schemas - https://phabricator.wikimedia.org/T303712 (10phuedx) [11:44:27] mforns: Would have a minute now/ [11:44:29] ? [11:53:43] (03CR) 10Gmodena: WIP - Add new mediawiki entity fragments, and use them in new mediawiki page change schema (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [12:07:01] joal: heya! now yes [12:13:45] (03PS1) 10Phuedx: Remove MediaViewer and MultimediaViewer* allowlist entries [analytics/refinery] - 10https://gerrit.wikimedia.org/r/807963 (https://phabricator.wikimedia.org/T310890) [12:14:30] 10Data-Engineering, 10MediaViewer, 10MediaWiki-extensions-EventLogging, 10MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), 10Patch-For-Review: Decommission the MediaViewer and MultimediaViewer* instruments - https://phabricator.wikimedia.org/T310890 (10phuedx) [12:19:36] 10Data-Engineering, 10MediaViewer, 10MediaWiki-extensions-EventLogging, 10MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), 10Patch-For-Review: Decommission the MediaViewer and MultimediaViewer* instruments - https://phabricator.wikimedia.org/T310890 (10phuedx) [12:27:54] Hey mforns [12:28:02] sorry my time to be gone earlier [12:28:10] no prob! [12:28:14] wanna talk now? [12:28:21] sure! [12:28:27] bc! [12:28:27] batcave! [12:38:42] mforns: just checking you saw my response about system vs human users [12:41:28] aqu: looks like that wikipediaportal one has a different problem [12:41:30] than usual! [12:41:37] https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml#L1096 [12:41:40] oops [12:41:41] not that [12:41:51] Could not extract /$schema field from event, field does not exist [12:44:38] yes, rerunning it with ignore_failure_flag is not solving the pb [12:45:50] 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform, 10Fundraising-Backlog, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10phuedx) [12:45:53] aqu want to debug it together? [12:45:54] ottomata: I saw! thanks :] In the end I didn't use that in the script. I thought it would be more explicit to make the user execute the kerberos-run-command, what do you think? [12:46:17] ottomata: yep [12:46:23] hm, mforns if you can do it automatically that would be pretty cool [12:46:35] mforns: what if, you jsut check if the expected keytab is readable [12:46:39] and then do it auto? [12:46:47] if they sudo to a user [12:46:59] and then /etc/security/keytabs/$USER/... whatever is readable [12:47:03] then do kerkberos run command [12:47:04] ? [12:47:39] or, maybe you can even bypass kerberos-run-command at that point and basically just kinit with keytab in the script? then maybe you don't have to bash -c ...? [12:53:25] ottomata: but then the user has to pass the keytab path... [12:53:40] o, you mean.. ok I see [12:54:26] 10Data-Engineering: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10BTullis) No objections from me. Do we need to seek approval from anyone else? [12:55:52] ottomata: bash -c ? [12:58:02] 10Data-Engineering: Drop MediaViewer and MultimediaViewer* tables - https://phabricator.wikimedia.org/T311229 (10phuedx) [12:58:42] 10Data-Engineering: Drop MediaViewer and MultimediaViewer* tables - https://phabricator.wikimedia.org/T311229 (10phuedx) [12:58:45] 10Data-Engineering, 10MediaViewer, 10MediaWiki-extensions-EventLogging, 10MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), 10Patch-For-Review: Decommission the MediaViewer and MultimediaViewer* instruments - https://phabricator.wikimedia.org/T310890 (10phuedx) [12:59:12] 10Data-Engineering, 10MediaViewer, 10MediaWiki-extensions-EventLogging, 10MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), 10Patch-For-Review: Decommission the MediaViewer and MultimediaViewer* instruments - https://phabricator.wikimedia.org/T310890 (10phuedx) [13:03:00] mforns: review sent - your views are welcome :) [13:04:44] Heads-up, I'll shortly restart the namenode process on an-master1001 - which is currently in standby mode - increasing heap memory from 72 GB to 84 GB [13:04:56] ack btullis - thanks for the note :) [13:06:08] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10BTullis) I've applied the patch. I will restart an-master1001 to pick up the new settings, then wait at least 10 minutes before trying another failback operation fr... [13:07:41] !log restarted hadoop-hdfs-namenode service on an-master1001 [13:07:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:27:17] 10Data-Engineering, 10Superset: Running into errors while adding Hive table to Superset dataset - https://phabricator.wikimedia.org/T284604 (10JArguello-WMF) a:05mforns→03None [13:28:41] (03CR) 10Joal: Add projectview hql scripts to analytics/refinery/hql path. (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu) [13:30:25] 10Data-Engineering, 10Anti-Harassment, 10Product-Analytics: Distinguish between types of block events in the Mediawiki user history table - https://phabricator.wikimedia.org/T213583 (10JArguello-WMF) [13:39:45] !log attempting failback of namenode service from an-master1002 to an-master1001 [13:39:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:41:39] !log The failback didn't work again. [13:41:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:46:17] btullis: I think we're gonna need to investigate more :/ - I'll put up some time probably tomorrow - It's starting to become an emergency :( [13:47:12] 10Data-Engineering, 10Data-Engineering-Kanban: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10BTullis) Nope, even with an 84 GB heap and having waited over 30 minutes it still failed. ` btullis@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqi... [13:47:36] joal: Thanks, I'd appreciate that. [13:47:45] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:12] !log started the namenode service on an-master1001 after failback failure [13:48:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:49:29] ACKNOWLEDGEMENT - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service Btullis T310293 restarted the systemd unit https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:59] RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:08] (03CR) 10Ori: "This change is ready for review." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/807340 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [13:58:08] (03CR) 10Joal: [C: 03+1] "Ok to merge as is - We usually have the logic of the function in the refinery-core package and only the hive wrapping code in the refinery" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/807340 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [14:00:32] 10Data-Engineering, 10Data-Engineering-Kanban: HDFS Namenode failover failure - https://phabricator.wikimedia.org/T310293 (10BTullis) These are some entries from the health check log at the time: `an-master1001:/var/log/hadoop-hdfs/hadoop-hdfs-zkfc-an-master1001.log` (stacktraces trimmed) ` 2022-06-23 13:39:57... [14:02:11] (03CR) 10Joal: Add projectview hql scripts to analytics/refinery/hql path. (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/797240 (https://phabricator.wikimedia.org/T309023) (owner: 10Snwachukwu) [14:10:55] 10Analytics, 10Data-Engineering, 10Event-Platform: mediawiki/page/properties-change schema should use map type for added and removed page properties - https://phabricator.wikimedia.org/T281483 (10Ottomata) Right now, this schema doesn't have any defined properties, and is not a map type either. So, JSONSChe... [14:16:37] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Research, discuss and decide on DAG/task dependencies VS. success/failure files (Oozie style) - https://phabricator.wikimedia.org/T301568 (10JArguello-WMF) @Ottomata @JAllemandou @mforns @Antoine_Quhen Did you reach an agreement on this one? [14:19:43] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Patch-For-Review: Deploy an-test-coord1002 to facilitate failover testing of analytics coordinator role - https://phabricator.wikimedia.org/T287864 (10JArguello-WMF) [14:30:10] aqu: the other thing i'm going to do. i'm going to manuually remove those bad events from the raw data [14:30:13] and re-refine [14:32:41] aqu: https://gerrit.wikimedia.org/r/c/operations/puppet/+/807995/ [14:32:43] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Cmjohnson) @RobH this may be a controller issue, the servers were able to go through the installation without any issue, after the install, th... [14:35:27] ottomata: Then, only need to refine again. [14:44:02] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Cmjohnson) @BTullis I don't have any real guidance for you other than all disks are controlled by the raid controller. Partman recipes are not a specialty of mine. pinging @robh... [14:51:59] (03CR) 10Ottomata: [C: 03+2] Deprecate chronology_id [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/804352 (https://phabricator.wikimedia.org/T241410) (owner: 10DCausse) [14:52:06] (03CR) 10Ottomata: [C: 03+2] "ty" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/804352 (https://phabricator.wikimedia.org/T241410) (owner: 10DCausse) [14:52:32] (03Merged) 10jenkins-bot: Deprecate chronology_id [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/804352 (https://phabricator.wikimedia.org/T241410) (owner: 10DCausse) [14:54:41] (03CR) 10Ottomata: WIP - Add new mediawiki entity fragments, and use them in new mediawiki page change schema (032 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/807565 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [15:05:05] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [15:18:58] 10Data-Engineering-Radar, 10Platform Engineering: Deploy AQS service to codfw clusters - https://phabricator.wikimedia.org/T309808 (10Eevans) The Cassandra cluster has now been expanded to codfw, and the AQS dataset is replicated there. I made the assumption when opening this ticket that -like RESTBase- the A... [15:22:11] (03CR) 10Ori: UDF for testing uri_query for duplicate query parameters (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/807340 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [15:23:49] 10Data-Engineering-Radar, 10Platform Engineering: Deploy AQS service to codfw clusters - https://phabricator.wikimedia.org/T309808 (10BTullis) Thanks for all of your excellent work on this @Eevans. I'll discuss with the #data-engineering team what the plans are (or should be) regarding the multi-DC Druid and A... [15:48:59] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: [Airflow] Research, discuss and decide on DAG/task dependencies VS. success/failure files (Oozie style) - https://phabricator.wikimedia.org/T301568 (10mforns) I think the decision depends on a research that we have not yet done. We should do a time-bo... [16:15:04] 10Data-Engineering: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10Ottomata) 05Open→03Declined After discussing more with Xabriel, it turns out he doesn't need this access for now. [16:59:23] hmm , btullis joal just tried to rerun a refine job and got [16:59:24] Operation category READ is not supported in state standby. [17:00:09] but hmm wierd [17:00:13] not for all refine tasks [17:00:14] ?? [17:06:14] OH those are warnings [17:06:56] huh, but the hadoop client is still failing on connection? [17:07:00] 2022-06-23 16:57:16,473 WARN [ContainerLocalizer Downloader] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server [17:07:00] org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error [17:08:56] hm [17:08:57] RetryInvocationHandler: A failover has occurred since the start of call #1 ClientNamenodeProtocolTranslatorPB.getBlockLocations over an-master1002.eqiad.wmnet/10.64.21.110:8020 [17:14:18] hm okay the issue is that there are 0 size .gz files in the gobblin raw dirs [17:20:20] (03CR) 10Joal: [C: 03+1] UDF for testing uri_query for duplicate query parameters (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/807340 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [17:21:02] ottomata: we've seen that happening last time we had failover issues [17:21:37] same isue: empty gz files generated by gobblin, leading to failure because the gz decompressor doesn't find magic header [17:21:40] (03CR) 10Ori: UDF for testing uri_query for duplicate query parameters (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/807340 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [17:22:01] i kind of recall that, what did we do? [17:22:10] ottomata: I dropped the empty files [17:22:14] i'm trying to find out if there are gobblin dropped messages? [17:22:52] i don't see any weird in gobblin logs [17:22:57] last time I hadn't found issue (looked at gobblin metrics) [17:22:59] yeah [17:23:02] 2022-06-23 13:12:13 UTC INFO [Commit-thread-0] org.wikimedia.gobblin.copy.BaseDataPublisher - Moving hdfs://analytics-hadoop/wmf/gobblin/task_working/eventlogging_legacy/job_eventlogging_legacy_1655989815701/task-output/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_0_0.txt.gz to [17:23:02] /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12/part.task_eventlogging_legacy_1655989815701_0_0.txt.gz [17:23:05] that is a 0 size file [17:23:51] that's weird :( [17:24:31] i think there is data loss [17:24:37] The other weird thing is that it is written the next hour [17:24:48] oh? no that i normal [17:24:57] the job runs at :10 after the hour [17:25:18] so that 13:10 job run file should have all the 12:* events since the 12:10 job ran [17:25:19] yes, in the /wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslationCTA/year=2022/month=06/day=23/hour=12 folder [17:25:29] so bascially all events between 12:10 and 12:59 [17:25:36] i think we are missing those [17:25:41] yes indeed [17:25:43] for a few other datasets too [17:25:57] without explicit failures :( [17:26:04] not good gobblin :( [17:26:55] is weird? [17:26:57] Jun 23 17:10:34 an-launcher1002 gobblin-eventlogging_legacy[9348]: 2022-06-23 17:10:34 UTC INFO [pool-18-thread-16] org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-17, groupId=kafka1] Resetting offset for partition eventlogging_ContentTranslationCTA-0 to offset 41638170. [17:26:58] Jun 23 17:10:34 an-launcher1002 gobblin-eventlogging_legacy[9348]: 2022-06-23 17:10:34 UTC INFO [pool-18-thread-16] org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-17, groupId=kafka1] Resetting offset for partition eventlogging_ContentTranslationCTA-0 to offset 41917970. [17:27:05] happen basically right after each other [17:27:38] That's weird :( [17:27:54] the hour of the log is weiord as well [17:28:07] OH oops [17:28:10] sorry that is a bad grep [17:28:13] let me find the right one [17:28:27] Jun 23 13:10:35 an-launcher1002 gobblin-eventlogging_legacy[17477]: 2022-06-23 13:10:35 UTC INFO [pool-18-thread-6] org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-7, groupId=kafka1] Resetting offset for partition eventlogging_ContentTranslationCTA-0 to offset 41638170. [17:28:27] Jun 23 13:10:35 an-launcher1002 gobblin-eventlogging_legacy[17477]: 2022-06-23 13:10:35 UTC INFO [pool-18-thread-6] org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-7, groupId=kafka1] Resetting offset for partition eventlogging_ContentTranslationCTA-0 to offset 41911482. [17:28:37] actuallyk, that first offset is the earliest offset [17:28:44] i dunno why it has to do that [17:28:47] And with the same pattern you found, I actually realize that I missed the dataloss bit on my previous analysis [17:28:59] 41911482 [17:29:03] is 13:10 [17:29:14] which means it looks like it is skipping 12:10-> 13:10 [17:29:54] The task starts at 13:10, so it's expected - you should look at logs from previous hour - we'd have a number of expected rows [17:30:12] hm - let's batcave for am inute? [17:30:22] ya [17:30:29] actually [17:30:34] lets do slack huddle screen share works better [17:30:46] for me anyway [17:47:18] 10Data-Engineering, 10Data-Engineering-Kanban: Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263 (10Ottomata) [18:47:28] PROBLEM - Check systemd state on an-worker1140 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:32] PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:53:27] PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:54:08] RECOVERY - Check systemd state on an-worker1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:31] RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:03:54] 10Data-Engineering, 10Data-Engineering-Kanban: Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263 (10Ottomata) Okay, in this gobblin task attempt, I found the following logs https://yarn.wikimedia.org/jobhistory/logs/an-worker1135.eqiad.wmnet:8041/container_e44_1655808530211_1... [19:15:05] RECOVERY - Check systemd state on an-worker1140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:36] 10Data-Engineering, 10Data-Engineering-Kanban: Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263 (10Ottomata) BTW, it looks like all eventlogging analytics gobblin job logs fail to extract a timestamp, which causes a warn message to be output to the application logs for every... [20:03:44] 10Data-Engineering, 10Airflow: [Airflow] Research, discuss and decide on DAG/task dependencies VS. success/failure files (Oozie style) - https://phabricator.wikimedia.org/T301568 (10JArguello-WMF) [20:04:15] 10Data-Engineering, 10Data-Engineering-Kanban: Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263 (10Ottomata) Oh, but `Caught exception. Operation will be retried. Attempt #1` should not be logged if attempt # < RETRY_MAX_ATTEMPTS. If it was using the default, maxAttempts wo... [20:14:31] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [20:36:58] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Services, 10Platform Engineering, and 2 others: Log_param is redacted in wiki replica when only comment and/or user should be - https://phabricator.wikimedia.org/T301943 (10JArguello-WMF) Hi @EChetty ! This was assigned to Razzi. Nobody is working on th... [20:38:21] 10Data-Engineering, 10Data-Engineering-Kanban: Gobblin dataloss during namenode failure - https://phabricator.wikimedia.org/T311263 (10Ottomata) Am looking in Gobblin source for where our `SimpleStringWriter#write()` method is called from, but I'm not familiar enough with the code to find it! There are other... [20:38:53] 10Data-Engineering, 10Data-Services, 10Platform Engineering, 10Patch-For-Review, 10cloud-services-team (Kanban): Log_param is redacted in wiki replica when only comment and/or user should be - https://phabricator.wikimedia.org/T301943 (10JArguello-WMF) [20:55:35] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10wmfdata-python: conda-create-stacked breaks wmfdata.presto - https://phabricator.wikimedia.org/T301734 (10Ottomata) 05Open→03Declined Okay, we are aiming to one day replace anaconda-wmf with a simpler (and not stackable) base env, an... [21:02:41] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics: Improvements to mediawiki_geoeditors_monthly dimensions - https://phabricator.wikimedia.org/T302079 (10JArguello-WMF) Hi @JAllemandou ! I'm moving this task back to DE Workboard since we depend on Airflow migration (either druid or geoeditor... [21:02:53] 10Data-Engineering, 10Product-Analytics: Improvements to mediawiki_geoeditors_monthly dimensions - https://phabricator.wikimedia.org/T302079 (10JArguello-WMF) [21:07:26] 10Data-Engineering: Modify HiveToDruid Job - https://phabricator.wikimedia.org/T302514 (10JArguello-WMF) [21:16:25] 10Data-Engineering, 10Airflow: Variabilization of existing jobs - https://phabricator.wikimedia.org/T303473 (10JArguello-WMF) [21:24:05] 10Data-Engineering, 10Platform Engineering, 10Product-Analytics: AQS `edited-pages/new` metric does not make clear that the value is net of deletions - https://phabricator.wikimedia.org/T240860 (10JArguello-WMF) [21:30:43] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) [21:32:09] 10Data-Engineering-Radar, 10Cassandra, 10Generated Data Platform: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 (10Eevans) a:03Eevans [21:35:12] 10Data-Engineering, 10Data-Engineering-Kanban, 10MediaWiki-extensions-EventLogging, 10Patch-For-Review: Generate $wgEventLoggingSchemas from $wgEventStreams - https://phabricator.wikimedia.org/T303602 (10JArguello-WMF) @Ottomata Look like this one is resolved. Should I change the status to resolved? [21:46:18] 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform: Remove StreamConfig::INTERNAL_SETTINGS logic from EventStreamConfig and do it in EventLogging client instead - https://phabricator.wikimedia.org/T286344 (10JArguello-WMF) [21:55:01] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Superset: Error when updating dashboard - https://phabricator.wikimedia.org/T308441 (10JArguello-WMF) 05Open→03Resolved a:03JArguello-WMF Emojis have been removed and that solved the issue. [21:55:02] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Superset, 10Patch-For-Review: Upgrade Superset to 1.4.2 - https://phabricator.wikimedia.org/T304972 (10JArguello-WMF) [22:02:22] 10Data-Engineering, 10Data-Engineering-Kanban, 10MediaWiki-extensions-EventLogging, 10Patch-For-Review: Generate $wgEventLoggingSchemas from $wgEventStreams - https://phabricator.wikimedia.org/T303602 (10Ottomata) Code has not been merged or deployed, but looks ready to be. @phuedx ? [22:52:50] 10Analytics, 10API Platform: Establish testing procedure for Druid-based endpoints - https://phabricator.wikimedia.org/T311190 (10BPirkle) [23:03:55] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: Pageviews integration testing - https://phabricator.wikimedia.org/T299735 (10BPirkle)