[07:11:52] good morning! [07:12:14] So I waited a bit in Hue for the log panel and I finally got [07:12:15] org.apache.oozie.executor.jpa.JPAExecutorException: E0603: SQL error in operation, The last packet successfully received from the server was 39,601,801 milliseconds ago. The last packet sent successfully to the server was 39,601,801 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use [07:12:22] in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem. [07:12:35] this is why the text coordinator stopped [07:13:33] !log umount/mount /mnt/hdfs on an-coord1001 to pick up java upgrades [07:13:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:14:06] oozie itself doesn't seem to be in a weird state [07:17:59] !log kill webrequest bundle, text coordinator failed (logs/info/etc.. https://hue.wikimedia.org/hue/jobbrowser/#!id=0024621-210701181527401-oozie-oozi-B) [07:18:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:19:02] !log launch webrequest bundle from 2022-01-16T01:00 (first hour missing for text) - 0003712-220113112502223-oozie-oozi-B [07:19:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:20:21] there will be some hours for upload to be recomputed I know, but in the interest of time I thought that it was good to start backfilling as early as possible [07:20:28] in case it was the wrong move apologies :( [07:20:46] (I could have done it yesterday) [07:32:00] Good morning! [07:32:04] (the alternative could be to stop the bundle, run only the text coordinator for a bit to match upload, and then restart the bundle in theory) [07:32:08] bonjour joal :) [07:32:10] Hi elukey [07:32:17] if I did something wrong please don't be mad [07:32:49] Thanks a lot for having an eye on our stuff! [07:33:03] I'm trying to understand the flow of things [07:33:19] I left the links to bundles etc.. for you to take a look [07:33:34] my understanding is that oozie lost connection with the mysql db for some reason [07:33:41] and the text coordinator failed [07:33:46] but not the upload one [07:33:52] ok [07:33:55] (that kept refining) [07:34:20] the error msg seems to suggest to use autoReconnect or similar (I guess on the jdbc oozie setting?) [07:34:38] Oh wow - webrequest-text got completely stuck? [07:34:40] Meh [07:34:46] yeah since 2022-01-16T01 [07:34:51] this is the first time I see this [07:35:01] me too, I don't recall it [07:35:13] full rerun is fine elukey - thanks for taking actions! [07:35:27] <3 [07:37:03] I'm gonna babysit them, just in case [09:54:01] Thanks both. I've just caught up on the scrollback for this. Anything I can do? [09:56:41] np! I think that we should just wait some hours for the webrequest backlog to complete [10:00:56] 10Quarry, 10Epic, 10cloud-services-team (Kanban): Productionize quarry a bit - https://phabricator.wikimedia.org/T288982 (10Aklapper) This task is still listed under "FY2021/2022-Q1"; removing as that's in the past. [10:01:08] 10Quarry, 10cloud-services-team (Kanban): Switch to using prefix puppet instead of direct-on-instance puppet - https://phabricator.wikimedia.org/T289531 (10Aklapper) This task is still listed under "FY2021/2022-Q1"; removing as that's in the past. [10:02:08] RECOVERY - Check unit status of check_webrequest_partitions on an-launcher1002 is OK: OK: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:07:49] yesss [10:15:02] PROBLEM - Check unit status of check_webrequest_partitions on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:20:15] nope [10:20:17] :D [10:43:42] upload has caught up, webrequest still has some hours to backfill :) [10:45:58] I also confirm that pageview-actor is following - slowing but surely :) [11:07:13] it is nice that we can recover a day of webrequest text so quickly [11:08:04] (text is already processing today's hours) [11:09:43] elukey: we could even make it faster if we re-cluster webrequest using a higher value - this is the bottlnexk of speed-processing [11:14:54] joal: what do you mean with "re-cluster"? (curious) [11:15:42] elukey: the webrequest table is clustered (https://github.com/wikimedia/analytics-refinery/blob/master/hive/webrequest/create_webrequest_table.hql#L71) [11:16:35] elukey: this means that data is stored in 256 files, shuffled by (hostname,sequence) - This is a very nice trick to allow for fast sampling if needed (you can access single clusters) [11:17:05] elukey: Now this forces data to be written by 256 reducers, no more :) [11:20:38] ahhhhhhhh [11:20:47] yes yes thanks for the explanation [12:27:01] I need to restart the an-airflow100[1-3] today, if possible. [12:27:02] I know that an-airflow1001 is used by the search team. an-airflow1002 is used by the research team, and an-airflow1003 is used by the platform engineering team. [12:27:28] Any insights on the best people to ping in each team for permission to reboot them? [12:40:39] btullis: I'd suggest fkaelin for the research team, dcausse or ebernhardson for the search team, and hnowlan and clarakosi for the platform team [12:40:58] Great, thanks. [14:45:32] btullis, thumbs up from research for rebooting [14:48:32] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review, 10Readers-Web-Backlog (Kanbanana-FY-2021-22): WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10ovasileva) [14:50:59] 10Data-Engineering, 10Generated Data Platform, 10Platform Engineering, 10SRE: Import Debian package of Cassandra 3.11.11 as 'dev' version - https://phabricator.wikimedia.org/T298805 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I've imported 3.11 for buster and stretch, enjoy :-) [14:51:34] fab: Many thanks. It's back up and running now. [14:55:18] 10Data-Engineering, 10Project-Admins: Archive Analytics tag - https://phabricator.wikimedia.org/T298671 (10odimitrijevic) p:05Triage→03High [15:59:15] 10Data-Engineering, 10Project-Admins: Create a workboard for Data-Catalog component - https://phabricator.wikimedia.org/T299357 (10odimitrijevic) [16:05:13] 10Analytics, 10Wikidata, 10Wikidata Analytics: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier) - https://phabricator.wikimedia.org/T299358 (10Lucas_Werkmeister_WMDE) [16:08:55] 10Analytics, 10Wikidata, 10Wikidata Analytics: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier) - https://phabricator.wikimedia.org/T299358 (10Lucas_Werkmeister_WMDE) The requests prior to 2021-05-05 came from a variety of user agents, so I’m willing to rule out th... [16:12:04] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE Observability, and 2 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10lmata) [16:12:16] 10Analytics, 10Data-Engineering, 10Event-Platform, 10SRE, and 2 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10lmata) [16:12:49] 10Analytics, 10Wikidata, 10Wikidata Analytics: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier) - https://phabricator.wikimedia.org/T299358 (10Lucas_Werkmeister_WMDE) @Michael speculates that there might be multiple servers rsyncing their access logs to stat1007 an... [16:49:40] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Evaluate Atlas - https://phabricator.wikimedia.org/T299165 (10odimitrijevic) p:05Triage→03High [16:51:35] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10odimitrijevic) [16:51:37] 10Data-Engineering, 10Airflow, 10Product-Analytics, 10Epic: Replace Oozie with better workflow scheduler - https://phabricator.wikimedia.org/T271429 (10odimitrijevic) [16:53:41] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10odimitrijevic) Please create subtasks for specific jobs that are migrated. Please add additional columns risk & complexity assessed to the following spreadsheet so that we c... [16:54:22] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Migrate Oozie jobs to Airflow - https://phabricator.wikimedia.org/T299074 (10odimitrijevic) p:05Triage→03High [16:56:11] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Structured-Data-Backlog, and 3 others: Write an Airflow job converting commons structured data dump to Hive - https://phabricator.wikimedia.org/T299059 (10odimitrijevic) p:05Triage→03High [16:56:30] 10Data-Engineering, 10Project-Admins: Archive Analytics tag - https://phabricator.wikimedia.org/T298671 (10Aklapper) @odimitrijevic: Regarding Analytics* herald rules, I could only find * H126: When project tags include any of { #Pageviews-API, #Analytics-EventLogging, #Analytics-Visualization, #Analytics-Das... [17:01:46] 10Data-Engineering, 10Airflow: [Airflow] Troubleshoot MySQL connection issues - https://phabricator.wikimedia.org/T298893 (10odimitrijevic) This leads to a conversation about automatic restarts. Specifically with sql connections - oozie jobs seem to have failed due to a similar issue today. The ROI in doing... [17:02:50] 10Data-Engineering, 10Wikipedia-iOS-App-Backlog, 10iOS-app-feature-Analytics, 10Product-Analytics (Kanban): MobileWikiAppiOSUserHistory sending incompatible data - https://phabricator.wikimedia.org/T298721 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic Based on Conversation in slack I think th... [17:04:29] 10Data-Engineering, 10Superset: Create a custom Superset Role to allow for non default permissioning - https://phabricator.wikimedia.org/T298714 (10odimitrijevic) p:05Triage→03Medium [17:12:40] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform, 10Image-Suggestion-API, 10Image-Suggestions: Update HiveToCassandra for variable substitution and HQL from files loading - https://phabricator.wikimedia.org/T297934 (10odimitrijevic) p:05Triage→03High [17:14:11] 10Analytics, 10Data-Engineering, 10Event-Platform, 10SRE, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10odimitrijevic) @Ottomata is this complete? [17:17:21] 10Data-Engineering, 10Data-Engineering-Kanban, 10Generated Data Platform, 10Image-Suggestion-API, 10Image-Suggestions: Update HiveToCassandra for variable substitution and HQL from files loading - https://phabricator.wikimedia.org/T297934 (10JAllemandou) [17:19:54] 10Analytics, 10Data-Engineering, 10Event-Platform: Automate EventGate validation error reporting - https://phabricator.wikimedia.org/T268027 (10odimitrijevic) Do we have dashboards per event stream in logstash? [17:22:24] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE-swift-storage: Deploy research_poc Swift credidentials to Hadoop - https://phabricator.wikimedia.org/T296945 (10odimitrijevic) p:05Triage→03High @Ottomata is this complete? Should we add documentation on wikitech? [17:28:39] 10Data-Engineering, 10Infrastructure-Foundations: Netflow data pipeline - https://phabricator.wikimedia.org/T257554 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic [20:48:09] 10Data-Engineering, 10Project-Admins: Create a workboard for Data-Catalog component - https://phabricator.wikimedia.org/T299357 (10Peachey88) @odimitrijevic I've added you to #trusted-contributors, Try adding/editing the work board now. [23:12:23] 10Analytics-Radar, 10Observability-Logging, 10SRE, 10Wikimedia-Logstash, and 2 others: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10lmata) [23:12:48] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Observability-Logging, and 2 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10lmata) [23:25:51] 10Analytics-Radar, 10Instrument-ClientError, 10Observability-Logging, 10Wikimedia-Logstash, and 3 others: Documentation of client side error logging capabilities on mediawiki - https://phabricator.wikimedia.org/T248884 (10lmata)