[00:52:17] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Wikipedia-iOS-App-Backlog, and 6 others: Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage - https://phabricator.wikimedia.org/T328127 (10SNowick_WMF) Findings thus far for`XXX:_Return... [11:29:30] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10EChetty) [11:31:28] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto clients to Bullseye - https://phabricator.wikimedia.org/T329361 (10EChetty) [11:32:22] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade Zookeeper clients to Bullseye - https://phabricator.wikimedia.org/T329362 (10EChetty) [11:33:44] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade hadoop test clients to Bullseye - https://phabricator.wikimedia.org/T329363 (10EChetty) [11:45:23] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10BTullis) [11:54:57] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10BTullis) [11:59:56] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade the Data Engineering infrastructure to Debian Bullseye - https://phabricator.wikimedia.org/T288804 (10BTullis) [12:20:58] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) [12:21:12] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure (Shared-Data-Infra Sprint 08): Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) [12:29:34] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade the Data Engineering team's Zookeeper servers to Bullseye - https://phabricator.wikimedia.org/T329362 (10BTullis) [12:38:07] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08), 10MW-1.40-notes (1.40.0-wmf.23; 2023-02-13): mediawiki.page-undelete stream is empty - https://phabricator.wikimedia.org/T329064 (10EChetty) [12:38:09] 10Data-Engineering-Planning, 10MediaWiki-extensions-EventLogging: Determine what UserBucketService::getUserEditCountBucket should return for anons - https://phabricator.wikimedia.org/T329292 (10EChetty) [12:38:11] 10Data-Engineering-Planning: Deprecate old mobile datasets - https://phabricator.wikimedia.org/T329310 (10EChetty) [12:38:13] 10Data-Engineering-Planning, 10Product-Analytics: 13 new wikis missing from mediawiki_history - https://phabricator.wikimedia.org/T329119 (10EChetty) [12:38:15] 10Data-Engineering-Planning, 10Equity-Landscape: Access output metrics - https://phabricator.wikimedia.org/T329185 (10EChetty) [12:38:18] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Automated event stream throughput alerting for important state change streams - https://phabricator.wikimedia.org/T329070 (10EChetty) [12:38:20] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 08): Remove hardcoded kafka parameters - https://phabricator.wikimedia.org/T329061 (10EChetty) [12:38:22] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Flink Enrichment monitoring - https://phabricator.wikimedia.org/T328925 (10EChetty) [12:38:24] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10EChetty) [12:38:28] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): [Flink Operation] How to handle app upgrades - https://phabricator.wikimedia.org/T328569 (10EChetty) [12:38:32] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 09): [Flink Operations] How to handle restarting a Flink application - https://phabricator.wikimedia.org/T328563 (10EChetty) [12:38:36] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [Flink Operations] Automate Replay of Failed Events - https://phabricator.wikimedia.org/T328565 (10EChetty) [12:38:40] 10Data-Engineering-Planning, 10Event-Platform Value Stream: [NEEDS GROOMING] Fix eventutilities-python linting - https://phabricator.wikimedia.org/T328547 (10EChetty) [12:38:44] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Epic: Flink Operations - https://phabricator.wikimedia.org/T328561 (10EChetty) [12:38:48] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Refactor parameterization of eventutilities-python and mediawiki-event-enrichment - https://phabricator.wikimedia.org/T328478 (10EChetty) [12:38:52] 10Analytics-Radar, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Introduce EventBusSendUpdate - https://phabricator.wikimedia.org/T292123 (10EChetty) [12:38:58] 10Analytics-Radar, 10Data-Engineering-Planning, 10Pageviews-API, 10RESTBase-API, and 2 others: views error in mostread feed - https://phabricator.wikimedia.org/T267624 (10EChetty) [12:39:10] 10Analytics-Radar, 10Data-Engineering-Planning, 10ChangeProp, 10Event-Platform Value Stream, and 6 others: Run EventBus tests in MediaWiki core CI - https://phabricator.wikimedia.org/T257583 (10EChetty) [12:39:14] 10Analytics-Radar, 10Analytics-Wikistats, 10Data-Engineering-Planning: Negative total number of bytes for German Wikipedia in 2001? - https://phabricator.wikimedia.org/T203906 (10EChetty) [12:39:20] 10Analytics-Radar, 10Data-Engineering-Planning, 10Editing-team, 10MediaWiki-extensions-EventLogging, and 3 others: WikiEditor records all edits as platform = desktop in EventLogging - https://phabricator.wikimedia.org/T249944 (10EChetty) [12:39:26] 10Analytics-Wikistats, 10Data-Engineering-Planning: String found too soon, while searching for ' 10Analytics-Radar, 10Data-Engineering-Planning, 10GLAM-Tech, 10Pageviews-API: WMF pageview API (404 error) when requesting statistics over around 1000 files on GLAMorgan - https://phabricator.wikimedia.org/T145197 (10EChetty) [12:39:34] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Metrics-Platform-Planning: Source geolocation directly rather than using IP in schema - https://phabricator.wikimedia.org/T290014 (10EChetty) [12:39:38] 10Analytics-Radar, 10Data-Engineering-Planning, 10Event-Platform Value Stream: mw.user.generateRandomSessionId should return a UUID - https://phabricator.wikimedia.org/T266813 (10EChetty) [12:39:42] 10Analytics-Radar, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Platform Team Workboards (Clinic Duty Team): Duplicated revision_create events - https://phabricator.wikimedia.org/T262203 (10EChetty) [12:39:50] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Discovery-Search, and 9 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10EChetty) [12:40:02] 10Analytics, 10Data-Engineering-Planning, 10MediaWiki-extensions-EventLogging, 10Research-Backlog: 20K events by a single user in the span of 20 mins - https://phabricator.wikimedia.org/T202539 (10EChetty) [12:40:08] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Metrics-Platform-Planning: Document in-schema who sets which fields - https://phabricator.wikimedia.org/T253392 (10EChetty) [12:40:12] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Deploy schema repos to analytics cluster and use local uris for analytics jobs - https://phabricator.wikimedia.org/T280017 (10EChetty) [12:40:16] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Platform Team Workboards (Clinic Duty Team): Add expiry info to mediawiki.page-restrictions-change stream - https://phabricator.wikimedia.org/T282057 (10EChetty) [12:40:20] 10Analytics-Radar, 10Data-Engineering-Planning, 10MediaWiki-extensions-EventLogging, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Epic: Explore an API for logging events sampled by session - https://phabricator.wikimedia.org/T168380 (10EChetty) [12:40:28] 10Analytics, 10Data-Engineering-Planning, 10MediaWiki-Action-API, 10PageViewInfo, 10Pageviews-API: API Analytics - page views by country - https://phabricator.wikimedia.org/T213221 (10EChetty) [12:40:36] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Release-Engineering-Team (Radar): Stop using puppet + git pull for auto deployment of schema repos - https://phabricator.wikimedia.org/T274901 (10EChetty) [12:40:40] 10Analytics, 10Data-Engineering-Planning, 10Pageviews-API: Filter top pages by namespace/category - https://phabricator.wikimedia.org/T182975 (10EChetty) [12:40:44] 10Analytics, 10Data-Engineering-Planning, 10Pageviews-API: Yearly endpoint for the /pageviews/top API - https://phabricator.wikimedia.org/T154381 (10EChetty) [12:40:48] 10Analytics, 10Data-Engineering-Planning, 10Pageviews-API: Adding top counts for wiki projects (ex: WikiProject:Medicine) to pageview API - https://phabricator.wikimedia.org/T141010 (10EChetty) [12:40:54] 10Analytics, 10Data-Engineering-Planning, 10Pageviews-API, 10RESTBase-API: Pageviews Data : removes 1000 limit in the most viewed articles for a given project and timespan API - https://phabricator.wikimedia.org/T153081 (10EChetty) [12:40:58] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering-Planning: Determine total number of external links in all Wikipedias - https://phabricator.wikimedia.org/T137984 (10EChetty) [12:41:04] 10Analytics-Radar, 10Analytics-Wikistats, 10Data-Engineering-Planning, 10Machine-Learning-Team, 10ORES: Discuss Wikistats integration for ORES - https://phabricator.wikimedia.org/T184479 (10EChetty) [12:41:12] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10EChetty) [12:41:16] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade Stats clients to bullseye - https://phabricator.wikimedia.org/T329360 (10EChetty) [12:41:20] 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Upgrade Presto servers to Bullseye - https://phabricator.wikimedia.org/T329361 (10EChetty) [12:41:24] 10Data-Engineering-Planning, 10Data Pipelines: Deprecate old mobile datasets - https://phabricator.wikimedia.org/T329310 (10EChetty) [12:41:36] 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10Metrics-Platform-Planning: Determine what UserBucketService::getUserEditCountBucket should return for anons - https://phabricator.wikimedia.org/T329292 (10EChetty) [12:42:01] 10Data-Engineering-Planning, 10Data Pipelines: Review Superset permissions and assign roles as appropriate - https://phabricator.wikimedia.org/T328457 (10EChetty) [12:43:57] 10Analytics-Radar, 10Data-Engineering-Planning, 10Data Pipelines, 10Editing-team, and 4 others: WikiEditor records all edits as platform = desktop in EventLogging - https://phabricator.wikimedia.org/T249944 (10EChetty) [12:44:11] 10Analytics-Radar, 10Analytics-Wikistats, 10Data-Engineering-Planning, 10Data Pipelines, and 2 others: Discuss Wikistats integration for ORES - https://phabricator.wikimedia.org/T184479 (10EChetty) [12:45:02] 10Data-Engineering-Planning, 10Data Pipelines, 10Equity-Landscape: Access output metrics - https://phabricator.wikimedia.org/T329185 (10EChetty) [12:45:06] 10Data-Engineering-Planning, 10Data Pipelines, 10Product-Analytics: 13 new wikis missing from mediawiki_history - https://phabricator.wikimedia.org/T329119 (10EChetty) [12:46:21] 10Data-Engineering-Planning, 10DBA, 10Data-Persistence, 10Discovery-Search, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10EChetty) [12:48:04] 10Analytics-Wikistats, 10Data-Engineering: String found too soon, while searching for ' Hi btullis or steve_munene - Would one of you merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/887786/ today? [14:00:47] 10Analytics, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Release-Engineering-Team (Radar): Stop using puppet + git pull for auto deployment of schema repos - https://phabricator.wikimedia.org/T274901 (10awight) I would suggest going in a slightly different direction than described in the... [14:13:38] Hi joal, I can merge it now [14:14:01] Hi nfraison - I had forgotten you could do that :) Would you mind going for it please? [14:15:32] done [14:16:59] Thanks nfraison - I'll have a second patch deleting the absented code later in the das [14:17:04] day sorry [15:06:18] nfraison: for when you have a minute - https://gerrit.wikimedia.org/r/c/operations/puppet/+/888228/ [15:15:03] heya joal :] can I pick your brain on https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/227 ? [15:17:05] joal it is merged [15:19:23] joal: here's a funny story about Spark... I think... [15:20:06] for daily pageviews dump, I think, it takes all the data, shuffles it to 3 separate reducers, and then filters it on each reducer to get the output: https://yarn.wikimedia.org/proxy/application_1663082229270_878973/stages/ [15:20:36] I'm playing around with resource allocation to see if I can find something more optimal, right now it's 32 executors with 1 core at 16G, smaller stuff failed [15:20:48] I'm surprised, I don't remember the Hive job being so bad [15:20:54] and I'm scared about the monthly one [15:21:59] so - my question is - what do you think about generating a pageview daily table that we purge after 35 days? It would speed up these dump generations. Also... the spider data is the biggest, wonder if we should just stop dumping spider dumps, I bet nobody ever downloads them [15:37:28] I'm getting hive failures when trying to "create table A as select B" in hive. Almost certainly something simple that I'm doing wrong, but I thought I would still mention it here. [15:38:09] The stack trace looks slightly like an application error... https://hue.wikimedia.org/hue/jobbrowser/#!id=task_1663082229270_878950_m_000000 [16:05:24] 10Data-Engineering, 10Project-Admins, 10PM: Archive Analytics tag - https://phabricator.wikimedia.org/T298671 (10JArguello-WMF) Hi! The 218 open tasks should all go to #Data-Engineering-Icebox. The 116 open in #analytics-radar do not belong to Data Engineering, the team was keeping track of them, but was not... [16:10:48] Slightly more concerning: I tried to workaround with "create table ... like" and then "insert select", and got an error "DIGEST-MD5: IO error acquiring password" [16:16:29] 10Data-Engineering-Planning, 10CirrusSearch, 10Event-Platform Value Stream, 10Discovery-Search (Current work): EventRowTypeInfo should support schema evolution of rows seriliazed in flink application state - https://phabricator.wikimedia.org/T325273 (10Gehel) 05Open→03Resolved [16:38:29] awight: a link to the queries you were tring would help here :) [16:39:23] 10Data-Engineering-Planning, 10Event-Platform Value Stream: Replace refinery-source Guava caches by Caffeine - https://phabricator.wikimedia.org/T325266 (10Antoine_Quhen) 1 more test to make sure all UDFs are yielding the same results with Spark+Caffeine: https://gist.github.com/aq/056c1248e906b184a11063682740... [16:39:35] hey milimetric - let's talk about how we can make this work efficiently - I have some questions to be sure I completely understand the state of affairs [16:49:28] 10Data-Engineering, 10Equity-Landscape: Access input metrics - https://phabricator.wikimedia.org/T324968 (10JAnstee_WMF) @KCVelaga_WMF & @ntsako Signing off on the data QA for inputs! [16:50:18] 10Data-Engineering-Planning, 10Data Pipelines, 10Equity-Landscape: Access output metrics - https://phabricator.wikimedia.org/T329185 (10JAnstee_WMF) @KCVelaga_WMF @ntsako Signing off on the data QA with the caveat that we need to relabel one change to the output labels FROM: access_presence_growth TO: access [16:51:11] (03PS9) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [16:53:07] joal: I've pasted a hue link to one of the failing query tasks above, this is the basic idea: https://phabricator.wikimedia.org/P44249 [16:54:05] yeah, nothing special - that's weird [16:54:22] This was the attempted workaround: https://phabricator.wikimedia.org/P44250 [16:56:14] (03PS10) 10Aqu: Remove Guava from dependency [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) [16:57:34] (sorry, the password thing may have just been a stale beeline session.) [16:58:06] No rush, I can play around a bit more and then file a task if I still need help. [17:00:08] awight: I can replicate [17:01:36] Very helpful to know that I'm normal :-D [17:04:33] :) [17:05:04] (03CR) 10Aqu: "* I've completed the comment about Maxmind<>Caffeine." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/883118 (https://phabricator.wikimedia.org/T327072) (owner: 10Aqu) [17:06:29] Well, my workaround was successful this time, although it leaves me with a slightly awkward table having 5 partition dimensions but only one shard. [17:14:35] ok glad to know the workaround worked awight :S [17:18:41] 10Data-Engineering-Icebox: WikiStats should recognize global bots - https://phabricator.wikimedia.org/T37196 (10odimitrijevic) 05Open→03Declined [17:47:24] 10Data-Engineering, 10Project-Admins, 10PM: Archive Analytics tag - https://phabricator.wikimedia.org/T298671 (10Aklapper) Thanks for the reply! > The 116 open in #analytics-radar do not belong to Data Engineering, the team was keeping track of them, but was not the responsible party, therefore, they should... [18:45:51] (03CR) 10Chad: Drop vestiges of git-fat (031 comment) [analytics/hdfs-tools/deploy] - 10https://gerrit.wikimedia.org/r/887000 (https://phabricator.wikimedia.org/T328473) (owner: 10Chad) [19:36:05] ooh, a bunch of airflow failures [20:04:17] joal: A slightly minimized test query for that error I mentioned, trying to drop the partitions: > create table mapdata2 as select * from awight.mapdata where uri_path='/w/api.php' and uri_query rlike "mapdata" and year=2023 and month=2 and day=1 and hour=1 and webrequest_source='text'; [21:09:21] a-team, are there any SREs present? I'm stuck troubleshooting the Airflow and Oozie failures, and they are piling up :S [21:19:25] 10Data-Engineering-Planning, 10Data Pipelines, 10Pageviews-Anomaly, 10Wikipedia-iOS-App-Backlog, and 6 others: Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage - https://phabricator.wikimedia.org/T328127 (10SNowick_WMF) Findings thus far for `Index`:... [21:21:15] !log restarted airflow@analytics.service in an-launcher1002 [21:21:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:45:11] I'll have look too. [21:46:44] everything green in Icinga for an-launcher1002 [21:49:20] ok, so skein is having trouble launching the spark driver it seems. [21:49:22] btullis: the errors happen when skein tries to connect to yarn I believe [21:49:55] the issues started about 16:30 UTC [21:50:16] but nothing had changed then, no airflow deploys, no job restarts, [21:50:43] I looked at grafana and saw a peak in NodeManager node allocation [21:50:52] Yes I see. Nothing relevant in https://sal.toolforge.org/analytics [21:50:53] but doesn't seem out of normal [21:50:59] nope [21:51:13] plus oozie jobs seem unaffected [21:51:32] I restarted airflow@analytics.service, but no difference [21:51:57] Are other yarn jobs ok, as far as we can see? It's just airflow jobs? [21:52:10] yes, just airflow jobs it seems [21:52:47] Everything is green on an-master1001 and an-master1002 as well. [21:53:45] btullis: an-launcher1002 seems to have high CPU load, but don't think this it's related [21:54:56] mforns: Looking at /var/log/syslog it looks like some jobs are reporting success, but maybe this is just some stages. [22:00:24] hm btullis I just saw that I deployed airflow yesterday at exactly the same time of the day the error started to happen today... [22:01:17] It wouldn't make sense that it started failing 24 hours after though... bt [22:01:36] OK, interesting. I've checked the yarn resource manager logs on an-master100 and there don't seem to be any errors. [22:01:40] https://www.irccloud.com/pastebin/JX1H58E6/ [22:01:54] back now, catching up in case I can help [22:03:14] Great. We're a bit short on ideas. Do you want to jump on a call? Best guess at the moment is some kind of bad deploy of airflow-dags yesterday, causing errors from skein when launching the driver. [22:03:46] where yall at? [22:04:40] btullis: I'm in the batcave, you all in a different call? [22:05:14] milimetric: can you paste the link, please? [22:05:22] https://meet.google.com/rxb-bjxn-nip [22:25:56] !log stopping hadoop-yarn-resourcemanager service in an-master1001 to fail over automatically to an-master1002 [22:25:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:30:15] !log starting the hadoop-yarn-resourcemanager on an-master1001 and failing back to iy. [22:30:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [23:09:35] 10Data-Engineering: Puppetize Skein certificate generation - https://phabricator.wikimedia.org/T329398 (10Milimetric) [23:22:56] !log unpaused all airflow dags and cleared all failed tasks after the incident [23:22:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log