[01:24:05] 10Data-Engineering, 10SRE-Access-Requests: Give bmansurov access necessary to support Research Airflow jobs - https://phabricator.wikimedia.org/T301215 (10bmansurov) [07:14:16] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [07:16:58] hello folks [07:17:35] an-worker1115.eqiad.wmnet's yarn nm seems in a weird state [07:18:46] org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=35: Could not create local files and directories [07:19:16] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [07:23:39] it started around 5:50AM, the same time of the log [07:25:56] ah yeah and others [07:25:57] org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container Executor reached unrecoverable exception [07:27:00] user analytics-search, app-id: https://yarn.wikimedia.org/jobhistory/job/job_1637058075222_520927 [07:27:51] !log restart hadoop-yarn-nodemanager on an-worker1115 (container executor reached unrecoverable exception, doesn't talk with the Yarn RM anymore) [07:27:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:29:13] back in service, very weird, I didn't find any clear reason, and IIRC I've never seen this error before [08:24:51] weird elukey - the application was a regular hive query [08:44:33] no idea [09:53:15] elukey: That is odd. Thanks for looking into it. [10:03:23] we've had in the past one-time issues like this one, they are weird but the hadoop node manager sometimes errors out with unexpected things :D (the issue might also be fixed in recent versions of hadoop etc..) [10:05:23] Yeah, I thought I'd check `dmesg` to see if there were any kernel or hardware issues, but it seems that there's nothing. It's all in user space. [10:29:58] (03CR) 10STran: Basic ipinfo instrument setup (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [11:13:04] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Connect Atlas to a Data Source - https://phabricator.wikimedia.org/T298710 (10BTullis) 05Open→03Declined We can't easily do this with our evaluation setup, due to Hive incompatibility. [11:13:06] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10BTullis) [11:36:07] 10Data-Engineering-Kanban, 10Data-Catalog: Add alert for varnishkafka low/zero messages per second to alertmanager - https://phabricator.wikimedia.org/T300246 (10EChetty) [11:36:53] 10Data-Engineering, 10Data-Catalog: Data Catalog Feature Matrix [Mile Stone 1] - https://phabricator.wikimedia.org/T299887 (10EChetty) a:03EChetty [11:37:49] 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10EChetty) [12:18:29] (03CR) 10Phuedx: Basic ipinfo instrument setup (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/753548 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [13:54:31] folks the dse-k8s-worker nodes are in the eqiad dc :) [13:54:32] https://phabricator.wikimedia.org/T291579 [13:57:22] NIIIICE [13:57:24] Nice! Thanks for the update elukey [13:58:01] We are moving from devicemapper to overlay fs + bullseye on the other clusters, this new one should get all the latest new stuff [13:58:06] elukey: are you worried that naming these nodes after a team might bite you in the future? [13:58:13] org structures change about every 2 years [13:59:10] ottomata: didn't care about the name, I asked around and dse-k8s-worker was the name that got the majority. We can call it in any way, but if we want to change we better do it soon otherwise dcops will kill us :D [13:59:18] elukey: Great news. I've used devicemapper for docker in the past and it's horrible. [13:59:31] btullis: nah it is not that bad, overlay is better for usre [13:59:33] *sure [13:59:41] heheh, you don't care about the name?!?! [13:59:55] :) [14:00:05] nope as long as we are all happy we can call it in any way :D [14:00:08] you lookin forward to explaining 'dse' for the next 8 years once DS is gone? [14:00:19] or going through a renaming effort :) [14:00:19] ? [14:00:37] not that i have a better suggestion yet [14:00:43] :) [14:00:49] who says that I'll be the one explaining the naming?? :P :P :P [14:00:53] haha [14:00:54] We could just call them by their IP addresses :-) [14:01:07] elukey: should I try to come up with something? i can also butt out [14:02:18] There are enough variations here: https://acronyms.thefreedictionary.com/DSE We could maybe pick one after a team rename :-) [14:02:45] ottomata: jokes aside, this is a good time to change the name in case, before dcops starts the whole import procedure [14:03:25] data-science-engineering seems a high level enough acronym that could last some years, even if we reorg (last famous words) [14:03:42] that's true. [14:03:49] hm [14:04:38] would just a unique adjective of some kind be useful? it def is hard to find something that isn't a team name and isn't stupid like just 'data' [14:04:51] e.g. 'jumbo' for that kafka cluster was good i tthink [14:05:03] can always fall back to https://en.wikipedia.org/wiki/Dark_septate_endophyte as backronym [14:06:09] haha 'mondo' [14:06:40] mondo-k8s-worker? do you hate it? do I hate it? [14:10:05] I have to say, I prefer dse. It's not like dse is actually an acronym of a single team anyway, it's an amalgam of three broad terms. I don't think that we're going to stop doing data, science, or engineering any time soon. [14:11:27] btullis: elukey got a sec for a very quick hangout? [14:11:33] sure [14:11:45] bc is good! [14:11:47] Yep. [14:12:00] https://meet.google.com/rxb-bjxn-nip [14:37:28] So that's agreed then. Foundation Analytics andScience Teams. The *fast* kubernetes cluster :-) [14:37:53] Maybe not. [14:47:48] ottomata hi! I'm getting an error when using the recently updated workflow_utils: [14:47:54] https://www.irccloud.com/pastebin/DwRXuGfa/ [14:48:13] yeah [14:48:15] i just pushed the fix [14:48:19] sorry about that [14:48:24] uninistall and reinstall [14:48:32] mforns: ^ [14:48:57] ottomata: if I recreate the conda env will that be enough? [14:51:19] hm, no you need to reinstall workflow utils [14:51:27] i don't think i've deployed the new airflow deb with that fix yet [14:51:38] just uninstall workflow utils and reinstall [14:51:40] via pip [14:51:41] in your conda env [14:51:52] ok ok, thanks! [14:51:58] pip install git+https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils.git@main [16:10:36] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Patch-For-Review: [Airflow] Set up scap deployment - https://phabricator.wikimedia.org/T295380 (10Ottomata) TODO: set up research's deployment [17:03:24] 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Finish evaluation of "other" Data Governance Options - https://phabricator.wikimedia.org/T296672 (10Milimetric) [17:05:32] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Low Risk Oozie Migration: wikidata_json_entity - https://phabricator.wikimedia.org/T300026 (10Snwachukwu) [17:23:12] a-team - train is soon - anything else than my patch for cassandra-loader to deploy? [17:54:07] 10Analytics, 10Analytics-Kanban, 10Data-Engineering-Kanban, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10nshahquinn-wmf) [17:54:10] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Conda's CPPFLAGS may not be correct when pip installing a package that needs c/cpp compilation - https://phabricator.wikimedia.org/T292699 (10nshahquinn-wmf) [17:54:21] 10Analytics, 10Analytics-Kanban, 10Data-Engineering-Kanban, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10nshahquinn-wmf) Blocked on T292699. [17:55:33] 10Analytics, 10Analytics-Kanban, 10Data-Engineering-Kanban, 10wmfdata-python, 10Product-Analytics (Kanban): wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10nshahquinn-wmf) a:05nshahquinn-wmf→03Milimetric [19:04:25] 10Data-Engineering-Kanban, 10Data-Catalog: Add alert for varnishkafka low/zero messages per second to alertmanager - https://phabricator.wikimedia.org/T300246 (10razzi) I don't think this has to do with the data catalog [19:50:59] 10Data-Engineering, 10Product-Analytics: Add Product-Analytics Announcements to the oozie job for notifications - https://phabricator.wikimedia.org/T301281 (10Mayakp.wiki) [20:15:15] 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10Mayakp.wiki) Thank you Data Engineering team for the report. I used [[ https://docs.google.com/spreadsheets... [20:40:02] 10Analytics, 10Data-Engineering, 10Event-Platform, 10EventStreams, and 2 others: Expose rdf-streaming-updater.mutation content through EventStreams - https://phabricator.wikimedia.org/T294133 (10RBrounley_WMF) [21:35:04] 10Data-Engineering, 10Data-Engineering-Kanban: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10Isaac) @Mayakp.wiki thanks for doing this analysis! Question: it looks like these numbers are for 0.0423% d...