[10:15:11] RhinosF1: is that nginx acting as a client of varnish or as a backend? [12:02:40] out of curiosity, how easy (or hard) is now a days, to get assigned IPv4 addresses, eg. on Europe? IPv6 ones seem to be quite cheap [12:08:36] vgutierrez: we've got nginx on both cp and mw backends [12:10:16] RhinosF1: nope, cp don't run nginx since a few years [12:10:46] vgutierrez: they do for me for reasons at we probably forked wikimedia code a few years ago [12:10:52] and haven't looked at it since [12:11:06] so if you wanna trace it from the TLS termination layer till the backend, you need to generate some sort of unique ID per request on the TLS termination layer [12:11:33] varnish will send the XID on the x-varnish header [12:11:51] so logging both IDs will help correlating the requests [12:12:05] When does the XID get set [12:13:18] the XID is set by varnish, so after your first nginx layer [12:14:09] Right so when varnish passes a request to the backend the nginx on the backend would see it? [12:14:27] yes [12:14:38] you can log the x-varnish header on that nginx [12:15:49] that sounds useful [12:49:06] vgutierrez: we have a couple of acme-chief related problems in Toolforge, are you available for some assistance ? [12:54:53] in particular T301117 and T288406 [12:54:53] T288406: acme-chief-cert-sync failing on tools-acme-chief-01 - https://phabricator.wikimedia.org/T288406 [12:54:53] T301117: toolsbeta acme-chief certtificate has expired - https://phabricator.wikimedia.org/T301117 [17:23:03] <_joe_> razzi: apart from the neo4j licensing issues, we should avoid having multiple technologies for the same type of functionality if possible; I know the search team is looking into blazegraph alternatives, you should work with them (cc ottomata ) [17:23:24] <_joe_> oh and cc dcausse gehel ryankemper as well :) [17:24:04] _joe_: agreed. btw what are the licensing issues? it looks like CE is on gpl v3, at least as of mid 2020 [17:24:06] Yeah definitely, maintaining a new backend is quite an effort [17:24:43] <_joe_> ryankemper: that it was heavily freemium, as in the community version didn't have things we absolutely needed [17:24:44] neo4j in CE is missing key features (replication and efficient query engine) [17:25:13] ack, yup that's the classic open core model for ya [17:25:29] * ryankemper is very bearish on open core as a general concept [17:25:56] razzi: is there some context somewhere ? Neo4j and Blazegraph are solving 2 very different problems (triple stores vs property graph) [17:26:35] I'm pretty sure my team would be happy to participate in any discussion around any kind of graph database ! [17:26:49] gehel: https://docs.google.com/document/d/1laiqY9Mj3ldAdEX9yhZQeUbYU-JSwdsMUfx961NCbRc/edit#heading=h.qu2wvm4nuplw [17:27:12] > We're starting on a data catalog that will provide visibility into metadata foundation-wide. Our choice: datahub, open sourced by Linkedin and well-integrated with the Apache ecosystem [17:27:15] Thanks ryankemper [17:27:17] (datahub depends on neo4j) [17:28:28] Yes, we're still writing up notes on the technical evaluation of data catalogs, but we're planning on going with datahub which uses neo4j to store the graph of data lineage [17:30:35] here's an example of lineage from the datahub demo: https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)/Schema?is_lineage_mode=true [17:30:53] <_joe_> razzi: is there a phab task / design doc about all this? [17:31:17] https://phabricator.wikimedia.org/T293643 is for the technical evaluation part [17:31:22] <_joe_> oh I see this uses neo4j to store the local flows [17:31:31] razzi: they seem to support elasticsearch as a backend (https://datahubproject.io/docs/how/migrating-graph-service-implementation/) [17:31:32] <_joe_> so the dataset shouldn't be huge? [17:32:08] yep yep, dcausse, we might be able to skip neo4j entirely using that [17:32:52] <_joe_> razzi: just to be sure, when you say "data" here you mean "analytics datasets", right? [17:32:55] _joe_: indeed the dataset should not be too big; records are metadata elements, should be on the order of tens of thousands [17:33:15] <_joe_> yeah ok then scratch any worry I might have myself :) [17:33:30] <_joe_> I thought this was going to be a large-scale installation [17:34:04] That still leaves open the question of how we host non search indices if we go the elastic route. Our current story isn't great. [17:35:26] analytics datasets to start, and any datasets derives from them. Eventually we'd like to have a catalog of all datasets at the foundation, even things like fundraising [17:38:42] <_joe_> razzi: ok, when I hear "datasets" I think of the wiki databases or dumps :) [17:43:24] _joe_: eventually we'd like to have the metadata from the wiki databases and the dumps, but all the data we'd store on them would be metadata like table names and column names for the wiki databases, and the names and locations of the dumps [17:43:24] But we'll spend many months just ingesting the metadata from the analytics cluster [17:44:27] <_joe_> yeah sorry, the "data as a service" name can be interpreted in a very different way if you think of hte production application layer [17:46:24] _joe_: ya emphasizing what razzi said, analytics for now, hopefully more than analytics in the future, using learnings from analytics data catalog [17:46:57] e.g. it'd be nice if someone could say "where is page revision data, and what derives from it?", [17:47:06] it originally comes from mediawiki / mariadb [17:47:27] but it gets out via events/ sqoops/ wiki dumps, is also stored in system and X, Y with these schemas [17:47:37] and some downstream analytics is calculated from it in Z [17:47:45] and this prod system consumes it from Y [17:47:46] etc. [17:48:01] <_joe_> ottomata: so a couple things [17:48:16] <_joe_> 1) "learnings", tu quoque :D [17:48:21] <_joe_> it's "lessons" [17:49:00] <_joe_> 2) I don't see a usage for something like this in production, really, I'd like to hear more [17:50:47] _joe_: event stream config is an example that we have in prod right now [17:53:18] We tried to do an inventory of all production datasets [17:53:53] I think legal said they were the ones wanted to manage it [18:19:56] I remember that, that was a while ago, right? [18:20:07] yep [18:20:16] this would be a metadata catalog, with much of what is in it being populated automatically [18:20:21] instead of manually maintained [18:20:29] ok [18:20:47] for us we probably would need more coarse grained- we wanted to track backups [18:20:47] legal would probably be involed though, especially around governance and retention, etc. [18:21:01] that's stuff we'd like to automate too, retention and sanitization policies [18:21:04] yeah [18:21:31] but data engineering is too complex to handle at this moment for us [18:21:44] I mean, I think there are more data on analytics than on the other backups