Themes from the Subsurface Data Lake Conference
And coming to terms with never-ending cheesy aquatic metaphors
Losing sight of the larger trends shaping the data ecosystem is easy when down in the trenches of the daily work grind. Because of this I love attending conferences that serve as an inspirational footstool from which one can easily glimpse the full landscape picture.
The first ever Subsurface Cloud Data Lake Conference held virtually in July was a great opportunity for this as it featured an impressive lineup of speakers that thoughtfully contextualized where we are with data in 2020.
Talk #1: The Future is Open — The Rise of the Cloud Data lake
Tomer Shiran, Co-founder & CPO, Dremio
The whole event was put together by Dremio and began with an awesome opening keynote by one of its founders, who gave an overview of how data architectures have evolved over the last 10 years.
The main idea is this: we’ve gone from proprietary, monolithic analytic architectures anchored by an expensive Oracle-licensed database or Hadoop cluster to architectures defined by flexible, increasingly open source technologies across the four main layers of a data stack: the storage layer, data layer, compute layer, and client layer.
The base of the whole stack — the storage layer — is supported by the crucial development of cloud technologies like S3 and ADLS that offer infinitely scalable, highly-available, globally distributed, easily-connected-to, and outrageously cheap cloud storage.
The ability to agnostically use these storage blobs to separate data storage from specialized compute engines (like Spark, Snowflake, and Athena) is the dominant architectural trend that nearly every talk mentioned.
So if nothing else, leave this article with that concept clear in your mind.
Talk #2: Apache Arrow: A New Gold Standard for Dataset Transport
Wes Mckinney, Director, Ursa Labs
Wes began by explaining that there’s a problem when you have a bunch of different systems handling data each with potentially its own storage and transport protocols. The problem is a “combinatorial explosion of pairwise data connectors” as he calls it, that manifests as costly-to-implement custom data connectors developers must create if they want to transport data efficiently in their data pipelines or applications.
This is one of a few issues highlighted as the inspiration for the Apache Arrow project. The others being:
Unnecessary CPU time spent serializing & de-serializing data
Expensive writes to disk/blog storage as an intermediary
Decreased performance due to executor node bottlenecks in distributed systems
And so Wes continued by explaining some of the technical concepts behind Arrow’s solution to these problems.
The end result is a mostly behind-the-scenes library that makes — for example, converting between Spark and Parquet more efficient — and all of us more productive in the long run.
The image that became clear in my mind is how Arrow aims to be an in-memory intermediary between systems, the same way a lot of folk use S3 (for lack of a better option) for that purpose.
Talk #3: Functional Data Engineering: A Set of Best Practices
Maxime Beachemin, CEO and Founder, Preset
Lastly, Maxime gave an interesting talk on how functional programming principles can be applied to the data engineering discipline to create reliable data pipelines.
The three principles are:
1. Pure Functions — Same input = Same Output
2. Immutability — Never changing the value of variables once assigned
3. Idempotency — The ability to repeat an operation without changing the result
Taken in a data engineering context, Beauchemin recommends writing ETL tasks that are “pure”. This means given the same input data, they will output the same data partition which you can INSERT OVERWRITE into your data lake (an idempotent operation compared to a mutable UPSERT).
Without going into too much detail, I’m inspired to leverage these concepts to add structure to the way I think of my ETL tasks, instead of a tangled mess of logic whose output I have little understanding of.
Thank you for reading and look forward to a recap of the equally interesting Future Data Conference next week!