My Experience at Data Council Austin 2023: Day 1
I have to admit I didn’t do enough research before I joining the Data Council conference, say why it’s special, what tracks are available, and who will speak. To some degree, quite often I do something similar when I go to the cinema. When I roughly know this won’t be too bad, I enjoy those happy surprises (yes, I am talking about John Wick: Chapter 4).
Again, it’s 11pm and I am just back from 2 parties in a row tonight and I have to reserve some time to give you an update of the Day 1 of Data Council, as a follow up of 👇
The TL;DR is Data Council is amazing!
It’s my first time in Data Council, and I have attended many many tech conferences in my 20 years of tech career. Data Council is just great.
To be honest, I was a little bit disappointed when I entered the demo booth area this morning for “standup” breakfast and saw “750 attendees” in the opening keynote of Pete Soderling:
I’ve been to AWS re:Invent like 20K attendees, or Current ’22 with about 2K-3K attendees. When I saw this 3-digit number, I say well, but later on “wow”. Because it’s a relative small amount of attendees,speakers, or sponsors, you are connecting to the like-minded folks much closer. You can come to the Speaker Office Hours to have good discussions with the presenters. There are a series of workshops which are relevant to your day to day work. You can easily meet someone having similar challenges or hope in the hallway. And there is no live streaming of it (although the recordings will be available online, say Youtube, in 2–3 weeks), so you have to enjoy this in-person conference.
And also Data Council is not organized by a single vendor or cloud provider. Because it’s relative small, because it’s very focus on data, because it’s in Austin, because the overall theme is very casual, honest and no bullshit, it turns to be a quite amazing experience to join such data conference in person.
Again, this series of blogs are all about my personal experience at the Data Council conference. I do have some biases on data stack, streaming database, or data engineering in general. However I guess I am open-minded enough to talk to different folks with different background and different opinions. In this blog (and day 2 day 3 update), I mainly list what I saw/heard/discussed, other than my personal opinion. I just started https://jove.substack.com/ for more personal opinions, or you can ping me on https://www.linkedin.com/in/jovezhong/ or https://twitter.com/jove
Some of the key takeaway of Day 1 at Data Council are
- We are living in the same world, but it’s also a multiverse. Not just on Linkedin/twitter, but also in this conference, we are hearing very different opinions, such as “Big Data is dead” or “we are at the new chapter of Big Data”, “we’re a cloud-only company” or “we won’t use any commercial software. Just run the open-source systems in our VPC”. Companies/organizations at different stages, or different sectors, even different regions/countries may have very different take on those principles. I don’t think it’s necessary a bad thing. As a vendor, the takeaway is you cannot please everyone, period. Believe what you believe and execute well. Don’t change your direction every 6 months or shorter. Have some faith on what you are heading and just do it.
- People talk a lot about Data Quality, Data Contract, Data Catalog, Data Lineage, Data Security/Privacy. It’s relatively easy to build a working system to solve certain types of problems. But when the organization gets bigger, more users with different skillsets or backgrounds/assumptions start using the product, then how to get a certain team agreement about how to process/manage the data can be a big problem. More than just technical challenges. Can be culture, can be SOC2/GDPR, can be organizational boundaries. It’ll be easier if the CTO/CIO can define the blueprint, principles, roadmap, then spend some effort consolidating the stacks.
- There are a lot of tools available for data engineers, analysts, scientists (maybe too many 😂), and there will be just more (like us). People are a bit upset seeing thousands of logo on Modern Data Stack or MAD. But don’t worry this too much. For many data practitioners, the goal is not really picking up the best tool in the world for this XYZ job. It’s more about whether you trust the vendor, you get good support from them, and the vendor won’t send you a big invoice or monthly bill. I know it’s not a good metaphor, but take coffee shops as an example, you may choose Starbucks worldwide as a safe choice, or you may prefer a local roasthouse. As a vendor, you don’t have to compete whether you make the best latte in the worldwide. It’s more about whether you have good connections with your customers, whether you build a nice and friendly community, whether you keep your reputation.
Sorry, there might be too much about my opinions in this blog. For those who cannot make the Data Council conference, let me quickly go through what happened today. I am not a good notetaker. Honest I trust my brain more than any physical or digital notebook. Here are some quick memos for what I experienced today.
The breakfast starts at 8am, in the demo booth area. No table. No chair. You have to choose either walk around the booths or chat with random person nearby. To me, both are good opinions.
For example, I visited the booth of Databand, which is actually acquired by IBM. They provide very nice web UI for alerts, charts, pipeline monitoring. To be honest, I was a bit confused for Databand as the company/product name. I happened to know the founders for Databend, which is a Rust-based cloud native data warehouse. Databand, Databend 🤔
Some other cool companies I got to know are Sync, Hasura, etc.
Sync is actually a product/solution to save your $$ for Apache Spark. I am lucky enough to talk to CPO/EngLead from Sync in the party tonight. In today’s economy, it’s indeed a good business model to save cost on cloud bills. Apache Spark is well-established open-source framework that is trusted by many organizations. If one organzation already spends a lot of money on a certain cloud version of Spark, that means they already buy-in the value of Spark. Then if a company comes and says they can save X% of the bill. Why not? In today’s conference, there is also a good talk from https://select.dev/ regarding how to tune Snowflake to do more with less spending. One of my classmates is working on bluesky, which is to optimize the cost for Snowflake too.
Hasura is an interesting company that provide “Federated Search” for various systems with a unified GraphQL/REST API. They mentioned the SQL interface is coming soon (and I look forwards to it). I asked how this is different than Presto/Trino/Starburst. Well, they seem to be a big fan of GraphQL. Okay, cool. There is not really good or bad direction. It’s good to have a strong opinion. Someone may like it, someone may hate it. Again, you cannot please everyone.
I also stopped by the booth of Redis and happy to learn that they are now providing real-time search. Kind of like Elastic but based on Redis. Sure why not, if many customers are using Redis for all kinds of data, why they have to spin off another system for search.
The keynote speech started at 9 sharp. Nice! Pete set the overall tone very well. At least he convinced me for Data Council as yet another data conference to THE data conference.
The following talk is from Shirshanka Das, former Lead Data Architect at Linked and now Co-founder and CEO at Acryl Data, the company behind open source metadata platform for Modern Data Stack, datahub.
It was a great talk. He talked about the crowded Modern Data Stack, what got easier and what got harder.
Again, it won’t be too hard to get things working initially. The real problem is how to scale, how to maintain/improve the quality, how to react to changes, and how to avoid piss off your data team or stakeholders.
The following section I attended was from https://select.dev/ on how to optimize the cost of snowflake. It was very technical, very dry, and very helpful indeed.
I was sitting in the same room for “Data Engineering & Infra” track. Two more sessions in the morning were “Data Contracts” by Chad Sanderson from Data Quality Camp (a bit marketing to be honest), and “Malloy, an experimental language for data” from Lloyd Tabb, who has built a lot of OLAP/BI systems in past 20 years and recently left Looker and work on this open-source “next-gen” SQL. The session was half presentation, half live demo. Very cool. Malloy is an interesting attempt to explore your data, do easier/better JOIN, and visualize your data with a single comment (not even a line of code 👍) It will compile to SQL for each database they support. I think there are some similar goals for dbt but in different direction. I personally like SQL more than other SQL-like or next-gen SQL languages. But maybe it’s just me. (let me know your thoughts in the comments)
The afternoon session started with a spicy talk on “Big Data is Dead” from MotherDuck CEO, Jordan Tigani. He is definitely a great speaker, with a very open mind.
I will recommend you to check the recording when it’s available. Not very technical, but a good summary of his data journey (from a single SQL query to scan X PB data with 4-digit cost, to running duckdb in your browser outperforming the expensive boxes in cloud)
At 2pm, I went to a workshop of Materialize.
They built a special repo with nice getstarted guide and cloud instances to help folks in the rooms to build a dbt-project with new version of Materialize (not the single binary, outdated version). Probably the demo instances are not available now. Feel free to contact them to get a trail.
I was a bit surprised to see ~100 folks in the workshop room. Is streaming database so popular now? Wow! (I am working on streaming database too) But later I realized some folks just stay in the room for the entire afternoon. They just want to learn new things and get hands dirty. Respect! 🫡
When I returned to my favorite spot in the DataEng/Infra room (yes, I sit in the same spot, even with the same nice guy next to me), I enjoyed the talk from Uber engineers on how to scale their metrics system from ElasticSearch to Pinot, by Yupeng Fu and Nan Ding. It was a great technical presentation, with good level of problem statements, goals, high level design and med/low level optimizations.
The last speaker session I attended today is from Matt Housley, one of the co-authors of the best-selling book, Fundamentals of Data Engineering (O’Reilly Media). His topic is “The End of History? Convergence of Batch and Realtime Data Technologies”
Another great talk. After the session, I also spent 1 hour with Matt in the Speaker Office Hours, together with other ~15 admirers. We discussed a lot of data quality, security, privacy, then ELT, semi-structured data, and Continuous Query (aka streaming SQL). I was happy to see someone in the StreamingSQL expert group from incits.org was there too. I had a good conversation with him after the 1-hour-long group discussion, and walked to the 1st party of tonight together.
Parties are very important elements of this Data Council, period. There are many parties/events in the same time or at least same day. I joined 2 parties tonight and met many folks with famous and new brands, such as Fivetran, dbt, airbyte, rockset, lightup, etc. I may learn more from the party than the speaker sessions, to some degree. That’s something you cannot easily get from Youtube, zoom, twitter, or blog. I really appreciate people being open, honest and helpful to each other.
The second party I joined tonight (I know someone joined 3 parties tonight 👍) was in the place called Devil May Care (I guess it’s a twist of a video game call Devil May Cry. I haven’t played it, but many of my friends like it). I was expecting it’s another party, but turned out 1 hour of it is more like a panel discussions.
Well, to be honest, I had mixed feeling for that and I observed something common in the audience. Some folks had to watch their phones to handle urgent emails (of coz), or had to leave to meet someone important (sure). Well, the topics were interesting, such as whether AI can replace data engineers or analysts, do you believe edge computing. The panel discussion ended around 9:50pm. I chatted a little bit with my neighbours, and had to leave to go back hotel to summarize what I experience today as Day 1.
Now is 12:20am. I may want to join another event tomorrow morning, at 7:15am, with the title “#StreamBrew #1 [coffee] @ DataCouncilAI in Austin” Why not?
Thank you for reading this lengthy blog. If you enjoy it or find it’s a bit useful, stay tuned for the day 2 update. Peace!
UPDATED on 11:30pm Wed, here is the Day2 summary: