Episode 17 - Chatting Velox and Accelerating Data Management with Masha Basmanova

Philip Bell, a Developer Advocate at Meta Open Source, chats with a Software Engineer Masha Basmanova about Velox, an open source unified execution engine aimed at accelerating data management systems.

Summary

In this episode of The Diff podcast, host Philip Bell talks to Meta Software Engineer Masha Basmanova about Velox, an open source unified execution engine aimed at accelerating data management systems.

Masha Basmanova

Episode Transcript

Chatting Velox and Accelerating Data Management with Masha Basmanova

[00:00:00] Philip Bell: Hi everyone. I'm Philip. Thank you all for joining us. With me today is Masha Bava. In this episode, we discuss a new open source c plus plus library and how it has the potential to revolutionize how we build data processing and storage platforms. Hello, Masha, can you share a little bit about yourself and how you ended up working in big data and data processing engines?

[00:00:26] Masha Basmanova: Hello, Philip. Thank you for having me today. Very excited to be here and talking to you about, um, big data and s I, uh, joined the Big Data World a few years ago, uh, just kind of by chance. Uh, I was coming out of a maternity leave and I was looking for something new to start and work in, and I got introduced to David Dewitt, who is like a very big name in query optimizers databases world.

[00:00:56] And at the time I barely understood how [00:01:00] things actually work. My prior experience with data databases, big data, was running some queries, building sequel, just thinking that database is just magically figuring out how to compute the results for my queries. And so this first introduction, this first encounter with David was very revealing.

[00:01:24] He introduced me into like the inner workings of the databases, how they are put together using different components like optimizers, execution engine, and. He kind of taught me about like some basic algorithms for how to do joints or how aggregations work, and essentially kind of like made, it made the whole field a bit more approachable and accessible.

[00:01:50] And so during that kind of introduction, we were thinking about maybe we could look into Spark and see how we [00:02:00] can make it better by introducing a cost-based optimizer. So we started looking into that and then we started looking into Presto. And uh, at some point a friend of mine kinda made another introduction to some folks who were trying to get geospatial queries to work on preso.

[00:02:19] And it turned out it was very difficult. There was no geospatial capabilities. It was very difficult to compute distances on the globe between points. It was very hard to figure out if a given location is within this city or that city or is not belonging to any city at all. Is there, through all of those kind of random encounters, I ended up learning about, uh, spatial joints.

[00:02:46] How to put them together, learning about Presto, learning how to make changes in presto, and made my first contribution to the big, big data, uh, engines, which was a geospatial [00:03:00] capabilities in Presto. They were consistent of like a package of geospatial functions and uh, a spatial joint. So that was, uh, that was how it all started, and that was a few years back, maybe six or seven years back.

[00:03:16] Thank you for the

[00:03:17] Philip Bell: introduction. What have you been working on lately?

[00:03:20] Masha Basmanova: So more recently, so after this first encounter with the big data, after like, you know, making this first contribution using the geospatial capabilities, I met another person, or Ling, and he got me excited about making press to run a lot faster than it was running at a time.

[00:03:43] At the time, Preston was considered to be super fast already. It was much faster than Hive. You didn't need to wait for your query. You could just type a query and in two free seconds you could see the results. And then you can type another query and you can adjust a query. [00:04:00] You could use the engine interactively.

[00:04:02] As opposed to hive, where you would craft your query very carefully, submit a query and then you have to wait often minutes, sometimes hours, even like simple queries like just do select star limit five, you still have to wait for a long time. And so in my mind and in in the mind of people around me, Presto was a super fast, lightning fast engine.

[00:04:29] Everybody was super excited. Users loved using it. I personally thought it's awesome. And so here comes this person who says, you know, it's super slow. It can be free four, five times faster. Uh, there are lots of things that are done inefficiently. They can be done better. And that was of course theory intriguing at first, unsettling.

[00:04:54] But you know, I started like digging in, why were you saying [00:05:00] that things can be faster? Like how exactly they can be faster, where the inefficiencies are. And so together we took a deep dive into Presto. We figured out that the most time the queries spent, uh, spent in table scan, Like essentially about half of the C P u, half of the wall time is being spent doing table scan, and so we launched the first project, our first efficiency project in.

[00:05:29] A press to called Project Aria, which was about making table scan more efficient about taking the filters, which come after the table scan and pushing them down into table scan itself, making sure that first we read the columns, which have the filters, apply those filters, and then reading subsequent columns, only read the rows, which passed the filters on previous columns.

[00:05:54] This was an interesting journey. Uh, it was a complicated project changing [00:06:00] table scan, which is used by every single query, making sure it. Still works, produces correct results and works much faster. Was a lot of hard work, but also super exciting. So that's, um, that's how like, you know, we, uh, ended up spitting up a portion of Presto, which was table scan through Aria, through the push down.

[00:06:25] And, uh, in that journey we were like thinking, okay, but what's next? We noticed that using Java to write efficient code was pretty difficult. We were hitting roadblocks everywhere we turned, and so we saw that if we wanted to make press to much faster, if we wanted to eliminate the inefficiencies we were seeing, we probably had to switch to a different language.

[00:06:55] We probably had to switch to native language, like c plus plus, and just the [00:07:00] thought was terrifying because query engine is very complex. Piece of software, takes years to build, requires lots of expertise. Thinking of rewriting it from scratch in a different language. That was hard. And so we were like, really?

[00:07:22] Discussing and trying to figure out some other ways and not kind of making the decision lightly because we knew that what we are undertaking is, is. Is absolutely huge. And so when we decided that if we want to make Presto fast, we have to go to c plus plus, we have to write it from scratch In this new language, we thought about, okay, how can we make that investment pay more than just making Presto faster?

[00:07:53] We also saw that new engines were emerging, which were competing with Presto on [00:08:00] efficiency, on speed. And they were all written in c plus plus. They were all native engine. They were taking like a subset of the workload, optimizing it and showing that this workload can be executed many times faster than Protocode.

[00:08:17] And so what we saw was that there was a proliferation of query engine. Different engines for ad hoc, presto for batch, spark and hive, and yet a different engine for streaming another different engine for some other ad hoc use case. So there is like a lot of systems, which were pretty much implementing the same logic again and again, each chasing down a specific use case, a specific workload, implementing a specific set of optimizations.

[00:08:53] So if you look at all of those things together, you would think, oh, but all of those [00:09:00] optimizations combined would make for like a great engine. But neither team had capacity in bandwidth to implement all the optimizations. So we had many engines. Each had like a killer feature, a killer optimization, which made a certain workload run very fast, but there was no single engine that would have all of those.

[00:09:23] Optimizations and features combined to allow for a wide range of workloads to run seamlessly, to run uniformly on a single platform. So we decided not to build a new engine, not to add to that family of already pretty large, uh, engines, but rather build a library. A library, which would serve as, as a core.

[00:09:49] On which we could rebuild different engines, which serve different workloads and do need to be slightly different, but they don't need to [00:10:00] be different at the core. They do not need a different hash or hash aggregation. They do not need their own custom expression evaluation. All those things can be shared.

[00:10:14] The, the core of the execution can be a library. And specific details of each engine can be put on top. And so that's how we decided to build Lux, uh, and that's what I've been doing for the past. Two, three years building out the library, figuring out how to integrate it with Presto. Spark streaming machine learning libraries, like about a dozen different, uh, applications to provide seamless experience, very fast execution, and a single place to put all the optimizations.

[00:10:56] Philip Bell: That sounds amazing. What would you say people can use V [00:11:00] Ls for?

[00:11:01] Masha Basmanova: So Vox is good if you, if you want to build a new query engine or if you want to maybe experiment with a particular optimization. Maybe you're doing some research in databases. You could use Valox to bootstrap your project. Very quickly so you can get the core algorithm.

[00:11:25] The core operators like a set of SQL relational operators like joins, aggregations, window operators, order by, and then you can start adding on top. You can either create an alternative optimized version of some op uh, operator. Or you could maybe package it differently and use a different distribution mechanism for how you distribute your query among different machines.

[00:11:54] You could maybe look into adding some custom functions that [00:12:00] you, you find useful in your workload. So DevOps is, gives you a base, gives you basic functionality. And so you can dig in into building f uh, new features that you are passionate about.

[00:12:14] Philip Bell: Cool. For clarity, what would you say Vlo CS is not, or should not be used for?

[00:12:19] Masha Basmanova: What Vlo CSS should not be used for? So Vlo CS is not a full-blown database. Mm-hmm. And it's not even a full-blown query engine. So Vox is, is a library. Which means that you either should use an existing application that's built on top of Vals, and this way you leverage Vals through that application.

[00:12:45] Like an app. An application could be Presto, it could be Spark, or you need to build an application on top of it. So you, you would not be using Val's standalone. As [00:13:00] is, you would not provide it, uh, as is to end users. Um, it doesn't speak sql. It expects that the application would translate a query either in SQL form or in a form of a data frame or really any other input and convert it into like a query plan that's, uh, optimized and ready for execution.

[00:13:26] So it's more of a like, Low level infrastructure rather than something that can be exposed directly to end users who are interested to dig into data.

[00:13:37] Philip Bell: So besides being written in c plus plus, how would you say vlo CS operates differently than the core technologies it's replacing?

[00:13:45] Masha Basmanova: So the main difference in like main difference in like this project from like say spark press to photon.

[00:13:55] Is, it's, it's, it's a library. So we believe that [00:14:00] we are, we as an industry, we are moving away from monolithic databases, monolithic engines, and into the world where we are building modules, and then we are taking those separate components and modules and putting them together. To create a final solution.

[00:14:20] We are seeing projects like substrate and Arrow already contributing to that modularization of the query, uh, engines, capabilities, and databases. And so Val is a piece of that puzzle. Val provides you the execution primitive. And so that makes it stand out from, uh, fully, uh, fully blown databases and fully blown query engines like Pres opo.

[00:14:49] So FedEx is highly extensible. You can take it and build your own application on top of it. It's very flexible and so it, it, it, it, it plays into [00:15:00] this new vision of everything being. Essentially Lego blocks. So Le Valox is one of the Lego blocks. Very powerful, very useful, but it's a Lego block look that you put together with others to build, uh, a useful application.

[00:15:17] And

[00:15:17] Philip Bell: what are you most excited to see Vlo Cs will have impact on?

[00:15:20] Masha Basmanova: So we are looking to, um, Integrate Vals with some of the most, uh, uh, common and widely used engines like press to spark and streaming. And with that, we are looking to get much more efficient execution, both in terms of resources, used machines, uh, being used, power being consumed, as well as human, uh, resources like.

[00:15:49] People spending less time waiting for the insights, waiting to get their answers to their queries. And in addition to efficiency, we're also looking for [00:16:00] consolidation. Um, currently, um, many users have to stitch together multiple different systems to get their data to be ingested through streaming, and then processed through some batch, uh, pipeline.

[00:16:14] And then, Shipped over to some sort of machine learning system to use something else to train that data and then produce models, publish them and uh, use them. And all those systems are similar. But very different. And so users having really hard time learning different dialects of sequel and non-SEC to interact with all of those systems to make sure that the whole end-to-end solution is working smoothly.

[00:16:47] The valex, we are hoping to make that experience a lot better. We are hoping that the execution will become consistent. And we are hoping that also front end will become [00:17:00] more consistent as well. So I'm excited about how far we can go with that journey. I think efficiency aspect is a little bit easier to achieve, uh, but I'm much more excited about this consistency aspect in the use of usability, uh, potential of using a single backend for all of those engines that.

[00:17:23] Users have to interact with today. And what were

[00:17:26] Philip Bell: some of the early challenges for implementing the first iteration of V DDoS? The

[00:17:31] Masha Basmanova: early challenges were the pandemic. We started the project two weeks before, so we got two weeks to kind of work normally before the world shut down. So that was, uh, definitely a challenge in addition to just.

[00:17:51] Starting a new project, I feel like is always challenging because you need to build a lot of things. You know, you can't [00:18:00] build them overnight. Still. You need to figure out how to build enough things quickly enough for people to start believing that what you are doing is actually possible. Cuz initially not only we were like totally scared about how we going to pull it off.

[00:18:21] But also people around us fought like, oh, that's probably not possible. That's too big of an undertaken. How are you going to rebuild an engine that took six years to build? Like how long would it take? How? How can you just do that? And so we had those challenges. We had pandemic. And then I personally had a challenge of switching from Java to c plus plus.

[00:18:49] C plus plus is a very complex language. It it's a steep learning curve and um, especially coming from like Java, [00:19:00] which is a much nicer language to work with, it's just more human friendly. It's more readable. First few weeks just reading c plus plus. I thought, what is this? I'll never be able to understand what those.

[00:19:17] Characters mean. So it was this kind of triple challenge of pandemic taking on a huge project, making, figuring out how to make people believe you can do it when you yourself don't quite believe in that yourself. And switching to like a very new, very challenging programming language that, that, that was, uh, what the early experience was.

[00:19:46] Philip Bell: Okay. How do you feel about c plus plus now that you've been working in it for a while? I'm

[00:19:52] Masha Basmanova: like completely the opposite. Uh, recently I had to go back to some of the Java code to [00:20:00] help with the integration of Valox into Preston Spark. And now I can't read Java. So I, I guess it's just like, it's, it's, it's, it's, uh, you, you get used to one environment, one program in language, and then it becomes natural easy.

[00:20:18] You can do things very quickly and then you go back to something that you used to know and you used to think that was easy and. Have fun and all of a sudden you don't really know how to do things, and you have to remember those things again. Uh, I, I remember similar transition when at some point I had to transition from Mercial to get, and I fought like, oh, mercial is so nice and easy to use.

[00:20:46] Kit is so hard. Mm-hmm. And after working in Git for like few months, I had to go back to Meial. And I'm like, why is this so hard? GI is so easy. Yeah. [00:21:00] So I guess we, or at least me, I'm like a, a creature of habit. Mm-hmm. So I get used to something, it becomes easier. I love it. And then other things that I'm not as used to, I'm getting rusty.

[00:21:11] They start looking kind of complicated. Since

[00:21:15] Philip Bell: we're discussing languages quickly, I wanna ask, where do you fall on the debate of, uh, same line versus next line? Uh, curly braces.

[00:21:22] Masha Basmanova: Oh, same line versus next line on the curly brace. Right. This is something I just never even think about. Okay. Uh, one of the first things we did when we started Valox, we put together like a style guidelines and automated style checks.

[00:21:41] Mm-hmm. So whenever you write in some code and you are ready to submit it for review, you just run a command and it puts the curly braces in the right places and puts the new lines where they should be, removes the extra new lines. And so that makes it so that you [00:22:00] really don't need to think about it. So I no longer manually adjust the curled braces.

[00:22:06] And so now that you asked me about it, I'm like, I can't, I, I don't even. Notice where they are, I think. Yeah. And so I, I kind of don't, don't have a preference. Like, like over the years I couldn't, was used, I worked on like multiple projects. And what I learned is that every project has its own coding style and it's going to be different than the previous project.

[00:22:30] Mm-hmm. And first week it'll look completely weird. Yeah. And you'll think, why do they do it this way? Why can't they do it the other way? But after a week, maybe two weeks, you just stop noticing. I feel like those things just really not important. What's important is for the code base to have a consistent look and feel so that those details don't distract you when you are reading the code.

[00:22:57] Mm-hmm. Uh, I feel like [00:23:00] consistency in general helps a lot. So that when you read the code, you, you can just kinda scan it and just assume that it works this way or that way just by looking at the shape. Mm-hmm. And if the code base is consistent, then usually those assumptions are correct. And it's very helpful if you can make those assumptions very quick and not worry about them being wrong.

[00:23:22] So I'm big, big fan of consistency, but a particular way. As long as it's consistent, I'm fine with it. Absolutely

[00:23:31] Philip Bell: agree. Back on topic, uh, what challenges still lie ahead for

[00:23:34] Masha Basmanova: vlogs? The, probably like some of the biggest challenges are how to grow the open source community around the project. How to scale ourselves, how to.

[00:23:50] Scale, the, um, helping people on board to the project. We've seen a lot of interest. We've seen lots of [00:24:00] contributions, and right now the challenge is for us, how to organize ourselves and how to organize the community so that we can support each other in a healthy way. Like right now, we probably getting more contributions when we can handle.

[00:24:19] Uh, easily. So we need to figure out how to grow more folks who have enough knowledge about the project so that they can help with reviews, help with onboarding new contributors, help do design reviews for new features or new ideas people want to bring into the project. So I see that as, uh, probably the biggest, um, challenge for the next year or

[00:24:45] Philip Bell: so.

[00:24:45] And what should people know upfront when they're looking at integrating Vlo Cs into their

[00:24:49] Masha Basmanova: architecture? It definitely helps to ramp up on c plus plus because it is a very complex code in integrations would require you [00:25:00] to dig deep. And do some debugging, and do some code reading, and do some trying, um, adjusting the code, extending the code.

[00:25:11] Um, main integrations that we've been, um, uh, doing require extensions, building maybe some custom operators adding custom functions, aggregate functions, um, maybe building a connector. And all those integrations kind of require you being comfortable working with c plus plus being comfortable reading a code base.

[00:25:35] So ramping up on the language, ramping up on the database concepts, on the query execution concepts definitely helpful. And also just, um, Joining the community, um, coming to the project, sharing who you are, what you're trying to do, looking for, uh, [00:26:00] introducing yourself and seeing if anybody else is also interested in taking the project in the direction you are interested in.

[00:26:09] Or maybe somebody already did an integration. Similar to what you're thinking about. So kind of finding partners who can help you, at least in your initial steps would be very

[00:26:19] Philip Bell: helpful. Can you tell me about the decision to collaborate externally through opensource?

[00:26:24] Masha Basmanova: Sure. That was actually not even a decision that that was kind, kinda just, just, just, I wouldn't do it any other way.

[00:26:36] Um, I caught open source bug by working on Presto. I really enjoyed how that project was. Uh, open source, had a healthy community, had lots of users outside of a single company. It was very exciting. It was very interesting, uh, to work on. And so when we were [00:27:00] considering this new big project, vals, we didn't even consider building it closed source.

[00:27:09] First because we fought that there is no way we can build it ourselves. We fought that. We definitely would need help from folks who may not necessarily be at Facebook or Meta, and we also thought that it's such a complex project, it's going to take so much effort. We really wanted to see this effort pay off by.

[00:27:36] Contributing to something as big as you know, as possible. We didn't want it, the project to be limited to kind of one company, one use case, and so going open source was something that, this is just how we build it. From the get-go. Excellent.

[00:27:56] Philip Bell: And what kinds of skills should developers have if they want to contribute to V Locks?

[00:27:59] Masha Basmanova: [00:28:00] I would say that kind of usual, you know, coding skills help a lot, but when you are working on an open source project, one of the important skills to have is communication. Uh, most of the communications happen asynchronously through. GitHub issues through GitHub pr. So being comfortable with that and being comfortable explaining your thoughts in writing in a way that another person who may not be able to get on a quick video call with your phone call if you could understand and respond just in general working.

[00:28:45] Remotely as we all, I guess now used to. And, uh, communicating asynchronously would be very helpful for contributing to an open source project. And

[00:28:55] Philip Bell: where should someone go to learn more about vlo Cs and get involved? Uh,

[00:28:59] Masha Basmanova: we, uh, we [00:29:00] have a GitHub repo and, um, there is, Uh, landing page. There is a link to the documentation.

[00:29:07] So I suggest everybody starts by reading the articles we have in the documentation, just to get a feel for what the project is about, what kind of code there is, what kind of functionality exists. And once you get yourself familiar, then. Either join Slack or open a GitHub issue and introduce yourself.

[00:29:29] Tell us a little bit about how you would like to contribute, what you would like to do for the project, or how you want to use the project. It's okay if you don't want to contribute, but just have a question about how to use it, how to integrate it with, uh, a system you are interested in. So just, um, Introduce yourself, uh, share about what you're doing, and we will, we will engage.

[00:29:58] Well,

[00:29:58] Philip Bell: thank you so much Masha, for sharing your [00:30:00] insights and your work with us. I'm excited to see the performance results as your team finalizes integrating Blos into our production

[00:30:05] Masha Basmanova: systems. Thank you very much, Phillip. It was, uh, nice to have this conversation with you and thank you for having me.

[00:30:13] Absolutely. Looking forward to more. Have a good day. Thank you. Bye.

Summary​

Social Accounts​

Episode Transcript​

Summary

Social Accounts

Episode Transcript