Launch HN: Data Mechanics (YC S19) – The Simplest Way to Run Apache Spark

flowerlad 2255d ago

Running Spark on a Kubernetes cluster is already pretty easy, so it is unclear what value this is adding. Controlling cost is the hard part. You may only need a cluster for 1 hour per day for a nightly aggregation job. Kubernetes clusters are not easy to provision and de-provision, so you end up paying for a cluster for 24 hour days and use it for only 1 hour. If someone comes up with a way to pay for pre-provisioned Kubernetes clusters only for the duration you use it that would be interesting.

apoverton 2255d ago

I've thought about solving this problem with an ML approach like you all are taking but as you say never had the bandwidth because I was focusing on my "core missions". I'm no longer a heavy spark user but am very happy to see you all working on this!It always seemed so inefficient to me to spend all this time hand tuning jobs only to have the data change and need to do the same thing again.Good luck!

blancothewhite 2255d ago

Very interesting topics in good hands !

sg47 2255d ago

I saw that dynamic allocation is enabled by default. I thought dynamic allocation does not work well on k8s if the executors need to be kept around for serving shuffle files. How does it work in your case?

izyda 2255d ago

What do you see as your key differentiator from Databricks? what's the key pain point they weren't/couldn't solve that you are?

ev0xmusic 2255d ago

Congrats guys, what you are doing is awesome :)

knes 2255d ago

Awesome! Making Spark more approachable is good news for the wave of new data engineers.Do you have any record demo you can share where we can see how a user would set up and integrate with the other tools? that would be neat

ojnabieoot 2255d ago

Speaking as someone who might be in your target audience: my experience with Databricks (back in 2017/2018, without Kubernetes) is that their product is just as unreliable and frustrating as deploying a Spark cluster manually, but also more expensive and more time-consuming. It was so bad that I was wondering if the entire company was a scam - which isn't true, of course. I suspect a big part of our problem was a shuffle-heavy workload hitting a relatively new product. But it left a really bad taste in my mouth about the entire business model of "Spark as a Service."My impulse reaction to your sales pitch is "their product probably doesn't work very well and is way too expensive." I know that's unfair, but this entire idea of "our platform automates away the tedium of Spark clusters" just strikes me as a bag of magic beans.What would help a lot with drawing cynical, bitter people like me: case studies on your website. I know that's a lot to ask for a young startup. But actual details about either money or developer time saved with Data Mechanics - specific pains your customers were having and how Data Mechanics addressed them, or specific analyses your customers were able to do now that they're spending less time managing Spark. Running a big Spark job in the cloud is a huge financial risk, and many Spark users are much more concerned about this than the headaches involved with management - and again, my last experience with Databricks resulted in more cost and more headaches. I do not think I am alone here.I am wondering if you're considering selling your Spark telemetry/parameter tuning/etc software, or offering it as a service, etc. Speaking personally, I would be much more open to using Data Mechanics's tools on my own Spark cluster rather than outsource the actual management. At my organization, in addition to AWS, we also have a local Hadoop cluster with Spark installed; commercial software that gives better insight into its performance could be very useful.

soumyadeb 2255d ago

>Many of our customers use us for their ETL pipelines, they appreciate the ease of use of the platform and the performance boost from automated tuning.This is quite interesting. Founder of RudderStack here (we are a CDI or simply an open-source Segment equivalent). I have seen a similar pain point across some of our customers. They use RudderStack to get data into S3 (or equivalent) and then run some kind of post-processing Spark jobs for analytics/machine-learning use cases. Managing two setups (RudderStack on Kubernetes + Spark) is a pain.A singly managed solution with Spark on Kubernetes makes so much sense. Would love to figure out how to integrate with you guys.

ggregoire 2254d ago

Julien is a really smart guy I had the pleasure of working with.If you are reading this, I'm glad and very excited for you! Good luck!

perlin 2254d ago

Can confirm that running Spark at scale is difficult. Not even necessarily talking about scale of data or scale of performance, but organizational scale. Getting dozens or hundreds of engineers aligned around best practices, tooling and local development for Spark is both challenging and extremely rewarding. When you have everyone buy into Spark as not just an execution environment but a programming paradigm, it really unlocks some cool potential. If anyone cares this is how I've found to best get Spark users riding on rails:* Use a monorepo to "namespace" different projects/teams/whatever. Each namespace has its own build.sbt for Scala jobs and Conda/Pip requirements file for PySpark. This gives you package isolation so that different projects can bump requirements at their own pace. This is crucial in larger organizations where you might have more siloed development or more legacy applications.* Build each project in the monorepo into a separate Docker image and tag it accordingly with some combination of the branch and namespace.* Deploy applications onto Kubernetes by invoking the SparkOperator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator), This abstracts away a lot of the hassle of driver/executor configuration and gives you nice out-of-the-box functionality for scraping Spark metrics.* For local development, use some type of CLI or Makefile to build/run the image locally. This is where the implementation diverges somewhat from using SparkOpelrator (unless you want to tell your employees that everyone needs to run Kubernetes on their local machine, which we thought would create too much friction).* For orchestration, write a custom operator for Airflow that submits a SparkOperator resource to the Kubernetes cluster of your choosing. The operator should supervise the application state, since the SparkOperator doesn’t quite do that well enough for you. This is something I wish we had the opportunity to open source.* Where it gets tricky is building Spark applications locally and running remotely, Say you built a job locally and tested it on a small subset of your data. Now you want to see what happens when you run across a full dataset, requiring more than 16gb of memory (or whatever the developer has on their laptop). You need some way to build your image locally but schedule it remotely. This could be done via the same CLI or Makefile, but you end up with a lot of images and it gets pretty costly. I’m sure we would have figured it out eventually if we didn’t all get laid off last month :P* BONUS: Use Iceberg or Delta (https://iceberg.apache.org/) (https://delta.io/). These are storage formats that work with distributed file storage like HDFS or S3 to partition and query data using the Spark DataFrame API. You get time travel, schema evolution and a bunch of other sweet features out of the box. They are an evolution of Hadoop-era partitioned file formats and are an absolute must for organizations dealing with lots of data & ML infrastructure.This post took up more time than I had wanted, but it actually feels good to write down before I forget. I hope it is useful for someone building Spark infrastructure. I'm sure others have a completely different approach, which I'd be curious to hear! As someone whose full time job was basically just to orchestrate Spark application development, I can say for certain products like this are needed in order for the ecosystem to thrive, and I would probably have given you my business had the circumstances been correct. Good luck to you and your team.

apankrat 2254d ago

Only tangentially related -Data Mechanics was one of contenders for our company name too! It was one of my favourite options in fact. It sounds nice, can be read in two ways, works well when shortened - DataMech. But getting datamech.com proved to be impossible, so we settled on something else. Just 2c.

missosoup 2254d ago

Spark is sort of dead though. Dask looks to be the way of the future. In part because doesn't take a zillion parameters to tune and consume a bucket of resources just for overheads. Good luck.

Launch HN: Data Mechanics (YC S19) – The Simplest Way to Run Apache Spark

Top discussion (from HN)

Quick links