GCP DataFlow vs Dataproc

Updated: 2019-01-21

Google Cloud Platform has 2 data processing/analytics products: Cloud DataFlow and Cloud Dataproc. They sounds confusingly similar, so what are the differences and which one to use?

Hadoop was developed based on Google's The Google File System paper and the MapReduce paper. Hadoop got its own distributed file system called HDFS, and adopted MapReduce for distributed computing. Then Hive, Pig were created to translate(and optimize) the queries into MapReduce jobs. But still MapReduce is very slow to run. Then Spark was born to replace MapReduce, and also to support stream processing in addition to batch jobs.

Separately, Google created its internal data pipeline tool on top of MapReduce, called FlumeJava(not the same and Apache Flume), and later moved away from MapReduce. Another project called MillWheel was created for stream processing, now folded into Flume. Part of the Flume was open sourced as Apache Beam.

So both Flume and Spark can be considered as the next generation Hadoop/MapReduce. Cloud DataFlow is the productionisation, or externalization, of the Google's internal Flume; and Dataproc is a hosted service of the popular open source projects in Hadoop/Spark ecosystem. They share the same origin(Google's papers) but evolved separately.

If you want to migrate from your existing Hadoop/Spark cluster to the cloud, or take advantage of so many well-trained Hadoop/Spark engineers out there in the market, choose Cloud Dataproc; if you trust Google's expertise in large scale data processing and take their latest improvements for free, choose DataFlow.