Versus
    32-bit vs 64-bit
    Annotations vs Decorators
    BigQuery vs Bigtable
    Block Storage vs File Storage vs Object Storage
    C vs C++
    Canvas vs SVG
    Constructor vs Init() vs Factory
    Containers vs Virtual Machines (VMs)
    DOM vs Virtual DOM vs Shadow DOM
    DQL vs DDL vs DCL vs DML
    Dagger vs Guice
    Data Mining vs Machine Learning vs Artificial Intelligence vs Data Science
    Flux vs Redux
    GCP API Gateway vs Cloud Endpoint
    GCP Cloud Run vs Cloud Functions vs App Engine
    GCP DataFlow vs Dataproc
    Google Analytics 4 vs Universal Analytics
    Google Internal vs Open Source
    HEIC vs HEIF vs HEVC vs JPEG
    Java vs C++
    Jetty vs Netty
    Kotlin vs Java
    LLVM vs JVM
    Linux vs BSD
    Microcontroller vs Microprocessor vs Computer
    Node.js vs Erlang
    POSIX vs SUS vs LSB
    Pass-by-value vs Pass-by-reference
    Proto2 vs Proto3
    PubSub vs Message Queue
    REST vs SOAP
    React vs Flutter vs Angular
    Rust vs C++
    SLI vs SLO vs SLA
    SRAM vs DRAM
    SSD vs HDD
    Software Engineer vs Site Reliability Engineer
    Spanner vs Bigtable
    Stack based VM vs Register based VM
    Stateless vs Stateful
    Static Site Generation vs Server-side Rendering vs Client-side Rendering
    Strong Consistency vs Eventual Consistency
    Subroutines vs Coroutines vs Generators
    Symlinks vs Hard Links
    Tensorflow vs PyTorch
    Terminal vs Shell
    Vi vs Vim vs gVim vs Neovim
    WAL vs rollback journal
    gtag vs Tag Manager
    stubs vs mocks vs fakes

GCP DataFlow vs Dataproc

Updated: 2022-02-12

TL;DR

Google Cloud Platform has 2 data processing / analytics products:

  • Cloud DataFlow is the productionisation, or externalization, of the Google's internal Flume.
  • Cloud Dataproc is a hosted service of the popular open source projects in Hadoop / Spark ecosystem. They share the same origin (Google's papers) but evolved separately.

A little bit history

Hadoop was developed based on Google's The Google File System paper and the MapReduce paper. Hadoop got its own distributed file system called HDFS, and adopted MapReduce for distributed computing. Then Hive, Pig were created to translate (and optimize) the queries into MapReduce jobs. But still MapReduce is very slow to run. Then Spark was born to replace MapReduce, and also to support stream processing in addition to batch jobs.

Separately, Google created its internal data pipeline tool on top of MapReduce, called FlumeJava (not the same and Apache Flume), and later moved away from MapReduce. Another project called MillWheel was created for stream processing, now folded into Flume. Part of the Flume was open sourced as Apache Beam.

So both Flume and Spark can be considered as the next generation Hadoop / MapReduce.

Which one to use

  • If you want to migrate from your existing Hadoop / Spark cluster to the cloud, or take advantage of so many well-trained Hadoop / Spark engineers out there in the market, choose Cloud Dataproc.
  • If you trust Google's expertise in large scale data processing and take their latest improvements for free, choose Cloud DataFlow.