Aggregation

Updated: 2018-12-11

Describe a DataFrame

Assume we already have the DataFrame df, and column names are col0, col1, col2 ...

val result = df.describe("col0", "col1", "col6")
result.show()
+-------+------------------+-----------------+----+
|summary| col0             | col1            |col6|
+-------+------------------+-----------------+----+
| count | 100              | 100             | 100|
| mean  | 1.625            | 1.5             |null|
| stddev|1.3252656767320465|1.125087900926024|null|
| min   |                  |                 |    |
| max   | ~                | 3.0             |aaaa|
+-------+------------------+-----------------+----+

GroupBy

Count of Categorical Field

df.groupBy("colName").count()

Weighted Count of Categorical Field

df.groupBy("colName").sum("weightColName")

Group by multiple columns

df.groupBy("col0", "col1", "col2").count()

agg()

Unique Values/Cardinality

df.agg(approxCountDistinct("col0")).show()

With multiple aggregation functions

val t = Seq(count("col1"), count("col2"))
df.agg(count("col0"), t:_*).show()

Pragmatically:

val t = header
  .map(name =>
    Seq(
      count(name).as("cnt_" + name),
      min(name).as("min_" + name),
      max(name).as("max_" + name),
      countDistinct(name).as("dist_" + name)))
  .reduce(_ ++ _)

val result = df.agg(count("col0"), t: _*)

println(result.head.getAs[Double]("cnt_col0"))

where t:_* notes that t should be used as varargs