Aggregation
Last Updated: 2021-11-19
Describe a DataFrame
Assume we already have the DataFrame df
, and column names are col0, col1, col2 ...
val result = df.describe("col0", "col1", "col6")
result.show()
+-------+------------------+-----------------+----+
|summary| col0 | col1 |col6|
+-------+------------------+-----------------+----+
| count | 100 | 100 | 100|
| mean | 1.625 | 1.5 |null|
| stddev|1.3252656767320465|1.125087900926024|null|
| min | | | |
| max | ~ | 3.0 |aaaa|
+-------+------------------+-----------------+----+
GroupBy
Count of Categorical Field
df.groupBy("colName").count()
Weighted Count of Categorical Field
df.groupBy("colName").sum("weightColName")
Group by multiple columns
df.groupBy("col0", "col1", "col2").count()
agg()
Unique Values/Cardinality
df.agg(approxCountDistinct("col0")).show()
With multiple aggregation functions
val t = Seq(count("col1"), count("col2"))
df.agg(count("col0"), t:_*).show()
Pragmatically:
val t = header
.map(name =>
Seq(
count(name).as("cnt_" + name),
min(name).as("min_" + name),
max(name).as("max_" + name),
countDistinct(name).as("dist_" + name)))
.reduce(_ ++ _)
val result = df.agg(count("col0"), t: _*)
println(result.head.getAs[Double]("cnt_col0"))
where t:_*
notes that t
should be used as varargs