pyspark.sql.DataFrame.groupBy#
- DataFrame.groupBy(*cols)[source]#
Groups the
DataFrameby the specified columns so that aggregation can be performed on them. SeeGroupedDatafor all the available aggregate functions.groupby()is an alias forgroupBy().New in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- Returns
GroupedDataA
GroupedDataobject representing the grouped data by the specified columns.
Notes
A column ordinal starts from 1, which is different from the 0-based
__getitem__().Examples
>>> df = spark.createDataFrame([ ... ("Alice", 2), ("Bob", 2), ("Bob", 2), ("Bob", 5)], schema=["name", "age"])
Example 1: Empty grouping columns triggers a global aggregation.
>>> df.groupBy().avg().show() +--------+ |avg(age)| +--------+ | 2.75| +--------+
Example 2: Group-by ‘name’, and specify a dictionary to calculate the summation of ‘age’.
>>> df.groupBy("name").agg({"age": "sum"}).sort("name").show() +-----+--------+ | name|sum(age)| +-----+--------+ |Alice| 2| | Bob| 9| +-----+--------+
Example 3: Group-by ‘name’, and calculate maximum values.
>>> df.groupBy(df.name).max().sort("name").show() +-----+--------+ | name|max(age)| +-----+--------+ |Alice| 2| | Bob| 5| +-----+--------+
Example 4: Also group-by ‘name’, but using the column ordinal.
>>> df.groupBy(1).max().sort("name").show() +-----+--------+ | name|max(age)| +-----+--------+ |Alice| 2| | Bob| 5| +-----+--------+
Example 5: Group-by ‘name’ and ‘age’, and calculate the number of rows in each group.
>>> df.groupBy(["name", df.age]).count().sort("name", "age").show() +-----+---+-----+ | name|age|count| +-----+---+-----+ |Alice| 2| 1| | Bob| 2| 2| | Bob| 5| 1| +-----+---+-----+
Example 6: Also Group-by ‘name’ and ‘age’, but using the column ordinal.
>>> df.groupBy([df.name, 2]).count().sort("name", "age").show() +-----+---+-----+ | name|age|count| +-----+---+-----+ |Alice| 2| 1| | Bob| 2| 2| | Bob| 5| 1| +-----+---+-----+