pyspark.sql.functions.
count_distinct
Returns a new Column for distinct count of col or cols.
Column
col
cols
New in version 3.2.0.
Changed in version 3.4.0: Supports Spark Connect.
first column to compute on.
other columns to compute on.
distinct values of these two column values.
Examples
>>> from pyspark.sql import types >>> df1 = spark.createDataFrame([1, 1, 3], types.IntegerType()) >>> df2 = spark.createDataFrame([1, 2], types.IntegerType()) >>> df1.join(df2).show() +-----+-----+ |value|value| +-----+-----+ | 1| 1| | 1| 2| | 1| 1| | 1| 2| | 3| 1| | 3| 2| +-----+-----+ >>> df1.join(df2).select(count_distinct(df1.value, df2.value)).show() +----------------------------+ |count(DISTINCT value, value)| +----------------------------+ | 4| +----------------------------+