pyspark.RDD.subtractByKey¶

RDD.subtractByKey(other: pyspark.rdd.RDD[Tuple[K, Any]], numPartitions: Optional[int] = None) → pyspark.rdd.RDD[Tuple[K, V]][source]¶

Return each (key, value) pair in self that has no pair with matching key in other.

New in version 0.9.1.

Parameters

otherRDD: another RDD
numPartitionsint, optional: the number of partitions in new RDD

Returns

RDD: a RDD with the pairs from this whose keys are not in other

See also

RDD.subtract()

Examples

>>> rdd1 = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 2)])
>>> rdd2 = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(rdd1.subtractByKey(rdd2).collect())
[('b', 4), ('b', 5)]

pyspark.RDD.subtract pyspark.RDD.sum