pyspark.
RDD
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.
Methods
aggregate(zeroValue, seqOp, combOp)
aggregate
Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.”
aggregateByKey(zeroValue, seqFunc, combFunc)
aggregateByKey
Aggregate the values of each key, using given combine functions and a neutral “zero value”.
barrier()
barrier
Marks the current stage as a barrier stage, where Spark must launch all tasks together.
cache()
cache
Persist this RDD with the default storage level (MEMORY_ONLY).
cartesian(other)
cartesian
Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other.
(a, b)
a
b
checkpoint()
checkpoint
Mark this RDD for checkpointing.
cleanShuffleDependencies([blocking])
cleanShuffleDependencies
Removes an RDD’s shuffles and it’s non-persisted ancestors.
coalesce(numPartitions[, shuffle])
coalesce
Return a new RDD that is reduced into numPartitions partitions.
cogroup(other[, numPartitions])
cogroup
For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other.
collect()
collect
Return a list that contains all of the elements in this RDD.
collectAsMap()
collectAsMap
Return the key-value pairs in this RDD to the master as a dictionary.
collectWithJobGroup(groupId, description[, …])
collectWithJobGroup
When collect rdd, use this method to specify job group.
combineByKey(createCombiner, mergeValue, …)
combineByKey
Generic function to combine the elements for each key using a custom set of aggregation functions.
count()
count
Return the number of elements in this RDD.
countApprox(timeout[, confidence])
countApprox
Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.
countApproxDistinct([relativeSD])
countApproxDistinct
Return approximate number of distinct elements in the RDD.
countByKey()
countByKey
Count the number of elements for each key, and return the result to the master as a dictionary.
countByValue()
countByValue
Return the count of each unique value in this RDD as a dictionary of (value, count) pairs.
distinct([numPartitions])
distinct
Return a new RDD containing the distinct elements in this RDD.
filter(f)
filter
Return a new RDD containing only the elements that satisfy a predicate.
first()
first
Return the first element in this RDD.
flatMap(f[, preservesPartitioning])
flatMap
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
flatMapValues(f)
flatMapValues
Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD’s partitioning.
fold(zeroValue, op)
fold
Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral “zero value.”
foldByKey(zeroValue, func[, numPartitions, …])
foldByKey
Merge the values for each key using an associative function “func” and a neutral “zeroValue” which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).
foreach(f)
foreach
Applies a function to all elements of this RDD.
foreachPartition(f)
foreachPartition
Applies a function to each partition of this RDD.
fullOuterJoin(other[, numPartitions])
fullOuterJoin
Perform a right outer join of self and other.
getCheckpointFile()
getCheckpointFile
Gets the name of the file to which this RDD was checkpointed
getNumPartitions()
getNumPartitions
Returns the number of partitions in RDD
getResourceProfile()
getResourceProfile
Get the pyspark.resource.ResourceProfile specified with this RDD or None if it wasn’t specified.
pyspark.resource.ResourceProfile
getStorageLevel()
getStorageLevel
Get the RDD’s current storage level.
glom()
glom
Return an RDD created by coalescing all elements within each partition into a list.
groupBy(f[, numPartitions, partitionFunc])
groupBy
Return an RDD of grouped items.
groupByKey([numPartitions, partitionFunc])
groupByKey
Group the values for each key in the RDD into a single sequence.
groupWith(other, *others)
groupWith
Alias for cogroup but with support for multiple RDDs.
histogram(buckets)
histogram
Compute a histogram using the provided buckets.
id()
id
A unique ID for this RDD (within its SparkContext).
intersection(other)
intersection
Return the intersection of this RDD and another one.
isCheckpointed()
isCheckpointed
Return whether this RDD is checkpointed and materialized, either reliably or locally.
isEmpty()
isEmpty
Returns true if and only if the RDD contains no elements at all.
isLocallyCheckpointed()
isLocallyCheckpointed
Return whether this RDD is marked for local checkpointing.
join(other[, numPartitions])
join
Return an RDD containing all pairs of elements with matching keys in self and other.
keyBy(f)
keyBy
Creates tuples of the elements in this RDD by applying f.
keys()
keys
Return an RDD with the keys of each tuple.
leftOuterJoin(other[, numPartitions])
leftOuterJoin
Perform a left outer join of self and other.
localCheckpoint()
localCheckpoint
Mark this RDD for local checkpointing using Spark’s existing caching layer.
lookup(key)
lookup
Return the list of values in the RDD for key key.
map(f[, preservesPartitioning])
map
Return a new RDD by applying a function to each element of this RDD.
mapPartitions(f[, preservesPartitioning])
mapPartitions
Return a new RDD by applying a function to each partition of this RDD.
mapPartitionsWithIndex(f[, …])
mapPartitionsWithIndex
Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.
mapPartitionsWithSplit(f[, …])
mapPartitionsWithSplit
mapValues(f)
mapValues
Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.
max([key])
max
Find the maximum item in this RDD.
mean()
mean
Compute the mean of this RDD’s elements.
meanApprox(timeout[, confidence])
meanApprox
Approximate operation to return the mean within a timeout or meet the confidence.
min([key])
min
Find the minimum item in this RDD.
name()
name
Return the name of this RDD.
partitionBy(numPartitions[, partitionFunc])
partitionBy
Return a copy of the RDD partitioned using the specified partitioner.
persist([storageLevel])
persist
Set this RDD’s storage level to persist its values across operations after the first time it is computed.
pipe(command[, env, checkCode])
pipe
Return an RDD created by piping elements to a forked external process.
randomSplit(weights[, seed])
randomSplit
Randomly splits this RDD with the provided weights.
reduce(f)
reduce
Reduces the elements of this RDD using the specified commutative and associative binary operator.
reduceByKey(func[, numPartitions, partitionFunc])
reduceByKey
Merge the values for each key using an associative and commutative reduce function.
reduceByKeyLocally(func)
reduceByKeyLocally
Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary.
repartition(numPartitions)
repartition
Return a new RDD that has exactly numPartitions partitions.
repartitionAndSortWithinPartitions([…])
repartitionAndSortWithinPartitions
Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys.
rightOuterJoin(other[, numPartitions])
rightOuterJoin
sample(withReplacement, fraction[, seed])
sample
Return a sampled subset of this RDD.
sampleByKey(withReplacement, fractions[, seed])
sampleByKey
Return a subset of this RDD sampled by key (via stratified sampling).
sampleStdev()
sampleStdev
Compute the sample standard deviation of this RDD’s elements (which corrects for bias in estimating the standard deviation by dividing by N-1 instead of N).
sampleVariance()
sampleVariance
Compute the sample variance of this RDD’s elements (which corrects for bias in estimating the variance by dividing by N-1 instead of N).
saveAsHadoopDataset(conf[, keyConverter, …])
saveAsHadoopDataset
Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package).
RDD[(K, V)]
saveAsHadoopFile(path, outputFormatClass[, …])
saveAsHadoopFile
saveAsNewAPIHadoopDataset(conf[, …])
saveAsNewAPIHadoopDataset
Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the new Hadoop OutputFormat API (mapreduce package).
saveAsNewAPIHadoopFile(path, outputFormatClass)
saveAsNewAPIHadoopFile
saveAsPickleFile(path[, batchSize])
saveAsPickleFile
Save this RDD as a SequenceFile of serialized objects.
saveAsSequenceFile(path[, compressionCodecClass])
saveAsSequenceFile
Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the “org.apache.hadoop.io.Writable” types that we convert from the RDD’s key and value types.
saveAsTextFile(path[, compressionCodecClass])
saveAsTextFile
Save this RDD as a text file, using string representations of elements.
setName(name)
setName
Assign a name to this RDD.
sortBy(keyfunc[, ascending, numPartitions])
sortBy
Sorts this RDD by the given keyfunc
sortByKey([ascending, numPartitions, keyfunc])
sortByKey
Sorts this RDD, which is assumed to consist of (key, value) pairs.
stats()
stats
Return a StatCounter object that captures the mean, variance and count of the RDD’s elements in one operation.
StatCounter
stdev()
stdev
Compute the standard deviation of this RDD’s elements.
subtract(other[, numPartitions])
subtract
Return each value in self that is not contained in other.
subtractByKey(other[, numPartitions])
subtractByKey
Return each (key, value) pair in self that has no pair with matching key in other.
sum()
sum
Add up the elements in this RDD.
sumApprox(timeout[, confidence])
sumApprox
Approximate operation to return the sum within a timeout or meet the confidence.
take(num)
take
Take the first num elements of the RDD.
takeOrdered(num[, key])
takeOrdered
Get the N elements from an RDD ordered in ascending order or as specified by the optional key function.
takeSample(withReplacement, num[, seed])
takeSample
Return a fixed-size sampled subset of this RDD.
toDF([schema, sampleRatio])
toDF
toDebugString()
toDebugString
A description of this RDD and its recursive dependencies for debugging.
toLocalIterator([prefetchPartitions])
toLocalIterator
Return an iterator that contains all of the elements in this RDD.
top(num[, key])
top
Get the top N elements from an RDD.
treeAggregate(zeroValue, seqOp, combOp[, depth])
treeAggregate
Aggregates the elements of this RDD in a multi-level tree pattern.
treeReduce(f[, depth])
treeReduce
Reduces the elements of this RDD in a multi-level tree pattern.
union(other)
union
Return the union of this RDD and another one.
unpersist([blocking])
unpersist
Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
values()
values
Return an RDD with the values of each tuple.
variance()
variance
Compute the variance of this RDD’s elements.
withResources(profile)
withResources
Specify a pyspark.resource.ResourceProfile to use when calculating this RDD.
zip(other)
zip
Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc.
zipWithIndex()
zipWithIndex
Zips this RDD with its element indices.
zipWithUniqueId()
zipWithUniqueId
Zips this RDD with generated unique Long ids.
Attributes
context
The SparkContext that this RDD was created on.
SparkContext