pyspark.pandas.read_csv

pyspark.pandas.read_csv(path: str, sep: str = ',', header: Union[str, int, None] = 'infer', names: Union[str, List[str], None] = None, index_col: Union[str, List[str], None] = None, usecols: Union[List[int], List[str], Callable[[str], bool], None] = None, squeeze: bool = False, mangle_dupe_cols: bool = True, dtype: Union[str, numpy.dtype, pandas.core.dtypes.base.ExtensionDtype, Dict[str, Union[str, numpy.dtype, pandas.core.dtypes.base.ExtensionDtype]], None] = None, nrows: Optional[int] = None, parse_dates: bool = False, quotechar: Optional[str] = None, escapechar: Optional[str] = None, comment: Optional[str] = None, encoding: Optional[str] = None, **options: Any) → Union[pyspark.pandas.frame.DataFrame, pyspark.pandas.series.Series][source]

Read CSV (comma-separated) file into DataFrame or Series.

Parameters
pathstr

The path string storing the CSV file to be read.

sepstr, default ‘,’

Delimiter to use. Must be a single character.

headerint, default ‘infer’

Whether to to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names

namesstr or array-like, optional

List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list will cause an error to be issued. If a string is given, it should be a DDL-formatted string in Spark SQL, which is preferred to avoid schema inference for better performance.

index_col: str or list of str, optional, default: None

Index column of table in Spark.

usecolslist-like or callable, optional

Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True.

squeezebool, default False

If the parsed data only contains one column then return a Series.

mangle_dupe_colsbool, default True

Duplicate columns will be specified as ‘X0’, ‘X1’, … ‘XN’, rather than ‘X’ … ‘X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns. Currently only True is allowed.

dtypeType name or dict of column -> type, default None

Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use str or object together with suitable na_values settings to preserve and not interpret dtype.

nrowsint, default None

Number of rows to read from the CSV file.

parse_datesboolean or list of ints or names or list of lists or dict, default False.

Currently only False is allowed.

quotecharstr (length 1), optional

The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.

escapecharstr (length 1), default None

One-character string used to escape delimiter

comment: str, optional

Indicates the line should not be parsed.

encoding: str, optional

Indicates the encoding to read file

optionsdict

All other options passed directly into Spark’s data source.

Returns
DataFrame or Series

See also

DataFrame.to_csv

Write DataFrame to a comma-separated values (csv) file.

Examples

>>> ps.read_csv('data.csv')