pyspark.pandas.melt#

pyspark.pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value')[source]#

Unpivot a DataFrame from wide format to long format, optionally leaving identifier variables set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters
frameDataFrame
id_varstuple, list, or ndarray, optional

Column(s) to use as identifier variables.

value_varstuple, list, or ndarray, optional

Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

var_namescalar, default ‘variable’

Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.

value_namescalar, default ‘value’

Name to use for the ‘value’ column.

Returns
DataFrame

Unpivoted DataFrame.

Examples

>>> df = ps.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
...                    'B': {0: 1, 1: 3, 2: 5},
...                    'C': {0: 2, 1: 4, 2: 6}},
...                   columns=['A', 'B', 'C'])
>>> df
   A  B  C
0  a  1  2
1  b  3  4
2  c  5  6
>>> ps.melt(df)
  variable value
0        A     a
1        B     1
2        C     2
3        A     b
4        B     3
5        C     4
6        A     c
7        B     5
8        C     6
>>> df.melt(id_vars='A')
   A variable  value
0  a        B      1
1  a        C      2
2  b        B      3
3  b        C      4
4  c        B      5
5  c        C      6
>>> df.melt(value_vars='A')
  variable value
0        A     a
1        A     b
2        A     c
>>> ps.melt(df, id_vars=['A', 'B'])
   A  B variable  value
0  a  1        C      2
1  b  3        C      4
2  c  5        C      6
>>> df.melt(id_vars=['A'], value_vars=['C'])
   A variable  value
0  a        C      2
1  b        C      4
2  c        C      6

The names of ‘variable’ and ‘value’ columns can be customized:

>>> ps.melt(df, id_vars=['A'], value_vars=['B'],
...         var_name='myVarname', value_name='myValname')
   A myVarname  myValname
0  a         B          1
1  b         B          3
2  c         B          5