pyspark.sql.functions.array#

pyspark.sql.functions.array(*cols)[source]#

Collection function: Creates a new array column from the input columns or column names.

New in version 1.4.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
colsColumn or str

Column names or Column objects that have the same data type.

Returns
Column

A new Column of array type, where each value is an array containing the corresponding values from the input columns.

Examples

Example 1: Basic usage of array function with column names.

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("Alice", "doctor"), ("Bob", "engineer")],
...     ("name", "occupation"))
>>> df.select(sf.array('name', 'occupation').alias("arr")).show()
+---------------+
|            arr|
+---------------+
|[Alice, doctor]|
|[Bob, engineer]|
+---------------+

Example 2: Usage of array function with Column objects.

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("Alice", "doctor"), ("Bob", "engineer")],
...     ("name", "occupation"))
>>> df.select(sf.array(df.name, df.occupation).alias("arr")).show()
+---------------+
|            arr|
+---------------+
|[Alice, doctor]|
|[Bob, engineer]|
+---------------+

Example 3: Single argument as list of column names.

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("Alice", "doctor"), ("Bob", "engineer")],
...     ("name", "occupation"))
>>> df.select(sf.array(['name', 'occupation']).alias("arr")).show()
+---------------+
|            arr|
+---------------+
|[Alice, doctor]|
|[Bob, engineer]|
+---------------+

Example 4: Usage of array function with columns of different types.

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame(
...     [("Alice", 2, 22.2), ("Bob", 5, 36.1)],
...     ("name", "age", "weight"))
>>> df.select(sf.array(['age', 'weight']).alias("arr")).show()
+-----------+
|        arr|
+-----------+
|[2.0, 22.2]|
|[5.0, 36.1]|
+-----------+

Example 5: array function with a column containing null values.

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("Alice", None), ("Bob", "engineer")],
...     ("name", "occupation"))
>>> df.select(sf.array('name', 'occupation').alias("arr")).show()
+---------------+
|            arr|
+---------------+
|  [Alice, NULL]|
|[Bob, engineer]|
+---------------+