Advanced Array Manipulation in NumPy

Reshaping, Stacking, Splitting
Photo by Vlado Paunovic on Unsplash

In the world of data science and numerical computing, NumPy stands as one of the most popular libraries due to its high-speed operations, flexibility, and compatibility with a wide range of other libraries. NumPy’s power is especially pronounced when it comes to handling arrays, which are fundamental structures in most computational tasks.

Among the operations you typically perform with NumPy, array manipulation stands out. Reshaping, stacking, and splitting arrays are advanced actions that offer great control over your data, allowing for more efficient data analysis, transformation, and visualization. Understanding how to use these features effectively will enable you to leverage the full power of NumPy, leading to cleaner code and faster computations.

In this article, we will delve into these three types of array manipulation. We’ll start by exploring reshaping, which allows you to change the dimensions of your arrays while preserving their data. Then we will dive into stacking, where we’ll learn how to combine different arrays along a specified axis. Finally, we’ll turn our attention to splitting, the process of dividing larger arrays into several smaller ones.

Daily Dose of Scientific Python

Reshaping

Reshaping allows you to change the structure of an array without altering its data, providing a versatile tool for viewing and accessing your data in different ways. In NumPy, this transformation is achieved using various reshaping methods, which we’ll delve into in this section, namely reshape , flatten , and ravel . We will discuss how to use them based on concrete examples and discuss their differences and main use cases.

The numpy.reshape(a, newshape) function is the most commonly used function for reshaping an array. It allows you to provide a new shape to an array without changing its data. Here, a is the array to be reshaped, and newshape is a tuple of integers describing the target shape.

Consider a simple 1-dimensional array:

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
array([1, 2, 3, 4, 5, 6, 7, 8])

We can reshape this 1D array into a 2D-array, say, with 2 rows and 4 columns:

np.reshape(arr, (2, 4))
array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

or 4 rows and 2 columns:

np.reshape(arr, (4, 2))
array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

Alternatively, we can directly call the reshape method on the array:

arr.reshape(4, 2)
array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

Of course, reshaping is not limited to 2 dimensions. We can reshape the same array to a 3D array of shape 2x2x2:

arr.reshape(2, 2, 2)
array([[[1, 2],
        [3, 4]],

       [[5, 6],
        [7, 8]]])

But be careful! The total number of elements must match the size. So, here this worked because 8 = 2 x 4 = 2 x 2 x 2. But we cannot reshape to incompatible shapes. For instance,

arr.reshape(2, 3)

would raise an exception.

However, there is one shortcut that helps you avoid shape mismatch: in the reshape function in NumPy, the -1 value is used as a placeholder value that means “infer the correct value from the length of the array.”

When reshaping, you may not always know all the dimensions you want for the new shape, but you do know the total size should remain the same. So, you can specify one of the dimensions as -1, and NumPy will automatically calculate the correct value based on the array’s total size and the other dimensions you provided. For example, for array defined above, we could have said

arr.reshape(2, -1)
array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In this case, the -1 is effectively a placeholder for ‘4’. We’re saying, “reshape this into an array with 2 rows and however many rows are needed to accommodate all the data.” So NumPy automatically calculates that we need 4 rows to accommodate the 8 total elements when arranged in 2 rows.

Analogously, we could have said

arr.reshape(-1, 2)
array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

to reshape the array into 4 rows and 2 columns.

This placeholder technique also works if you only specify a placeholder. This effectively flattens a multidimensional array to a one-dimensional array:

arr2 = arr.reshape(2, -1)
arr2.reshape(-1)
array([1, 2, 3, 4, 5, 6, 7, 8])

This comes in handy if you want to apply matrix multiplications to multidimensional grids. The same can be achieved by the more explicit function flatten :

arr2.flatten()
array([1, 2, 3, 4, 5, 6, 7, 8])

or by the function ravel :

arr2.ravel()
array([1, 2, 3, 4, 5, 6, 7, 8])

While flatten and ravel perform similar operations, there is a crucial difference between them. The flatten function always returns a copy of the original array, while ravel returns a view of the original array whenever possible. This makes ravel more memory efficient, but changes to the output array can affect the original array.

Another important thing to know about reshape is that it returns a copy of the original data:

arr2 = arr.reshape(2, 4)
id(arr) == id(arr2)
False

So when you are working on the reshaped array, it has no effect on the original array.

Stacking

Stacking in NumPy refers to the process of combining different arrays along a specified axis. This can be achieved in various ways, depending on the desired output. We’ll delve into each one in this section.

Let’s start with vertical stacking. The easiest way to see how this works is by example. Consider two simple 1-dimensional arrays:

import numpy as np
arr1 = np.array([0, 1, 2, 3])
arr2 = np.array([4, 5, 6, 7])
array([0, 1, 2, 3])
array([4, 5, 6, 7])
arr_stacked = np.vstack((arr1, arr2))
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

So vstack treated arr1 and arr2 as row vectors and stacked them vertically, row over row. Note that vstack expects a tuple of arrays as an argument, so we need the additional parentheses. We are not restricted to only stacking 2 arrays for example:

np.vstack((arr1, arr2, arr1, arr2))
array([[0, 1, 2, 3],
       [4, 5, 6, 7],
       [0, 1, 2, 3],
       [4, 5, 6, 7]])

What happens, if we define our 1D-arrays explicitly as column vectors ? Well,

np.vstack((arr1.reshape(-1, 1), arr.reshape(-1, 1)))
array([[0],
       [1],
       [2],
       [3],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8]])

vstack has again stacked vertically rows over rows, but this time, our original arrays had more than just 1 row each. As you can see, vstack behaves quite intuitively.

We are not restricted to using 1D arrays for stacking. In fact, we can use any dimension. For example,

arr3 = np.arange(8).reshape(2, -1)
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])
arr4 = np.arange(8, 16).reshape(2, -1)
array([[ 8,  9, 10, 11],
       [12, 13, 14, 15]])
np.vstack((arr3, arr4))
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

The only condition is that the sizes of all axes except the first (rows; index 0) must match.

So this was vertical stacking. As you can imagine, there is also horizontal stacking, and the function to do this is aptly called hstack . The usage is analogously to vstack . In vertical stacking, we stacked rows over rows, that is we appended our arrays along axis 0. In horizontal stacking, we stack columns next to columns, that is, we append our arrays along axis 1. Again, an example should clarify possible confusion:

print(f'arr1 = {arr1}, arr2 = {arr2}')
np.hstack((arr1, arr2))
arr1 = [0 1 2 3], arr2 = [4 5 6 7]
array([0, 1, 2, 3, 4, 5, 6, 7])

or

print(f'arr3: \n{arr3}\narr4:\n{arr4}')
np.hstack((arr3, arr4))
arr3: 
[[0 1 2 3]
 [4 5 6 7]]
arr4:
[[ 8  9 10 11]
 [12 13 14 15]]

array([[ 0,  1,  2,  3,  8,  9, 10, 11],
       [ 4,  5,  6,  7, 12, 13, 14, 15]])

By the way, there are also functions row_stack and column_stack in NumPy. These are doing exactly the same as vstack and hstack , respectively.

So we had vertical (appending along axis 0) and horizontal (appending along axis 1) stacking. But in NumPy, we can stack along any axis. We just have to use the stack function instead. In stack , you give the arrays to be stacked, and then the axis along which to stack. For example:

np.stack((arr1, arr2), axis=0)
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])
np.stack((arr1, arr2, arr2), axis=1)
array([[0, 4, 4],
       [1, 5, 5],
       [2, 6, 6],
       [3, 7, 7]])
arr5 = np.arange(8).reshape(2, 4)
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])
np.stack((arr5, 2*arr5, 3*arr5), axis=2).shape
(2, 4, 3)

Splitting

Splitting is another crucial operation in NumPy that essentially does the opposite of stacking. It divides a larger array into several smaller ones. Like stacking, there are different types of splitting operations available in NumPy, depending on your needs.

The split function is the most basic method of splitting in NumPy. The syntax is numpy.split(arr, indices_or_sections, axis=0) . Here, arr is the array to be split, indices_or_sections can either be an integer, specifying the number of equal arrays to be created, or a 1-D array of points at which to split. The axis parameter specifies the axis along which to split, with the default being 0.

For instance,

arr = np.array([0, 1, 2, 3, 4, 5])
np.split(arr, 2, axis=0)
[array([0, 1, 2]), array([3, 4, 5])]

splits arr into 2 equal-sized sub-arrays.

Note that this only works because our original array had an even size along axis 0. If we had

arr = np.array([0, 1, 2, 3, 4, 5, 6])
np.split(arr, 2, axis=0)

this would raise an exception! But we can use the alternative syntax for the indices_or_sections parameter, defining points at which to split:

arr = np.array([0, 1, 2, 3, 4, 5, 6])
np.split(arr, [3], axis=0)
[array([0, 1, 2]), array([3, 4, 5, 6])]

or

np.split(arr, [3, 5], axis=0)
[array([0, 1, 2]), array([3, 4]), array([5, 6])]

But what does axis=0 actually do? For a 1-dimensional array, there is only one axis (axis 0), so if we are splitting this type of array, axis will be 0 by default or not mentioned at all.

However, for multi-dimensional arrays, you can specify the axis to split along. For example, consider a 2-dimensional array. If axis=0 , the array will be split along the rows. This means that each subarray will have fewer rows than the original, but the same number of columns. If axis=1 , the array will be split along the columns. Each subarray will have fewer columns than the original, but the same number of rows.

import numpy as np

# Define a 2D array
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr)
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Split array along axis 0 (rows):

print(np.split(arr, 3, axis=0))
[array([[1, 2, 3]]), array([[4, 5, 6]]), array([[7, 8, 9]])]

Split array along axis 1 (columns):

print(np.split(arr, 3, axis=1))
[array([[1],
       [4],
       [7]]), array([[2],
       [5],
       [8]]), array([[3],
       [6],
       [9]])]

In this case, splitting along axis 0 will produce three arrays each containing one of the original array’s rows. Splitting along axis 1 will produce three arrays each containing one of the original array’s columns.

Like in the case of stacking, there are also convenience functions hsplit and vsplit which do exactly what you would expect. vsplit is basically split with axis=0 , and hsplit is split with axis=1 .

Conclusion

Over the course of this article, we’ve ventured deep into the realm of advanced array manipulation with NumPy, focusing on three key operations: reshaping, stacking, and splitting. We have dissected these operations, demonstrating their potential in simplifying complex data manipulation tasks, and equipping you with powerful tools for your data science journey.