Advanced Array Manipulation in NumPy
In the world of data science and numerical computing, NumPy stands as one of the most popular libraries due to its high-speed operations, flexibility, and compatibility with a wide range of other libraries. NumPy’s power is especially pronounced when it comes to handling arrays, which are fundamental structures in most computational tasks.
Among the operations you typically perform with NumPy, array manipulation stands out. Reshaping, stacking, and splitting arrays are advanced actions that offer great control over your data, allowing for more efficient data analysis, transformation, and visualization. Understanding how to use these features effectively will enable you to leverage the full power of NumPy, leading to cleaner code and faster computations.
In this article, we will delve into these three types of array manipulation. We’ll start by exploring reshaping, which allows you to change the dimensions of your arrays while preserving their data. Then we will dive into stacking, where we’ll learn how to combine different arrays along a specified axis. Finally, we’ll turn our attention to splitting, the process of dividing larger arrays into several smaller ones.
Daily Dose of Scientific Python
Reshaping
Reshaping allows you to change the structure of an array without altering its data, providing a versatile tool for viewing and accessing your data in different ways. In NumPy, this transformation is achieved using various reshaping methods, which we’ll delve into in this section, namely
reshape
,
flatten
, and
ravel
. We will discuss how to use them based on concrete examples and discuss their differences and main use cases.
The
numpy.reshape(a, newshape)
function is the most commonly used function for reshaping an array. It allows you to provide a new shape to an array without changing its data. Here,
a
is the array to be reshaped, and
newshape
is a tuple of integers describing the target shape.
Consider a simple 1-dimensional array:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
array([1, 2, 3, 4, 5, 6, 7, 8])
We can reshape this 1D array into a 2D-array, say, with 2 rows and 4 columns:
np.reshape(arr, (2, 4))
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
or 4 rows and 2 columns:
np.reshape(arr, (4, 2))
array([[1, 2],
[3, 4],
[5, 6],
[7, 8]])
Alternatively, we can directly call the
reshape
method on the array:
arr.reshape(4, 2)
array([[1, 2],
[3, 4],
[5, 6],
[7, 8]])
Of course, reshaping is not limited to 2 dimensions. We can reshape the same array to a 3D array of shape 2x2x2:
arr.reshape(2, 2, 2)
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
But be careful! The total number of elements must match the size. So, here this worked because 8 = 2 x 4 = 2 x 2 x 2. But we cannot reshape to incompatible shapes. For instance,
arr.reshape(2, 3)
would raise an exception.
However, there is one shortcut that helps you avoid shape mismatch: in the reshape function in NumPy, the -1 value is used as a placeholder value that means “infer the correct value from the length of the array.”
When reshaping, you may not always know all the dimensions you want for the new shape, but you do know the total size should remain the same. So, you can specify one of the dimensions as -1, and NumPy will automatically calculate the correct value based on the array’s total size and the other dimensions you provided. For example, for array defined above, we could have said
arr.reshape(2, -1)
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
In this case, the -1 is effectively a placeholder for ‘4’. We’re saying, “reshape this into an array with 2 rows and however many rows are needed to accommodate all the data.” So NumPy automatically calculates that we need 4 rows to accommodate the 8 total elements when arranged in 2 rows.
Analogously, we could have said
arr.reshape(-1, 2)
array([[1, 2],
[3, 4],
[5, 6],
[7, 8]])
to reshape the array into 4 rows and 2 columns.
This placeholder technique also works if you only specify a placeholder. This effectively flattens a multidimensional array to a one-dimensional array:
arr2 = arr.reshape(2, -1)
arr2.reshape(-1)
array([1, 2, 3, 4, 5, 6, 7, 8])
This comes in handy if you want to apply matrix multiplications to multidimensional grids. The same can be achieved by the more explicit function
flatten
:
arr2.flatten()
array([1, 2, 3, 4, 5, 6, 7, 8])
or by the function
ravel
:
arr2.ravel()
array([1, 2, 3, 4, 5, 6, 7, 8])
While flatten and ravel perform similar operations, there is a crucial difference between them. The flatten function always returns a copy of the original array, while ravel returns a view of the original array whenever possible. This makes ravel more memory efficient, but changes to the output array can affect the original array.
Another important thing to know about
reshape
is that it returns a
copy
of the original data:
arr2 = arr.reshape(2, 4)
id(arr) == id(arr2)
False
So when you are working on the reshaped array, it has no effect on the original array.
Stacking
Stacking in NumPy refers to the process of combining different arrays along a specified axis. This can be achieved in various ways, depending on the desired output. We’ll delve into each one in this section.
Let’s start with vertical stacking. The easiest way to see how this works is by example. Consider two simple 1-dimensional arrays:
import numpy as np
arr1 = np.array([0, 1, 2, 3])
arr2 = np.array([4, 5, 6, 7])
array([0, 1, 2, 3])
array([4, 5, 6, 7])
arr_stacked = np.vstack((arr1, arr2))
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
So
vstack
treated
arr1
and
arr2
as row vectors and stacked them vertically, row over row. Note that
vstack
expects a tuple of arrays as an argument, so we need the additional parentheses. We are not restricted to only stacking 2 arrays for example:
np.vstack((arr1, arr2, arr1, arr2))
array([[0, 1, 2, 3],
[4, 5, 6, 7],
[0, 1, 2, 3],
[4, 5, 6, 7]])
What happens, if we define our 1D-arrays explicitly as column vectors ? Well,
np.vstack((arr1.reshape(-1, 1), arr.reshape(-1, 1)))
array([[0],
[1],
[2],
[3],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8]])
vstack
has again stacked vertically rows over rows, but this time, our original arrays had more than just 1 row each. As you can see,
vstack
behaves quite intuitively.
We are not restricted to using 1D arrays for stacking. In fact, we can use any dimension. For example,
arr3 = np.arange(8).reshape(2, -1)
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
arr4 = np.arange(8, 16).reshape(2, -1)
array([[ 8, 9, 10, 11],
[12, 13, 14, 15]])
np.vstack((arr3, arr4))
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
The only condition is that the sizes of all axes except the first (rows; index 0) must match.
So this was vertical stacking. As you can imagine, there is also horizontal stacking, and the function to do this is aptly called
hstack
. The usage is analogously to
vstack
. In vertical stacking, we stacked rows over rows, that is we appended our arrays along axis 0. In horizontal stacking, we stack columns next to columns, that is, we append our arrays along axis 1. Again, an example should clarify possible confusion:
print(f'arr1 = {arr1}, arr2 = {arr2}')
np.hstack((arr1, arr2))
arr1 = [0 1 2 3], arr2 = [4 5 6 7]
array([0, 1, 2, 3, 4, 5, 6, 7])
or
print(f'arr3: \n{arr3}\narr4:\n{arr4}')
np.hstack((arr3, arr4))
arr3:
[[0 1 2 3]
[4 5 6 7]]
arr4:
[[ 8 9 10 11]
[12 13 14 15]]
array([[ 0, 1, 2, 3, 8, 9, 10, 11],
[ 4, 5, 6, 7, 12, 13, 14, 15]])
By the way, there are also functions
row_stack
and
column_stack
in NumPy. These are doing exactly the same as
vstack
and
hstack
, respectively.
So we had vertical (appending along axis 0) and horizontal (appending along axis 1) stacking. But in NumPy, we can stack along any axis. We just have to use the
stack
function instead. In
stack
, you give the arrays to be stacked, and then the axis along which to stack. For example:
np.stack((arr1, arr2), axis=0)
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
np.stack((arr1, arr2, arr2), axis=1)
array([[0, 4, 4],
[1, 5, 5],
[2, 6, 6],
[3, 7, 7]])
arr5 = np.arange(8).reshape(2, 4)
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
np.stack((arr5, 2*arr5, 3*arr5), axis=2).shape
(2, 4, 3)
Splitting
Splitting is another crucial operation in NumPy that essentially does the opposite of stacking. It divides a larger array into several smaller ones. Like stacking, there are different types of splitting operations available in NumPy, depending on your needs.
The
split
function is the most basic method of splitting in NumPy. The syntax is
numpy.split(arr, indices_or_sections, axis=0)
. Here,
arr
is the array to be split,
indices_or_sections
can either be an integer, specifying the number of equal arrays to be created, or a 1-D array of points at which to split. The
axis
parameter specifies the axis along which to split, with the default being 0.
For instance,
arr = np.array([0, 1, 2, 3, 4, 5])
np.split(arr, 2, axis=0)
[array([0, 1, 2]), array([3, 4, 5])]
splits
arr
into 2 equal-sized sub-arrays.
Note that this only works because our original array had an even size along axis 0. If we had
arr = np.array([0, 1, 2, 3, 4, 5, 6])
np.split(arr, 2, axis=0)
this would raise an exception! But we can use the alternative syntax for the
indices_or_sections
parameter, defining points at which to split:
arr = np.array([0, 1, 2, 3, 4, 5, 6])
np.split(arr, [3], axis=0)
[array([0, 1, 2]), array([3, 4, 5, 6])]
or
np.split(arr, [3, 5], axis=0)
[array([0, 1, 2]), array([3, 4]), array([5, 6])]
But what does
axis=0
actually do? For a 1-dimensional array, there is only one axis (axis 0), so if we are splitting this type of array, axis will be 0 by default or not mentioned at all.
However, for multi-dimensional arrays, you can specify the axis to split along. For example, consider a 2-dimensional array. If
axis=0
, the array will be split along the rows. This means that each subarray will have fewer rows than the original, but the same number of columns. If
axis=1
, the array will be split along the columns. Each subarray will have fewer columns than the original, but the same number of rows.
import numpy as np
# Define a 2D array
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr)
[[1 2 3]
[4 5 6]
[7 8 9]]
Split array along axis 0 (rows):
print(np.split(arr, 3, axis=0))
[array([[1, 2, 3]]), array([[4, 5, 6]]), array([[7, 8, 9]])]
Split array along axis 1 (columns):
print(np.split(arr, 3, axis=1))
[array([[1],
[4],
[7]]), array([[2],
[5],
[8]]), array([[3],
[6],
[9]])]
In this case, splitting along axis 0 will produce three arrays each containing one of the original array’s rows. Splitting along axis 1 will produce three arrays each containing one of the original array’s columns.
Like in the case of stacking, there are also convenience functions
hsplit
and
vsplit
which do exactly what you would expect.
vsplit
is basically
split
with
axis=0
, and
hsplit
is
split
with
axis=1
.
Conclusion
Over the course of this article, we’ve ventured deep into the realm of advanced array manipulation with NumPy, focusing on three key operations: reshaping, stacking, and splitting. We have dissected these operations, demonstrating their potential in simplifying complex data manipulation tasks, and equipping you with powerful tools for your data science journey.