terminal = false packages = ["numpy", "pandas"] [[fetch]] files = ["/resources/example.csv", "/resources/titanic.csv"]
nup_logo

Machine Learning with Python

Lecture 4. Numpy and Pandas


Alexander Avdiushenko
October 10, 2023

NumPy

Since 1995 known as numeric, since 2006 as NumPy — "Numerical Python extensions"

Features of the NumPy library

  • Work with multi-dimensional arrays (tables)
  • Quickly compute mathematical functions on multi-dimensional arrays

The core of the NumPy package is the ndarray object

Important differences between NumPy arrays and Python sequences:

  • NumPy array has a fixed length, which is determined at the time of its creation (unlike Python lists, which can grow dynamically)
  • Elements in a NumPy array must be of the same type
  • You can perform operations directly on NumPy arrays

Speed of NumPy is due to:

  • Implementation in C
  • Vectorization and Broadcasting (e.g., multiplication of arrays of compatible shapes)

Conversion from Python structures

import numpy as np np.array([1, 2, 3, 4, 5])

During conversion, you can specify the data type using the dtype argument:

np.array([1, 2, 3, 4, 5], dtype=np.float32)

Or similarly

np.float32([1, 2, 3, 4, 5])

Generation of numpy arrays

  • arange — analogue of Python's range, to which a non-integer step can be passed
  • linspace — a way to evenly divide a segment into `n-1` intervals
  • logspace — a way to divide a segment on a logarithmic scale
  • zeros — creates an array filled with zeros of the specified dimension
  • ones — creates an array filled with ones of the specified dimension
  • empty — creates an array of the specified dimension not initialized with any value
np.arange(0, 5, 2) np.linspace(0, 5, 11) np.logspace(0, 9, 10, base=2) np.zeros((2, 2, 3)) np.diag([1,2,3])

The array sizes are stored in the shape field, and the number of dimensions are stored in ndim.

arr = np.ones((2, 3)) print(f"Array shape — {arr.shape}, number of dimensions — {arr.ndim}")

The reshape method allows you to transform the dimensions of an array without changing the data, but possibly with copying.

array = np.arange(0, 6) array = array.reshape((2, 3)) array
a = np.zeros((10, 2)) # A transpose makes the array non-contiguous b = a.T # Taking a view makes it possible to modify the shape # without modifying the initial object c = b.view() # AttributeError: ... Use `.reshape()` to make a copy with the desired shape c.shape = (20)
a a.shape = (20) a

To unroll a multi-dimensional array into a vector, you can use the ravel function (equivalent to `reshape(-1, order=order)`).

array = np.arange(0, 6) array = array.reshape((2, 3)) array = np.ravel(array) array

Indexation

In NumPy, familiar Python indexing works, including the use of negative indices and slices.

import numpy as np array = np.arange(0, 6) print(array[0], array[-1]) print(array[1:-1]) print(array[1:-1:2]) print(array[::-1])
Note 1: Indices and slices in multidimensional arrays do not need to be separated by square brackets, i.e., instead of matrix[i][j] you should use matrix[i, j]
Note 2: Slices in NumPy create views, not copies, as in the case of slices in native Python sequences (string, tuple, and list).

You can use lists as indices.

import numpy as np array = np.arange(0, 6) array[[0, 2, 4]] array[[True, False, True, False, True, False]] x = np.array([[1, 2, 3]]) y = np.array([1, 2, 3]) print(x.shape, y.shape) print(np.array_equal(x, y)) print(np.array_equal(x, y[np.newaxis, :]))
x = np.arange(10) x x % 2 x[(x % 2 == 0) & (x > 5)] print(x) x[x > 3] *= 2 print(x)

Reading from disk

(but better use pandas)

from numpy import genfromtxt data = genfromtxt('./resources/example.csv', delimiter=',', skip_header=1) for row in data: print(row) help(genfromtxt)

Broadcasting and Vectorization

Operations in NumPy can be performed directly on vectors of equal dimension without using loops.

For example, the computation of the element-wise difference between vectors looks like this:

import numpy as np _l = np.ones(5) x = np.linspace(1, 5, 5) print(x - _l) print(x - 2 * _l)

Similarly for multi-dimensional arrays.

Note: All arithmetic operations on arrays of the same size are performed element-wise.

Broadcasting

Broadcasting removes "the same dimension" rule and allows you to perform arithmetic operations on arrays of different, but still, aligned dimensions. The simplest example is multiplying a vector by a number

Imgur

import numpy as np np.arange(1, 4) * 2

The rule of dimensionality consistency is expressed in one sentence:

For broadcasting, the dimensions along the axes in two arrays must either be the same, or one of them must be equal to one.
a = np.ones((2, 3, 4)) b = np.arange(1, 5) # b.shape = (4,) # here a.shape = (2, 3, 4) and b.shape is considered to be (1, 1, 4) a * b

Add the same vector to each row of the matrix.

np.array([[0, 0, 0], [10, 10, 10], [20, 20, 20], [30, 30, 30]]) + np.arange(3)

broadcasting

Now if we want to do the same trick but with columns, we can't just add a vector consisting of 4 elements because in this case the dimensions will not be coordinated.

wrong_shapes

First, the vector needs to be transformed to the form:

np.arange(4) np.arange(4)[:, np.newaxis]

And then add the matrix to it:

np.arange(4)[:, np.newaxis] + np.array([[0, 0, 0], [10, 10, 10], [20, 20, 20], [30, 30, 30]])

If you need to multiply multidimensional arrays not element-by-element, but according to the rule of matrix multiplication, you should use np.dot. Transposition is done with array.T

Also, NumPy has implemented many useful operations for working with arrays: np.min, np.max, np.sum, np.mean, etc.

Note: Each of the listed functions has the axis parameter, which indicates along which dimension to perform this operation. By default, the operation is performed on all values of the array.

Operations

import numpy as np x = np.arange(20).reshape(4, 1, 5) print(x) print(x.T.shape)
print(x.mean()) print(np.mean(x)) x_mean_0 = x.mean(axis=0) print(x_mean_0.shape) print(x_mean_0) x_mean_1 = x.mean(axis=1) print(x_mean_1.shape) print(x_mean_1) x_mean_02 = x.mean(axis=(0, 2)) print(x_mean_02.shape) print(x_mean_02)

Concatenation of multidimensional arrays

You can concatenate several arrays using the functions np.concatenate, np.vstack, np.hstack, np.dstack

import numpy as np x = np.arange(10).reshape(5, 2) y = np.arange(100, 120).reshape(5, 4) print(f"{x=},\n {y=}") np.hstack((x, y))
p = np.arange(1).reshape([1, 1, 1, 1]) p print("vstack: ", np.vstack((p, p)).shape) print("hstack: ", np.hstack((p, p)).shape) print("dstack: ", np.dstack((p, p)).shape) np.concatenate((p, p), axis=3).shape

Numpy types

import numpy as np x = [1, 2, 70000] np.array(x, dtype=np.float64) np.array(x, dtype=np.uint16) np.array(x, dtype=np.unicode_)

Function vectorization

def f(value): return np.sqrt(value) vf = np.vectorize(f) # like the python map function, except it uses the broadcasting rules of numpy vf(np.arange(10))

Pandas

Consider the Pandas library (from panel data), designed for reading, preprocessing and fast visualization of structured data, as well as for simple analytics.

Even when there are only two arrays (for example, grouping by one and aggregating by the second), Pandas is already better. The column titles should correctly reflect the physical meaning, the index (row labels) is not necessarily numerical.

import pandas as pd df = pd.read_csv("./resources/titanic.csv", sep="\t") df.head(3)
df[["Sex", "Cabin"]].describe()

Slices in DataFrames

df.sort_values("Age", inplace=True) print(df.iloc[78]) # integer-location — just a line in order from 0 to length-1
print(df.loc[78]) # index 78 print(df.loc[[78, 79, 100], ["Age", "Name"]])

Note: If you want to modify the data of a slice without changing the main table, you need to make a copy.

df_slice_copy = df.loc[[78, 79, 100], ["Age", "Name"]].copy() df_slice_copy["Age"] = 3 print(df_slice_copy)

Note: If you want to change the main table, use loc/iloc.

some_slice = df["Age"].isin([20, 25, 30]) df.loc[some_slice, "Fare"] = df.loc[some_slice, "Fare"] * 1000

And don't do this way:

slice_df = df[some_slice] slice_df["Fare"] = slice_df["Fare"] * 10

You can get the values of only the necessary columns by passing in [] the column name (or list of column names).

Note: If we pass the name of one column, we get an object of class pandas.Series, and if a list of column names, we get a pandas.DataFrame. To get a numpy.array, refer to the values field.

Series and DataFrame have many common methods.

# pandas.Series df["Age"].head(5) # pandas.DataFrame df[["Age"]].head(5) # Series to DataFrame s = pd.Series([1, 2, 3], index=["Red", "Green", "Blue"]) s.to_frame("Values")
merging_concat_basic

Join and merge

df1 = df[["Age", "Parch"]].copy() df2 = df[["Ticket", "Fare"]].copy() # by index df1.join(df2).head(3) df1 = df[["Age", "Parch", "PassengerId"]].copy() df2 = df[["Ticket", "Fare", "PassengerId"]].copy() # by columns pd.merge(df1, df2, on=["PassengerId"], how="inner").head(3)

Grouping

print("Pclass 1: ", df[df["Pclass"] == 1]["Age"].mean()) print("Pclass 2: ", df[df["Pclass"] == 2]["Age"].mean()) print("Pclass 3: ", df[df["Pclass"] == 3]["Age"].mean()) df.groupby(["Pclass"])[["Age"]].mean() df.groupby(["Survived", "Pclass"]) df.groupby(["Survived", "Pclass"])["PassengerId"].count()
df.groupby(["Survived", "Pclass"])[["PassengerId", "Cabin"]].count()