### CDS NYU
### DS-GA 1007 | Programming for Data Science
### Lab 06
### October 12, 2022


# NumPy: Array Manipulation for Scientific Computing

## Section Leaders


Cora Mao  --  ym1596@nyu.edu  -- 

Devarsh Patel --  dp3324@nyu.edu  --


## Resources

* Concise textbook introduction to NumPy: ”Python Data Science Handbook” pp. 33-96, by Jake VanderPlas
    * Also accessible online at https://jakevdp.github.io/PythonDataScienceHandbook

* NumPy's freely acccessible, online, high-quality and concise documentation: https://numpy.org/doc/

* Case Study:  https://swcarpentry.github.io/python-novice-inflammation/02-numpy/index.html


## 1. Creation, Manupulation and Indexing of NumPy Arrays
NumPy supports large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions

In [None]:
import numpy as np

### Create NumPy arrays

#### Using `arange` to define numerical entries and (optionally)  `reshape` to define the dimensions

In [None]:
x = np.arange(1, 9, 2) # start, end, step
print(x)

In [None]:
x = np.arange(20).reshape(5, 4)

In [None]:
print(x.ndim)  # Number of dimensions of the array
print(x.shape) # Tuple of integers indicating the size of the array in each dimension
print(x.size)  # Total number of elements of the array
print(x.dtype) # Type of the elements in the array

#### Self-defined NumPy array

In [None]:
# 1 dimension
x = np.array([1, 2, 3, 4])
print(x)
print(x.shape)

In [None]:
# 2 dimensions
x = np.array([[1.5, 2.0, 3.3], [4.7, 5.9, 6.1]])
print(x)
print(x.shape)

#### Using other methods with specific parameters

In [None]:
x = np.linspace(0, 1, 5) # start, stop, num
print(x)

In [None]:
print(np.ones((5, 5)))  # All elements are 1

In [None]:
print(np.zeros((2, 2))) # All elements are 0

In [None]:
print(np.identity(3))   # Identity matrix

Sample array from a uniform Distribution: https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)

In [None]:
x = np.random.rand(3, 2) # Return a sample (or samples) from a uniform distribution (in the range [0,1))
print(x)

In [None]:
x = np.random.randint(low = 0, high = 10, size = (3,2))
print(x)

Sample array from a Gaussian/Normal Distribution: https://en.wikipedia.org/wiki/Normal_distribution  

In [None]:
x = np.random.randn(3, 2) # Return a sample (or samples) from the “standard normal” distribution - N(0,1)
print(x)

#### Loading data stored on file into a NumPy array

In [None]:
# Create two NumPy array by loading data in two seprate files (used for Exercise 2 below)
a = np.loadtxt(fname='ClassA.csv', delimiter=',')
b = np.loadtxt(fname='ClassB.csv', delimiter=',')

### Select sub-arrays by slicing an array

In [None]:
x = np.random.randn(5, 4)
print(x)

In [None]:
print(x[1, :])

In [None]:
print(x[:, 1])

In [None]:
print(x[:, -1])

In [None]:
print(x[:3, 1:])

In [None]:
print(x[0: 4, 0: 3])

### Select sub-arrays using a Boolean mask

In [None]:
x = np.array([[1, 2, 4], [0, 5, 9]])
print(x)
mask = (x <= 4)
print(mask)

In [None]:
x[mask] = -1
print(x)

### Change the shape of an array 

In [None]:
#ravel function flattens the array
print(x.ravel())
print(x.ravel().shape)

In [None]:
# Reshape in a self-defined way
y = x.reshape((4, 5))
print(y)
print(y.shape)

### Stack arrays
Pay attention to array dimension!

In [None]:
x = np.array([[1, 2, 3], [4, 5, 6]])
print(x)

In [None]:
y = np.array([7,8,9])
print(y)

In [None]:
np.vstack([x, y])

In [None]:
# Should report error
np.hstack([x, y])

In [None]:
z = np.array([[1], [2]])
np.hstack([x, z])

### Exercise 1
1. Create an array $X$ with 5 rows and 6 colunms where the elements start from 1 to 30 
2. Use slicing to select:  
   a. Subarray containing only the odd rows from $X$  
   b. Subarray containing only the odd rows and even columns from $X$
3. Replace all the elements that is a multiple of 5 in $X$ by 0

## 2. Statistical Analysis of Data
### Get summary statistics of arrays: min, max, mean, median, std, sum

In [None]:
X = np.arange(1,31).reshape(5, 6)
print(X)

### Example of the median

Definition: The median of a distribution of numbers is the "middle" number, that is the one such that, if the numbers were sorted, there would be as many numbers on its left as there would be on its rights: https://en.wikipedia.org/wiki/Median

In [None]:
# Along row
np.median(X, axis=1)

In [None]:
# Along column
np.median(X, axis=0)

In [None]:
# All elements
np.median(X)

### Other statistics such as mean, std, min, max, etc (seen during the lecture)
Try for yourself to compute some statistics on different arrays of 1 or 2 dimensions

In [None]:
# Examples
m = np.mean(X, axis=0)
s = np.std(X, axis=0)
h = np.min(X, axis=1)
l = np.max(X, axis=1)
m_total = np.mean(X)
print('Overall average:', m_total)


### Writing your custom statistical functions
Try writing functions to perform statistical operations

In [None]:
# Function which takes an array as input, and return the vector of averages for each row 
def stat_avg1(a):
    return np.mean(a, axis=1)

In [None]:
# Function which takes an array as input, and return two vector (averages for each row, averages for each column)
def stat_avg2(a):
    return np.mean(a, axis=1), np.mean(a, axis=0)

In [None]:
# Function to return the maximum across averages for each row
def stat_avgmax(a):
    v = np.mean(a, axis=1)
    return np.max(v)

Apply these functions to an array of shape ``(10,10)`` where each entry is drawn from a normal (Gaussian) distribution with mean 0 and standard deviation 1

In [None]:
a = np.random.randn(10, 10)

In [None]:
stat_avg1(a)

In [None]:
stat_avg2(a)

In [None]:
stat_avgmax(a)

### Exercise 2

In [None]:
from IPython.display import Image
Image(filename="lab06_data.png")

The files `ClassA.csv` and `ClassB.csv` contain results for 10 students each. Each student took 5 Quizes in total. Load the data from these files and:
1. Stack ClassA and ClassB vertically. Now we have 20 students in total.
2. What are the maximum scores for each Quiz among the 20 students?
3. What is the lowest score that each student got among the 5 Quizes?
4. What is the average score that each student got for the 5 Quizes?
5. In terms of total scores of 5 Quizes, which class performed better?

## 3. Broadcasting and Mathematical Operations
https://numpy.org/doc/stable/user/basics.broadcasting.html

### Examples of arithmetic operations

In [None]:
a = np.arange(0,9)

In [None]:
a + a

In [None]:
a * a

In [None]:
a**2

In [None]:
np.sqrt(a)

In [None]:
np.exp(a)

### Examples of broadcasting

In [None]:
# Broadcasting examples seen during the lecture
a[0:5] = 100 # Scalar broadcasting (example shown above)
a + 10 # Scalar broadcasting
10 * a # Scalar broadcasting
np.ones((2,9)) + np.arange(9) # Array broadcasting (example shown above)
a + np.random.rand(2,9)# Array broadcasting (duplicate row and add small random number to each entry)

In [None]:
### Other example of array broadcasting
a = np.array([1, 2, 3, 4])
print(a**2 + 4)

In [None]:
a = np.array([[ 0.0,  0.0,  0.0],
              [10.0, 10.0, 10.0],
              [20.0, 20.0, 20.0],
              [30.0, 30.0, 30.0]])
b = np.array([1.0, 2.0, 3.0])
print(a)
print(b)
print(a+b)

### Examples of linear algebra operations
Numpy also offers Linear Algebra operators to manipulate arrays as vectors or matrices

In [None]:
a = np.array([[1,2], [3,4]], float)
b = np.array([[2,0], [1,3]], float)
print(a)
print(b)

#### Vector dot product

In [None]:
np.dot(a[:,0], a[:,1]) # Vector dot product between first and second columns of matrix a

#### Matrix multiplication

In [None]:
np.matmul(a, b) # Matrix product between matrices a and b (number of columns of a needs be the same as number of rows of b)
a @ b           # Shortcut: Same as np.matmul(), but faster to type


#### Matrix determinant, norm, inverse, trace, and many others

In [None]:
np.linalg.det(a+b) # Determinant of (a + b)

In [None]:
np.linalg.norm(a) - np.linalg.norm(b) # Norm of the difference between matrices a and b

### Exercise 3

For arrays `A`, `B`, `C` and `D` given below:

1. What is the element-wise product of A and B?
1. What is the matrix product of A and B?
3. Why can you compute the element-wise product of B and C even though they have different shape? 
4. Can you compute the matrix product of B and D? Why?
5. Can you compute the matrix product of D and B? Why?

In [None]:
import numpy as np

A = 2*np.identity(3)
B = np.arange(1, 10).reshape(3,3)
C = np.array([1,2,3])
D = np.arange(0, 12).reshape(3,4)

### [Optional] For more details on the difference between `np.matmul` and `np.dot`:
https://numpy.org/doc/stable/reference/generated/numpy.dot.html#numpy.dot
https://numpy.org/doc/stable/reference/generated/numpy.matmul.html#numpy.matmul

Short version: For most needs, they are identical when applied on two matrices :)

## **Thank you everyone!**