Table of Contents¶

@[toc]

Introduction to Deep Learning¶

AI is likened to the new electricity because, much like electricity did over 100 years ago, AI is transforming multiple industries such as: automotive, agriculture, and supply chains.
The recent surge in deep learning can be attributed to: advancements in hardware, particularly GPU computing, which provides greater computational power; its successful application in key domains like advertising, speech recognition, and image recognition; and the digital age, which has generated vast amounts of data.
The following diagram illustrates the iterative thinking process across different machine learning approaches:

- This thought map enables rapid idea testing, allowing deep learning engineers to iterate on their concepts more swiftly.
- It accelerates the team’s ability to refine an idea. New progress in deep learning algorithms allows for better model training even without upgrades to CPU or GPU hardware.
Feature engineering is crucial for achieving good performance. While experience helps, building an effective model requires multiple iterations.
The graph of the ReLU activation function is shown below:
Cat recognition is an example of “unstructured” data. A demographic dataset that counts population, GDP per capita, and economic growth across cities is an example of “structured” data, as opposed to images, audio, or text.
RNNs (Recurrent Neural Networks) are used for machine translation because translation is a supervised learning problem that can be trained with sequences. Both the input and output of an RNN are sequences, and translation involves mapping a sequence from one language to another.
This hand-drawn graph has the x-axis representing the amount of data and the y-axis representing algorithm performance:

- The graph shows that increasing training data does not negatively impact algorithm performance; more data is always beneficial for the model.
- Increasing the size of a neural network typically improves performance. Larger networks generally outperform smaller ones.

Fundamentals of Neural Networks¶

A neuron computes a linear function \( z = Wx + b \), followed by an activation function (sigmoid, tanh, ReLU, etc.).
When directly computing matrices with numpy, unlike np.dot(a, b), the operations use broadcasting rules. np.dot(a, b) follows standard matrix multiplication.
If img is a (32, 32, 3) array representing a 32x32 image with 3 color channels (red, green, blue), reshaping it into a column vector would be: x = img.reshape((32*32*3, 1)).
The “Logistic Loss” function is defined as:
The calculation process for the following image is:

[
\begin{align}
J &= u + v - w \
&= (a \cdot b) + (a \cdot c) - (b + c) \
&= a(b + c) - (b + c) \
&= (a - 1)(b + c)
\end{align}
]

Shallow Neural Networks¶

\( a^{[l(2)]}_{12} \) represents the 12th element of the activation vector in the 2nd layer for a training sample.
Each column of matrix \( X \) is a training sample.
\( a_4^{[2]} \) denotes the 4th activation output in the 2nd layer.
\( a^{[2]} \) represents the activation vector of the 2nd layer.
The tanh activation function generally performs better than sigmoid for hidden units because its output ranges between (-1, 1), with a mean closer to zero. This centralizes data for the next layer, simplifying learning.
The sigmoid function outputs values between 0 and 1, making it ideal for binary classification (output < 0.5 = class 0, output > 0.5 = class 1). Tanh can also be used, but its values between -1 and 1 are less intuitive for binary tasks.
In Logistic Regression (no hidden layers), initializing weights to zero causes the first sample output to be zero. However, the derivative of Logistic Regression depends on the input \( x \) (not zero), so weights will follow \( x \)’s distribution and differ after the second iteration if \( x \) is not a constant vector.
When the input to tanh is far from zero, its gradient approaches zero because the slope of tanh is near zero in this region.
A single-hidden-layer neural network:

- \( b^{[1]} \) should have shape (4, 1)
- \( b^{[2]} \) should have shape (1, 1)
- \( W^{[1]} \) should have shape (4, 2)
- \( W^{[2]} \) should have shape (1, 4)
For the \( l \)-th layer (where \( 1 \leq l \leq L \)), the correct vectorized forward propagation formula is:

Deep Neural Networks¶

Using “cache” in forward and backward propagation records values computed during forward passes to enable backward pass calculations via the chain rule.
Hyperparameters include: number of iterations, learning rate, number of layers \( L \), and number of hidden units.
Deep neural networks can handle more complex input features than shallow networks.
The following 4-layer network has 3 hidden layers:

- Layer count: Hidden layers + 1 (input and output layers are not counted as hidden layers).
In forward propagation, the activation function (sigmoid, tanh, ReLU, etc.) must be specified because the derivative for backward propagation depends on it.
Shallow networks require larger circuits (measured by the number of logic gates) for computations, while deep networks can achieve the same with exponentially smaller circuits.
A 2-hidden-layer neural network:
[Image not available in original repository]
- (i) \( W^{[1]} \) shape: (4, 4) (calculation: \( W^{[l]} = (n^{[l]}, n^{[l-1]}) \)), \( W^{[2]} = (3, 4) \), \( W^{[3]} = (1, 3) \)
- (ii) \( b^{[1]} \) shape: (4, 1) (calculation: \( b^{[l]} = (n^{[l]}, 1) \)), \( b^{[2]} = (3, 1) \), \( b^{[3]} = (1, 1) \)
To initialize model parameters using the layer_dims array [n_x, 4, 3, 2, 1] (4 hidden units in layer 1, etc.):

for i in range(1, len(layer_dims)):
  parameter['W' + str(i)] = np.random.randn(layers[i], layers[i-1]) * 0.01
  parameter['b' + str(i)] = np.random.randn(layers[i], 1) * 0.01

This note is based on studying Andrew Ng’s courses. As a beginner, if there are any misunderstandings, please feel free to correct me!

Table of Contents¶

Introduction to Deep Learning¶

Fundamentals of Neural Networks¶

Shallow Neural Networks¶

Deep Neural Networks¶

Related Articles