Fundamentals of AI — Writeup
| Module ID | Difficulty | Estimated Duration | Number of Sections | Reward |
|---|---|---|---|---|
| 290 | Fundamental · Tier 0 | 6 Hours | 24 (including 1 interactive skills assessment) | 10 Cubes |
Module Link: academy.hackthebox.com/module/details/290
Table of Contents
| # | Section | Type |
|---|---|---|
| 1 | Introduction to Machine Learning | Theory |
| 2 | Mathematics Refresher for AI | Theory |
| 3 | Supervised Learning Algorithms | Theory |
| 4 | Linear Regression | Theory |
| 5 | Logistic Regression | Theory |
| 6 | Decision Trees | Theory |
| 7 | Naive Bayes | Theory |
| 8 | Support Vector Machines (SVMs) | Theory |
| 9 | Unsupervised Learning Algorithms | Theory |
| 10 | K-Means Clustering | Theory |
| 11 | Principal Component Analysis (PCA) | Theory |
| 12 | Anomaly Detection | Theory |
| 13 | Reinforcement Learning Algorithms | Theory |
| 14 | Q-Learning | Theory |
| 15 | SARSA (State-Action-Reward-State-Action) | Theory |
| 16 | Introduction to Deep Learning | Theory |
| 17 | Perceptrons | Theory |
| 18 | Neural Networks | Theory |
| 19 | Convolutional Neural Networks | Theory |
| 20 | Recurrent Neural Networks | Theory |
| 21 | Introduction to Generative AI | Theory |
| 22 | Large Language Models | Theory |
| 23 | Diffusion Models | Theory |
| 24 | Skills Assessment | Interactive |
1. Introduction to Machine Learning
Key Learning Points
- AI: Developing systems that can perform tasks requiring human intelligence, covering NLP, computer vision, robotics, expert systems
- ML: A subfield of AI where systems learn from data rather than being explicitly programmed, divided into three main categories:
- Supervised Learning: learning from labeled data (image classification, spam detection)
- Unsupervised Learning: discovering patterns from unlabeled data (customer segmentation, anomaly detection)
- Reinforcement Learning: learning through trial and error with reward and punishment feedback (games, robotics, autonomous driving)
- DL: a subfield of ML, which uses multi-layer neural networks to automatically extract features; representative architectures include CNN, RNN, Transformer
- Relationship of the three: DL ⊂ ML ⊂ AI
Understanding and Insights
- The inclusion relationship AI ⊃ ML ⊃ DL is the core mental model throughout the entire module—each layer is a specialization of the previous one. AI is the broadest goal (enabling machines to exhibit intelligence), ML is the mainstream methodology for achieving AI (learning from data), and DL is the most automated family of methods within ML (automatic feature extraction)
- The choice among the three learning paradigms depends on the available form of data: choose supervised for labeled data, unsupervised for unlabeled data, and reinforcement for interactive environments. This is not a distinction of algorithm superiority, but rather the problem's nature dictates the method.
- A common point of confusion: Deep Learning is not always superior to traditional ML. When data volume is small and features are clear, traditional ML is often more efficient and interpretable.
Practical Takeaways
- Established a hierarchical cognitive framework for AI/ML/DL, enabling quick identification of where new technologies fit within this hierarchy.
- Mastered the basic judgment ability to select learning paradigms based on problem type (labeled/unlabeled, interactive environment or not).
2. Mathematics Refresher for AI
Key Knowledge Points
This section serves as a reference manual, listing mathematical symbols and concepts covered in subsequent chapters:
- Basic Operations: addition, subtraction, multiplication, division, subscript/superscript notation
- Linear Algebra: vector norms, matrix multiplication/transpose/inverse/determinant/trace, eigenvalues and eigenvectors
- Calculus and Statistics:Summation symbol Σ、Logarithms (log2, ln)、Exponential functions
- Probability Theory:Conditional probability P(x|y)、Expectation E[X]、Variance、Standard deviation、Covariance、Correlation
- Set Theory:Cardinality、Union、Intersection、Complement
Understanding and Insights
- The value of this section lies in establishing a mapping from symbols to meanings — enabling quick reference when formulas appear frequently in subsequent chapters.
- Different branches of mathematics correspond to different ML scenarios: linear algebra supports matrix operations in neural networks, probability theory is the cornerstone of Bayesian methods, and calculus drives gradient descent optimization.
- Eigenvalues and eigenvectors may seem abstract in this section, but when we get to PCA, we'll find they are core tools for dimensionality reduction.
Practical Takeaways
- Gained a quick reference guide for mathematical symbols that can be consulted repeatedly, lowering the barrier to reading subsequent chapters.
- Established an awareness of the correspondence between mathematical tools and ML application scenarios.
3. Supervised Learning Algorithms
Key Points
- Supervised learning learns mapping functions from labeled data, divided into classification (predicting categories) and regression (predicting continuous values).
- Core Concepts:
- Training Data / Features / Labels:Basic components of input and output
- Model / Training / Prediction / Inference:The entire process from construction to application
- Evaluation Metrics:Accuracy、Precision、Recall、F1-score
- Generalization / Overfitting / Underfitting: Key issues for model performance on new data
- Cross-validation: Splitting data into multiple folds to more reliably evaluate models
- Regularization: L1 / L2 penalty terms, preventing overfitting
Understanding and Insights
- The concepts of overfitting/underfitting/generalization are more important than any single algorithm—they are the universal criteria for judging the quality of all models. No matter how well a model performs on the training set, if its generalization is poor, it is worthless.
- Cross-validation is a practical tool for catching overfitting: A single train/test split might lead to misjudgment due to data distribution randomness; K-fold cross-validation averages results from multiple splits, providing a more robust performance estimate.
- There is a natural tension (trade-off) between Precision and Recall: Increasing Precision usually lowers Recall, and vice versa. F1-score is the harmonic mean of the two, suitable for scenarios requiring balanced consideration.
- The essence of regularization is to make a trade-off between model complexity and training error—L1 tends to produce sparse solutions (automatic feature selection), while L2 tends to make weights generally smaller.
Practical Takeaways
- Established a "generalization-first" model evaluation mindset: Training set performance is not the goal; test set/cross-validation results are.
- Mastered cross-validation as a standard operating procedure for diagnosing overfitting
4. Linear Regression
Key Concepts
- Linear Regression: Modeling the relationship between predictor and target variables using a linear equation
- Simple Linear Regression:
y = mx + c(one predictor variable) - Multiple Linear Regression:
y = b0 + b1x1 + b2x2 + ... + bnxn - Ordinary Least Squares (OLS): Finding the best-fit line by minimizing the Residual Sum of Squares (RSS)
- Four Major Assumptions: Linear Relationship, Independence of Observations, Homoscedasticity, Normality of Errors
Understanding and Insights
- Linear Regression is the starting point and baseline for all regression models—even if the actual problem is nonlinear, linear regression is often used first to establish a reference.
- The essence of OLS is an optimization problem with an analytical solution: For linear regression, iterative optimization is not required; the optimal weights can be directly calculated using mathematical formulas. This contrasts with subsequent neural networks that require gradient descent iterative optimization.
- The four major assumptions are often violated in real-world data, in which case data transformation (e.g., taking logarithms) or switching to a more flexible model is necessary.
Practical Takeaways
- Understood the status of linear regression as the simplest predictive model, and why it is often used as a baseline model.
- Realized that model assumptions are not negligible details—when assumptions are not met, model conclusions may be invalid.
5. Logistic Regression
Key Knowledge Points
- Although named "regression", it is actually used for binary classification, outputting probability values between 0 and 1.
- The core is the Sigmoid function:
P(x) = 1 / (1 + e^-z), which maps linear combinations to probabilities. - Decision Boundary: determined by model parameters and threshold probability, which is a hyperplane in high-dimensional space.
- The threshold is usually set to 0.5, and can be adjusted according to business needs.
- Assumptions: Binary output, Linearity of log-odds, Low multicollinearity among features, Large sample size
Understanding and Insights
- It has "regression" in its name but is actually a classification algorithm—this is the most common naming trap. The reason it's called "regression" is because it models log-odds using a linear equation at its core, regressing probabilities rather than continuous values.
- Core Value of the Sigmoid Function: It maps any real value to the (0, 1) interval, giving the output a probabilistic meaning. This function later also became one of the earliest activation functions used in neural networks.
- Thresholds are adjustable, which is very practical: In the original spam example, the threshold was set to 0.8; adjusting to 0.6 would require a higher probability to be classified as spam. In scenarios like medical diagnosis, the threshold can be lowered to reduce missed diagnoses (leaning towards Recall), while in scenarios where false positives are costly, the threshold can be raised to reduce misjudgments (leaning towards Precision).
- Logistic Regression outputs probabilities rather than hard classifications, which gives decision-makers more flexibility—allowing them to take actions of varying intensity based on the probability.
Practical Takeaways
- Clarified that "regression" and "classification" can be bridged by the Sigmoid function: Linear Regression + Sigmoid = Logistic Regression (Classifier).
- Understood that threshold adjustment is a key link between model output and business requirements.
6. Decision Trees
Key Concepts
- Tree-structured model, composed of root node → internal nodes → leaf nodes.
- Splitting criteria:
- Gini impurity:
Gini(S) = 1 - Σ(pi)², the lower, the purer - Entropy:
Entropy(S) = -Σ pi * log2(pi), the lower, the more ordered - Information Gain: The reduction in entropy before and after splitting; the larger, the better.
- Gini impurity:
- Stopping conditions: reaching maximum depth, node data volume below threshold, node is pure.
- Advantages: no linear/normal assumptions, robust to outliers, can handle non-linear relationships.
Understanding and Insights
- The biggest advantage of decision trees is their interpretability—the decision path can be directly shown to non-technical personnel, e.g., "If X > 5 and Y < 3, then it belongs to class A". This is very valuable in security auditing and compliance scenarios.
- Gini impurity and entropy do not differ much in practice, but the conceptual difference is worth understanding: Gini measures "the probability that two randomly selected samples have different classes", and entropy measures "the uncertainty of information"
- The biggest drawback of decision trees is that they are prone to overfitting—unrestricted trees will perfectly fit the training set but generalize very poorly. This is precisely the motivation behind the invention of random forests (ensemble of trees) and pruning techniques.
Practical Takeaways
- Understood that decision trees, as naturally interpretable models, are irreplaceable in scenarios requiring transparency.
- Understood the intuition of information gain as a feature selection criterion: choose the feature that results in the greatest reduction in "disorder" after splitting.
7. Naive Bayes
Key Concepts
- Probability classification algorithm based on Bayes' Theorem:
P(A|B) = [P(B|A) * P(A)] / P(B) - "Naive" assumption: features are conditionally independent given the class
- Workflow: Calculate prior probabilities → Calculate likelihoods → Apply Bayes' Theorem → Select the class with the highest posterior probability
- Three variants:
- Gaussian Naive Bayes: continuous features, assumes Gaussian distribution
- Multinomial Naive Bayes: discrete features, commonly used for text classification
- Bernoulli Naive Bayes: binary features (presence/absence)
Understanding and Insights
- The "naive" independence assumption is almost always wrong—yet the algorithm still performs excellently in practice, which is a counter-intuitive but important realization. The reason is that classification tasks only require relative probability ranking to be correct (which class has the highest posterior probability), not precise absolute probability values. Even if probability values are biased due to the independence assumption, the ranking relationship often remains correct.
- Naive Bayes has long held a dominant position in text classification (spam filtering, sentiment analysis) because the high-dimensional sparse features of text are well-suited for this type of model—high dimensionality but each document involves only a few words.
- The choice of the three variants depends on the data type of the features, not the problem itself: continuous values use Gaussian, word frequency uses Multinomial, and binary existence uses Bernoulli.
Practical Takeaways
- Understood the phenomenon of 'assumptions being wrong but the model still being effective'—the practical value of the model does not depend on the strict validity of the assumptions, but on whether the assumption bias affects the final decision.
- In the field of security, Naive Bayes is often used for initial screening in spam filtering and intrusion detection.
8. Support Vector Machines (SVMs)
Key Points
- Finding the maximum margin hyperplane to separate different classes; the larger the margin, the better the generalization.
- Support Vectors: data points closest to the hyperplane, which determine the position of the hyperplane.
- Linear SVM: used when data is linearly separable, hyperplane equation
w · x + b = 0 - Non-linear SVM: by using the kernel trick to map data to a higher-dimensional space to make it linearly separable.
- Common kernel functions: Polynomial kernel, RBF kernel, Sigmoid kernel.
- Advantages: no distribution assumptions, good with high-dimensional data, robust to outliers.
Understanding and Insights
- The intuition behind maximum margin is very elegant: the larger the margin, the greater the model's tolerance for new data, and the stronger its generalization ability. This directly echoes the discussion on overfitting/generalization.
- The elegance of the kernel trick: it doesn't require actual computation in the high-dimensional space, but rather directly computes the inner product in the high-dimensional space via the kernel function, avoiding the computational cost brought by the 'curse of dimensionality'.
- SVM only cares about data points near the decision boundary (support vectors); points far from the boundary do not affect the result. This makes it particularly effective in high-dimensional, small-sample scenarios.
Practical Takeaways
- Understood the idea of using "maximum margin" as a proxy metric for generalization ability, an idea that repeatedly appears in ML.
- The kernel trick provides a concrete example for understanding the general strategy of "transforming problems into an easier-to-solve space".
9. Unsupervised Learning Algorithms
Key Takeaways
- Discovering hidden patterns from unlabeled data, three main task types:
- Clustering: Grouping similar data
- Dimensionality Reduction: Reducing the number of features while retaining key information
- Anomaly Detection: Identifying data points that significantly deviate from normal patterns
- Core Concepts:
- Similarity Measures: Euclidean distance, cosine similarity, Manhattan distance
- Clustering Tendency / Clustering Validity: Assessing whether data is suitable for clustering, and the quality of clustering results
- Dimension / Intrinsic Dimension: Actual number of features in data vs. intrinsic dimension
- Feature Scaling: Min-Max scaling, Z-score standardization, ensuring features contribute fairly to calculations
Understanding and Insights
- Key difference from supervised learning: No labels mean it's impossible to simply judge "right or wrong"—evaluation becomes difficult and subjective. Whether clustering results are "good" often requires judgment from domain experts, or reliance on proxy metrics (e.g., silhouette coefficient)
- Unsupervised learning is more like "exploratory analysis" than "predictive modeling": Its value lies in discovering structures in data that humans had not foreseen
- Feature scaling is more critical in unsupervised learning than in supervised learning—distance metrics are directly affected by feature scales, and an unscaled feature might dominate the entire clustering result
- Anomaly detection is particularly important in the security domain: intrusion detection and fraud detection are typical scenarios, and in these scenarios, anomalous samples are extremely rare and difficult to label.
Practical Takeaways
- Developed an understanding of the inherent characteristic of unsupervised learning: 'difficulty in evaluation' – do not expect to get clear accuracy metrics like in supervised learning.
- Understood that feature scaling is not an optional step, but a necessary prerequisite for distance-based algorithms.
10. K-Means Clustering
Key Points
- Divides data into K non-overlapping clusters, iterative process:
-
- Randomly select K centroids → 2. Assign each point to the nearest centroid → 3. Recalculate centroids → 4. Repeat until convergence
-
- Choosing the Optimal K Value:
- Elbow Method: Plot WCSS vs K curve, find the inflection point where the rate of decrease slows down.
- Silhouette Analysis: Silhouette coefficient range [-1, 1], closer to 1 indicates better clustering.
- Assumptions and Limitations: Assumes clusters are spherical and similar in size, sensitive to feature scales and outliers.
Insights
- The core contradiction of K-Means: K value needs to be specified in advance, but the number of clusters in real data is often unknown. Elbow method and silhouette analysis are heuristic methods and do not guarantee a "correct" answer.
- Algorithm convergence does not mean finding the global optimum – K-Means only guarantees convergence to a local optimum. In practice, multiple random initializations are often required (e.g., K-Means++ improved initialization strategy) to obtain the best result.
- The spherical cluster assumption is a strong limitation: when data clusters are ring-shaped, elongated, or irregularly shaped, K-Means will produce misleading results.
Practical Takeaways
- Mastered the usage process of K-Means as the most basic clustering algorithm, and practical methods for selecting K values using the elbow method and silhouette coefficient.
- Recognized the limitations of K-Means, establishing motivation for subsequent learning of more flexible clustering algorithms (e.g., DBSCAN).
11. Principal Component Analysis (PCA)
Key Knowledge Points
- Dimensionality Reduction technique: Projecting high-dimensional data into a lower-dimensional space, preserving maximum variance.
- Steps: Standardization → Calculate covariance matrix → Find eigenvalues and eigenvectors → Sort by eigenvalues in descending order → Select the top k principal components → Transform data.
- Eigenvalues represent the amount of variance explained by each principal component, and eigenvectors represent the direction of the principal components.
- Choosing the number of components to retain: Typically, choose the number of components whose cumulative explained variance reaches 95%.
- Assumptions: Linear relationships and significant correlations exist between features, sensitive to feature scales.
Understanding and Insights
- The core idea of PCA is to find the direction of greatest data variance—the direction with the largest variance carries the most information, while the direction with the smallest variance is often noise and can be discarded.
- Here, the seemingly abstract eigenvalues/eigenvectors from section 2 finally have a practical use: eigenvectors define the direction of the principal components, and eigenvalues measure the amount of information in that direction.
- The cost of PCA is reduced interpretability: Original features have clear meanings (e.g., "age", "income"), but principal components are linear combinations of original features and no longer have intuitive meanings.
- The 95% variance threshold is a rule of thumb, not a hard rule, and may need to be adjusted for specific scenarios.
Practical Takeaways
- Understood the core motivations for dimensionality reduction: reducing computational cost, eliminating noise, and enabling visualization of high-dimensional data.
- Connected eigenvalues/eigenvectors from linear algebra with practical data analysis techniques.
12. Anomaly Detection
Key Knowledge Points
- Identify data points that significantly deviate from normal behavior, three types of anomalies:
- Point Anomalies: single anomalous data points
- Contextual Anomalies: anomalous in a specific context (e.g., 30°C in winter)
- Collective Anomalies: a group of data points that are anomalous as a whole
- Detection methods are divided into three main categories:
- Statistical Methods: assume normal data follows a specific distribution (e.g., Gaussian), use z-score etc. to identify outliers
- Clustering Methods: points that do not belong to any cluster or belong to sparse clusters are considered anomalies (e.g., K-Means)
- Machine Learning Methods:
- One-Class SVM: learns a boundary that encloses normal data
- Isolation Forest: isolates anomalies through random partitioning; the shorter the path, the more likely it is an anomaly
- Local Outlier Factor (LOF): compares the local density of a data point with its neighbors
Understanding and Insights
- The concept of contextual anomalies is profound: the same data value can be normal or anomalous in different contexts—30°C is normal in summer, but anomalous in winter. This means anomaly detection models need to understand the contextual definition of "normal".
- The unique value of anomaly detection in the security domain: attack methods are constantly evolving, making it impossible to enumerate all attack patterns (a limitation of supervised learning), but one can define "what is normal" and then detect deviations.
- The idea behind Isolation Forest is very intuitive: anomalies, being different, are easily "isolated" during random partitioning—much like a person behaving strangely in a crowd is easily noticed.
- Different methods are suitable for different scenarios: statistical methods are simple but have strong assumptions, Isolation Forest is suitable for high-dimensional data, and LOF is suitable for data with uneven density.
Practical Takeaways
- Mastered a classification framework for three types of anomalies, and can choose appropriate detection strategies based on specific scenarios.
- Developed an awareness of anomaly detection applications in cybersecurity—the core idea of Intrusion Detection Systems (IDS) is precisely "deviation from normal is suspicious".
13. Reinforcement Learning Algorithms
Key Knowledge Points
- The agent learns the optimal strategy by interacting with the environment, based on reward and punishment feedback.
- Categorized into Model-based RL and Model-free RL.
- Core Concepts:
- Agent / Environment / State / Action / Reward: The five basic elements of RL
- Policy: A mapping strategy from states to actions
- Value Function: Estimates the long-term value of a state/action (State-value / Action-value)
- Discount Factor(γ): Controls the degree of importance given to future rewards, between 0 and 1
- Episodic vs Continuous Tasks: Tasks with a terminal state vs tasks without a terminal state
Understanding and Insights
- The fundamental difference between RL and supervised/unsupervised learning: No pre-existing dataset—the agent must "generate" its own training data by interacting with the environment, which means data quality depends on the policy itself, creating a chicken-and-egg dilemma.
- The discount factor γ is a philosophical parameter: When γ is close to 1, the agent has "foresight" and values long-term rewards; when γ is close to 0, the agent is "short-sighted" and only considers immediate rewards. Different tasks require different time horizons.
- Model-based vs Model-free Trade-offs: Model-based can "imagine" results before acting (efficient but the model might be inaccurate), while Model-free must interact with the real environment (reliable but sample inefficient).
Practical Takeaways
- Understood the five-element framework of RL, and can map any interactive decision-making problem into the Agent/Environment/State/Action/Reward structure.
- Established connections between RL and games, cybersecurity offense and defense (e.g., automated penetration testing).
14. Q-Learning
Key Concepts
- Model-free RL algorithms learn optimal policies by estimating Q-values (expected cumulative reward for state-action pairs).
- Q-table: stores Q-values for all state-action pairs.
- Update formula (Bellman Equation):
Q(s,a) = Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)] - Belongs to off-policy algorithms: updates using the maximum Q-value of the next state, independent of the current policy.
- Exploration-Exploitation Trade-off: Epsilon-Greedy strategy—explores randomly with probability ε, and selects the optimal action with probability 1-ε.
- Assumptions: The environment satisfies the Markov property, and environment dynamics are stationary.
Understanding and Insights
- Meaning of Off-policy: Q-Learning updates using
max Q(s',a'), meaning it assumes the optimal action will be chosen next—but in actual execution, a random action might be chosen due to ε-greedy. The "learned policy" and the "used policy" are separate; this is "off-policy." In contrast, SARSA updates using the actually executed action, "learning what it uses," which is "on-policy." - The exploration-exploitation dilemma is a fundamental challenge in RL: Over-exploitation (exploit) leads to local optima, never discovering better policies; over-exploration (explore) wastes significant resources on inefficient actions. ε-greedy is the simplest balancing strategy, but far from optimal.
- Q-table methods are only suitable for problems with finite state-action spaces. When the state space is huge or continuous (e.g., images as states), function approximation (e.g., Deep Q-Networks DQN) is needed to replace the Q-table.
- The essence of the Bellman Equation is to update the estimate of the current value using the estimate of future values—this is a "bootstrapping" idea.
Practical Takeaways
- Mastered the complete process of Q-Learning as the most classic RL algorithm: Initialize Q-table → Select action → Get reward → Update Q-value → Repeat
- Understood the difference between off-policy and on-policy, laying the foundation for comparing with SARSA.
15. SARSA (State-Action-Reward-State-Action)
Key Knowledge Points
- Model-free on-policy RL algorithm, differing from Q-Learning by using the Q-value of the actually executed next action for updates.
- Update Formula:
Q(s,a) ← Q(s,a) + α(r + γ·Q(s',a') - Q(s,a)) - On-policy characteristic makes SARSA more conservative and safer, suitable for scenarios with high safety requirements.
- Exploration Strategies: Epsilon-Greedy and Softmax
- Key Parameter Tuning: Learning rate α (update step size) and discount factor γ (trade-off between immediate and future rewards)
Understanding and Insights
- The formulas for SARSA and Q-Learning are almost identical. The only difference is that Q-Learning uses
max Q(s',a')(value of the optimal action) while SARSA usesQ(s',a')(value of the actually executed action). This subtle difference leads to significant behavioral distinctions. - Source of SARSA's Conservatism: Because SARSA incorporates the 'pits' encountered during exploration into its updates, the policy it learns actively avoids dangerous areas. Q-Learning, on the other hand, assumes optimal action selection in the future, potentially learning a more aggressive but theoretically superior policy.
- Classic Case: "Cliff Walking": SARSA learns to take a longer route to avoid the cliff edge, while Q-Learning learns to walk along the cliff edge for the shortest path—the latter is theoretically optimal but frequently falls during exploration.
Practical Takeaways
- Through the comparison of Q-Learning and SARSA, deepened the understanding of the on-policy/off-policy distinction.
- Recognized that in safety-critical scenarios (e.g., robot control, autonomous driving), conservative strategies may have more practical value than theoretically optimal strategies.
16. Introduction to Deep Learning
Key Concepts
- Deep learning uses multi-layered neural networks to automatically learn features from raw data, eliminating the need for manual feature engineering.
- Core Components:
- Artificial Neural Networks (ANN): Composed of neurons and weighted connections
- Layer Structure: Input Layer → Hidden Layers (multiple) → Output Layer
- Activation Functions: Introduce non-linearity – Sigmoid, ReLU, Tanh
- Backpropagation: Calculates the gradient of the loss function with respect to weights, propagating errors layer by layer
- Loss Functions: Measure the difference between predictions and true values (MSE / Cross-Entropy)
- Optimizers: SGD, Adam, RMSprop, used to update weights
- Hyperparameters: Parameters set before training, such as learning rate, number of layers, number of neurons per layer
Understanding and Insights
- The core motivation of deep learning: automated feature engineering. Traditional ML requires manual feature design (e.g., extracting edge histograms from images), while DL directly learns features from raw data (pixel values). This frees up domain experts but also turns the model into a "black box".
- The necessity of activation functions introducing non-linearity: Without activation functions, no matter how many layers are stacked, the entire network would still be equivalent to a linear transformation – making multiple layers meaningless. Non-linearity is the source of deep networks' expressive power.
- Backpropagation + Gradient Descent is the core training loop of DL: Forward propagation calculates predictions → Loss function calculates error → Backpropagation calculates gradients → Optimizer updates weights. Understanding this loop means understanding the full picture of DL training.
- Hyperparameter tuning still resembles "art" more than "science" – there's no universal optimal configuration, largely relying on experience and experimentation.
Practical Takeaways
- Understood that the watershed between traditional ML and deep learning lies in "whether manual feature engineering is required"—a crucial criterion for choosing a method.
- Established a complete mental model of the DL training loop: Forward Propagation → Calculate Loss → Backpropagation → Weight Update.
17. Perceptrons
Key Knowledge Points
- Basic building blocks of neural networks: Input × Weights → Sum + Bias → Activation Function → Output.
- Using a step function as the activation function, the output is a binary value (0 or 1).
- Limitations: A single-layer perceptron can only learn linearly separable decision boundaries and cannot solve the XOR problem.
Understanding and Insights
- The perceptron is the smallest unit for understanding neural networks—all complex deep networks are built by stacking this basic structure.
- The significance of the XOR problem goes far beyond a mathematical toy: it almost "killed" the entire field of neural network research in 1969 (the AI winter), until the invention of multi-layer networks and backpropagation solved this problem.
- The evolution from perceptrons to neurons: replacing non-differentiable step functions with continuously differentiable activation functions (Sigmoid, ReLU), making backpropagation possible.
Practical Takeaways
- Understood the lowest-level computational logic of neural networks: weighted sum + non-linear transformation.
- Through the XOR problem, recognized the fundamental limitations of single-layer networks, and understood why "depth" is needed.
18. Neural Networks
Key Knowledge Points
- Multi-Layer Perceptron (MLP) overcomes the limitations of single-layer perceptrons by introducing hidden layers.
- Activation Functions: Sigmoid, ReLU, Tanh, Softmax
- Training Process:
- Forward Propagation: Data is computed layer by layer from the input layer to the output layer
- Backward Propagation: Errors are propagated back layer by layer from the output layer, calculating the gradients of weights for each layer
- Gradient Descent: Weights are updated along the negative gradient direction, with the learning rate controlling the step size
Understanding and Insights
- MLP is a bridge from perceptrons to deep learning: Theoretically, an MLP with a sufficiently wide hidden layer can approximate any continuous function (Universal Approximation Theorem), but in practice, "deep and narrow" is often more efficient than "shallow and wide"
- Reasons why ReLU replaced Sigmoid as the mainstream activation function: Sigmoid's gradient approaches 0 at both ends (saturation region), leading to vanishing gradients in deep networks; ReLU's gradient is consistently 1 in the positive region, alleviating this problem
- Design of Softmax as a multi-class classification output layer: It transforms an arbitrary real-valued vector into a probability distribution (where all outputs sum to 1), and is a generalization of the logistic regression Sigmoid function for multi-class scenarios
Practical Takeaways
- Fully understood the neural network training loop: Forward Propagation → Calculate Loss → Backward Propagation → Gradient Descent to update weights
- Mastered the principles for choosing different activation functions: ReLU preferred for hidden layers, Sigmoid for binary classification output, Softmax for multi-class classification output
19. Convolutional Neural Networks
Key Knowledge Points
- Neural networks designed specifically for grid-like data (images), with three core layers:
- Convolutional Layer: Uses learnable filters to extract local features (edges, textures, shapes)
- Pooling Layer: Downsamples to reduce dimensionality (Max Pooling / Average Pooling)
- Fully Connected Layer: Performs final classification or regression based on the extracted features
- Hierarchical Feature Learning: shallow layers detect edges → middle layers recognize shapes → deep layers identify objects
- Assumptions: grid-like data structure, spatial hierarchy of features, feature locality and stationarity
Understanding and Insights
- Hierarchical feature learning (edges → shapes → objects) is the core insight of CNNs—the network doesn't recognize "cats" in one go, but rather builds increasingly abstract representations layer by layer. This is strikingly similar to the hierarchical processing mechanism of the human visual system.
- CNNs' advantage over MLPs in image processing comes from two key inductive biases: Locality (features are local, and convolutional kernels only look at local regions) and Translation Invariance (the same filter shares weights across the entire image, detecting "cats" regardless of their position in the picture)
- Weight sharing significantly reduces the number of parameters: a 3x3 convolutional kernel has only 9 parameters but scans the entire image—if a fully connected layer were used to process a 256x256 image, the number of parameters would explode.
- Pooling layers not only reduce dimensionality, but also provide a degree of positional invariance—objects moving a few pixels won't affect detection results.
Practical Takeaways
- Understood why CNNs achieved revolutionary success in computer vision—the triple design of hierarchical feature learning + weight sharing + local receptive fields.
- Mastered the typical architectural pattern of CNNs: alternately stacking convolutional and pooling layers, finally connecting to fully connected layers for output.
20. Recurrent Neural Networks
Key Knowledge Points
- Designed specifically for sequential data, maintaining "memory" through recurrent connections, receiving current input and previous hidden state at each time step.
- Vanishing Gradient Problem: gradients exponentially decay in long sequences, making it difficult to learn long-term dependencies.
- Solutions:
- LSTM: introduces a memory cell and three gates (input gate, forget gate, output gate), selectively remembering and forgetting.
- GRU: a simplified version of LSTM, with only two gates (update gate, reset gate), and higher efficiency.
- Bidirectional RNN: Processes sequences from both forward and backward directions simultaneously, capturing the full context
Understanding and Insights
- The vanishing gradient problem is key to understanding the motivation behind LSTM/GRU invention: In standard RNNs, gradients are repeatedly multiplied by the weight matrix at each time step. If weights < 1, gradients exponentially decay (vanish); if weights > 1, gradients exponentially grow (explode). Vanishing gradients mean the network cannot learn long-range dependencies—for example, the influence of a subject at the beginning of a sentence on the tense of a verb at the end
- LSTM's gating mechanism is a form of "selective memory": The forget gate decides what old information to discard, the input gate decides what new information to write, and the output gate decides what to expose to the next step. Information in the memory cell can pass unimpeded along a "highway," thereby solving the vanishing gradient problem
- However, the sequential processing nature of RNN/LSTM (which must proceed step by step) limits their parallelization capabilities, setting the stage for the emergence of the Transformer
- The intuition behind Bidirectional RNNs: Understanding the meaning of a word often requires looking at both its preceding and following context. For example, in "I am at the bank ______"—the meaning of "bank" can only be determined after seeing "deposit money" or "fishing"
Practical Takeaways
- Understood the technical implementation of "memory" in sequence modeling, and why vanilla RNNs cannot handle long sequences
- Established the technical evolution roadmap from RNN → LSTM/GRU → Transformer: each step aimed at solving the bottlenecks of the previous one
21. Introduction to Generative AI
Key Concepts
- Generative AI focuses on creating new content (text, images, music, code), rather than merely analyzing or classifying
- Main Model Types:
- GAN: Generator and discriminator trained adversarially
- VAE: Learns compressed representations of data for generation
- Autoregressive Models: Generate elements sequentially, one by one
- Diffusion Models: Generate by progressively denoising from noise
- Key Concepts: Latent Space, Sampling, Mode Collapse, Overfitting
- Evaluation Metrics: IS, FID, BLEU
Understanding and Insights
- Essential Difference between Generative AI and Discriminative AI: Discriminative models learn P(y|x) (predicting categories given input), while generative models learn P(x) (learning the distribution of the data itself, thereby enabling the generation of new samples). The latter is significantly more difficult than the former
- GAN's adversarial training idea is extremely elegant: the generator and discriminator are adversaries—the generator attempts to create fakes that are indistinguishable from real ones, and the discriminator attempts to expose them. Both progress together in this game. However, this also leads to training instability (e.g., mode collapse issues)
- Latent space is a core concept for understanding all generative models: It is a compressed, continuous representation space where similar data points are closer together. The generation process involves sampling from the latent space and then "decoding"
- Evaluating generative models is much more difficult than evaluating classification models—there is no single standard for the "quality" of generated content, and IS/FID/BLEU are merely proxy metrics
Practical Takeaways
- Gained an overview of the four major types of generative AI models and understood their respective generation paradigms
- Recognized the double-edged sword effect of generative AI in the security domain: It can be used to generate adversarial samples for attacks, and also for data augmentation and synthesizing training data
22. Large Language Models
Key Points
- Large-scale text generation models based on Transformer architecture, with parameters reaching billions or even trillions
- Three Major Characteristics: Large-scale parameters, Few-shot learning, Contextual understanding
- Core Technology Stack:
- Tokenization: splitting text into tokens
- Embeddings: mapping tokens to high-dimensional vectors that capture semantics
- Encoder / Decoder: Encoder understands input, decoder generates output
- Self-Attention: Calculates attention scores between words, capturing long-range dependencies
- Training Method: Unsupervised learning based on massive text data, optimizing parameters using gradient descent
Understanding and Insights
- Self-Attention mechanism (Self-Attention) is the key breakthrough for Transformer to defeat RNN: RNN must process sequences sequentially (1st word → 2nd → ... → nth word), whereas self-attention allows each word to directly interact with any other word in the sequence, without distance limitations. This not only captures long-range dependencies but also allows for large-scale parallel computation
- Intuition behind Self-Attention: When processing "The cat sat on the mat because it was tired", "it" needs to attend to "cat" (not "mat") to determine the anaphoric relationship. Attention scores are precisely the weights that measure the relevance between words
- Scaling Laws is another profound discovery in LLMs: Model capabilities predictably improve with increases in parameter count, data volume, and computational resources. This explains why "bigger is better" has become the main theme in LLM development
- Few-shot learning means that LLMs, through pre-training, have already "seen" massive patterns, and can adapt to new tasks with just a few examples during inference—something traditional ML cannot achieve
Practical Takeaways
- Understood the technical reasons why the Transformer architecture replaced RNNs: the dual advantages of parallelization + long-range dependency modeling
- Mastered the full technical stack of LLMs: Tokenization → Embedding → Self-Attention → Encoder/Decoder, laying the foundation for understanding and using LLM tools
23. Diffusion Models
Key Knowledge Points
- Learns data distribution through a "noise addition → denoising" process, generating high-quality images
- Forward Process: Gradually adds noise to data until it becomes pure noise
- Reverse Process: Trains a denoising network to predict and remove noise, gradually recovering data from pure noise
- Text-Guided Generation: Uses Transformer / CLIP to encode text into latent representations, conditioning the denoising process
- Key Components: Noise Scheduling (controlling the amount of noise added at each step), Denoising Network (CNN or Transformer)
- Assumptions: Markov Property, Static Data Distribution, Smoothness of Data Distribution
Understanding and Insights
- The intuition behind diffusion models is quite elegant: Easy to destroy, hard to repair—forward noise addition is simple (adding a small amount of Gaussian noise at each step), but learning to denoise in reverse requires the network to truly understand the data's structure. This bears a striking resemblance to the second law of thermodynamics in physics (entropy increase is easy, entropy decrease is difficult).
- Compared to GANs, diffusion models' training is more stable (without the instability of adversarial training) and generates higher quality results, but at the cost of slower inference speed—requiring multi-step iterative denoising.
- Text-guided generation connects the visual and language modalities: The CLIP model established a shared semantic space for text and images, making it possible to "describe desired images with text".
- Diffusion models are the mainstream method for current image generation (the foundation of Stable Diffusion, DALL-E, Midjourney), replacing the previous dominance of GANs.
Practical Takeaways
- Understood the core idea of diffusion models' "noise-addition-denoising" and grasped the basic principles of current image generation technology.
- Recognized the technical foundation of multimodal AI: connecting different types of data through a shared latent space.
24. Skills Assessment
Exercise Solutions
Q1: Which probabilistic algorithm, based on Bayes' theorem, is commonly used for classification tasks such as spam filtering and sentiment analysis, and is known for its simplicity, efficiency, and good performance in real-world scenarios?
Solution Approach:
The problem describes a probabilistic classification algorithm based on Bayes' theorem, known for its simplicity and efficiency, and widely used in spam filtering and sentiment analysis. This is precisely the Naive Bayes introduced in Section 7.
Answer: Naive Bayes
Q2: What dimensionality reduction technique transforms high-dimensional data into a lower-dimensional representation while preserving as much original information as possible, and is widely used for feature extraction, data visualization, and noise reduction?
Solution Approach:
The problem describes a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional representation, widely used for feature extraction and data visualization. This is precisely the PCA introduced in Section 11.
Answer: Principal Component Analysis
Q3: What model-free reinforcement learning algorithm learns an optimal policy by estimating the Q-value, which represents the expected cumulative reward an agent can obtain by taking a specific action in a given state and following the optimal policy afterward? This algorithm learns directly through trial and error, interacting with the environment and observing the outcomes.
Solution Idea:
The keywords of the problem are "model-free", "Q-value", and "expected cumulative reward". This is exactly the Q-Learning introduced in Section 14, which learns the optimal policy by estimating the expected cumulative reward for each state-action pair using a Q-table.
Answer: Q-Learning
Q4: What is the fundamental computational unit in neural networks that receives inputs, processes them using weights and a bias, and applies an activation function to produce an output? Unlike the perceptron, which uses a step function for binary classification, this unit can use various activation functions such as the sigmoid, ReLU, and tanh.
Solution Idea:
The question asks about the basic computational unit in a neural network, which receives input, processes it using weights and biases, and applies an activation function to produce output. Unlike a perceptron, it can use various activation functions (Sigmoid, ReLU, Tanh). This is the Neuron defined in Section 18.
Answer: Neuron
Q5: What deep learning architecture, known for its ability to process sequential data like text by capturing long-range dependencies between words through self-attention, forms the basis of large language models (LLMs) that can perform tasks such as translation, summarization, question answering, and creative writing?
Solution Idea:
The problem describes a deep learning architecture that captures long-range dependencies and processes sequential data through a self-attention mechanism, and is the foundation of LLMs. This is exactly the Transformer architecture introduced in Section 22.
Answer: Transformer
Answer Key
| Section | Question Number | Answer |
|---|---|---|
| 24 - Skills Assessment | Q1 | Naive Bayes |
| 24 - Skills Assessment | Q2 | Principal Component Analysis |
| 24 - Skills Assessment | Q3 | Q-Learning |
| 24 - Skills Assessment | Q4 | Neuron |
| 24 - Skills Assessment | Q5 | Transformer |