Applications of AI in InfoSec — Writeup
| Module ID | Difficulty | Estimated Duration | Number of Sections | Reward |
|---|---|---|---|---|
| 292 | Easy · Tier II | 8 hours | 25 (including 4 interactive assessments + 1 skill assessment) | 20 Cubes |
Module Link: academy.hackthebox.com/module/details/292
Contents
| # | Section | Type | Topic |
|---|---|---|---|
| 1 | Introduction | Theory | — |
| 2 | Environment Setup | Interactive | Q1 ✅ |
| 3 | JupyterLab | Interactive | — |
| 4 | Python Libraries for AI | Theory | — |
| 5 | Datasets | Theory | — |
| 6 | Data Preprocessing | Theory | — |
| 7 | Data Transformation | Theory | — |
| 8 | Metrics for Evaluating a Model | Theory | — |
| 9 | Spam Classification | Theory | — |
| 10 | The Spam Dataset | Interactive | — |
| 11 | Preprocessing the Spam Dataset | Interactive | — |
| 12 | Feature Extraction | Interactive | — |
| 13 | Training and Evaluation (Spam Detection) | Interactive | — |
| 14 | Model Evaluation (Spam Detection) | Interactive | Q1 ✅ |
| 15 | Network Anomaly Detection | Interactive | — |
| 16 | Preprocessing and Splitting the Dataset | Interactive | — |
| 17 | Training and Evaluation (Network Anomaly Detection) | Interactive | — |
| 18 | Model Evaluation (Network Anomaly Detection) | Interactive | Q1 ✅ |
| 19 | Malware Classification | Theory | — |
| 20 | The Malware Dataset | Interactive | — |
| 21 | Preprocessing the Malware Dataset | Interactive | — |
| 22 | The Model | Interactive | — |
| 23 | Training and Evaluation (Malware Image Classification) | Interactive | — |
| 24 | Model Evaluation (Malware Image Classification) | Interactive | Q1 ✅ |
| 25 | Skills Assessment | Interactive | Q1 ✅ |
1. Introduction
Key Learning Points
- This module builds three complete AI projects: Spam SMS Classifier (NLP + Naive Bayes), Network Anomaly Detection (Tabular Data + Random Forest), and Malware Image Classification (CNN + ResNet50 Transfer Learning)
- All code is provided in Python code blocks, to be executed sequentially in Jupyter Notebook.
- Runtime Environment: Playground VM (
http://<TARGET_IP>:8888) or Local Environment (recommended 4GB+ RAM, 4-core CPU) - Module Evaluation Method: Trained models are uploaded to the Playground VM's evaluation port, and a flag is returned after passing the performance threshold.
Understanding and Insights
The value of this module is not in individual algorithms, but in completing the full closed loop of an ML project—from raw data to a deliverable model. The three projects cover both scikit-learn and PyTorch toolchains, as well as three data types: text, tabular, and image.
A key realization: This module does not test theoretical derivations; it tests whether you can run through the end-to-end process. Code can be directly copied and pasted, but if the preprocessing pipeline is inconsistent with training, the model will fail after upload—this is precisely the most common source of bugs in real ML engineering.
Practical Takeaways
Established a complete mental model of "Data → Preprocessing → Feature Engineering → Model Training → Evaluation → Deployment", so that when encountering new ML tasks later, one knows what to do at each step.
2. Environment Setup
Key Concepts
- Miniconda: A lightweight version of Anaconda, provides the
condapackage manager, and can create isolated Python virtual environments - Installation Method: Windows: Use Scoop, macOS: Use Homebrew, Linux: Download installation script
- Creating a virtual environment:
conda create -n ai python=3.11→conda activate ai - Core dependency installation:
conda install numpy scipy pandas scikit-learn matplotlib seaborn nltk+conda install pytorch torchvision - Channel configuration:
conda config --add channels conda-forgeetc.,channel_priority strictensures package version consistency conda config --set auto_activate_base falsecan prevent the base environment from automatically activating every time a terminal is opened
Understanding and Insights
The fundamental reason for using Miniconda instead of system Python or pip: ML projects have extremely complex dependency trees (PyTorch requires matching CUDA versions, scikit-learn and numpy have ABI coupling), and pip often encounters version conflicts when installing deep learning libraries. Conda can manage binary compatibility at the source level and is the de facto standard in the ML domain.
Common pitfalls
- After
conda init, you must restart the terminal for it to take effect, otherwiseconda activatewill report an error. - When installing PyTorch, be sure to specify the CUDA version (
pytorch-cuda=12.4). If installed incorrectly, the GPU will not report an error, but training will fall back to the CPU, and the speed will be more than 10 times slower. - Playground VM can be used but has limited performance; running training locally is much more efficient.
Exercise Solutions
Q1: If you choose to use the Playground VM, you can start it here and familiarize yourself with the environment. We recommend keeping the VM running as you work through the module and follow along with the code snippets. Type DONE to continue.
Answer: DONE
3. JupyterLab
Key Takeaways
- JupyterLab: A web-based interactive development environment, a standard tool for data science and ML.
- Three cell types: Code cells, Markdown cells, Raw cells
- Stateful Environment (Stateful): Variables, functions, and imports defined in one cell remain available in all subsequent cells until the kernel is restarted.
- Execute code:
Shift + Enter(execute and jump to the next),Ctrl + Enter(execute without jumping) - Restart Kernel: Kernel → Restart Kernel or Restart & Clear All Outputs
- Installation:
conda install -y jupyter jupyterlab notebook ipykernel
Understanding and Insights
Jupyter's stateful nature is a double-edged sword. The advantage is that you can see the results as you write and gradually build data pipelines; the risk is that if cells are executed out of order, variable states will not match the code order. The first reaction when debugging should be to Restart & Run All, running everything from scratch to confirm consistent states.
Practical Takeaways
Mastered the working mode of Jupyter as an ML experimental environment—using notebooks for rapid iteration and visual exploration, and after confirming the feasibility of the solution, exporting it as a .py script for automation and version control.
4. Python Libraries for AI
Key Knowledge Points
- Scikit-learn: Traditional ML library based on NumPy/SciPy/Matplotlib
- Data Preprocessing:
StandardScaler(Standardization),MinMaxScaler(Normalization),OneHotEncoder(Categorical Encoding),SimpleImputer(Missing Value Imputation) - Unified API:
model.fit(X_train, y_train)Training →model.predict(X_test)Prediction - Model Evaluation:
train_test_split(Data Splitting),cross_val_score(Cross-validation),accuracy_score/f1_score(Metric Calculation)
- Data Preprocessing:
- PyTorch: Deep learning framework developed by Facebook
- Tensor: Similar to NumPy ndarray, but supports gradient tracking and GPU acceleration
- Dynamic Computation Graph: Computation graph built on-the-fly during forward pass, debugging is more intuitive than static graphs (TensorFlow 1.x)
- Model Building:
nn.Sequential(simple stacking) or inheritingnn.Module(custom forward pass) - Five Steps of Training Loop:
zero_grad()→ Forward Pass → Calculate Loss →backward()→step() - Data Loading:
Dataset+DataLoaderfor batch iteration, shuffling, and multi-process parallelism - Model Persistence:
torch.save(model.state_dict())to save parameters /torch.jit.script()to save the complete model
Understanding and Insights
The fundamental difference between scikit-learn and PyTorch is not "one is simple and one is complex", but rather different levels of abstraction. scikit-learn encapsulates the training loop (fit() in one line), suitable for structured data and classical algorithms; PyTorch exposes every step of the training loop, suitable for deep learning tasks that require custom network structures, loss functions, or training strategies. The criterion for choosing which one: If scikit-learn has a ready-made algorithm that can solve your problem, use scikit-learn; only use PyTorch when CNN/RNN/Transformer is needed.
Commonly Confused Concepts
PyTorch Tensors and NumPy ndarrays look very similar, but Tensors come with gradient tracking (requires_grad=True) and can operate on GPUs. Converting between the two requires .numpy() / torch.from_numpy(), and Tensors on the GPU must first be .cpu() before converting to NumPy.
5. Datasets
Key Takeaways
- Four main types of datasets: Tabular Data (CSV/databases), Image Data (pixel arrays), Text Data (natural language), Time Series (sequences with timestamps)
- Seven attributes of high-quality datasets: Relevance, Completeness, Consistency, Accuracy, Representativeness, Balance, Scale
- Example dataset
demo_dataset.csvcontains network logs:source_ip,destination_port,protocol,bytes_transferred,threat_level - Three essential tools for data exploration:
df.head()(viewing samples),df.info()(checking types and missing values),df.isnull().sum()(counting missing values)
Understanding and Insights
The upper limit of a model is determined by the data, not the algorithm. A complex model trained on dirty data is inferior to a simple model trained on clean data. The first thing to do after getting data is always to examine the data, not write the model—if a numeric column in df.info() shows a object type, it's almost certain that non-numeric strings are mixed in and require cleaning.
Practical Takeaways
Developed data quality awareness: A 'good' dataset isn't just about being larger; it needs to be balanced (positive and negative sample ratios are close) and representative (covering real-world scenarios). If 99% of the traffic is normal, a model predicting 'normal' for everything would achieve 99% accuracy, but it wouldn't have learned anything—this is the data root cause of why the Accuracy metric in Section 8 can be misleading.
6. Data Preprocessing
Key Takeaways
- Four main tasks of data preprocessing: Cleaning (handling missing values/outliers), Transformation (encoding/scaling), Integration (merging multi-source data), Formatting (type conversion/reshaping)
- Invalid value detection methods: Regex validation for IP format, Range validation for ports (0-65535) / byte count (≥0) / threat level (0-2)
- Two strategies for handling invalid data:
- Discarding:
data.drop(invalid.index, errors='ignore'), suitable for large datasets with few bad data points - Imputation:
SimpleImputer(strategy='median')fills numeric columns with the median,strategy='most_frequent'fills categorical columns;KNNImputerinfers based on neighbor relationships
- Discarding:
- Unify invalid value representation: First, use
df.replace()to replace various placeholders (MISSING_IP,STRING_PORT,?) withNaN, then convert withpd.to_numeric(errors='coerce').
Understanding and Insights
This section teaches not just API calls, but a troubleshooting approach—first using a validation function to locate where the bad data is, then deciding how to handle it based on the data volume. In the demo dataset of 100 entries, 23 are bad, leaving only 77 after dropping them—in this situation, it must be imputed, not dropped.
Key technique: First, unify all kinds of invalid values by replacing them with NaN, then process them all at once with SimpleImputer, which is much more efficient than fixing them one by one.
Practical Takeaways
Mastered the data cleaning process of "Locate bad data → Determine strategy → Unify representation → Batch process", which is the first step in any ML project.
7. Data Transformation
Key Knowledge Points
- Encoding Categorical Features:
OneHotEncoder: generates a binary column for each category (protocol_TCP=1/0), does not introduce ordinal relationshipsLabelEncoder: integer encoding (TCP=0, UDP=1), introduces spurious ordering, only suitable for ordinal categories
- Handling Skewed Data: Perform
np.log1p()logarithmic transformation on numerical features with uneven distribution to compress extreme values and make the distribution more uniform;log1pinstead oflogbecauselog(0)is undefined - Data Splitting (Train/Validation/Test = 60%/20%/20%):
- First
train_test_split(test_size=0.2)→ 80% Train + 20% Test - Second
train_test_split(test_size=0.25)on 80% → 0.8×0.25=0.2 → 60% Train + 20% Validation
- First
Understanding and Insights
The pitfalls of One-Hot vs LabelEncoder is the most important concept in this section: LabelEncoder assigns TCP=0, UDP=1, HTTP=2, and the model might mistakenly assume HTTP > UDP > TCP. The rule is simple—always use One-Hot for unordered categorical variables.
The second test_size=0.25 in the three-segment split is often confusing—it's a proportion of the remaining 80%, not of the whole. Remembering 0.8 × 0.25 = 0.2 will prevent confusion.
Practical Takeaways
Mastered the intermediate steps of the ML data pipeline: Cleaned data → Encoding categorical variables → Handling skewed distributions → Three-segment split. Each step has a clear 'why': encoding is because algorithms only recognize numbers, log transformation is because extreme values can dominate the model, and splitting is because independent validation/test sets are needed to prevent overfitting.
8. Metrics for Evaluating a Model
Key Concepts
- Accuracy =
(TP + TN) / All, overall correctness rate; misleading when classes are imbalanced - Precision =
TP / (TP + FP), proportion of true positives among predicted positives; High Precision = fewer false positives - Recall =
TP / (TP + FN), proportion of actual positives correctly identified; High Recall = fewer false negatives - F1-Score =
2 × Precision × Recall / (Precision + Recall), harmonic mean of the two - Other metrics: Specificity, AUC-ROC, Confusion Matrix
Understanding and Insights
Accuracy can be deceptive is the most important lesson in this section. When 99% of the dataset is normal traffic, a model predicting everything as 'normal' can achieve 99% Accuracy, but fail to detect any attacks. This is why the security domain rarely relies solely on Accuracy.
Precision and Recall are essentially a seesaw—raising the threshold reduces false positives (Precision increases), but more true attacks will also be missed (Recall decreases). F1-Score provides a balance point between the two.
Practical Takeaways
Developed the ability to select metrics based on business scenarios: Intrusion detection favors Recall; spam filtering favors Precision. There is no one-size-fits-all 'best metric', it depends on the cost of errors.
9. Spam Classification
Key Concepts
- Bayes' Theorem:
P(A|B) = P(B|A) × P(A) / P(B)—after observing evidence B, update the belief in event A - Applied to spam detection:
P(Spam|Features) = P(Features|Spam) × P(Spam) / P(Features) - Naive Bayes' "Naive" Assumption: features are conditionally independent, i.e.,
P(F1,F2|Spam) = P(F1|Spam) × P(F2|Spam) - Classification Decision: Calculate P(Spam|Features) and P(Not Spam|Features) separately, and choose the class with the larger posterior probability.
Understanding and Insights
The "naive" assumption is almost always false in reality—"free" and "prize" are highly correlated in spam SMS, not independent. But Naive Bayes still performs well because classification only requires comparing the relative magnitudes of two probabilities, not precise estimation of their absolute values. Although the conditional independence assumption makes probability estimates inaccurate, it does not affect the ranking.
This section's calculation example nicely demonstrates the power of Bayesian updating: prior P(Spam)=0.3 (30%), posterior P(Spam|F1,F2)≈0.588 (59%) after observing features—observational evidence pulled the belief from 30% to nearly 60%.
Practical Takeaways
Understood the complete inference chain of Naive Bayes in text classification—from prior probability to likelihood calculation to posterior comparison, and why a "wrong" assumption can still produce an effective classifier.
10. The Spam Dataset
Key Points
- SMS Spam Collection: 5574 SMS messages, labeled as ham or spam, from the UCI Machine Learning Repository
- Data Format: TSV, must specify
sep="\t",header=None,names=["label", "message"]when loading - Three steps for data checking:
df.head()(check if parsing is correct),df.isnull().sum()(check for missing values),df.duplicated().sum()(check for duplicates) - Deduplication:
df.drop_duplicates()removed 403 duplicates, leaving 5169 entries
Understanding and Insights
This dataset is TSV, not CSV—if sep="\t" is not paid attention to, the entire line of text will be treated as a single field, and all subsequent steps will be wrong but may not necessarily throw an error; the results will just be a mess.
Duplicate entries must be removed: If the same SMS message appears multiple times, it might appear in both the training and test sets, leading to inflated test scores (the model has "seen" this data rather than truly learned to classify it).
Practical Gains
Developed the habit of immediately head() + info() + duplicated() after loading data, confirming data completeness and correctness before entering any modeling steps.
11. Preprocessing the Spam Dataset
Key Knowledge Points
Standard NLP Preprocessing Pipeline (executed sequentially):
- Lowercase Conversion:
str.lower()— Merges "Free" and "free" into the same feature. - Punctuation and Number Removal: Regex
[^a-z\s$!]— Retains$(implies amount) and!(implies urgency), as these two symbols have strong discriminative power for spam messages. - Tokenization:
word_tokenize()— More accurate thansplit(" "), correctly handles contractions (e.g., don't → do, n't). - Stop Word Removal: Removes high-frequency, low-information words like the, is, and, using
nltk.corpus.stopwords. - Stemming:
PorterStemmerconverts running/runs/ran → run, significantly reducing vocabulary size. - Re-joining:
" ".join(tokens)restores to a string for consumption by CountVectorizer.
Understanding and Insights
The order of preprocessing cannot be arbitrary—tokenization must precede stop word removal, which must precede stemming. If stemming is done before stop word removal, the stemmed forms of certain stop words might not be in the stop word list, leading to missed deletions.
Retaining $ and ! is a noteworthy design decision: most NLP tutorials blindly remove all punctuation, but in the specific context of spam message classification, these two symbols carry crucial discriminative information. Preprocessing is not a mechanical process; it requires judgment combined with domain knowledge.
Common Pitfalls
The exact same preprocessing function must be used during training and prediction. The Pipeline only encapsulates steps after CountVectorizer; the preceding text cleaning (lowercasing/regex/tokenization/stop word removal/stemming) needs to be manually ensured for consistency. Any discrepancy will lead to a mismatch in vocabulary and meaningless prediction results.
12. Feature Extraction
Key Knowledge Points
- Bag of Words:Build a vocabulary, each message becomes a vector, element value = number of times the word appears in that message
- CountVectorizer implements the Bag of Words model, key parameters:
min_df=1:words must appear in at least 1 document to be kept (in practice, can be increased to 5 to remove extremely rare words)max_df=0.9:words appearing in 90%+ of documents are excluded (too common, equivalent to another layer of stop word filtering)ngram_range=(1, 2):simultaneously extracts unigrams and bigrams, capturing local word order
- Output: Sparse matrix X, most elements are 0
- Label transformation:
y = df["label"].apply(lambda x: 1 if x == "spam" else 0)converts ham/spam to 0/1
Understanding and Insights
ngram_range=(1, 2) is a key parameter for improving performance. When using only unigrams, "free" appearing alone is not necessarily spam ("feel free to ask"), but the bigram "free prize" almost certainly is. Bigrams recover the local word order information lost by the Bag of Words model.
The output matrix is very sparse—5000+ messages × tens of thousands of words, but each message contains only a dozen words on average, and 99%+ of elements are 0. scikit-learn stores it in scipy.sparse format, which is far more memory-efficient than dense matrices.
Practical Takeaways
Mastered the complete text → numerical feature conversion chain: Raw text → Preprocessing (Section 11) → CountVectorizer → Feature matrix consumable by ML models. This pattern applies to all Bag of Words-based text classification tasks.
13. Training and Evaluation (Spam Detection)
Key Knowledge Points
- Pipeline:chains
CountVectorizer+MultinomialNBinto a unified estimatorpipeline.fit(X, y)automatically vectorizes first, then trainspipeline.predict(new_text)automatically vectorizes first, then predictsjoblib.dump(pipeline)saves the complete pipeline, directly usable after loading.
- GridSearchCV: Iterates through hyperparameter combinations and selects the optimal configuration using cross-validation.
- Search for
alpha(Laplace smoothing factor) for the optimal value in [0.01, 0.1, 0.15, 0.2, 0.25, 0.5, 0.75, 1.0]. - Evaluation metric:
scoring="f1", 5-fold cross-validation.
- Search for
- Model Evaluation: For new messages, you must first manually perform the exact same preprocessing as during training, then call
pipeline.predict(). - Model Persistence:
joblib.dump()save /joblib.load()restore, what is saved is the complete Pipeline.
Understanding and Insights
Intuition for the alpha parameter: MultinomialNB's alpha is the Laplace smoothing factor. If alpha=0, when a new word not seen in the training set is encountered, its probability directly becomes zero, causing the probability calculation for the entire message to fail (a zero multiplied by any number is zero). The larger alpha is, the more "conservative" it is (more uniform probability distribution); the smaller alpha is, the more "aggressive" it is (more trust in training data).
Limitations of Pipeline: Pipeline only encapsulates the steps after CountVectorizer. The preceding text cleaning (lowercase/regex/tokenization/stopwords removal/stemming) is not within the Pipeline. When predicting new messages, the same preprocess_message() function must be called manually.
Practical Takeaways
Mastered two core engineering patterns of scikit-learn—Pipeline and GridSearchCV. These two tools will be repeatedly used in any scikit-learn project.
14. Model Evaluation (Spam Detection)
Exercise Solutions
Q1: What is the flag you get from submitting a good model for evaluation?
Solution Approach:
Overall process: Download SMS Spam Collection dataset → Text preprocessing (lowercase/punctuation removal/tokenization/stopwords removal/stemming) → CountVectorizer feature extraction → GridSearchCV training MultinomialNB → Save and upload.
Complete Training Code:
import requests, zipfile, io, os, re, json
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import joblib
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
nltk.download("stopwords", quiet=True)
# 1. Download dataset
url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
response = requests.get(url, verify=False)
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
z.extractall("sms_spam_collection")
# 2. Load Data
df = pd.read_csv(
"sms_spam_collection/SMSSpamCollection",
sep="\t", header=None, names=["label", "message"],
)
df = df.drop_duplicates()
# 3. Text Preprocessing
df["message"] = df["message"].str.lower()
df["message"] = df["message"].apply(lambda x: re.sub(r"[^a-z\s$!]", "", x))
df["message"] = df["message"].apply(word_tokenize)
stop_words = set(stopwords.words("english"))
df["message"] = df["message"].apply(lambda x: [w for w in x if w not in stop_words])
stemmer = PorterStemmer()
df["message"] = df["message"].apply(lambda x: [stemmer.stem(w) for w in x])
df["message"] = df["message"].apply(lambda x: " ".join(x))
# 4. Feature Extraction + Training
vectorizer = CountVectorizer(min_df=1, max_df=0.9, ngram_range=(1, 2))
y = df["label"].apply(lambda x: 1 if x == "spam" else 0)
pipeline = Pipeline([
("vectorizer", vectorizer),
("classifier", MultinomialNB())
])
param_grid = {"classifier__alpha": [0.01, 0.1, 0.15, 0.2, 0.25, 0.5, 0.75, 1.0]}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring="f1")
grid_search.fit(df["message"], y)
best_model = grid_search.best_estimator_
# 5. Save Model
joblib.dump(best_model, "spam_detection_model.joblib")
Upload Model:
curl -F "model=@spam_detection_model.joblib" http://<TARGET_IP>:8000/api/upload
Answer: HTB{sp4m_cla55if13r_3v4lu4t0r}
15. Network Anomaly Detection
Key Knowledge Points
- Random Forest: An ensemble learning algorithm that builds multiple decision trees and aggregates their results (classification = majority voting, regression = taking the average)
- Three core mechanisms:
- Bootstrap Sampling: Sampling with replacement to create multiple training subsets, so each tree sees different data
- Random Feature Selection: Only considers a subset of features for each split, reducing correlation between trees
- Voting Aggregation: A single tree might not be accurate, but accuracy significantly improves after multiple trees vote
- NSL-KDD Dataset: A standard benchmark for network intrusion detection, an improvement over KDD Cup 1999 (eliminating redundant records and class imbalance)
- The data contains 41 features (statistical properties of network connections) and attack type labels
Understanding and Insights
Random Forest is an ideal choice for this task: network traffic data has 40+ features and is high-dimensional. Random Forest is naturally good at handling high-dimensional data, does not require feature scaling (based on split thresholds rather than distance calculations), is robust to outliers, trains much faster than deep learning, and can yield good results with default parameters.
NSL-KDD is to network intrusion detection what MNIST is to image classification—a standard benchmark in academia. It fixed two fatal problems of the original KDD dataset: redundant records (leading models to be biased towards frequent patterns) and severe class imbalance.
Practical Takeaways
Understood why "ensemble" is stronger than "individual": Each tree only sees a subset of data and features, and might be inaccurate individually, but after 100 trees vote, the noise is averaged out, and accuracy significantly improves. This idea is not limited to Random Forest; it's the foundation of the entire field of ensemble learning.
16. Preprocessing and Splitting the Dataset
Key Points
- Binary Classification Target:
attack_flag— normal → 0, any attack → 1 - Multi-class Classification Target:
attack_map— mapping dozens of specific attack names to 5 classes:- 0 = Normal, 1 = DoS, 2 = Probe
- 3 = Privilege Escalation, 4 = Access
- Categorical Variable Encoding:
pd.get_dummies(df[['protocol_type', 'service']])One-Hot Encoding - Numerical Features: 34 statistical metrics (duration, src_bytes, dst_bytes, serror_rate, etc.) used directly
- Data Splitting: 80/20 split for test set → then 70/30 split from training set for validation set
Understanding and Insights
Trade-offs between Binary vs. Multi-class Classification: Binary classification (normal/attack) is simple but provides less information—only knowing "there's an attack" but not its type. Multi-class classification (5 classes) can distinguish attack types, which is more valuable for actual security responses (DoS requires rate limiting, Probe requires monitoring, Privilege Escalation requires immediate isolation). The evaluation port for this module requires a multi-class model.
Random Forest does not require feature scaling (as it's based on splitting thresholds rather than distance calculations), so the 34 numerical features can be used directly, without needing StandardScaler like SVM/KNN.
Common Pitfalls
random_state=1337must be consistent with the tutorial, otherwise, if the data split is different, the final model performance may differ from expectations.- Some spellings in the attack name list (e.g., dos_attacks, probe_attacks) are unintuitive (e.g.,
loadmdouleinstead ofloadmodule). Copy them directly from the tutorial, do not type manually.
17. Training and Evaluation (Network Anomaly Detection)
Key Takeaways
- Training:
RandomForestClassifier(random_state=1337)default parameters are sufficient. - Evaluation Metrics:
accuracy_score,precision_score,recall_score,f1_score(useaverage='weighted'for multi-class classification) - Visualization:
confusion_matrix+seaborn.heatmapto plot the confusion matrix;classification_reportto output details for each class. - Two-round evaluation: First, tune parameters/confirm direction on the validation set, then report performance on the test set.
- Model Saving:
joblib.dump(rf_model, 'network_anomaly_detection_model.joblib')
Understanding and Insights
Random Forest with default parameters + no feature engineering achieved 99.5% F1, while Naive Bayes for Spam Detection only reached 93% even with careful parameter tuning. This indicates that the match between algorithm and data is more important than parameter tuning—Random Forest is naturally suited for high-dimensional tabular data, whereas text classification has more noise to contend with.
Meaning of average='weighted': For multi-class classification, F1 has two types: macro (equal weighting for each class) and weighted (weighted by sample count). The Privilege class only has dozens of samples; if macro is used, its low F1 will severely drag down the overall score; weighted weighted by sample count is fairer.
How to read a confusion matrix: Diagonal = number of correctly classified instances, off-diagonal = misclassified instances. If there's a number in the Access column of the Probe row, it means some Probe attacks were misclassified as Access, and targeted improvements can be made.
Practical Takeaways
Experienced the complete closed loop from training to evaluation to visualization, and mastered the method of using confusion matrices and classification reports to pinpoint model weaknesses—this is much more valuable than just looking at a single F1 score.
18. Model Evaluation (Network Anomaly Detection)
Exercise Solutions
Q1: What is the flag you get from submitting a good model for evaluation?
Approach:
Using the NSL-KDD dataset, traffic is divided into 5 categories (Normal/DoS/Probe/Privilege/Access), and Random Forest is employed.
Complete Training Code:
import requests, zipfile, io
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import joblib
# 1. Download Dataset
url = "https://academy.hackthebox.com/storage/modules/292/KDD_dataset.zip"
response = requests.get(url, verify=False)
z = zipfile.ZipFile(io.BytesIO(response.content))
z.extractall('.')
# 2. Load Data
columns = [
'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes',
'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in',
'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations',
'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login',
'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate',
'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate',
'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'attack', 'level'
]
df = pd.read_csv('KDD+.txt', names=columns)
# 3. Create Multi-class Target
dos = ['apache2','back','land','neptune','mailbomb','pod','processtable','smurf','teardrop','udpstorm','worm']
probe = ['ipsweep','mscan','nmap','portsweep','saint','satan']
priv = ['buffer_overflow','loadmdoule','perl','ps','rootkit','sqlattack','xterm']
access = ['ftp_write','guess_passwd','http_tunnel','imap','multihop','named','phf',
'sendmail','snmpgetattack','snmpguess','spy','warezclient','warezmaster','xclock','xsnoop']
def map_attack(a):
if a in dos: return 1
elif a in probe: return 2
elif a in priv: return 3
elif a in access: return 4
else: return 0
df['attack_map'] = df['attack'].apply(map_attack)
# 4. Encode Categorical Variables + Select Numerical Features
encoded = pd.get_dummies(df[['protocol_type', 'service']])
numeric_features = [
'duration','src_bytes','dst_bytes','wrong_fragment','urgent','hot',
'num_failed_logins','num_compromised','root_shell','su_attempted',
'num_root','num_file_creations','num_shells','num_access_files',
'num_outbound_cmds','count','srv_count','serror_rate','srv_serror_rate',
'rerror_rate','srv_rerror_rate','same_srv_rate','diff_srv_rate',
'srv_diff_host_rate','dst_host_count','dst_host_srv_count',
'dst_host_same_srv_rate','dst_host_diff_srv_rate',
'dst_host_same_src_port_rate','dst_host_srv_diff_host_rate',
'dst_host_serror_rate','dst_host_srv_serror_rate',
'dst_host_rerror_rate','dst_host_srv_rerror_rate'
]
train_set = encoded.join(df[numeric_features])
multi_y = df['attack_map']
# 5. Data Split
train_X, test_X, train_y, test_y = train_test_split(train_set, multi_y, test_size=0.2, random_state=1337)
multi_train_X, _, multi_train_y, _ = train_test_split(train_X, train_y, test_size=0.3, random_state=1337)
# 6. Train + Save
rf_model = RandomForestClassifier(random_state=1337)
rf_model.fit(multi_train_X, multi_train_y)
joblib.dump(rf_model, 'network_anomaly_detection_model.joblib')
Upload Model:
curl -F "model=@network_anomaly_detection_model.joblib" http://<TARGET_IP>:8001/api/upload
Answer: HTB{n3tw0rk_tr4ff1c_4n0m4ly_d3t3ct0r}
19. Malware Classification
Key Points
- Malware Families: Categories of malware classified by behavior, propagation methods, and technical characteristics (e.g., Emotet, WannaCry). Detailed information can be found on malpedia
- Traditional classification methods: Static analysis (disassembly/decompilation) + Dynamic analysis (sandbox execution to observe behavior) + Reverse engineering, which are time-consuming and require specialized skills.
- Malware Image Classification: Maps each byte (0-255) of a binary file to a grayscale pixel value to generate a visual image; malware from the same family exhibits similar image textures due to shared code structures.
- Using CNNs to classify these images transforms the malware family identification problem into an image classification problem.
Understanding and Insights
"Drawing binaries as images" might seem counter-intuitive at first glance, but it becomes quite natural once you think it through—a binary file is essentially a sequence of bytes from 0-255, and mapping each byte to a grayscale pixel forms an image. The key insight is: malware from the same family, due to shared code segments, packing methods, and data structures, visually exhibits similar texture patterns in the generated images—this is the basis for CNNs to classify them.
Two Practical Advantages of Image Classification:(1) CNNs are very mature in image classification (ResNet/VGG/EfficientNet) and can be directly transferred and used; (2) Operating on images will not infect your machine, which is much safer than directly analyzing malicious binary files.
Practical Takeaways
Understood how to transform a difficult problem (malware classification) into a problem with existing mature solutions (image classification) through data representation transformation (binary → image). This "mapping problems to known domains" approach is very common in ML engineering.
20. The Malware Dataset
Key Points
- malimg dataset : 9339 grayscale PNG images of malware, covering 25 malware families
- Directory structure: one folder per family, folder name = family name (e.g.,
Adialer.C,Allaple.A,Rbot!gen, etc.) - Image source: each byte value (0-255) of a PE file (Windows executable) is directly mapped to grayscale pixel brightness (0=black, 255=white)
- Image sizes are inconsistent (because different binary files have different lengths), requiring uniform Resize later.
- Download method:
wget kaggle.com/.../malimg-original -O malimg.zip && unzip malimg.zip
Understanding and Insights
The dataset's directory organization (one folder per family) perfectly matches the expected format of PyTorch ImageFolder — no need to manually write annotation files or CSV mapping tables, ImageFolder automatically uses the folder name as the label. This is the most common data organization method in PyTorch image classification projects.
Imbalanced data distribution: an average of 374 images per class, but some classes have fewer than 100 images, while others have over 1000. This imbalance may lead to poor recognition of minority classes by the model, which can be mitigated in practice using oversampling / class weights.
Practical Takeaways
Mastered the concept of malware byteplot — where each pixel is the value of a byte in the binary file. Malware from different families, due to varying code structures and packing methods, exhibits distinct byteplot textures, which provides a visual basis for CNN classification.
21. Preprocessing the Malware Dataset
Key Points
- Data Splitting : using the
split-folderslibrary to split the training and test sets in an 80/20 ratio (splitfolders.ratio(ratio=(0.8, 0, 0.2))) - Image Preprocessing (
transforms.Compose):Resize((75, 75)): Unify all image sizes (original sizes vary due to different binary lengths)ToTensor(): Convert PIL Image to PyTorch Tensor, pixel values scaled from [0,255] to [0,1]Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]): Normalization parameters required by ImageNet pre-trained models
- ImageFolder: Automatically load dataset from directory structure (folder name = class name), automatically generate labels
- DataLoader: Encapsulates batch iteration, key parameters
batch_size(number of samples per batch),shuffle(whether to shuffle),num_workers(number of parallel loading processes)
Understanding and Insights
Why are the mean/std values for Normalize not the statistics of our own dataset? Because the ResNet50 we use is pre-trained on ImageNet, and its convolutional kernel weights are learned based on ImageNet's distribution. Input data must use the same normalization parameters, otherwise the response of the pre-trained weights will shift, and the feature extraction effect will be greatly reduced.
Trade-offs of Resizing to 75×75: Larger sizes (e.g., 224×224, ResNet's original input size) preserve more texture details, but training is slower and memory consumption is higher; 75×75 is a compromise made by the tutorial for training speed, sacrificing some accuracy.
Logic for choosing batch_size: Too small (32): high gradient noise per batch, slow training; Too large (2048): may exceed GPU memory. 512 is a good choice that runs efficiently on most GPUs.
Practical Takeaways
Mastered the standard trio for PyTorch image data pipelines: transforms (preprocessing) → ImageFolder (dataset) → DataLoader (iterator). This pattern applies to all PyTorch image classification projects.
22. The Model
Key Knowledge Points
- Transfer learning based on ResNet50: Load ImageNet pre-trained weights (
weights='DEFAULT'), 50 layers deep, approximately 23 million parameters - Freezing Strategy:
requires_grad = Falsefreeze all pre-trained layers, only train the replaced last layer - Custom fully connected layer:
Linear(2048, 1000) → ReLU → Linear(1000, n_classes)- 2048 = output dimension of ResNet50's second to last layer
- 1000 = adjustable hidden layer size
- n_classes = 25 (dynamically obtained from
len(train_dataset.classes))
- Model definition inherits
nn.Module, requiring implementation of__init__()andforward()methods
Understanding and Insights
Why transfer learning works: The low-level features (edges, textures, shapes) learned by ResNet50 on ImageNet are general and equally applicable to malware byte maps. We only need to replace the last layer to adapt to 25 classes, without training 23 million parameters from scratch.
Trade-off: Frozen vs. Unfrozen: After freezing, only the last fully connected layer is trained (approx. 2 million parameters vs. full 23 million), resulting in a training speed 10x faster or more. The trade-off is that the low-level feature extractor cannot be fine-tuned for the specific textures of malware images. In practice, freezing still achieves ~89% test accuracy, which is sufficient for a PoC; if higher accuracy is desired, more layers can be gradually unfrozen.
Benefits of dynamically obtaining n_classes: Using len(train_dataset.classes) instead of hardcoding 25 means the code does not need to be modified after adding or removing malware families.
Practical Takeaways
Mastered the standard pattern for transfer learning in PyTorch: Load pre-trained model → Freeze layers → Replace last layer → Train. This pattern applies to most image classification tasks, only requiring modification of the last layer's output dimension.
23. Training and Evaluation (Malware Image Classification)
Key Knowledge Points
- Five-step training loop template (repeated for each batch):
optimizer.zero_grad()— Clear previous gradients (PyTorch accumulates gradients by default)outputs = model(inputs)— Forward passloss = criterion(outputs, labels)— Calculate CrossEntropyLossloss.backward()— Backward pass, calculate gradients for each parameteroptimizer.step()— Adam optimizer updates parameters using gradients
- Evaluation Mode:
model.eval()+torch.no_grad()disables gradient computation and training behaviors of BatchNorm/Dropout - Model Saving:
torch.jit.script(model)serializes to TorchScript (.pth), including model structure + parameters, which can be loaded independently - Training Parameters:10 epochs, batch_size=512, Adam optimizer (default learning rate), CrossEntropyLoss
- Actual Performance:Training accuracy ~96%, Test accuracy ~89%
Understanding and Insights
model.eval() vs model.train() is not just a semantic tag:in eval mode, BatchNorm uses global statistics instead of batch statistics, Dropout stops random zeroing. Forgetting to switch to eval mode will lead to inconsistent inference results.
Why use jit.script instead of state_dict:torch.save(model.state_dict()) only saves parameter weights, requiring instantiation of a model class with the same structure when loading. jit.script serializes structure + parameters together, allowing evaluation ports to load without your MalwareClassifier class definition — this is crucial for model delivery.
Must .to("cpu") before saving:Models trained on GPU internally reference GPU devices, and will throw an error if loaded in a pure CPU environment.
Practical Takeaways
- Mastered the complete PyTorch training → evaluation → saving closed loop
- Experienced the actual effect of GPU acceleration:CPU ~210 seconds per epoch vs MPS ~19 seconds (11x acceleration), CUDA GPU might be even faster
- Understood the differences between scikit-learn and PyTorch paradigms and their respective applicable scenarios
24. Model Evaluation (Malware Image Classification)
Exercise Solutions
Q1: What is the flag you get from submitting a good model for evaluation?
Approach:
Using the malimg dataset (byte images of 25 malware families), performing transfer learning based on a pre-trained ResNet50.
Prerequisites:
pip3 install torch torchvision split-folders
wget https://www.kaggle.com/api/v1/datasets/download/ikrambenabd/malimg-original -O malimg.zip
unzip malimg.zip
Complete Training Code:
import os, time
import torch
import torch.nn as nn
import torchvision.models as models
from torchvision import transforms
from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder
import splitfolders
# Automatically detect GPU: CUDA > MPS (Apple Silicon) > CPU
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
print(f"Using device: {device}")
# 1. Split dataset (80% train / 20% test)
splitfolders.ratio(
input="./malimg_paper_dataset_imgs/",
output="./newdata/",
ratio=(0.8, 0, 0.2)
)
# 2. Data loading and preprocessing
transform = transforms.Compose([
transforms.Resize((75, 75)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
train_dataset = ImageFolder(root="./newdata/train", transform=transform)
test_dataset = ImageFolder(root="./newdata/test", transform=transform)
train_loader = DataLoader(train_dataset, batch_size=512, shuffle=True, num_workers=0)
test_loader = DataLoader(test_dataset, batch_size=1024, shuffle=False, num_workers=0)
n_classes = len(train_dataset.classes)
# 3. Model definition (ResNet50 transfer learning, freeze all weights except the last layer)
class MalwareClassifier(nn.Module):
def __init__(self, n_classes):
super().__init__()
self.resnet = models.resnet50(weights='DEFAULT')
for param in self.resnet.parameters():
param.requires_grad = False
num_features = self.resnet.fc.in_features
self.resnet.fc = nn.Sequential(
nn.Linear(num_features, 1000),
nn.ReLU(),
nn.Linear(1000, n_classes)
)
def forward(self, x):
return self.resnet(x)
model = MalwareClassifier(n_classes).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
# 4. Training
for epoch in range(10):
model.train()
running_loss, n_total, n_correct = 0, 0, 0
t0 = time.time()
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
_, predicted = outputs.max(1)
n_total += labels.size(0)
n_correct += predicted.eq(labels).sum().item()
running_loss += loss.item()
acc = 100 * n_correct / n_total
print(f"Epoch {epoch+1}/10: Acc={acc:.2f}% Loss={running_loss/len(train_loader):.4f} ({time.time()-t0:.1f}s)")
# 5. Evaluation
model.eval()
n_correct, n_total = 0, 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
_, predicted = torch.max(output, 1)
n_total += target.size(0)
n_correct += (predicted == target).sum().item()
print(f"Test accuracy: {100*n_correct/n_total:.2f}%")
# 6. Save (requires moving back to CPU then jit.script)
model_cpu = model.to("cpu")
model_scripted = torch.jit.script(model_cpu)
model_scripted.save("malware_classifier.pth")
Upload Model:
curl -F "model=@malware_classifier.pth" http://<TARGET_IP>:8002/api/upload
Answer: HTB{9569648083a8106ba057bbbe2d00d8ec}
25. Skills Assessment
Exercise Solutions
Q1: What is the flag you get from submitting a good model for evaluation?
Problem-solving Approach:
IMDB Movie Review Sentiment Analysis: Determine if a movie review is positive (1) or negative (0). The dataset is in JSON format, with 25,000 movie reviews. Using TF-IDF + LinearSVC performs better than Naive Bayes.
Complete Training Code:
import os, re, json
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
import joblib
import requests, zipfile, io
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
nltk.download("stopwords", quiet=True)
# 1. Download dataset
url = "https://academy.hackthebox.com/storage/modules/292/skills_assessment_data.zip"
response = requests.get(url, verify=False)
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
z.extractall(".")
# 2. Load Data
with open("train.json") as f:
train_data = json.load(f)
df = pd.DataFrame(train_data)
df = df.drop_duplicates(subset=['text'])
# 3. Preprocessing
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()
def preprocess(text):
text = str(text).lower()
text = re.sub(r"<[^>]+>", " ", text) # Remove HTML Tags
text = re.sub(r"[^a-z\s$!]", "", text) # Keep letters, spaces, $ and !
tokens = word_tokenize(text)
tokens = [w for w in tokens if w not in stop_words]
tokens = [stemmer.stem(w) for w in tokens]
return " ".join(tokens)
df['processed'] = df['text'].apply(preprocess)
y = df['label'].astype(int)
# 4. Train TF-IDF + LinearSVC
pipeline = Pipeline([
("vectorizer", TfidfVectorizer(
min_df=2, max_df=0.9,
ngram_range=(1, 2),
max_features=80000,
sublinear_tf=True
)),
("classifier", LinearSVC(C=1.0, max_iter=10000))
])
pipeline.fit(df['processed'], y)
# 5. Save
joblib.dump(pipeline, "skills_assessment.joblib")
Upload Model:
curl -F "model=@skills_assessment.joblib" http://<TARGET_IP>:5000/api/upload
Answer: HTB{s3nt1m3nt_4n4lys1s_d4t4}
Answer Quick Check
| Chapter | Question Number | Answer |
|---|---|---|
| 2 - Environment Setup | Q1 | DONE |
| 14 - Model Evaluation (Spam Detection) | Q1 | HTB{sp4m_cla55if13r_3v4lu4t0r} |
| 18 - Model Evaluation (Network Anomaly Detection) | Q1 | HTB{n3tw0rk_tr4ff1c_4n0m4ly_d3t3ct0r} |
| 24 - Model Evaluation (Malware Image Classification) | Q1 | HTB{9569648083a8106ba057bbbe2d00d8ec} |
| 25 - Skills Assessment | Q1 | HTB{s3nt1m3nt_4n4lys1s_d4t4} |