1

Linear Algebra

🔢 Vectors & Matrices
Dot Product
$$\mathbf{a}\cdot\mathbf{b}=\sum_{i=1}^n a_i b_i$$
Matrix Multiplication
$$(AB)_{ij}=\sum_{k=1}^n A_{ik}B_{kj}$$
Transpose
$$(AB)^T=B^T A^T$$
Identity Matrix
$$AI = IA = A$$
Inverse
$$AA^{-1}=A^{-1}A=I$$
📏 Norms
NormFormula
L1 (Manhattan)$$\|x\|_1=\sum_i |x_i|$$
L2 (Euclidean)$$\|x\|_2=\sqrt{\sum_i x_i^2}$$
Frobenius$$\|A\|_F=\sqrt{\sum_i\sum_j a_{ij}^2}$$
🎯 Eigenvalues & Eigenvectors
Eigen equation: \(Av=\lambda v\)
Characteristic: \(\det(A-\lambda I)=0\)
Trace: \(\mathrm{tr}(A)=\sum_i \lambda_i = \sum_i a_{ii}\)
Determinant: \(\det(A)=\prod_i \lambda_i\)
2

Calculus

📐 Basic Derivatives
RuleFormula
Power Rule$$\frac{d}{dx}(x^n) = nx^{n-1}$$
Chain Rule$$\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)$$
Product Rule$$\frac{d}{dx}[f(x)g(x)] = f'(x)g(x) + f(x)g'(x)$$
Quotient Rule$$\frac{d}{dx}\left[\frac{f(x)}{g(x)}\right] = \frac{f'(x)g(x) - f(x)g'(x)}{g(x)^2}$$
∇ Partial Derivatives
ConceptFormula
Gradient$$\nabla f=\left(\frac{\partial f}{\partial x_1},\dots,\frac{\partial f}{\partial x_n}\right)$$
Hessian$$H_{ij}=\frac{\partial^2 f}{\partial x_i\partial x_j}$$
Jacobian$$J_{ij}=\frac{\partial f_i}{\partial x_j}$$
🧮 Activation Derivatives
FunctionDerivative
Sigmoid$$\sigma'(x)=\sigma(x)(1-\sigma(x))$$
Tanh$$\tanh'(x)=1-\tanh^2(x)$$
ReLU$$\text{ReLU}'(x)=\begin{cases}1&x>0\\0&\text{otherwise}\end{cases}$$
Exponential$$\frac{d}{dx}(e^x) = e^x$$
Logarithm$$\frac{d}{dx}(\ln x) = \frac{1}{x}$$
3

Probability

🎲 Basic Probability Rules
Probability
$$P(A)=\frac{\text{favorable outcomes}}{\text{total outcomes}}$$
Complement Rule
$$P(A^c)=1-P(A)$$
Addition Rule
$$P(A\cup B)=P(A)+P(B)-P(A\cap B)$$
Multiplication Rule
$$P(A\cap B)=P(A|B)P(B)=P(B|A)P(A)$$
Conditional Probability
$$P(A|B)=\frac{P(A\cap B)}{P(B)}$$
🔮 Bayes' Theorem
Bayes' Rule
$$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$$
Extended Form
$$P(A|B)=\frac{P(B|A)P(A)}{P(B|A)P(A)+P(B|A^c)P(A^c)}$$
💡 Key Application: Fundamental in ML for classification, spam detection, and probabilistic reasoning
📊 Expected Value & Variance
ConceptFormula
Expected Value$$E[X]=\sum x_i P(x_i)\text{ or }\int x f(x)dx$$
Variance$$\text{Var}(X)=E[(X-\mu)^2]=E[X^2]-(E[X])^2$$
Standard Deviation$$\sigma = \sqrt{\text{Var}(X)}$$
Covariance$$\text{Cov}(X,Y)=E[(X-\mu_X)(Y-\mu_Y)]$$
Correlation$$\rho(X,Y)=\frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$$
4

Statistics

📈 Descriptive Statistics
Mean
$$\mu=\frac{1}{n}\sum_{i=1}^n x_i$$
Sample Variance
$$s^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2$$
Population Variance
$$\sigma^2=\frac{1}{n}\sum_{i=1}^n (x_i-\mu)^2$$
Standard Error
$$SE = \frac{\sigma}{\sqrt{n}}$$
🎯 Hypothesis Testing
Z-score
$$z=\frac{x-\mu}{\sigma}$$
T-statistic
$$t=\frac{\bar{x}-\mu}{s/\sqrt{n}}$$
Confidence Interval
$$CI=\bar{x}\pm z_{\alpha/2}\cdot SE$$
Key Terms:
Type I Error (α): Rejecting true null hypothesis
Type II Error (β): Failing to reject false null hypothesis
Power: 1 - β
5

Linear & Logistic Regression

📉 Linear Regression
Simple Model
$$y=\beta_0+\beta_1 x+\varepsilon$$
Matrix Form
$$\mathbf{y}=\mathbf{X}\beta+\varepsilon$$
Normal Equation
$$\beta=(X^TX)^{-1}X^T y$$
Predicted Values
$$\hat{y} = X\beta$$
📊 Loss Functions
MetricFormula
Mean Squared Error$$MSE=\frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2$$
Root MSE$$RMSE=\sqrt{MSE}$$
Mean Absolute Error$$MAE=\frac{1}{n}\sum_{i=1}^n |y_i-\hat{y}_i|$$
R² Score$$R^2=1-\frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2}$$
Adjusted R²$$R^2_{adj}=1-\frac{(1-R^2)(n-1)}{n-p-1}$$
🔄 Logistic Regression
Sigmoid Function
$$\sigma(z)=\frac{1}{1+e^{-z}}$$
Logit
$$z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n$$
Probability
$$P(y=1|x) = \sigma(w^T x + b)$$
Odds & Log-Odds
$$\text{Odds} = \frac{P(y=1)}{P(y=0)} = e^z \quad ; \quad \log(\text{Odds}) = z$$
Log Loss (Binary Cross-Entropy)
$$L=-\frac{1}{n}\sum_{i=1}^n[y_i\log\hat{y}_i+(1-y_i)\log(1-\hat{y}_i)]$$
🛡️ Regularization
TypeCost Function
Ridge (L2)$$J(\beta)=\sum(y_i-\hat{y}_i)^2+\lambda\sum\beta_j^2$$
Lasso (L1)$$J(\beta)=\sum(y_i-\hat{y}_i)^2+\lambda\sum|\beta_j|$$
Elastic Net$$J(\beta)=\sum(y_i-\hat{y}_i)^2+\lambda_1\sum|\beta_j|+\lambda_2\sum\beta_j^2$$
6

Neural Networks

⚡ Activation Functions
FunctionFormula
Sigmoid$$\sigma(x)=\frac{1}{1+e^{-x}}$$
Tanh$$\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}$$
ReLU$$\text{ReLU}(x)=\max(0,x)$$
Leaky ReLU$$f(x)=\begin{cases}x&x>0\\\alpha x&\text{otherwise}\end{cases}$$
Softmax$$\text{softmax}(x_i)=\frac{e^{x_i}}{\sum_j e^{x_j}}$$
🔄 Forward Propagation
Linear Combination
$$z^{(l)}=W^{(l)}a^{(l-1)}+b^{(l)}$$
Activation
$$a^{(l)}=g(z^{(l)})$$
⬅️ Backpropagation
Output Layer Error
$$\delta^{(L)}=(a^{(L)}-y)\odot g'(z^{(L)})$$
Hidden Layer Error
$$\delta^{(l)}=[(W^{(l+1)})^T\delta^{(l+1)}]\odot g'(z^{(l)})$$
Weight Gradient
$$\frac{\partial L}{\partial W^{(l)}}=\delta^{(l)}(a^{(l-1)})^T$$
Bias Gradient
$$\frac{\partial L}{\partial b^{(l)}}=\delta^{(l)}$$
7

Optimization Algorithms

⬇️ Gradient Descent Variants
Batch Gradient Descent
$$\theta:=\theta-\alpha\nabla J(\theta)$$
Stochastic Gradient Descent
$$\theta:=\theta-\alpha\nabla J(\theta;x^{(i)},y^{(i)})$$
Mini-batch Gradient Descent
$$\theta:=\theta-\alpha\frac{1}{m}\sum_{i=1}^m\nabla J(\theta;x^{(i)},y^{(i)})$$
🚀 Advanced Optimizers
Momentum
$$v:=\beta v+(1-\beta)\nabla J(\theta)\\ \theta:=\theta-\alpha v$$
RMSprop
$$s:=\beta s+(1-\beta)(\nabla J)^2\\ \theta:=\theta-\alpha\frac{\nabla J}{\sqrt{s}+\epsilon}$$
Adam (Adaptive Moment Estimation)
$$m:=\beta_1 m+(1-\beta_1)\nabla J\\ v:=\beta_2 v+(1-\beta_2)(\nabla J)^2\\ \theta:=\theta-\alpha\frac{\hat m}{\sqrt{\hat v}+\epsilon}$$
Pro tip: Adam is widely used in modern deep learning.
8

Evaluation Metrics

✅ Classification Metrics
MetricFormula
Accuracy$$\text{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN}$$
Precision$$\text{Precision}=\frac{TP}{TP+FP}$$
Recall (Sensitivity)$$\text{Recall}=\frac{TP}{TP+FN}$$
Specificity$$\text{Specificity}=\frac{TN}{TN+FP}$$
F1-Score$$F_1=\frac{2\cdot \text{Precision}\cdot \text{Recall}}{\text{Precision}+\text{Recall}}$$
F-beta Score$$F_\beta=\frac{(1+\beta^2)\cdot \text{Precision}\cdot \text{Recall}}{\beta^2\cdot\text{Precision}+\text{Recall}}$$
📉 Confusion Matrix
Legend:
TP = True Positive
TN = True Negative
FP = False Positive (Type I Error)
FN = False Negative (Type II Error)
📈 ROC & AUC
True Positive Rate (TPR)
$$TPR=\frac{TP}{TP+FN}$$
False Positive Rate (FPR)
$$FPR=\frac{FP}{FP+TN}$$
AUC (Area Under Curve)
$$AUC = \int_0^1 TPR(FPR^{-1}(x))dx$$
Note: ROC curves plot TPR vs FPR. AUC measures the area under the ROC curve.
9

Clustering

🎯 K-Means Algorithm
Objective Function
$$J=\min \sum_{j=1}^k\sum_{x\in C_j}\|x-\mu_j\|^2$$
Centroid Update
$$\mu_j=\frac{1}{|C_j|}\sum_{x\in C_j}x$$
📏 Distance Metrics
MetricFormula
Euclidean$$d(x,y)=\sqrt{\sum_i (x_i-y_i)^2}$$
Manhattan$$d(x,y)=\sum_i|x_i-y_i|$$
Cosine Similarity$$\cos(\theta)=\frac{x\cdot y}{\|x\|\|y\|}$$
📊 Silhouette Score
Silhouette Coefficient
$$s(i)=\frac{b(i)-a(i)}{\max\{a(i),b(i)\}}$$
Where: \(a(i)\) = mean distance to same cluster, \(b(i)\) = mean distance to nearest cluster
10

Deep Learning

🔄 Batch Normalization
Normalize
$$\hat x=\frac{x-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}$$
Scale & Shift
$$y=\gamma\hat x+\beta$$
🧠 Convolutional Neural Networks (CNN)
Convolution Operation
$$(f * g)(t) = \sum_x f(x)g(t-x)$$
Output Size
$$O=\left\lfloor\frac{W-K+2P}{S}\right\rfloor+1$$
Parameters: W = input size, K = kernel size, P = padding, S = stride
🔁 Recurrent Neural Networks (RNN)
Hidden State
$$h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h)$$
Output
$$y_t = W_{hy}h_t + b_y$$
🧬 Long Short-Term Memory (LSTM)
Forget Gate
$$f_t=\sigma(W_f[h_{t-1},x_t]+b_f)$$
Input Gate
$$i_t=\sigma(W_i[h_{t-1},x_t]+b_i)$$
Output Gate
$$o_t=\sigma(W_o[h_{t-1},x_t]+b_o)$$
Cell State
$$C_t=f_t\odot C_{t-1}+i_t\odot\tilde{C}_t$$
Hidden State
$$h_t=o_t\odot\tanh(C_t)$$
💧 Dropout
Training: Output = mask ⊙ activation / (1 - p)
Testing: Use all neurons (no dropout)
11

Dimensionality Reduction

📊 Principal Component Analysis (PCA)
Covariance Matrix
$$\Sigma=\frac{1}{n}X^T X$$
Principal Components
$$\text{Eigenvectors of } \Sigma$$
Explained Variance Ratio
$$\text{EVR}=\frac{\lambda_i}{\sum_j \lambda_j}$$
Projection
$$Z=XW \quad \text{(where W = eigenvectors)}$$
🔢 Singular Value Decomposition (SVD)
Decomposition
$$X = U\Sigma V^T$$
Reduced Form
$$X \approx U_k\Sigma_k V_k^T$$
12

Information Theory

📡 Entropy and Information
MeasureFormula
Entropy$$H(X)=-\sum P(x_i)\log_2 P(x_i)$$
Cross-Entropy$$H(p,q)=-\sum p(x)\log q(x)$$
KL Divergence$$D_{KL}(P\|Q)=\sum P(x)\log\frac{P(x)}{Q(x)}$$
Mutual Information$$I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)$$
Conditional Entropy$$H(Y|X)=-\sum_x\sum_y P(x,y)\log P(y|x)$$
13

Support Vector Machines

🎯 Linear SVM
Decision Function
$$f(x)=w^Tx+b$$
Margin
$$\text{margin}=\frac{2}{\|w\|}$$
Optimization
$$\min \frac{1}{2}\|w\|^2 \quad \text{s.t. } y_i(w^Tx_i+b)\geq 1$$
🛡️ Soft Margin SVM
Objective
$$\min \frac{1}{2}\|w\|^2 + C\sum_i\xi_i$$
Constraint
$$y_i(w^Tx_i+b)\geq 1-\xi_i, \quad \xi_i\geq 0$$
🔧 Kernel Functions
KernelFormula
Linear$$K(x,x')=x^Tx'$$
Polynomial$$K(x,x')=(x^Tx'+c)^d$$
RBF (Gaussian)$$K(x,x')=\exp(-\gamma\|x-x'\|^2)$$
Sigmoid$$K(x,x')=\tanh(\alpha x^Tx'+c)$$
14

Decision Trees & Ensembles

🌳 Impurity Measures
MeasureFormula
Gini Impurity$$\text{Gini}=1-\sum_i p_i^2$$
Entropy$$H=-\sum_i p_i\log_2(p_i)$$
Classification Error$$E=1-\max(p_i)$$
Information Gain
$$IG(D,A)=H(D)-\sum_v\frac{|D_v|}{|D|}H(D_v)$$
🌲 Ensemble Methods
Bagging (Random Forest)
$$\hat{y}=\frac{1}{B}\sum_{b=1}^B f_b(x)$$
AdaBoost - Sample Weight
$$w_i^{(t+1)}=w_i^{(t)}\cdot\exp[\alpha_t\cdot\mathbb{1}(y_i\neq h_t(x_i))]$$
AdaBoost - Model Weight
$$\alpha_t=\frac{1}{2}\ln\frac{1-\varepsilon_t}{\varepsilon_t}$$
Gradient Boosting - Update
$$F_m(x)=F_{m-1}(x)+\gamma_m h_m(x)$$
Gradient Boosting - Residual
$$r_{im}=-\left[\frac{\partial L(y_i,F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}$$
15

Bias-Variance Tradeoff

⚖️ Error Decomposition
Total Error
$$E[(y-\hat{y})^2]=\text{Bias}^2+\text{Variance}+\text{Irreducible Error}$$
Bias
$$\text{Bias}=E[\hat{y}]-y$$
Variance
$$\text{Variance}=E[(\hat{y}-E[\hat{y}])^2]$$
💡 Key Insights:
High Bias → Underfitting (model too simple)
High Variance → Overfitting (model too complex)
Goal: Find the optimal balance between bias and variance
16

Quick Reference: Loss Functions

📉 Regression Losses
LossFormulaUse Case
MSE$$L=\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2$$Standard regression
MAE$$L=\frac{1}{n}\sum_{i=1}^n|y_i-\hat{y}_i|$$Robust to outliers
Huber$$L=\begin{cases}\frac{1}{2}(y-\hat{y})^2&|y-\hat{y}|\leq\delta\\\delta|y-\hat{y}|-\frac{1}{2}\delta^2&\text{otherwise}\end{cases}$$Combines MSE & MAE
🎯 Classification Losses
LossFormulaUse Case
Binary Cross-Entropy$$L=-[y\log(\hat{y})+(1-y)\log(1-\hat{y})]$$Binary classification
Categorical Cross-Entropy$$L=-\sum_i y_i\log(\hat{y}_i)$$Multi-class classification
Hinge Loss$$L=\max(0,1-y\cdot\hat{y})$$SVM classification
17

Feature Engineering

📊 Normalization Techniques
MethodFormulaRange
Min-Max Scaling$$x'=\frac{x-\min}{\max-\min}$$[0, 1]
Z-Score Normalization$$x'=\frac{x-\mu}{\sigma}$$~ [-3, 3]
Max Abs Scaling$$x'=\frac{x}{|\max|}$$[-1, 1]
🔢 Polynomial Features
Degree 2
$$[x_1, x_2] \rightarrow [1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]$$
Degree 3
$$[x_1, x_2] \rightarrow [1, x_1, x_2, x_1^2, x_1 x_2, x_2^2, x_1^3, x_1^2 x_2, x_1 x_2^2, x_2^3]$$
💡 Tip: Polynomial features help capture non-linear relationships in data, but be careful of overfitting with high degrees.
🏷️ Encoding Categorical Variables
One-Hot Encoding
$$\text{Category } c \rightarrow [0, 0, \ldots, 1, \ldots, 0] \text{ (1 at position } c\text{)}$$
Label Encoding
$$\text{Categories } \{A, B, C\} \rightarrow \{0, 1, 2\}$$
👤

Contributor

Created and maintained by:

@iNSRawat

⭐ If this project helped you, consider giving it a star on GitHub!