Submitted:
29 April 2025
Posted:
29 April 2025
You are already at the latest version
Abstract

Keywords:
1. Introduction
2. Motivation and Background
3. Core Principles
- Precise Definitions of Foundational Terms: All critical terminology (e.g., weights, gradients) is clearly and systematically defined to establish a consistent conceptual framework.
- Graphical and Diagrammatic Representation: Visual aids, including graphs and diagrams, are utilized extensively to illustrate data flow, transformations, and algorithmic processes, thereby enhancing conceptual clarity.
- Formulaic Pseudocode as a Bridging Mechanism: Pseudocode formulations are employed to systematically bridge abstract mathematical expressions and practical programming logic, facilitating a more intuitive transition from theory to implementation.
- Prioritization of Intuitive Explanations: Initial emphasis is placed on cultivating intuitive understanding before introducing formal mathematical structures, with the goal of reducing cognitive barriers and promoting more effective knowledge transfer.
4. Target Audience
- High school students embarking on the study of Machine Learning, particularly those seeking accessible and intuitive learning resources.
- Coding bootcamp graduates aiming to deepen their conceptual and practical grasp of Machine Learning beyond standard curricula.
- Software developers and practitioners who possess limited exposure to calculus or higher-level mathematics, yet aspire to implement and innovate within Machine Learning systems.
- Individuals who have encountered challenges with traditional Machine Learning literature, often characterized by excessive formalism and limited pedagogical accessibility.
- Current and future educators interested in re-envisioning Machine Learning instruction models, with an emphasis on enhancing accessibility, transparency, and engagement for emerging generations of developers and researchers.
5. Vision for the Future of Machine Learning Education
6. Contribution Summary
7. Fundations to Machine Learning
- Gradient: Refers to a tensor or scalar value that represents the rate of change of a function with respect to its inputs, typically used in optimization algorithms to adjust parameters during training.
- Scalar: A scalar is a tensor with a single parameter or value, often representing a simple quantity such as a real number or a single element within a dataset.
- Vector: A vector is a one-dimensional tensor, typically represented as an array or list of numbers, each corresponding to a distinct quantity within a multi-dimensional space.
- Tensor: A tensor is a multi-dimensional array that generalizes the concept of scalars (zero-dimensional tensors) and vectors (one-dimensional tensors) to higher dimensions, used to represent and store multi-dimensional data structures.
- Target: The target refers to the desired gradient or output value that the model aims to approximate or predict, typically specified in the output layer of a neural network during the training process.
- Output: The output refers to the gradient or value that is produced by the network’s final layer, commonly referred to as the output layer. This value represents the model’s prediction after processing the input through all preceding layers
- Weights, and Bias: Weights and biases are the parameters of each layer within a neural network. Weights represent the strength of the connections between neurons, while biases provide additional offsets to the activation functions. Together, these parameters define the transformation applied to input data as it propagates through the network. Analogous to synapses in the human brain, they control the flow of information within the model.
- Underfitting and Overfitting: Underfitting occurs when a model fails to capture the underlying patterns within the training data, often due to insufficient complexity or inadequate training. Conversely, overfitting arises when a model excessively memorizes the training data, capturing noise and irrelevant details instead of generalizable patterns. This often results in poor performance on unseen data.
- Further sections of this paper provide a more in-depth exploration of these concepts and their implications in the context of neural network training and optimization.

7.1. What can a Tensor store
7.2. Layer in a Machine Learning Context
7.3. Neural Network

7.3.1. Understanding the Neural Network
- Input: Accepts raw data, which can be normalized to prevent gradient issues.
- Hidden: Applies matrix multiplication to propagate signal strengths throughout the network.
- Output: Produces the final prediction, which can be compared to the target to compute the network error.
- Data Collection and Preprocessing: A fundamental prerequisite for training a neural network is the acquisition of a clean, well-prepared dataset. This involves identifying and removing significant outliers and ensuring that the input features are correctly paired with their corresponding target values. The integrity of the data must be preserved to avoid introducing noise that could impair model performance.
- Model Architecture Definition: The next step is to design the architecture of the neural network. This includes the selection of layer types (such as Dense, Convolutional, or Recurrent layers), determining the number of layers, and specifying the number of units within each layer. Additionally, appropriate activation functions must be chosen for each layer to enable the model to learn complex relationships within the data.
- Loss Function Selection: The choice of a loss function is crucial for quantifying the error between the predicted outputs and the actual target values. Common loss functions include Mean Squared Error (MSE) for regression tasks and Mean Absolute Error (MAE), among others. The loss function serves as the objective that the network aims to minimize during training.
- Optimizer Selection: To optimize the learning process, an optimizer must be chosen to adjust the network’s weights. Algorithms such as Adam are commonly used due to their adaptive learning rates and effective convergence properties. The optimizer adjusts the learning rate dynamically to ensure efficient convergence based on the gradients computed during backpropagation.
- Foward Pass: During the forward pass, input data is propagated through the network. Each layer computes an activation function, transforming the input data at each stage. The final layer produces the predicted output, which is compared to the true labels or target values.
- Loss Computation: Once the forward pass is complete, the predicted output is evaluated against the true target values using the selected loss function. This step quantifies the difference between the predicted and actual values, providing a measure of the model’s performance.
- Backward Pass (Backpropagation): The backward pass involves calculating the gradients of the loss function with respect to the network’s weights. This is achieved by applying the Chain Rule of calculus to propagate the error backward through the network, layer by layer, to compute the necessary gradients.
- Weight Update: The optimizer is then used to update the network’s weights and biases based on the gradients computed during the backward pass. This iterative process continues until the model converges to a solution that minimizes the loss function.
8. Tensors
- Constituent Components of Tensors: Tensors are composed of several interrelated structural elements that define their behavior and application within computational frameworks.
- Dimension: In most Machine Learning contexts, tensors are utilized in their two-dimensional (2D) form. This paper primarily focuses on two-dimensional tensors; however, the conceptual principles discussed are extensible to tensors of higher dimensions.
- Shape: The shape of a two-dimensional tensor (matrix) is characterized by two parameters: the number of rows (X-axis) and the number of columns (Y-axis). This structure determines the organizational layout of the data within the tensor.
- Parameters: The total number of parameters within a tensor is defined as the product of its dimensions, calculated as rows × columns. This quantity corresponds to the total number of individual elements or connections represented.
- Weights: In neural network architectures, the primary tensor associated with each layer is referred to as the weights tensor. These tensors encode the learnable parameters that are updated during the training process.
- Bias: Complementing the weights tensor, the bias tensor is introduced at each layer to enable additional model complexity and flexibility, allowing the network to better capture patterns in the data.
- Scalar vs. Gradient: A tensor consisting of a single element with a shape of is classified as a scalar. In contrast, gradients tensors representing derivatives with respect to model parameters generally possess higher-dimensional shapes and necessitate iteration mechanisms (e.g., loops) for computation across their elements.
- Logits: Logits refer to the raw, unnormalized output values produced by a neural network layer, typically represented within tensors prior to the application of activation functions such as softmax or sigmoid.
- Element: An element denotes a single numerical value contained within a tensor, identified by its unique position according to the tensor’s dimensional structure.

8.1. Key Gradient Pathologies
- Exploding Gradients: This phenomenon occurs when the magnitudes of tensor elements increase exponentially during backpropagation, potentially exceeding the representational limits of the specified data type and causing numerical overflow or instability.
- Vanishing Gradients: Conversely, vanishing gradients arise when the magnitudes of tensor elements diminish exponentially, approaching zero and resulting in numerical underflow. This leads to stagnation in learning, as weight updates become negligibly small.
8.2. Tensor Rank Versus Tensor Shape
- Rank (Order): The rank of a tensor refers to the number of dimensions it possesses. For example, a scalar has rank 0, a vector has rank 1, and a matrix has rank 2.
- Shape: The shape of a tensor specifies the explicit size of each dimension. For instance, a tensor with a shape of has three rows and five columns in its two-dimensional representation.
8.3. Axis Operations
8.4. Dot Product
| Listing 1. Formulaic Pythonic Pseudocode for Dot Product Computation. |
![]() |
- Aggregating features within neural network architectures, particularly in fully connected (dense) layers.
- Computing the angle between two vectors, which provides insights into their directional alignment.
- Serving as the basis for cosine similarity, a common metric for comparing vector representations in tasks such as information retrieval and recommendation systems.
- A large positive result indicates strong agreement between the vectors (i.e., vectors oriented in similar directions).
- A negative result reflects opposing directions, suggesting that the vectors are substantially misaligned.
- A result close to zero indicates orthogonality (perpendicularity) between the vectors, implying the absence of linear correlation.

8.5. Tensor Addition
| Listing 2. Formulaic Pythonic Pseudocode for Tensor Addition. |
![]() |
8.6. Tensor Subtraction
| Listing 3. Formulaic Pythonic Code for Tensor Subtraction. |
![]() |
8.7. Scalar Addition
| Listing 4. Formulaic Pythonic Code for Scalar Addition. |
![]() |
8.8. Scalar Subtraction
| Listing 5. Formulaic Pythonic Code for Scalar Subtraction. |
![]() |
8.9. Scalar Multiplication
| Listing 6. Formulaic Pythonic Pseudocode for Scalar Multiplication. |
![]() |
8.10. Scalar Division
| Listing 7. Formulaic Pythonic Pseudocode for Scalar Division. |
![]() |
8.11. Use Cases of Tensor-Scalar Operations
- Loss Functions: Quantifying prediction errors by aggregating differences between predicted and actual outputs into a single optimization target.
- Evaluation Metrics: Assessing model performance through scalar scores such as accuracy, precision, recall, or F1 score.
- Regularization Terms: Penalizing model complexity by incorporating scalar-valued norms (regularization) into the loss function to encourage simpler models.
- Gradient Aggregation: Summarizing parameter updates across batches or layers to stabilize and guide the optimization process.
- State Monitoring: Tracking statistical properties of tensors, such as mean, variance, or norm, to facilitate model diagnostics and performance tuning.

8.12. Matrix Multiplication
| Listing 8. Formulaic Pythonic Pseudocode for Matrix Multiplication. |
![]() |
8.12.1. Misconception: Matrix Division?
8.13. Broadcasting
- Broadcasting Rules for Two-Dimensional Tensors:
- Let Tensor A and Tensor B be the two tensors involved in the operation.
- Broadcasting is permissible along the row dimension if the number of rows in A equals the number of rows in B, or if one of the tensors has exactly one row.
- Broadcasting is permissible along the column dimension if the number of columns in A equals the number of columns in B, or if one of the tensors has exactly one column.
- If either the row or column dimension is 1 in one tensor, it is virtually expanded (broadcast) to match the corresponding dimension of the other tensor.
| Listing 9. Formulaic Pythonic Code for Tensor Subtraction. |
![]() |

8.14. Tensor Raised to a Scalar Power
| Listing 10. Formulaic Pythonic Pseudocode for Element-wise Tensor Exponentiation. |
![]() |
8.14.1. Use Cases
- Activation Functions: Certain activation functions involve raising tensor elements to a scalar power, particularly in custom or experimentally-derived non-linearities.
- Regularization: Element-wise exponentiation can be employed as part of regularization strategies to control model complexity and mitigate overfitting.
- Normalization Techniques: Several normalization methods involve squaring tensor elements, such as in variance normalization or RMS (Root Mean Square) computations.
- Probability Distributions: In probabilistic modeling, tensor elements are often raised to powers when computing likelihoods, especially in formulations involving exponential families or in certain Bayesian inference procedures.
8.14.2. Important Notes
- Element-wise Application: Exponentiation by a scalar applies strictly in an element-wise manner. It does not constitute matrix multiplication, even if the tensor is two-dimensional.
- Scalar Flexibility: The scalar exponent can be any real number, including negative values and fractions, provided that the base tensor elements support the operation (e.g., non-negative bases for real-valued outputs when using fractional exponents).
- Differentiability: Element-wise exponentiation is differentiable almost everywhere and is thus suitable for use in gradient-based optimization algorithms.
- Numerical Stability: The use of non-integer scalars in combination with zero or negative tensor elements may lead to undefined or unstable results, depending on the specific computational context.
- Nonlinear Transformation: Raising tensor elements to a scalar power constitutes a nonlinear transformation, significantly altering the data distribution and affecting downstream computations.

9. Tensor Shaping and Broadcasting Operations
9.1. Transpose

| Listing 11. Formulaic Pythonic Pseudocode for Tensor Transposition. |
![]() |
9.2. Padding (Zero Padding)
| Listing 12. Formulaic Pythonic Pseudocode for Tensor Padding. |
![]() |
9.3. Flatten
| Listing 13. Formulaic Pythonic Pseudocode for Tensor Flattening. |
![]() |
9.4. Dimensional Products
| Listing 14. Concurrent Formulaic Pythonic Code for Dimensional Product Calculation. |
![]() |
9.5. Reshape and Resize Operations
| Listing 15. Concurrent Formulaic Pythonic Code for Tensor Reshape/Resize. |
![]() |
10. Fundamental Mathematical Concepts for Machine Learning
10.1. Calculus: A Descriptive Framework, Not a Barrier
10.2. Non-Linearity: The Essential Break from Linearity
10.3. Euler’s Number: A Fundamental Limit in Continuous Growth
| Listing 16. Formulaic Pythonic Code for Euler’s Number Approximation. |
![]() |
10.4. Mean of a Tensor: A Measure of Central Tendency
| Listing 17. Formulaic Pythonic Code for Tensor Mean. |
![]() |
10.5. Median of a Tensor: A Robust Measure of Central Tendency
| Listing 18. Formulaic Pythonic Code for Tensor Median. |
![]() |
10.6. Element-wise Absolute Value of a Tensor
| Listing 19. Formulaic Pythonic Code for Element-wise Tensor Absolute Value. |
![]() |
10.7. Tensor Standard Deviation (std): A Measure of Dispersion
| Listing 20. Formulaic Pythonic Code for Tensor Standard Deviation. |
![]() |
10.8. Hyperparameters: Configurable Variables in Model Design and Optimization
-
(1) Model Architecture Hyperparameters define the structural configuration of the neural network:
- Number of layers: The depth of the network, affecting its representational capacity.
- Neurons per layer: The dimensionality of each transformation, controlling expressive power.
- Activation functions: Non-linear transformations (e.g., ReLU, Sigmoid, GELU) applied to neurons, introducing model non-linearity.
-
(2) Optimization Hyperparameters govern how the model is trained using gradient descent:
- Learning rate: The step size at each iteration of optimization; critical for convergence behavior.
- Batch size: Number of training samples processed per update; affects memory usage and gradient stability.
- Epochs: The number of complete passes over the training dataset.
- Optimizer: Algorithm used to update weights (e.g., SGD, Adam, RMSProp), each with its own internal dynamics.
-
(3) Regularization Hyperparameters mitigate overfitting and improve generalization:
- Dropout rate: Fraction of neurons randomly deactivated during training to prevent co-adaptation.
- L1/L2 regularization coefficients: Penalize large weights to encourage sparsity or smoothness.
- Non-linearity inclusion: The use of activation functions contributes indirectly to regularization by introducing model flexibility.
-
(4) Hyperparameter Tuning Strategies determine how hyperparameters are selected:
- Grid search: Exhaustive evaluation across a Cartesian product of predefined hyperparameter values.
- Random search: Random sampling from defined distributions over hyperparameter values.
- Bayesian optimization: Model-based approach that builds a probabilistic surrogate function to efficiently explore the search space.
- Advanced methods: Includes algorithms like Hyperband, Optuna, and Population-Based Training (PBT) for scalable and adaptive tuning.
10.9. Function Hyperparameters: Tunable Modifiers of Algorithmic Behavior
- Example: Exponential Moving Averages in the Adam Optimizer
- Beta1 controls the decay rate of the moving average of the first moment (gradient). A typical default is , meaning recent gradients are weighted more heavily than earlier ones.
- Beta2 governs the second moment (squared gradients), often set to to ensure long-term memory of variance.
- Dropout rate in regularization functions
- Alpha in leaky ReLU or ELU activations
- Temperature in softmax-based sampling
- Kernel width in Gaussian functions or RBF kernels
10.10. Derivatives: Quantifying Change in Continuous Functions
- Example: Power Function Derivative
10.11. Epsilon (): Thresholds for Numerical Stability and Convergence
- Illustrative Example: Iterative Stopping Criterion
- Important Clarification
10.12. Geometric Symbols and Mathematical Constants in Machine Learning
- (1) The Constant
- (2) The Constant e
- (3) The Symbol
- The parameter vector in linear models (e.g., in logistic regression, encapsulates weights and biases).
- The orientation or slope of a function with respect to its gradient during optimization.
10.13. The Error Function (): A Fundamental Link to Gaussian Distributions
- Applications in Machine Learning
- Activation functions: The Gaussian Error Linear Unit (GELU) activation function, used prominently in Transformer architectures, incorporates for smooth non-linear transformations.
- Optimization algorithms: Some adaptive optimizers and smoothing functions leverage for stabilizing updates or defining soft thresholds.
- Probabilistic modeling: In Gaussian processes, diffusion models, and statistical physics-inspired ML methods, naturally emerges due to its relation to cumulative Gaussian distributions.
- Numerical Approximation
| Listing 21. Formulaic Pythonic Code for Numerical Approximation of erf. |
![]() |
11. Activation Functions: Modeling Complex Neuronal Dynamics
11.1. Meaning of Activation
11.2. Purpose and Biological Motivation
- Mathematical Role of Activation Functions
- Non-linearity introduction: Without non-linear activations, a network composed solely of affine transformations (matrix multiplications and bias additions) would collapse into a single equivalent linear transformation, severely limiting representational capacity. Non-linear activations (e.g., ReLU, sigmoid, GELU) enable networks to approximate arbitrary continuous functions, a property formally guaranteed by the Universal Approximation Theorem.
- Information shaping: Activations modulate the flow of information through the network layers, allowing selective amplification, suppression, or transformation of learned features. This shaping mechanism is crucial for the emergence of hierarchical feature representations in deep architectures.

11.3. Activation Function Progression and Comparative Analysis
11.3.1. Comparative Overview of Activation Functions
| Activation | Range | Non-linearity | Key Notes |
|---|---|---|---|
| ReLU | Sharp | Simple, sparse activation; efficient for deep convolutional networks. | |
| Leaky ReLU | Sharp | Introduces a small negative slope to address neuron inactivity. | |
| Swish | Smooth | Self-gating; improves performance, especially in very deep networks. | |
| GELU (erf) | Smooth | Probabilistic activation; superior performance in Transformer architectures. | |
| tanh | Smooth | Zero-centered; common in recurrent neural networks. | |
| GELU (tanh) | Smooth | Computationally efficient approximation of GELU. |
11.3.2. Expanded Insights on Activation Functions
| Activation | Additional Insights |
|---|---|
| ReLU | Prone to the “dying ReLU” phenomenon, where neurons output zero for all inputs. Dominant in CNNs due to computational simplicity. |
| Leaky ReLU | Commonly uses a negative slope coefficient ; mitigates dying neuron problems without introducing significant complexity. |
| Swish | Developed by Google researchers; exhibits non-monotonicity, enabling richer feature representations. |
| GELU (erf) | Softly gates inputs based on magnitude; combines ReLU’s sparsity with smoothness near the origin. |
| tanh | Saturates at the extremes, leading to vanishing gradients; more stable than sigmoid for zero-centered outputs. |
| GELU (tanh) | Provides a near-equivalent functional behavior to GELU with reduced computational overhead. |
11.3.3. Practical Tips and Cautions
| Activation | Usage Tips and Warnings |
|---|---|
| ReLU | Recommended as a default; monitor for neuron death, especially under aggressive learning rates. |
| Leaky ReLU | Useful when ReLU leads to stagnant training; helps maintain gradient flow even for negative inputs. |
| Swish | Yields marginally better performance in deep architectures at the cost of slightly increased computational burden. |
| GELU (erf) | Preferred for Transformer-based architectures (e.g., BERT, GPT) due to its smooth probabilistic gating behavior. |
| tanh | Best suited for recurrent architectures where centered activations improve gradient dynamics. |
| GELU (tanh) | Ideal for hardware-constrained environments seeking GELU-like performance with faster evaluation. |
11.4. Rectified Linear Unit (ReLU): Definition, Properties, and Practical Considerations
11.4.1. Intuitive Overview
11.4.2. Mathematical Definition
- Formulaic Pseudocode Implementation:
| Listing 22. Formulaic Pythonic Code for ReLU Activation. |
![]() |
11.4.3. Advantages and Use Cases
- Introduction of Non-linearity: ReLU introduces non-linear transformations while retaining simplicity.
- Computational Efficiency: It requires only a simple thresholding operation, making it extremely fast to compute.
- Alleviation of Vanishing Gradient Problem: Unlike saturating activations (e.g., sigmoid, tanh), ReLU maintains stronger gradient magnitudes, improving optimization in deep networks.
- Sparsity Induction: ReLU naturally produces sparse activations by zeroing out negative values, reducing the effective computational burden and promoting representational efficiency.
- Positive Constraint: By filtering out negative values, ReLU enforces non-negative activations, which can be advantageous in certain architectures and regularization schemes.
- Faster Convergence: Empirically, networks employing ReLU converge faster during training compared to those using traditional saturating activations.
11.4.4. Limitations and Cautions
- Limited Representational Scope: By entirely discarding negative activations, ReLU can hinder the model’s ability to learn functions where negative outputs are semantically meaningful.
- Dying ReLU Phenomenon: Neurons can become permanently inactive if their inputs consistently yield negative values during training, resulting in zero gradients and loss of learning capacity for those neurons.
- Suitability for Complex Relationships: For highly intricate or symmetric data patterns involving negative domains, alternative activations such as Leaky ReLU, ELU, or GELU may offer superior performance.
11.4.5. Why is it faster to train?
11.4.6. Alternative for ReLU, Leaky ReLU
- Solves the Negative relationship to keep neurons out of the inactive state.
- Able to calculate a tiny bit more relationships.

11.4.7. ReLU graph
- Slope, less than 0 region: Remain at 0
- Slope, greater than 0 region: Regular linear increase
11.5. Leaky Rectified Linear Unit (Leaky ReLU, LReLU): Definition, Properties, and Applications
11.5.1. Intuitive Overview
11.5.2. Mathematical Definition
11.5.3. Formulaic Pseudocode Implementation
| Listing 23. Formulaic Pythonic Code for Leaky ReLU Activation. |
![]() |
11.5.4. Advantages and Use Cases
- Introduction of Non-linearity: Similar to ReLU, LReLU introduces essential non-linear transformations into the network.
- Mitigation of Vanishing Gradients: By maintaining a non-zero gradient for negative inputs, LReLU reduces the risk of gradient vanishing compared to saturating functions like sigmoid or tanh.
- Neuron Survival: Allows neurons to maintain small gradients even when weights initially result in negative pre-activations, preventing complete neuron inactivity ("dying ReLU" problem).
- Efficient Computation: Simple to implement and computationally inexpensive, making it suitable for real-time or resource-constrained applications.
- Improved Convergence Speed: Networks employing LReLU often converge faster than those using pure ReLU in scenarios where data distributions produce many negative activations.
- Alternative to ReLU in Hidden Layers: Particularly beneficial in architectures where ReLU leads to excessive neuron death or sparse representations become detrimental.
11.5.5. Limitations and Cautions
- Limited Expressivity for Complex Patterns: While suitable for basic feature extraction, LReLU may not sufficiently capture highly intricate or symmetric relationships that require more sophisticated activations.
- Retention of Negative Values: Although small, the leakage of negative signals may not always align with tasks that benefit from strict non-negative representations.
11.5.6. Training Efficiency Considerations
11.5.7. Graphical Interpretation of Leaky ReLU
- Negative Region: For inputs , the function exhibits a slight downward tilt determined by the negative slope , allowing a small gradient (leak) to persist.
- Positive Region: For inputs , the function behaves as a standard identity mapping (), identical to ReLU behavior.
11.6. Swish Activation Function: Definition, Properties, and Practical Considerations
11.6.1. Intuitive Overview
11.6.2. Mathematical Definition
11.6.3. Formulaic Pseudocode Implementation
| Listing 24. Formulaic Pythonic Code for Swish Activation. |
![]() |
11.6.4. Advantages and Use Cases
- Non-linearity Introduction: Provides a smooth, non-monotonic mapping that enhances feature expressivity.
- Preservation of Negative Information: Unlike ReLU, Swish retains small negative values, facilitating richer learning dynamics.
- Improved Gradient Flow: The continuous derivative of Swish helps mitigate both vanishing and exploding gradient problems during backpropagation.
- Superior Performance: Empirical studies show that Swish often outperforms ReLU on deep convolutional networks (CNNs), Transformer architectures, and large-scale classification tasks.
- Suitability for Deep Architectures: Particularly effective in very deep models where smooth transitions improve optimization dynamics.
- Alternative to Sigmoid: Provides smoothness without suffering from the severe saturation issues of pure sigmoid activations.
11.6.5. Limitations and Practical Considerations
- Increased Computational Cost: Evaluation of the sigmoid function adds overhead compared to simpler activations like ReLU.
- Training Latency: Networks utilizing Swish may exhibit slower per-epoch training times relative to ReLU.
- Compatibility Limitations: Some machine learning frameworks may not natively support Swish, requiring custom implementation.
- Tradeoff with Simplicity: In some scenarios, the benefits of Swish may be marginal relative to its additional complexity and computational burden.
11.6.6. Graphical Interpretation of Swish
- Negative Region (): For negative inputs, Swish exhibits a small positive slope, allowing modest signal flow. Unlike ReLU, which zeroes negative inputs, Swish softly attenuates them, preserving some gradient and information flow during training.
- Positive Region (): For positive inputs, Swish behaves increasingly like the identity function, with the slope approaching 1 for large positive values. This ensures that strong activations pass through with minimal distortion, similar to ReLU but with smoother transitions.
11.7. Sigmoid Activation Function: Definition, Properties, and Practical Considerations
11.7.1. Intuitive Overview
11.7.2. Mathematical Definition
11.7.3. Formulaic Pseudocode Implementation
| Listing 25. Formulaic Pythonic Code for Sigmoid Activation. |
![]() |
11.7.4. Advantages and Use Cases
- Smooth Non-linearity: Introduces a continuous and differentiable mapping, essential for gradient-based optimization methods.
- Probabilistic Interpretation: Outputs can be directly interpreted as probabilities, facilitating binary classification tasks.
- Output Layer in Binary Classification: Commonly used in the final layer of binary classifiers to squash raw model outputs into interpretable confidence scores.
- Computer Vision and Signal Processing: Used when outputs need to represent bounded, normalized measurements (e.g., pixel brightness).
11.7.5. Limitations and Practical Considerations
- Vanishing Gradient Problem: Gradients diminish near the asymptotic regions ( or ), hindering learning in deep networks.
- Computational Overhead: Evaluation of the exponential function incurs higher computational cost relative to piecewise-linear activations like ReLU.
- Non-zero-centered Output: Since outputs are strictly positive, sigmoid activations can introduce bias in gradient updates, slowing convergence.
- Poor Suitability for Hidden Layers: Using sigmoid in hidden layers can severely impair training efficiency; modern practice favors ReLU, GELU, or tanh for intermediate layers.
11.7.6. Graphical Interpretation and Application in Binary Classification
- Compresses input values into the interval.
- Exhibits near-linear behavior around , enhancing sensitivity to small input changes.
- Flattens for large positive or negative inputs, where gradient magnitudes shrink dramatically.
- An output close to 1 indicates high confidence in the positive class.
- An output close to 0 indicates high confidence in the negative class.
11.8. Gaussian Error Linear Unit (GELU, erf-based): Definition, Properties, and Practical Applications
11.8.1. Intuitive Overview
11.8.2. Mathematical Definition
11.8.3. Formulaic Pseudocode Implementation
| Listing 26. Formulaic Pythonic Code for GELU (erf) Activation. |
![]() |
11.8.4. Advantages and Use Cases
- Smooth Non-linearity: Balances the sparsity introduced by ReLU with the smoothness of sigmoid-like activations.
- Superior Performance in Deep Networks: Empirically shown to improve training stability and accuracy, particularly in Transformer-based architectures.
- Stable Gradient Flow: The smooth, differentiable structure ensures robust gradient propagation even across very deep layers.
- Model Expressiveness: Enables the capture of subtle relationships in high-dimensional data spaces.
- High Performance-to-Complexity Ratio: Especially effective in scaling models to billions of parameters (e.g., BERT, GPT architectures).
11.8.5. Limitations and Practical Considerations
- Increased Computational Complexity: Evaluation of the error function is significantly more expensive than basic ReLU or even Swish activations.
- Hardware Efficiency: Less suited for deployment on edge devices or real-time applications due to the higher evaluation cost.
- Potential Overfitting Risk: Due to its ability to closely model complex data patterns, GELU may overfit small or noisy datasets.
- Training Latency: Slower per-epoch training times compared to simpler activations.
- Alternative Approximations: Faster approximations, such as GELU(tanh), are often preferred in production environments to reduce inference time.
11.8.6. Graphical Interpretation of GELU (erf)
- Negative Region (): For negative inputs, the output decays smoothly toward zero without abrupt cutoffs. This soft suppression enables minor negative information to contribute to feature learning, while preserving stable gradient flow.
- Positive Region (): For positive inputs, the activation increases approximately linearly, asymptotically approaching the identity line. This behavior maintains strong gradient magnitudes for positive signals, supporting efficient learning.
11.9. Hyperbolic Tangent (tanh): Definition, Properties, and Practical Applications
11.9.1. Intuitive Overview
11.9.2. Mathematical Definition
11.9.3. Formulaic Pseudocode Implementation
| Listing 27. Formulaic Pythonic Code for tanh Activation. |
![]() |
11.9.4. Advantages and Use Cases
- Zero-centered Output: Unlike sigmoid, tanh outputs are centered around zero, facilitating more balanced gradient updates during optimization.
- Smooth Non-linearity: Provides smooth, continuous gradients conducive to stable learning.
- Use in Recurrent Architectures: Commonly employed in Recurrent Neural Networks (RNNs) where controlling positive and negative signal propagation is beneficial.
- Legacy Neural Architectures: Historically favored in early neural networks for its ability to handle both positive and negative data distributions.
11.9.5. Limitations and Practical Considerations
- Vanishing Gradient Problem: Saturation at extreme input values leads to diminished gradients, impairing learning in deep networks.
- Slower Convergence: Networks using tanh may train slower compared to ReLU-based architectures.
- Limited Suitability for Deep CNNs and Transformers: In modern deep convolutional and Transformer models, activations like ReLU, GELU, or Swish are preferred due to better scaling properties.
- Higher Computational Cost Compared to Piecewise Activations: Requires exponential operations per evaluation.
11.9.6. Graphical Interpretation of tanh
- Negative Region (): For negative inputs, smoothly approaches . As inputs become more negative, the function flattens, resulting in smaller gradients and weaker sensitivity to input variations.
- Positive Region (): For positive inputs, increases toward 1 with a similar flattening effect. Near the origin (), tanh behaves approximately linearly, providing strong sensitivity to small changes.
11.10. Gaussian Error Linear Unit Approximation (GELU, tanh-based): Definition, Properties, and Practical Applications
11.10.1. Intuitive Overview
11.10.2. Mathematical Definition
11.10.3. Formulaic Pseudocode Implementation
| Listing 28. Formulaic Pythonic Code for GELU (tanh) Approximation. |
![]() |
11.10.4. Advantages and Use Cases
- Smooth Non-linearity: Maintains smooth gradient flow critical for training deep networks.
- Efficient Computation: Significantly faster to evaluate than the -based GELU, making it suitable for large-scale deployments.
- High-Fidelity Approximation: Provides a close match to the probabilistic behavior of the original GELU formulation.
- Transformer Architectures: Frequently employed in Transformer-based models such as BERT and GPT to balance computational cost with model expressivity.
11.10.5. Limitations and Practical Considerations
- Approximation Error: Although close, GELU(tanh) is still an approximation and may introduce minor deviations from true probabilistic behavior.
- Higher Cost Compared to ReLU: Still computationally heavier than simple piecewise activations like ReLU or Leaky ReLU.
- Potential Overfitting Risk: As with the exact GELU, its fine-grained sensitivity to input variations can lead to overfitting on small or noisy datasets.
11.10.6. Graphical Interpretation of GELU (tanh)
- Negative Region (): For negative inputs, the output softly decays toward zero without abrupt cutoffs, preserving minor negative activations. This enables richer feature learning compared to ReLU-like activations.
- Positive Region (): For positive inputs, the function behaves almost linearly for large magnitudes, closely tracking the identity function while maintaining smooth differentiability.
11.11. Softmax Activation: Definition, Properties, and Practical Applications
11.11.1. Intuitive Overview
11.11.2. Step-by-Step Breakdown
- Stabilization (Optional): To avoid numerical overflow during exponentiation, subtract the maximum value from each element in the vector.
- Exponentiation: Apply the exponential function to each stabilized value, emphasizing larger inputs.
- Summation: Compute the sum of all exponentiated values within the vector.
- Normalization: Divide each exponentiated value by the local sum to produce a normalized probability distribution.
11.11.3. Mathematical Definition
11.11.4. Formulaic Pseudocode Implementation
| Listing 29. Formulaic Pythonic Code for Softmax Activation. |
![]() |
11.11.5. Advantages and Use Cases
- Probabilistic Interpretation: Outputs can be interpreted as class confidence scores.
- Compatibility with Cross-Entropy Loss: When combined with cross-entropy loss, Softmax enables efficient optimization for multi-class classification.
- Human-Readable Outputs: Converts complex logits into easily interpretable probability distributions.
- Multi-Class Decision Making: Facilitates decisions between mutually exclusive outcomes.
11.11.6. Limitations and Practical Considerations
- Binary Classification: For binary problems, the sigmoid activation is typically preferred over Softmax.
- Regression Tasks: In regression problems predicting continuous values, Softmax is inappropriate.
- Hidden Layers: Softmax is rarely used in hidden layers, as it constrains activations unnecessarily early in the network.
11.11.7. Application in Multi-Class Output Layers
- Each class receives a probability score between 0 and 1.
- All class scores collectively sum to 1.
- The class with the highest probability is typically selected as the predicted label.
11.11.8. Graphical Interpretation of Softmax
- Each curve corresponds to a different class output.
- As one input becomes dominant, its associated probability sharply rises toward 1, while the probabilities of competing classes decline toward 0.
- The resulting probability vectors remain normalized, ensuring interpretability and consistency.
12. Derivatives of Activation Functions for Backpropagation
- Note: In all cases presented, it is assumed that the variable x refers to the input to the activation function during the forward pass, as required by the backpropagation procedure.
12.1. Derivative of the Rectified Linear Unit (ReLU)
- If the pre-activation input , the gradient flows unchanged (multiplied by 1).
- If , the gradient is zeroed out (multiplied by 0).
- Formulaic Pseudocode for ReLU Derivative:
| Listing 30. Formulaic Pythonic Code for ReLU Derivative. |
![]() |
12.2. Derivative of the Leaky ReLU
- If , the derivative is 1.
- If , the derivative is .
- Formulaic Pseudocode for Leaky ReLU Derivative:
| Listing 31. Formulaic Pythonic Code for Leaky ReLU Derivative. |
![]() |
12.3. Derivatives of Advanced Activation Functions
- Note: All derivatives assume x refers to the input to the activation function unless otherwise specified.
12.3.1. Derivative of Swish Activation
- Formulaic Pseudocode for Swish Derivative:
| Listing 32. Formulaic Pythonic Code for Swish Derivative. |
![]() |
| Listing 33. Sigmoid Helper Function for Swish Derivative. |
![]() |
12.3.2. Derivative of Sigmoid Activation
- Formulaic Pseudocode for Sigmoid Derivative:
| Listing 34. Formulaic Pythonic Code for Sigmoid Derivative. |
![]() |
12.3.3. Derivative of GELU (erf-based)
- Formulaic Pseudocode for GELU (erf) Derivative:
| Listing 35. Formulaic Pythonic Code for GELU (erf) Derivative. |
![]() |
12.3.4. Derivative of tanh Activation
- Formulaic Pseudocode for tanh Derivative:
| Listing 36. Formulaic Pythonic Code for tanh Derivative. |
![]() |
12.4. Derivative of GELU (tanh-based Approximation)
- Formulaic Pseudocode for GELU(tanh) Derivative:
| Listing 37. Formulaic Pythonic Code for GELU (tanh) Derivative. |
![]() |
12.5. Derivative of Softmax (and Its Practical Simplification)
- Practical Note: In modern machine learning pipelines, Softmax is almost exclusively paired with cross-entropy loss. The gradient of their composition simplifies considerably:
13. Loss Functions: Formal Definitions and Computational Structure
- Note: Unless otherwise specified, all functions assume that input tensors are of equal shape and appropriately batched.
13.1. Mean Absolute Error (MAE)
- Formulaic Pseudocode for MAE:
| Listing 38. Formulaic Pythonic Code for Mean Absolute Error (MAE). |
![]() |
13.2. Mean Squared Error (MSE)
- Formulaic Pseudocode for MSE:
| Listing 39. Formulaic Pythonic Code for Mean Squared Error (MSE). |
![]() |
13.3. Hinge Loss
- Formulaic Pseudocode for Hinge Loss:
| Listing 40. Formulaic Pythonic Code for Hinge Loss. |
![]() |
13.4. Huber Loss
- Formulaic Pseudocode for Huber Loss:
| Listing 41. Formulaic Pythonic Code for Huber Loss. |
![]() |
13.5. Binary Cross-Entropy (BCE)
- Formulaic Pseudocode for BCE:
| Listing 42. Formulaic Pythonic Code for Binary Cross-Entropy (BCE). |
![]() |
13.6. Categorical Cross-Entropy (CCE)
- Formulaic Pseudocode for CCE:
| Listing 43. Formulaic Pythonic Code for Categorical Cross-Entropy (CCE). |
![]() |
| Loss Function | Key Behavior |
|---|---|
| Mean Absolute Error (MAE) | Treats all errors equally by measuring the absolute difference between actual and predicted values. |
| Mean Squared Error (MSE) | Penalizes larger errors more severely by squaring them, making the model more sensitive to outliers. |
| Hinge Loss | Builds a margin of safety around classes, encouraging confident classification boundaries (commonly used in Support Vector Machines). |
| Huber Loss | Smoothly combines MAE and MSE behavior: acts like MSE for small errors and MAE for large errors, making it robust to outliers. |
| Binary Cross Entropy (BCE) | Measures the difference between two probability distributions in binary classification tasks (two classes: 0 or 1). |
| Categorical Cross Entropy (CCE) | Extends Cross Entropy to multi-class classification problems, comparing predicted probability distributions over multiple categories. |
14. Optimizers: Theory and Role in Training Dynamics
- The speed of convergence toward a solution.
- The stability of training across varying data distributions.
- The model’s ability to escape local minima or saddle points.
- Generalization performance on unseen data.
| Aspect | Adam Optimizer | SGD Optimizer |
|---|---|---|
| Learning Rate Adjustment | Adaptive learning rate for each parameter | Fixed or manually decayed learning rate |
| Speed of Convergence | Faster convergence, especially early on | Slower convergence without momentum |
| Memory Usage | Higher (stores moment estimates) | Lower (only stores gradients) |
| Sensitivity to Learning Rate | Less sensitive to initial learning rate choice | Highly sensitive to learning rate choice |
| Use Case | Good for complex networks, sparse gradients | Good for simple or very large datasets |
| Mathematical Complexity | More complex (uses moving averages of gradients) | Simpler (straightforward gradient update) |
14.1. Adam Optimizer (Adaptive Moment Estimation)
- – the exponentially weighted moving average of past gradients (first moment estimate)
- – the exponentially weighted moving average of squared gradients (second moment estimate)
- – current parameter value
- – learning rate (default: )
- – small constant for numerical stability (e.g., )
- – decay rates for the first and second moment estimates (typically , )
- Formulaic Pythonic Pseudocode for Adam Optimizer:
| Listing 44. Formulaic Pythonic Code for Adam Optimizer. |
![]() |
Advantages of Adam
- Adaptive learning rate per parameter
- Requires minimal tuning and is robust to sparse gradients
- Combines fast convergence with stability in non-stationary settings
- Well-suited for large models and noisy objective functions
Limitations and Practical Notes
- Adam may sometimes lead to overfitting or poor generalization compared to SGD with momentum in certain vision tasks.
- Learning rate warm-up, weight decay (AdamW), or switching to SGD mid-training are often employed for better results.
- Commentary: While basic stochastic gradient descent (SGD) is inherently integrated into the backpropagation update rule, optimizers like Adam offer substantial performance gains in training time and convergence speed—particularly in deep and high-dimensional architectures.
14.2. Stochastic Gradient Descent (SGD)
- – learning rate
- – the training example or mini-batch
- – gradient of the loss with respect to model parameters
- Formulaic Pseudocode for SGD:
| Listing 45. Formulaic Pythonic Code for SGD Optimizer. |
![]() |
Advantages of SGD
- Simple to implement and computationally efficient.
- Well-suited for very large datasets (via mini-batch training).
- Can generalize well when combined with regularization and proper learning rate scheduling.
Limitations and Practical Notes
- Convergence may be slow and noisy without momentum or learning rate decay.
- Requires careful tuning of the learning rate to avoid divergence or stagnation.
- Prone to getting stuck in sharp local minima or saddle points in complex loss surfaces.
- Note: SGD is essentially a direct implementation of the backpropagation update rule and is thus already integrated into the core mechanics of neural network training. Its simplicity makes it a standard benchmark and starting point for optimizer comparisons.
15. Normalization: Mitigating Gradient Instabilities in Deep Learning
- Min-Max Normalization rescales input features to a predefined range, usually , preserving relative proportions.
- Z-Score Standardization (Standard Scaling) centers the data around zero and scales it to unit variance, promoting symmetry in gradient flow.
- Comparison of Normalization Methods:
| Feature | Min-Max Normalization | Z-Score Standardization |
|---|---|---|
| Range after scaling | Typically ; customizable bounds | Centered around 0 with unit variance |
| Sensitive to outliers? | Highly sensitive | More robust due to variance scaling |
| Use case examples | Pixel intensities, bounded sensors | ML models assuming Gaussian features |
| Effect on distribution | Linear rescaling; preserves original shape | Normalizes spread and location |
| When to use? | When known min/max bounds exist | When data is approximately normal |
- Practical Note: Normalization is not strictly required for all models, but it plays a pivotal role in stabilizing gradient-based training processes. For networks prone to exploding/vanishing gradients, appropriate normalization can dramatically improve convergence rates and model robustness.
15.1. Z-Score Normalization (Standardization)
- Formulaic Pythonic Pseudocode for Standardization:
| Listing 46. Formulaic Pythonic Code for Z-Score Normalization. |
![]() |
Use Cases
- Effective when model assumptions include normally distributed features.
- Common in linear regression, logistic regression, SVMs, and neural networks.
15.2. Min-Max Normalization
- Formulaic Pythonic Pseudocode for Min-Max Normalization:
| Listing 47. Formulaic Pythonic Code for Min-Max Normalization. |
![]() |
Use Cases
- Image pixel scaling (e.g., from to ).
- Features with known and bounded value ranges.
- Practical Note: Min-max normalization assumes a stable range. In the presence of outliers or shifting distributions (e.g., in online learning), Z-score standardization or robust scaling may be preferred.
16. Tensor/Weights and Biases Initialization
- General Principles of Initialization
- Variance Preservation: Maintaining activation variance across layers prevents gradient shrinkage or amplification.
- Activation Function Dependency: Initialization should match the non-linearity used (e.g., ReLU vs. tanh).
- Symmetry Breaking: All weights must be initialized independently to allow neurons to learn distinct features. Biases may be initialized to zero.
| Initialization Method | Key Characteristics and Use Cases |
|---|---|
| Xavier (Glorot) |
|
| He (Kaiming) |
|
| Zero Initialization |
|
| Random Uniform |
|
- Practical Note
16.1. Zero Initialization
- Fails to break symmetry between neurons.
- Leads to vanishing gradients and stalled learning.
- Only suitable for initializing biases.
16.2. Pseudo-Random Initialization (Uniform)
| Listing 48. Random Uniform Tensor Initialization. |
![]() |
- Note: For uniform initialization, a common range is . This should be adjusted depending on layer size and depth.
16.3. Xavier (Glorot) Initialization
| Listing 49. Formulaic Pythonic Code for Xavier Initialization. |
![]() |
16.4. He Initialization
| Listing 50. Formulaic Pythonic Code for He Initialization. |
![]() |
- Helps avoid vanishing gradients.
- Promotes stable signal propagation in deep networks.
- Generally converges faster in deep architectures with ReLU-based activations.
17. Backpropagation: The Core Learning Algorithm in Neural Networks
- Forward Pass – Computes activations layer by layer using current weights and biases.
- Loss Function Evaluation – Measures the discrepancy between the model’s prediction and the true target.
- Backward Pass – Propagates the loss gradient backward through the network, computing parameter-wise gradients.
- Gradient Descent Step – Updates weights and biases using gradients and a selected optimization algorithm.
17.1. Key Computational Tools for Backpropagation
- Matrix Multiplication – Enables dot products between weights and activations.
- Transposition – Required for aligning gradients across layer interfaces.
- Element-wise Operations – Applied to activation outputs and error signals.
- Activation Functions and Their Derivatives – Nonlinearities such as ReLU, tanh, or sigmoid must support differentiation.
17.2. Forward Pass
- is the pre-activation output of layer l,
- and are the weight matrix and bias vector,
- is the activation function (e.g., ReLU, tanh),
- is the post-activation output, passed to the next layer.
17.3. Loss Function
17.4. Backward Pass (Backpropagation)
-
Steps:
- Compute error gradient at the output layer.
- Propagate this error backwards through hidden layers.
- Multiply by activation function derivative.
- Use the gradients to update parameters.
- Formulaic Pythonic Pseudocode:
| Listing 51. Formulaic Pythonic Code for Backpropagation. |
![]() |

17.5. Gradient Descent Step
17.6. Batch Training and Epochs
17.7. Summary: The Backpropagation Pipeline
Recent Advances
18. Conclusion
18.1. Guidelines for Increasing Accessibility in Machine Learning Education
- Avoid exclusive reliance on calculus-based notation. While mathematically rigorous exposition remains vital for theoretical development, educational materials should accompany such expressions with structured pseudocode or executable logic wherever possible.
- Recognize programming fluency as foundational. ML is ultimately implemented through code. Prioritizing algorithmic clarity ensures that practitioners from software engineering and systems programming domains can meaningfully engage with and apply ML principles.
- Promote hybrid explanations. Ideal instructional content should bridge formal mathematics and operational intuition—using diagrams, modular code, and relatable metaphors.
18.2. Machine Learning Without Walls: A Vision for Global Innovation
- Mathematicians focus exclusively on developing faster and more biologically realistic activation functions, such as novel generalizations of the Hodgkin–Huxley model.
- Systems engineers embed neural inference directly into operating system kernels to achieve ultra-low-latency pattern recognition.
- Educators adopt accessible frameworks for teaching ML logic, allowing students to implement and experiment before ever seeing formal integrals.
To the mathematicians: You do not have to carry the future of Machine Learning alone. To the programmers: Your intuition, structure, and systems thinking are just as vital. Together, we can dismantle the barriers that no one discipline can overcome in isolation.
18.3. A Complete Machine Learning Network Example
| Listing 52. Formulaic Pythonic Code for Example Complete machine learning network flow. |
![]() |
19. Final Reflection: The Power of Formulaic Thinking
References
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization; ICLR, 2015. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization; ICLR, 2019. [Google Scholar]
- Vaswani, A.; et al. Attention Is All You Need; NeurIPS, 2017. [Google Scholar]
- Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). 2016, arXiv:1606.08415. [Google Scholar]
- Ramachandran, P.; et al. Searching for Activation Functions. 2017, arXiv:1710.05941. [Google Scholar]
- De Vries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. 2017, arXiv:1708.04552. [Google Scholar]
- Zhang, H.; et al. mixup: Beyond Empirical Risk Minimization; ICLR, 2018. [Google Scholar]
- Takahashi, R.; Matsubara, T.; Yoshino, K. RICAP: Random Image Cropping and Patching Data Augmentation for Deep CNNs; ACML, 2018. [Google Scholar]
- Liu, L.; et al. On the Variance of the Adaptive Learning Rateand Beyond. 2019, arXiv:1908.03265. [Google Scholar]
- Zhang, M.R.; et al. Lookahead Optimizer: k steps forward, 1 step back; NeurIPS, 2019. [Google Scholar]
- Zhuang, J.; et al. AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients; NeurIPS, 2020. [Google Scholar]
- Ba, J.; Kiros, J.R.; Hinton, G. Layer Normalization. 2016, arXiv:1607.06450. [Google Scholar]
- Wu, Y.; He, K. Group Normalization; ECCV, 2018. [Google Scholar]
- Liu, Z.; et al. EvoNorm: Evolutionary Normalization for Deep Learning. 2020, arXiv:2004.02967. [Google Scholar]
- He, K.; et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance; ICCV, 2015. [Google Scholar]
- Zhang, H.; et al. Fixup Initialization: Residual Learning Without Normalization; ICLR, 2019. [Google Scholar]
- Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling; ICML, 2019. [Google Scholar]
- Zhang, H.; et al. ResNeSt: Split-Attention Networks. 2020, arXiv:2004.08955. [Google Scholar]
- Liu, Z.; et al. A ConvNet for the 2020s; CVPR, 2022. [Google Scholar]
- Chen, T.; et al. A Simple Framework for Contrastive Learning; ICML, 2020. [Google Scholar]
- Grill, J.-B.; et al. Bootstrap Your Own Latent (BYOL); NeurIPS, 2020. [Google Scholar]








Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).



















































