2. KAN Architecture
KAN and Multi Layered Perceptron
If f is a multivariate continuous function on a bounded domain, then f can be written as a finite composition of continuous functions of a single variable and the binary operation of addition. More specifically, for a smooth f: [0,1]n 0
where
: [0,1]
and
:
.
The outer function fis represented as a finite sum of transformed univariate functions. These univariate functions, φ(q,p) can be highly complex, non-smooth, and even fractal-like. This complexity can pose significant challenges for the practical application of the theorem.
The only genuinely multivariate function is addition, as every other function can be expressed using univariate functions and summation. At first glance, this might seem like great news for machine learning: learning a high-dimensional function could be reduced to learning a polynomial number of 1D functions. However, these 1D functions can be non-smooth and even fractal, making them potentially unlearnable in practice (Fakhoury et al. 2022). Due to this pathological behavior, the ML community remained pessimistic about the practical applications of the Kolmogorov-Arnold representation theorem.
At the same time, MLP is based on Universal Approximation Theorem where an arbitrary function
f(
x) is represented as
It states that a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of Rn, given appropriate weights and activation functions. Essentially, this means that neural networks can represent a wide variety of functions to any desired degree of accuracy, under certain conditions. Even a neural network with just one hidden layer (a single-layer feedforward network) can approximate any continuous function, provided the hidden layer has a sufficient number of neurons. The choice of activation function is crucial; common functions that satisfy the theorem include the sigmoid function, hyperbolic tangent (tanh), and rectified linear unit (ReLU). The accuracy of the approximation depends on the number of neurons in the hidden layer. Generally, more neurons lead to better approximation, but the exact number required depends on the complexity of the target function.
While MLPs place fixed activation functions on nodes (“neurons”), KANs place learnable activation functions on edges (“weights”). As a result, KANs have no linear weight matrices at all: instead, each weight parameter is replaced by a learnable 1D function parametrized as a spline. KANs’ nodes simply sum incoming signals without applying any non-linearities.
In spite of their elegant mathematical foundation, KANs (Kolmogorov-Arnold Networks) are essentially combinations of splines and MLPs (Multilayer Perceptrons), capitalizing on their respective strengths while mitigating their weaknesses. Splines excel at handling low-dimensional functions, are easy to adjust locally, and can switch between different resolutions. However, they suffer from the curse of dimensionality (COD) due to their inability to exploit compositional structures. On the other hand, MLPs are less affected by COD thanks to their feature learning capabilities but are less accurate than splines in low-dimensional settings because they cannot optimize univariate functions as effectively. For a model to accurately learn a function, it must capture the compositional structure (external degrees of freedom) and approximate univariate functions well (internal degrees of freedom). KANs achieve this balance by integrating MLPs on the outside and splines on the inside.
KAN Architecture
The architecture of Kolmogorov-Arnold Networks (KANs) revolves around a novel concept where traditional weight parameters are replaced by univariate function parameters on the edges of the network. Each node in a KAN sums up these function outputs without applying any nonlinear transformations, in contrast with MLPs that include linear transformations followed by nonlinear activation functions.
A general KAN network is a composition of
L layers (
Figure 2): given an input vector
x0 ∈
, the output of KAN is
Notice that an MLP can be written as interleaving of affine transformations
where
Wn are linear weight parameters and
is a nonlinear activation function
(Liu et al. 2024) extend the network (*) to accommodate arbitrary widths and depths. Most functions in science and engineering are often smooth and exhibit sparse compositional structures, which can enable smooth Kolmogorov-Arnold representations. This approach aligns with the mindset of physicists, who typically focus on typical cases rather than worst-case scenarios. Ultimately, both our physical world and machine learning tasks must possess inherent structures to make physics and machine learning useful and generalizable.
An example of activation for
is shown in
Figure 3.
KAN layer with
nin-dimensional inputs and
nout-dimensional outputs can be defined as a matrix of 1D function
where the functions
have trainable parameters. In the Kolmogov-Arnold theorem, the inner functions form a KAN layer with
nin =
n and
nout = 2
n+1, and the outer functions form a KAN layer with
nin = 2n + 1 and
nout = 1.
Regularization and Activation Functions in KANs
KANs’ linear weights are replaced by learnable activation functions, so one should define the L1 norm of these activation functions. For MLPs, L1 regularization of linear weights is used to favor sparsity.
L1 regularization is a technique used in ML, particularly in linear models like linear regression or linear classifiers such as logistic regression. It is used to prevent overfitting and to encourage simpler models by penalizing the absolute magnitude of the coefficients (weights) in the model. In the context of linear models, the L1 regularization adds a penalty term to the loss function proportional to the absolute value of the coefficients. This penalty is often represented by the L1 norm of the weight vector. When the optimization algorithm minimizes this penalized loss function, it tends to drive many of the coefficients towards zero.
The effect of this regularization is to encourage sparsity in the solution, meaning that many of the coefficients become exactly zero. In other words, L1 regularization tends to drive irrelevant or less important features to have zero coefficients, effectively removing them from the model. This is particularly useful when dealing with high-dimensional data where many features may not contribute significantly to the predictive power of the model.
Activation functions introduce non-linearity into neural networks, allowing them to model complex relationships between inputs and outputs. In NLP tasks, activation functions are typically applied at each neuron in the neural network layers, including input, hidden, and output layers. Popular activation functions for MLP include ReLU (Rectified Linear Unit), sigmoid, and tanh.
In NLP networks, the activation function of a neural network does not directly encode the meaning of a sentence. Instead, it serves to introduce non-linearity into the network, enabling it to learn complex relationships between inputs and outputs. However, the activations across the network layers, combined with the parameters (weights and biases), collectively contribute to the model’s ability to understand and represent the meaning of sentences.
The selection of activation functions in a neural network has a significant impact on the training process. There is, however, no obvious way to choose them because the “optimal choice” may depend on the specific task or problem to be solved. Nowadays, ReLU activation functions (and variations) are the default choice in the broad spectrum of activation functions for many types of neural networks.
In NLP, the input to a neural network representing a sentence is typically a sequence of word embeddings. Each word embedding captures some information about the meaning of the word it represents. As the input passes through the network, activations in different layers encode increasingly abstract representations of the input sentence. At each layer, the activation function (e.g., ReLU, sigmoid) transforms the weighted sum of inputs into an output signal.
Neural networks, particularly those used in NLP tasks like sentence classification or sentiment analysis, are capable of learning compositional representations. This means that they can understand the meaning of a sentence by combining the meanings of its constituent words and phrases. The activation patterns in the network layers capture these compositional relationships. During the training process, the MLP network adjusts its parameters (weights and biases) based on the input-output pairs provided in the training data. The activation function plays a crucial role in determining how the network learns to map input representations to output predictions. Through backpropagation and gradient descent, the network learns to adjust its parameters to minimize the loss function, thereby improving its ability to capture the meaning of sentences.
While the activation function itself does not encode the meaning of a text, it contributes to the overall capability of the neural network to learn and represent semantic information. The meaning of a sentence emerges from the collective behavior of the network’s activations and parameters, shaped by the training data and the task-specific objective.
Activation functions of KAN are designed as follows:
Residual activation function include a basis function
b(
x) (similar to residual connections) such that the activation function
ϕ(
x) is the sum of the basis function
b(
x) and the spline function:
The SiLU activation function can be seen as a smooth version of the ReLU (Rectified Linear Unit) activation function. It maintains all the desirable properties of ReLU, such as sparsity and non-saturation of gradients, while also being smooth and differentiable everywhere. This smoothness can help with gradient-based optimization methods during training.
Compared to ReLU, which can sometimes suffer from a “dying ReLU” problem where neurons can become inactive during training and never recover, SiLU tends to have a more consistent gradient flow throughout the network, potentially leading to better convergence and performance.
spline(x) is parametrized as a linear combination of B-splines such that
spline(x) = where ci-s are trainable. w is not essential it can be absorbed into b(x) and
spline(x). Each activation function is initialized to have spline(x) ≈ 0.
B-Splines
Univariate spline functions are a valuable tool in approximation theory (Lyche et al. 2018). These functions are piecewise polynomials with a specific degree and global smoothness. Notably, maximally smooth splines exhibit exceptional approximation behavior relative to the degree of freedom (Sande et al. 2019). ReLU functions can be considered linear spline functions. More flexible (learnable) spline activation functions, as studied by Bohra et al. (2020), have shown that their adaptability can reduce the overall network size required to achieve a given accuracy. Therefore, there is a trade-off between the complexity of the network architecture and the complexity of the activation functions.
Univariate spline functions can be expressed as linear combinations of B-splines, a set of locally supported basis functions that form a nonnegative partition of unity. B-spline representations are ideal for approximating univariate smooth functions because they can be compactly described by a small number of parameters, with each parameter having a local effect. Efficient algorithms for their computation are available (Lyche et al. 2018). Multivariate extensions can be readily achieved by taking tensor products of B-splines (
Figure 4).
Definition of B-splines necessitates the concept of knot sequence. A knot sequence is a nondecreasing
sequence of real numbers,
:= { 1 2 … r }. The elements of are called knots. Assuming integer values r p + 2 2, on such sequence it is possible to define N := r - p - 1 B-splines of degree p:
Given a knot sequence , the n-th B-spline of degree p 0 is zero if n+p+1 = n
and otherwise defined recursively by
Figure 4 depicts B-splines for certain values
N,
p, and
x .
The multivariate B-spline vector
can be interpreted as a fuzzy hierarchical partition of the domain that induces a tree structure with
L levels for every
t = 1, …,
T. This can be explained as follows. Let us fix t. For each
l = 1, . . . ,
L, by construction,
(
) is a vector of
B-splines of degree
. Its components are nonnegative real values that sum up to one, and thus can be regarded as a distribution over a discrete set of hidden classes {
} at level
l, where for
= 1, . . . ,
we have
The B-spline
plays the role of decision or gating function at level l based on the feature
x). Then, under the assumption that the events are mutually independent, the joint probability on the hierarchy of hidden classes at all levels is given by
for
= 1, . . . ,
and l = 1, . . . ,
L. All together they form the multivariate B-spline vector
. A graphical representation of the induced tree structure is in
Figure 5.
Symbolic Regression in KAN
What is a good way to select KAN shape that best reflects the structure of a dataset?
For instance, if we know the dataset is generated via the symbolic formula f(x,y)=exp(sin(πx)+y2), then it is obvious that a [2, 1, 1] KAN can “implement” this function. In real-world scenarios, however, an ML designer usually doesn’t have this information in advance, so it’s beneficial to determine the KAN shape automatically. The strategy involves starting with a sufficiently large KAN and training it with sparsity regularization followed by pruning. These pruned KANs turn out to be much more interpretable than their non-pruned counterparts. To further enhance interpretability, Liu et al. (2024) propose several simplification techniques and provide examples of how users can interact with KANs to make them more interpretable.
In cases where the activation functions turn out to be symbolic (e.g.,
tanh or
sin), there is an interface to set them to be a specified symbolic form,
fix_symbolic(
l,i,j,f) that sets the (
l, i, j) activation to be
f. However, one cannot just convert the activation function into be the exact symbolic formula, since its inputs and outputs may have shifts and scaling. Therefore, pre-activations
x and post-activations
y needs to be computed from training data, and needs to fit affine parameters (
a, b, c, d) such that
y ≈
c f(
ax+b)+
d. The fitting is done by iterative grid search of
a, b and linear regression (
Figure 7).
The symbolic space is densely filled, making it challenging to derive the correct symbolic formula. In some cases, such a formula might not exist at all. KAN can maintain the sensitivity of symbolic regression, especially in the presence of noise, which has its pros and cons. Fortunately, it is relatively easy to identify symbolic formulas that closely align with the data, often within an acceptable margin of error (epsilon). While these approximate symbolic formulas may not precisely capture the underlying relationship, they still offer valuable insights, possess predictive capabilities, and are computationally efficient.
The limitation of KAN is that determining the exact formula proves to be a formidable challenge. This becomes crucial in scenarios where precision is paramount, either for generalizability in tasks in physics or for fitting experimental data or solving partial differential equations with machine-like accuracy. For the former scenario, the quest for generalizability is ongoing and requires meticulous examination on a case-by-case basis. In the latter case, we can gauge the fidelity of a symbolic formula by observing a reduction in loss approaching machine precision.
User Interacting with Explainable KAN
Neural networks, with their intricate structures and multitude of parameters, often pose challenges in understanding the inner workings of their resulting functions. This complexity has earned them the moniker of „black-box models,” as their behavior can be highly unpredictable, particularly in contexts where comprehending the rationale behind a decision is crucial. To address this, specific assumptions tailored to the task at hand can be imposed on the neural network’s architecture, such as utilizing convolutional neural networks for image-related tasks. Additionally, post-hoc gradient-based methods, as suggested by Ribeiro et al. (2016), can be employed to unravel the contributions of features to the final decision.
Alternatively, simpler models like linear and additive models, as proposed by Hastie and Tibshirani (1990), offer an avenue to circumvent the challenges posed by neural networks. Additive models, inspired by KAN tree models, offer high interpretability by basing predictions on a hierarchical partition of the input space, represented as a list of rules. In contrast, classical decision trees are susceptible to overfitting due to their piecewise constant nature and the inherent instability introduced by the greedy algorithms used in their learning process. Probabilistic or fuzzy trees, however, exhibit greater resilience to noisy data and are adept at handling uncertainty in imprecise contexts and domains. Performance can be further enhanced through ensemble techniques such as random forests. Various neural network architectures have been devised to mimic the interpretability of additive models, as demonstrated by Potts (1999), and to emulate classical or fuzzy tree structures, as explored by Kontschieder et al. (2015).
Explainability in KAN is implemented as a result of network simplification.
Simplification choices can be viewed as a hypothetical seizers a user can cut a network node. A user interacting with the network can select which node is most promising to cut next to make KANs more interpretable.
Let us consider the regression task f(x, y) = exp(sin(πx)+y2) example. Given data points (xi, yi, fi), i = 1, 2,…, Np, a user attempts to derive a symbolic formula. The steps of user interaction with the KANs are as follows:
(1) Training with making the network more sparse. Starting from a fully-connected [2, 5, 1] KAN, training with
regularization can make the network significantly more sparse. Four out of five neurons in the hidden layer turn out to not perform any function and can be removed.
(2) Network reduction. Automatic pruning is seen to discard all hidden neurons except the last one,
leaving a [2, 1, 1] KAN. The activation functions appear to be known symbolic functions.
(3) Setting symbolic functions. Assuming that the user can correctly guess these symbolic
formulas from staring at the KAN plot, she can set
In case the user has no domain knowledge or no idea which symbolic functions these activation
functions might be, there is a function suggest_symbolic to suggest symbolic candidates.
(4) Perform more training. After turning all the activation functions in KAN into symbolic form, the only remaining parameters are the affine parameters. These affine parameters require additional training. Once the loss is minimized, the corresponding symbolic expression is expeted to be correct.
(5) Produce a symbolic formula of the output node. The user obtains 1.0e1.0y2+1.0sin(3.14x), which is the true answer. For compactness, only two decimals for π are shown.
Making Neural network more sparse refers to the process of reducing the number of parameters (weights and biases) in a neural network while attempting to maintain its performance. This is done by setting a significant number of the network’s weights to zero, effectively making the network “sparse.” Sparsification can lead to several benefits, including reduced computational complexity, lower memory usage, faster inference times, and potentially improved generalization. There are several techniques for making KAN more sparse (
Figure 8):
- (1)
Pruning: This involves removing weights that contribute the least to the overall performance of the network. Pruning can be done based on magnitude (removing weights with the smallest absolute values), sensitivity (removing weights that cause the least change in the loss function), or structured pruning (removing entire neurons, filters, or channels).
- (2)
Regularization: Techniques like L1 regularization can encourage sparsity by adding a penalty to the loss function that is proportional to the sum of the absolute values of the weights. This encourages many weights to become exactly zero during training.
- (3)
Quantization: This process reduces the precision of the weights, which can lead to many weights effectively becoming zero, contributing to sparsity.
- (4)
Low-rank factorization: This technique approximates the weight matrices in the network by lower-rank matrices, which can lead to a sparser representation.
- (5)
Sparse training: Training algorithms can be modified to enforce sparsity directly during the training process. Techniques such as DropConnect randomly drop connections during training, which can lead to a sparser network (Wan et al 2013).
The main goal of making neural network more sparse is to create a more efficient model that retains as much of the original performance as possible while reducing resource requirements. Making neural network more sparse is particularly valuable for deploying neural networks on devices with limited computational power and memory, such as mobile phones or embedded systems.