Within Data Science scenario both Machine Learning (ML) and Neural Networks (NNs) are widely used for a plethora of applications, spanning from engineering to biology, from finance to medicine. Meanwhile, the pure analytical analysis of related models used often lacks of detailed description. Trying to close this gap, during recent years the theory of Optimal Transport start becoming more a more popular within the Data Science arena, since it allows for efficient and scalable solutions about what can be broadly referred as Artificial Intelligence tasks, particularly by using approximate solvers through linear programming. More precisely, let us recall that both in ML and statistics, minimizing an objective function over the space of probability measures is fundamental problem. Along this line, the training of a Neural Network through Stochastic Gradient Descent has been shown to be equivalent to a gradient flow in Wasserstein space. Moreover, in many applications, it is crucial to consider the causal structure of the data generating process. In view of providing a possibly unifying as well as efficient approach to latter questions, we consider the Adapted Optimal Transport (AOT) method which allows to incorporate causality constraints into the optimization of the Wasserstein distance. AOT is particularly useful when dealing with laws of stochastic processes, being exploitable to encode an adapted (non-anticipative) constraint into the allocation of mass of the classical OT problem. Accordingly, we provide an in depth exploration of the connections between supervised learning and AOT, providing theoretical insights into the benefits of including causality constraints for developing more robust, scalable and accurate algorithms, while maintainig their computational efforts at minimum possible levels.