The Inner Life of Machine Learning in Search of a Delicate Balance: Insights from SAS Forum 2017, Milan, Italy

Goran S. Milovanović, PhD
We are becoming an evidence driven culture: it’s all about data these days, but big and small data are nevertheless accompanied by big dilemmas in relation to their use and misuse in guiding our choices. Given the plenitude of statistical models and machine learning algorithms that we have at our disposal nowadays, how do we decide which one should decide in our place in the digital environment of evergrowing complexity? And when the recommendation that should help us guide further information search or decision making is automatically made, will we follow it blindly without reminding ourselves of the ultimate question of “why” – a question that proved to be so central to our human existence? The quote (famously (mis)attributed to Albert Einstein) – “Every fool can know, the point is to understand” – could prove to be the best piece of advice for anyone connected to the Internet, whenever and for whatever reason.
Or to anything online, perhaps? Our evidence driven culture will be driven mostly by autonomous algorithms doing the number crunching to provide for recommendations, risk estimates, classifications, and inferences. We as species could never possibly compute all these data products with our natural cognitive systems – simply because we were never evolutionary designed to integrate data on a scale that is characteristic of our contemporary digital environments. That is why our machines have to learn to learn, and faster than we do: their adaption is our adaption indeed. But then, what is left for us once they master to recognize the most optimal of the structures that are found in the omnipresent data and infer the best course of action for us?
Two answers come to mind. First, it us who define what is useful. Second – and maybe more important, because we can differ in our positions on what is ‘useful’ widely – we need to understand, while the machines, in principle, do not. A sudden opportunity to attend the SAS Forum 2017 in Milan, Italy, together with the SAS Adriatic team, and as a representative of Data Science Serbia, inspired another round of my questioning of the present situation in Data Science and Analytics along these lines.
* * *

Cellular automaton: Rule 193 with random conditions. Wikimedia Commons, 21 September 2013, 09:17:51, Author: Sofeykov.
* * *
In his The difference between Statistical Modeling and Machine Learning, as I see it (2016), Mr Oliver Schabenberger, EVP and Chief Technology Officer at SAS, has provided an attempt at a concise delineation between statistical modeling and machine learning, relying on the following proposal that differentiates between (a) statistical modeling, (b) classic machine learning, and (c) modern machine learning:
(a) The basic goal of Statistical Modeling is to answer the question, “Which probabilistic model could have generated the data I observed?
(b) Classical machine learning is a data-driven effort, focused on algorithms for regression and classification, and motivated by pattern recognition. The underlying stochastic mechanism is often secondary and not of immediate interest […] the primary concern is to identify the algorithm or technique (or ensemble thereof) that performs the specific task.
(c) A machine learning system is truly a learning system if it is not programmed to perform a task, but is programmed to learn to perform the task. […] Like the classical variant, it is a data-driven exercise. Unlike the classical variant, modern machine learning does not rely on a rich set of algorithmic techniques. Almost all applications of this form of machine learning are based on deep neural networks.
I was granted an opportunity to learn on some elaborations of this line of thinking from Mr Schabenberger directly during the SAS Forum 2017. In my interpretation – and this is necessary to stress, given the immense complexity of the topic under discussion – his words reassured me that the following trade-offs between (a) our understanding of what do we do with data analytics, and (b) simply being able to develop more and more complex methods to accomplish progressively complicated tasks hold:
  • In mathematical statistics (i.e. statistical modeling) as we know it, our understanding of the data generating process is guaranteed; making use of binary or multiple logistic regression, cumulative logit models, or even ordinary least-squares regression methods or various ANOVA experimental designs, as we all know, can bring about some problems of interpretation, but those problems are miniscule when taken from the perspective of us being able to understand the data generating process in general – simply because we know the assumptions under which such techniques work and have strict mathematical proofs that support our understanding. The drawback is evidently related to the question of whether the assumed data generating processes captures the true complexity of the empirical reality that we need to model and predict. At some point, the realistic underlying stochastic processes are too complex to be even approximated by our simplifying assumptions, which are more often then not introduced only in order to be able to provide for the necessary mathematical proofs that some generating process that we can conceptualize can be estimated by a model whose parameters we can understand.
  • In what Schabenberger recognizes as “classical machine learning”, we can still establish some sort of interpretation of the results; given a typical back-propagation network, a multi-layer perceptron, for example, one can still at least in principle build an understanding of its inner workings by tracking the changes in the weights among the connections in the hidden layers and then perform the analyses (e.g. multivariate techniques like PCA) that reveal the patterns present in the model’s evolution towards an optimal state (i.e. where the model predicts or classifies correctly according to some criteria). Such methods were already used to provide an interpretation of the dynamical evolution of recurrent neural networks: for example Rogers and McClleland use multidimensional scaling in their book “Semantic Cognition: A Parallel Distributed Processing Approach” (2004, Chapter 3, p. 89) to trace the evolution of a conceptual system modeled by a recurrent back-prop network.
  • It seems that the problem of model interpretation – and consequently, the problem of our understanding of the analytical machinery on whose results we have to rely on – emerges very seriously in relation to what Schabenberger recognizes as “modern machine learning”. The gap between (a) our ability to solve very complex problems (i.e. the ability of our machines to dig out patterns from very complex datasets), and (b) our understanding of how the solution was reached – providing the explanatory foundations for the decisions that we are about to bring – could prove to be a true abyss in this case. Even a peak only into the results of Google’s recent revolution in machine translation uncovers a heroic struggle that the research team was facing to analytically understand the inner working of a complex learning system that has achieved a previously unimaginable performance in an extremely difficult task.
Least to say, the gap will not (and it should not) slow down the skyrocketing or modern machine learning, probably spawning “the machine learning of machine learning” paradigm around our efforts to understand the machines that we have designed to serve our ends. However, it should present a friendly reminder – this fascinating characteristic of the Fourth Industrial Revolution – that we are beginning to rely widely on automated systems whose inner workings we need to study scientifically and only eventually hope to understand fully, in spite of the fact that they were human designed from a beginning to an end. And any use of learning systems whose final outputs can be described as highly complex emerging properties – such as complex neural networks, evolutionary computation, and similar – will pose a similar problem to us.
* * *
The Data Analytics world will have to search for a delicate balance in respect to this dilemma. A typical Data Analyst (and maybe more important, a typical user of his or her recommendations too) is not at ease with buying an algorithm simply because it works, no matter how well motivated its development was. When I perform a logistic regression, assuming that the model assumptions hold, I can safely conclude that the exponential of the regression coefficient affects the odds ratio in a certain way, and I can rely confidently on my model because I can trace back exactly to an explanation of why is that so. I know how the model “reached the conclusion” that I have read out from its parameters, and thus I understand why some models work better than the others. Also, it is sometimes possible to demonstrate how classic statistical modeling can solve even very “modern” – very complex – problems when applied over an elaborate description of the dataset under consideration. As an analyst and a scientist, and no matter how complex the future that awaits, I don’t think that the interpretation game is over, and that we should ever give up of the effort to apply machine learning thoughtfully until we are able to fully understand its “inner life”. If that calls for an opening of a whole new scientific arena in Data Science, and even it is going to be so constrained by the complexity of the processes under study to forever remain a field of empirical, experimental study of artificial learning systems – be it. The challenge will only get harder as more and more advanced learning machinery becomes available, but I would avoid at any cost the attitude of just letting it go.