## A Few Useful Things to Know about Machine Learning

Lately I have been lamenting that my knowledge of machine learning is not very useful.  Luckily Pedro Domingos has recently written a paper, “A Few Useful Things to Know about Machine Learning” (publication link, author’s hosted version), to rectify my issue.  Since he likely did not write the paper just for my benefit, I would encourage anyone interested in machine learning to give it a quick look.  Would especially be a good read for a student taking a class in machine learning, even if they can’t follow every point.

The paper discusses 12 main points; each is given several paragraphs of discussion.  I could list every point and give a brief summary, but that would mostly reproduce the paper (eliminating your need to read the well-written original).  Instead I will mention a couple of points I found especially interesting; just enough to whet your appetite.

Data is never enough.  When you have an hour of video being uploaded to Youtube every second, limited training data seems to be a spectre of the past.  Domingos explains that if the function you are learning is complex, you will never see even a fraction of the total possible space in your training data.  Consider his example of a function with 100 binary variables, which gives $2^{100}$ possible inputs.  Even a billion training examples (approximately $2^{30}$) is just a speck in that space.  However, Domingos encourages the researcher not to despair.  While blind application of machine learning algorithms may never provide a solution, that just means there is room for the introduction of knowledge.  Perhaps the data actually lies on a low-dimensional manifold in that large feature space.  Maybe knowledge of the high-level structure of the data is needed.  This is related another point, feature engineering is of paramount importance.

The issue of overfitting is brought up several times throughout the paper.  He does not just discuss the issue in the typical way, but points out several less obvious cases.  Cross-validation is commonly used to avoid overfitting and to maintain a separation of training and testing data, but it too can lead to overfitting.  There is also a danger in turning a few knobs after the machine learning phase based on testing results.  Even simple classifiers are not immune from overfitting.  He gives the example of $sign(sin(\alpha x)$ where, by a proper setting of $\alpha$, you can obtain a classifier that will separate any one-dimensional dataset; even a classifier with a single parameter can overfit.

Many other interesting points are scattered throughout the paper.  As I was making my way through the paper, I connected many of the points to my misgivings about the black box approach some take to machine learning. One recent technique that has seen great success is the Deep Neural Network. When the topic of deep layered algorithms was broached, he considered it one of the major frontiers of machine learning.  He makes the point that a deep structure can model the same function with far fewer parameters than a single layer structure. I believe he envisions these deep architectures to eventually model the underlying structure of the data—perhaps a structure that can actually be understood by a human—as opposed to the black box approach that is mostly used now.

The one drawback to the paper is the references.  Often he makes a point that I want to follow and read more about.  Unfortunately, many of the references are to books.  While Vapnik’s The Nature of Statistical Learning Theory is a great text, I am not going to hunt through the book in order to find the few relevant lines.  I acknowledge that this is a fairly minor gripe, but it was on my mind as I read the paper.