The NFL Combine in Two Dimensions

Exploring data with Principal Component Analysis

Andrew Doss
The Inner Join

--

Principal component analysis enables us to visualize eight dimensions of NFL combine athleticism in a two-dimensional figure. This visualization packs a lot of information into a small space, and this article will teach you how to interpret this graphic and apply the same technique in your own data analysis.

Cool, but…

We recently published an analysis of pandemic unemployment and drinking where we used a data analysis technique called Principal Component Analysis (PCA) to visualize several variables in one plot. We received feedback from readers who were interested in learning more about PCA but have previously found it hard to interpret and difficult to apply with real datasets.

PCA is known for being cool to look at, but not always useful due to a lack of interpretability. We’ll walk through applying PCA to NFL combine data, explaining how to use it, the factors we considered, and how we build an intuition for useful analysis. We will also provide links to the data and Python implementations from the article so you can immediately apply PCA in your own work.

A particular perspective

There are multiple applications of principal components including feature engineering, compressing data with minimal information loss, and visualizing high dimensional data. Here, we focus on the third application and analyze principal components as a means of better understanding structure, particularly how variables tend to change when others change, in a dataset.

PCA can be studied from different perspectives including linear algebra, mathematical statistics, and computational methods. Here, we focus on the less formal perspective of exploratory data analysis. This application and perspective puts us in the setting of unsupervised learning for exploratory data analysis, with the general objective of summarizing a dataset in a way that reveals new insights.

Bivariate beginnings

We’ll start by applying PCA to a small, bivariate subset of a dataset from the 2021 NFL Scouting Combine and work our way up to the eight variables shown in our introduction.

The NFL Combine is an annual event where elite college athletes complete a series of tests in preparation for the draft. The measurements taken at the combine yield a dataset describing the variability of physical attributes and performance within this group. We’ll start building our intuition for PCA using just two variables, height and weight.

PCA assumes that variance is what is interesting about a dataset. When variables in our dataset are correlated, some of their variance is redundant. We can summarize this covariance with a smaller number of new variables, called principal components. Principal components are weighted averages of the original variables that possess particular properties.

In the plot below, it appears that athlete height and weight are correlated — when height increases, so does weight. If you had to choose one, and only one, variable to summarize the variability in these players, what would it be?

You could pick one of the existing variables (height or weight). However, the correlation between height and weight indicates that we can construct a more general variable related to both. Let’s call this more general variable player “size”, and define it as an average of height and weight. You might not know how to compute that average, but if you had these athletes in a lineup, you would likely still have some intuition for sorting them by “size.”

In essence, PCA enables us to measure players by the single variable “size.” Specifically, PCA finds the optimal weighted average of height and weight for summarizing the variability of both height and weight with a single value. That single dimension is the first principal component of this dataset.

Note: from here on, we will display all variables in standardized form, a nuance that we explain further in the appendix for practitioners.

In what sense is the first principal component optimal? The first principal component aligns with the direction of greatest variance in the dataset. The first principal component is chosen to minimize the (sum of squared) distances from all points to the principal component and maximize the variance of the points when projected onto the component. Projection simply means replacing each data point with the closest position along the principal component line.

By forcing our correlated dataset into one dimension, we have uncovered a more general, latent variable “size” that was not in our original dataset. “Size” does a better job describing how athletes varied in both height and weight than either height or weight alone. Put another way, if you had to estimate each athlete’s height and weight but only had access to one measurement, you would do better with “size” than either height or weight alone because size effectively approximates both height and weight.

Our discovery of the latent variable “size” is an example of finding structure that summarizes a dataset. We will return to this idea later with the full dataset.

Reframing the problem

We can also add a second principal component. The second principal component is orthogonal to the first (a constraint imposed on principal components) and captures the remaining variation in the dataset (because two independent axes fully span a plane). The second principal component explains less variance than the first. We can add as many principal components as we have variables in our dataset, and each subsequent principal component explains less variance than every principal component before it.

As with the first principal component, there is an intelligible explanation for the second component; it appears to represent how “stocky” a player is, relative to their overall “size.”

So far, we have been visualizing our data points using the axes of height and weight. We could also represent our data points using the first two principal components as our axes. We can transform the dataset from our original axes to “principal component” space with a clockwise rotation.

This perspective of two-dimensional “PCA space” becomes more useful as we add further variables to our dataset.

The next dimension

We now add a third dimension from the combine dataset, an athlete’s time in the 40-yard dash. We can no longer plot all of our variables in two dimensions and instead use a matrix of pairwise scatterplots.

There are positive correlations between all pairs of the three variables, which again indicates interesting structure for PCA. We can proceed as before, and see how PCA summarizes these three variables with only two dimensions. We can compute a third principal component now, but will stop at the first two because we are interested in using PCA to create two-dimensional visualizations of datasets.

To further our intuition, we’ll briefly visualize the three-dimensional dataset and first two principal components.

The principal components are still getting as close to the now-three-dimensional dataset as possible and would maximize the spread of the dataset if it was projected onto the plane spanned by the principal components. The multicollinearity is also evident in the shape of the dataset.

We now return to the two-dimensional PCA space with our three-dimensional dataset. We can no longer capture all the variability in two-dimensions, but are getting a great approximation — the first two principal components summarize over 90% of the dataset’s variance.

In the right panel of the plot, we summarize the loadings of the original variables in the principal components. The squares of these loadings sum to one for each component, and the loadings can be both positive and negative. These loadings quantify how the original variables are averaged to construct the principal components and also provide the direction and relative magnitude of the vectors in the plot.

Both the vectors in the left panel and the loadings in the right panel indicate that the first principal component still roughly represents “size” while the second still roughly represents “stockiness.”

So, what’s the point? When exploring a three-dimensional dataset with correlated variables, we can summarize much of the dataset’s variance with a two-dimensional plot in “PCA space.” We can also project the original variables into the PCA space and examine principal component loadings to examine structure in the dataset’s correlated variables and identify new latent variables.

In this case, PCA enables us to reduce most of the information from a hard-to-visualize three-dimensional space into a two-dimensional representation of height, weight, and speed.

Seeing in eight dimensions

We are now ready to revisit the eight-dimensional dataset from the beginning of the article. We can again plot all pairwise scatterplots, but already have 28 pairs to compare and would reach 45 pairs if we added just two further variables.

We are running into a scalability issue with pairwise scatterplots. The number of pairwise plots scales quadratically as we add more variables, and individual panels increasingly capture less and less of the total variance in the dataset.

Fortunately, we can return to the two-dimensional PCA space view with the eight-dimensional dataset. Incredibly, we can represent 80% of the variance across all eight dimensions with only the first two principal components.

The similarly directed vectors indicate clusters of correlated variables. We have the familiar “size” direction which has now picked up bench press repetitions as well. At the left, we have three agility drills, and towards the upper left we have two jumping tests.

Note: In contrast to our three-dimensional example, we have negated the timed measurements so that the arrows and positive loadings indicate the direction of superior performance.

Coloring by position, we can see many expected patterns including the speed and explosiveness of receivers and defensive backs, the dominant size of offensive linemen, the compactness of offensive backs, and the all-roundedness of tight ends. Further, we have highlighted some first round draft picks (solid circles) who tend to be on the athletic extremes of their position cohorts.

We can also review the loadings of the first two principal components to further clarify the interpretations of the horizontal and vertical axes. In contrast to earlier, the first principal component now largely focuses on the tradeoff between speed and weight while the second focuses on height and jumping ability.

We’ll stop at eight dimensions for this dataset, but PCA could keep going into much higher dimensions. We lose some information by squeezing our dataset down to two dimensions with PCA, but we often gain new clarity and understanding in return.

The upshot

We’ve shared an intuitive explanation of PCA and shown that it is a useful technique for visualizing datasets with several dimensions in a single two-dimensional space. In summary:

  • PCA provides a means to summarize the variance of a dataset with correlated features.
  • This summary can be used to represent a higher-dimensional dataset in two dimensions for exploratory data analysis. Pairwise scatterplots begin to scale poorly with several variables and can become hard to interpret holistically due to the splitting of information into many panels.
  • Forcing a dataset with correlation into a lower dimensional PCA space can reveal interesting structure and latent variables that organize the original dataset variables.

Interested in future Inner Join publications and related bit.io data content? Please consider subscribing to our weekly newsletter.

Your turn

The NFL Scouting Combine dataset is available online as well as a Deepnote with example Python implementations of PCA and the related visualizations from this article.

If you are interested in learning more about PCA, we highly recommend Introduction to Statistical Learning (beginner) or Elements of Statistical Learning (intermediate). This article is heavily influenced by the former.

Are you exploring interesting questions with data? Tell us about it at innerjoin@bit.io.

Appendix

There are nuances to be aware of when applying PCA for exploratory data analysis. Here are just a few tips for practitioners.

Centering and scaling your data

In all of our examples, we worked with centered and scaled variables. Centering (subtracting the mean from each variable) is important because we aim to study variance, and without centering we would also be studying position relatively to an oft arbitrary origin. Scaling (dividing each variable by its standard deviation) is important when our variables are measured on different scales. Without scaling, PCA will optimize for the variables with the largest scales. Consider that we could have measured player height in inches or in feet. If we had done the latter and also not scaled our data, PCA would consider height far less important when represented in feet compared to inches.

However, there are exceptions to this preprocessing step. If all variables are measurements on the same, meaningful scale, it may be preferable to not scale the variables. One example is applying PCA to a distance matrix.

In addition, PCA captures variance, so it is influenced by outliers. It may be worth clipping or omitting outliers in some cases, depending on the intent of the analysis.

Variable selection matters

PCA describes a particular dataset, and is dependent on the choice of features included. For example, if we added further variables correlated with size — perhaps bicep and waist circumference — our top principal components would increasingly focus on features related to size and appear to explain “more” variance. The inverse applies to removing variables. PCA can produce insights about a dataset, but PCA cannot tell you if you’ve selected the right dataset to analyze.

How many principal components?

Sometimes, we are interested in knowing how effectively PCA is summarizing a dataset with a particular number of principal components. We can investigate this by reviewing the amount of variance explained per principal component. This is often reviewed in an “elbow” plot showing cumulative variance explained after each principal component, with the aim of finding a kink where the variance explanation begins rapidly diminishing.

Unfortunately, elbow plots are often smooth in practice. It can sometimes be helpful to plot the variance explanation curve against another curve for the same dataset but with the columns independently shuffled to remove all correlations in the dataset. Then, the distance between the two curves gives a sense of how surprising the result is, a familiar principle for anyone who has studied hypothesis testing.

When considering heuristics for evaluating structure found through unsupervised learning, it is often a good idea to check what the heuristic looks like in extreme cases where that structure is certainly present or certainly absent as a means to normalize your interpretations. An analogous example is the gap statistic for clustering.

Using principal components as model features

As a separate application from exploratory data analysis, principal components can also be used as model features. This is attractive because we can reduce the dimensionality of our feature set while retaining most of the information that they contain.

However, this can also be problematic because PCA is blind to the end target variable for the downstream model. PCA simply maximizes variance in the lower dimensional space — PCA does not necessarily favor the portion of the variance that is related to the target variable.

There are some problems where PCA is useful in this application, but you will often do much better avoiding overfitting through thoughtful feature selection, managing model complexity with model selection and regularization, and applying other techniques such as bagging, randomization, and ensembling.

--

--