Principal Component Analysis
There are some tools that punch above their weight, and with some beautiful math, can produce powerful insights into incredibly complex and large datasets. One of these elegant tools that we use here at Simbex is Principal Component Analysis (PCA), one of many types of algorithms that whittles down a large number of variables (we call them features), some of which are correlated and therefore redundant, into a smaller set of features, where the unhelpful, redundant, and noisy bits of the data are removed and only the important, explanatory parts remain. For the more technically inclined, it does this by using the underlying inter-relationships between the input features (the correlation) to minimize the variance between data points. While many of these “dimensionality reduction” algorithms exist, PCA is exceptionally well known for its ease of use and impactful results. PCA outputs are often used to identify the most and least important input features in producing a desired result.
To explain further, let’s take an easy example; a company is designing a product to increase people’s happiness and well-being. The researchers in this company are currently trying to identify what specific inputs are correlated with happiness so they can prioritize their energy to maximize predictive power for their new happiness product. They are meticulous, so the researchers collect data from 10 people for one month in the following fields (features):
- How happy they were each day
- Number of hours of sleep they obtained the night before
- The amount of time spent with friends
- Number of outfit changes they did that day
- Number of hours of meetings they attended
In this hypothetical example, most of us would identify that a high happiness score is strongly determined by 8-9 hours of sleep a night, as little time spent in meetings as possible, and as much time spent with loved ones and friends as possible but is unlikely to be determined by the number of outfit changes done that day.
PCA does the exact same thing that we have just done but in a more quantified way. It takes a dataset with our features (hours of meetings, number of outfit changes, etc.) and assigns importance values to these features based on how well they correlate to the strength of the person’s happiness.
Using PCA to assign importance values to the fields in the above example, we can obtain the following principal components (fabricated for instructive purposes):
Feature | Hours of Sleep | Time in Meetings | Time with Friends | Number of Outfit Changes |
---|---|---|---|---|
Importance (for high happiness scores) | 1.3 | -0.5 | 0.6 | 0.003 |
A positive number means this feature is directly correlated to happiness, a negative number means the feature is anti-correlated, and a value close to 0 means that the feature is uncorrelated with happiness. The magnitude of the values themselves are indicative of the strength of their importance. The larger the magnitude, the stronger the effect is in pushing happiness scores one way or the other. We can see that prioritizing sleep plays a larger role in happiness than spending time with friends, although the story is often more complicated than what the first principal component shows, as we shall see next.
In reality, PCA doesn’t produce just one principal component. It produces as many principal components as there are features in the original dataset. This means that our example above will produce four principal components, since there are four features. This can make analysis of PCA convoluted and results easily mis-interpretable.
Our minds are easily able to manage this small list of features, but it becomes more difficult as the number of inputs increases to 20, 50, or 100. The analysis of human biometric signals is one example where many factors can contribute to the final picture. Imagine trying to identify what features of walking gait contribute to a healthy stride pattern. Is it step length, asymmetry, or peak swing velocity? Or imagine trying to identify what physiological measurements can be used to predict onset of a stroke. Heart rate is an obvious inclusion, but what about sweat, levels of O2, or number of hours of sleep the night before? This is where PCA can make a huge impact, because it takes a quantified dataset and determines what combination of features best describes the outcome. This is most helpful when one feature isn’t solely responsible for the outcome but rather a combination of features that are needed to describe it. In our original example above, if you got a solid night of sleep, but you spent 8 hours in meetings and spoke to no one you loved that day, you would not be as happy as if you were a little sleep deprived but had only a few hours of meetings and had a great dinner party with your closest family. When PCA is set up correctly, these complicated relationships are very easily identified.
The power of PCA isn’t merely what the algorithm generates. The real magic occurs when a human can take those quantifiable results and apply them to the physical world. PCA will print out a series of principal components. Each of those principal components is a linear combination of each of the features put into PCA, with different amounts of importance identified in each feature that will add up to the original signal. In the unsimplified version of the original example, there would be four rows with different importance levels identified for each feature in that row (for simplicity, we are showing only two of the four possible principal components, again fabricated for instructive purposes):
Feature | Hours of Sleep | Time in Meetings | Time with Friends | Number of Outfit Changes |
---|---|---|---|---|
Principal Component 1 | 1.3 | -0.5 | 0.6 | 0.003 |
Principal Component 2 | 0.5 | -0.4 | 1.2 | 0.2 |
The first principal component (the first row) is the same as above. The second principal component, however, adds an interesting twist. In mathematical terms, it is saying that we can also obtain happiness with at least some decent sleep (+0.5, but notice the smaller importance than in Principal Component 1 for sleep, 1.3), definitely no meetings (-0.4), and a substantial amount of time with friends (+1.2), but, crucially, some outfit changes (+0.2). What a PCA connoisseur would then do is identify the difference between the data points in principal component 1 and principal component 2 and determine what kinds of physical situations can describe this result- really, identifying the causation behind the correlation. A potential conclusion in this case is that the second principal component describes someone social and outgoing who enjoys going to dinner parties (the outfit changes) and is young (hence why sleep is less important). This can be determined by looking into the features themselves. How many outfit changes? How much sleep at night? How different are the data points described using principal component 1 compared to 2? We can then use these results to identify populations that would be best suited for our product. Maybe the kind of person identified in principal component 2 is part of an untapped audience that would benefit if we can tailor our happiness product to suit their needs.
Obtaining results with PCA isn’t hard. Anyone with a computer, a spreadsheet and a few python packages can get a PCA result within a few minutes. However, that is one of PCA’s downsides. It is difficult to identify if the results PCA provides are meaningful, and correct use of PCA requires a certain level of expertise. Like most tools, using the tool and using the tool to create something impactful are very different. While anyone can technically run a sewing machine, not just anyone can make functional and stylish clothing with a sewing machine. Running PCA correctly and setting up the correct assumptions (through scaling and pruning and identifying bad data, among others) is what makes PCA powerful in identifying relationships.
At Simbex, we have 20 years of experience with PCA. Our algorithm development team provides expert consulting services utilizing an agile approach to ensure our clients’ product requirements are met. If you have a problem that could benefit from PCA or other advanced analytical techniques, reach out to our friendly business development team and we’ll explore together how to transform your innovation into the powerful medical device you have envisioned.
Learn About PCA and Our Consulting Services
Ready to meet with Simbex?
About the Author:
Aroob Adbelhamid used PCA for 6 years during her PhD studies and these skills to inform the analyses she performs for Simbex clients in her role as an Algorithm and Data Engineer. She is well-versed in using big, messy, confusing data to identify and answer fundamental questions and solve problems. Aroob has over 8 years of experience in machine learning, big data, and disseminating tough topics in an easy-to-understand way. Aroob has a PhD in Chemistry from the University of Colorado, Boulder, and a BA in Natural Sciences: Chemistry from Fresno State.