Directed Random Projection
PCA finds the linear projections which maximize the variance of the data. It requires first finding the covariance matrix, and then finding the eigenvectors of this. For n rows and d columns, finding the covariance matrix is O(n*d^2), and finding the eigenvectors is then O(d^3). This is going to be slow for any large data set (either many rows or many columns).
Random projection on the other hand simply calculates each column in the projection by multiplying the data by a random vector of length d. It can be shown that the number of such columns required is proportional to log(n); anything beyond that does not add any value to the data. If the number of columns required is p, then the complexity of random projection is O(p*n*d). The downside of this approach is that it doesn’t take any advantage of the structure of the data – the first projections do not account for any more of the variance than later projections. PCA, on the other hand, finds the projections which account for the greatest amount of variance.
In this article I introduce a new method, Directed Random Projection, which provides the speed of random projection, whilst also taking advantage of the structure of the data to find the maximum variability with the fewest columns. I will refer to the older style of random projection as ‘naive’ random projection, in order to distinguish it from the approach outlined here.
The easiest way to explain directed random projection is with a small example. Let’s start with a small data set of 10 points, in 2 dimensions. Here’s the data set, with some annotations that I’ll explain in a minute:
As you can see in this sample data set, the variables are quite highly correlated, so we should be able to find a single linear combination which contains most of the variability of two variables. Proceed as follows:
- Select a random point (marked ‘A’ above)
- Find the point furthest from A (marked ‘B’ – x=0.86, y=3.49)
- Find the point furthest from B (marked ‘C’- x=0.24, y=1.15)
- Construct a normal vector v from in the direction from point B to point C (v1 = (0.86-0.24), (3.49-1.15) = 0.62, 2.35; v = v1/sqrt(0.62^2 + 2.35^2) = 0.26,0.97)
v is a vector which connects 2 extreme points in the data. As long as basic pre-processing has been done (removal of outliers, univariate box-cox normalizing transforms, and (0,1) standardization), this vector should represent the strongest ‘direction’ in the data. Therefore, we take this as our first component.
Our next step is to remove this component from the data. To do so, we simply take the dot-product of the data with v in order to find the distance of each point along v, and then subtract this from each data point. For instance, point C has a dot-product with v of 0.24*0.26+1.15*0.97=1.17, so C’ (i.e. the point after removing the 1st component) is (0.24-1.17*0.26,1.15-1.17*0.97). After applying this to each point we get the following:
The dotted line here is the line of best fit. Note the following:
- It is a straight line, since we have removed one dimension from the data
- It goes through the origin, since we subtracted a projection with a normal vector from the data
- The slope of the projection vector was 0.97/0.26=3.78. After removing this component, the remaining data is on a line perpendicular to this, with a slope of 1/3.78=-0.26.
Although this example has been shown in just 2 dimensions, it is clear to see that it is equally applicable in any number of dimensions, since the concepts of Euclidian distance (to find the extreme data points) and inner-product (to do each projection) are just as valid in higher dimensional spaces.
The process can be repeated as many times as necessary, until enough of the variability of the data has been accounted for, for the purpose of the underlying analysis. The variability can simply be calculated by finding the variance of the data after each projection is taken, and comparing that to the total variance of the initial data set.
Finding each of the 2 extreme data points requires O(nd) time to calculate the distance of each point. Then taking the inner product of the data with v takes O(nd) (for the first projection – for following projections, decrement d by one for each additional projection). Finally, subtracting out the projection vector is O(n).
So overall, the complexity of directed random projection is O(p*n*d), where p is the number of projections selected. The constant term can be improved by simple heuristics – in particular, note that v can be found by taking the extreme points of a random sample of the data; for a large enough sample, the extremes found will closely match the true population extremes. Assuming that the size of the sample is insignificant compared to the full data set, then the time required to find each random projection is simply the time required to calculated a single inner product – i.e. the time becomes the same as that for ‘naive’ random projection.