An workflow in factor-based equity trading, including factor analysis and factor modeling. For well-established factor models, I implement APT model, BARRA's risk model and dynamic multi-factor model in this project.
Author: Jerry Xia
Date: 2018/07/27
Note: The advanced Marckdown features such as math expression may not be compatible in GitHub, please see README.pdf instead if you want more details
This is a research survey about alpha trading. In this project, I built up a pipeline of alpha trading including:
The models involed are APT models, Barra's risk models and dynamic factors model using Kalman filter.
rqdata_utils.py: Utils dealing with the rice quant platform data
Step1_FactorPretest.ipynb: Factor returns profile visulization
Step2_FactorsScreening.ipynb: Factor returns turnover visulization and correlation coefficients
Step3_FactorCombination_AdaBoost_Quantopian.ipynb: A Quantopian notebook file to combine alpha factors using Adaboost
Step3_FactorCombination_BarraKalmanFilter.ipynb: Barra's risk model with three calibration schemes:
KalmanFilterIntro.ipynb: An introduction to the dynamic multi-factor model
APT_FammaBeth.ipynb: Using Famma-Macbeth regression to calibrate APT model.
The dataset is not available in GitHub as it is too large. Except for Step3_FactorCombination_AdaBoost_Quantopian.ipynb which we used US stock data in Quantopian, among other files, we used Chinese A-stocks data downloaded from RiceQuant instead (hard for free US equities' data).
The data frame is multi-indexed similar to Quantopian's format(see both Alphalens github codes and rqdata_utils.py). However, feel free to cast and apply your own dataset.
$\checkmark$ stands for finished and $\vartriangle$ stands for TODO
Universe definition
Factors collection and preprocessing
Factors screening and combination
Portfolio allocation
Here, I use correlation matrix as the measure. The difference from the second result is that the correlation matrix is calculated by the rank data rather than the raw data
Pearson's IC: measures linear relationship between components
Spearman's IC: measures monotonic relationship between components. Since We only care about the monotonic relationships. Spearman's IC wins.
From the correlation coefficients below, we can again conclude that Spearman's rank IC is far more robust. Take ps_ratio and sales_yield as a example. $$ps_ratio = \frac{\mbox{adjusted close price}}{\mbox{sales per share}}$$ whereas $$sales_yield = \frac{\mbox{sales per share}}{\mbox{price}}$$ Ahthogh the price in sales_yield formula is vague in our data source we can see roughly speaking, these two variable should be inverse of each other. The Spearman's rank correlation coefficient is -0.98 which verifies this statement, and we should avoid using both of these factors, which would exeggarate the impact of this peticular factor. However, we can not see such identity in the Pearson's regular correlation coefficients. It's quite misleading actually and that's why we choose Spearman's rank IC.
Here, I use principle component analysis because it can brings two benefits to our data - orthogonality and dimensionality reduction. Orthogonality makes data more separate, less dimensionality makes information more concentrated. Either of them is essential for machine learning algorithms.
In the next part, I used this preprocessed data as the input to obtain a "mega alpha".
construct an aggregate alpha factor which has its return distribution profitable. The term "profitable" here means condense, little turnover, significant in the positive return.
Here we only introduce AdaBoost algorithm in this documentation. For more details about the linear models, please See the appendix and Step3_FactorCombination_BarraKalmanFilter.ipynb.
The algorithm sequentially applies a weak classification to modified versions of the data. By increasing the weights of the missclassified observations, each weak learner focuses on the error of the previous one. The predictions are aggregated through a weighted majority vote.
The adaboost classifier was applied to our fundamental dataset. The objective is to train a classifier which give a score for the bunch of factors. Or in other word, the mega alpha. Pink for the positive forward returns observations and blue for the negative forward returns observations. A good score system is to make the two classes more separated. We can see, in train set, AdaBoost classifier did so well! The next plot is the precision in each quantile of scores. In the top and bottom quantile, the predicted precision is nearly 100%!
alpha values histogram quantile precision bar plot The precision in the top and bottom quantile is only slightly higher than 50%. Far from good if we considered transaction cost.
So, I added some technical analysis factors to see if we can tackle this problem. Surprisingly, even the average accuracy in test set is about 67%. What if we only trade the extreme quantile? That is around 80% accuracy! It literally shows that technical factors are really important in US stock market and can be used to find arbitrage opportunity.
where $f_k(t)$ is the realization(value) of risk factor at time t
Exposure of each security on each factor
Risk premium on each factor $$(Mean[r_i(t)])i = P_0 + \sum{k=1}^K \beta_{ik} \cdot P_k$$ or make $\beta_{0,k}$ equals 1 for each k, $$(Mean[r_i(t)])i = \sum{k=0}^K \bar{\beta}_{i,k} \cdot P_k$$ where $P_0$ is the risk free return
Portfolio exposure to each factor $$Portfolio_{it} = \beta_0 + \beta_k \cdot f_{kit}$$
statistical techniques such as factor analysis, principle analysis
portfolios: K different well-diversified portfolios as substitutions
economic theory (highly developed art)
The simplicity of APT framework is a great virtue. It is helpful to understand the true sources of stock returns. The basic APT model can be enhanced in many ways.
Using historical return extract the factors
$$r_{it} = \alpha_i + \sum_k \beta_{ik}\cdot f_{kt}$$ where $$E[\epsilon_{it} \epsilon_{jt}]=0$$ $$E[\epsilon_{it} f_{kt}]=0$$
$f_{kt}$: the return on index k inperiod t
$\beta$: sensitivities
Either exposure or factor return can be asserted on a priori grounds with the other identified empirically, or both can be identified empirically.
Let the data design the model
Identify the Indexes set
Determine the number of factors: PCA / Factor Analysis
Canonical Correlation (CCA):
take two sets of variables and see what is common amongst the two sets (can be two noncorresponding variables either on index or dimension) $$X_{N \times K}, Y_{N \times K^{\prime}}$$ $$\mbox{x_weights}{K,n}$$ $$\mbox{y_weights}{K^{\prime},n}$$ Use CCA / PLS: $$\mbox{X_score}{N\times n} = \mbox{Normalized}[X]{N \times K} \mbox{x_weights}_{K,n}$$
$$\mbox{Y_score}{N\times n} = \mbox{Normalized}[Y]{N \times K^{\prime}} \mbox{y_weights}_{K^{\prime},n}$$
Determin the number:
Generate Factors
Calibrate sensitivities:
Explanatory Power of the Model for Each Stock: R2>0.7 excellent
$$r_{i,t} = a_{i,t} + X_{i,k,t} \cdot f_{k,t}$$ where $X_{i,k,t}$: the exposure of asset i to factor k known at time t $f_{k,t}$: the factor return to factor k during the period from time $t$ to time $t+1$ $a_{i,t}$: the stock i's specific return during period from time $t$ to time $t+1$ $r_{i,t}$: the excess return (return above the risk-free return) on stock i during the period from time $t$ to time $t+1$
The risk structure $$V_{i,j} = X_{i,k1} F_{k1,k2} X_{j,k2}^T + \Delta_{i,j}$$ $$V = X^T F X + \Delta$$ where
$F_{k1,k2}$ is the K by K covariance matrix for factor returns
$\Delta_{i,j}$ is the N by N diagonal matrix of specific variance
A portfolio described by an N-element vector $h_i$
Model Setting:
Measures:
Goal:
You can think of this as slicing through the other direction from the APT analysis, as now the factor returns are unknowns to be solved for, whereas originally the coefficients b were the unknowns. Another way to think about it is that you're determining how predictive of returns the factor was on that day, and therefore how much return you could have squeezed out of that factor.