A Fund Price Prediction Framework (LSTM-based, web scraping included) 天天基金网爬虫+基金预测
This project aims to build a stable price predictor for open-end mutual funds available in China's fund market.
It consists of:
This project is written in Python
with data stored into a MySQL
database.
Key libraries include Scrapy
· PyMySQL
· Numpy
· Pandas
· Matplotlib
· Seaborn
· SciKit-Learn
· TensorFlow (tf.keras)
The following raw data of a fund will be collected using spider em.py
:
Note: data are updated daily by the scraped website, except for asset size, asset allocation and industry allocation, which are updated per reporting period (season).
Complete source code of this framework can be found here.
from ZFundETL import FundETL
ETL = FundETL()
funds, categorical = ETL.quick_prepossessing()
For more details, please check the demo.
This process mainly generates two datasets.
funds
: contains the historical daily prices of available funds during the selected period, plus the daily returns of the benchmarks (stock or bond index).
categorical
: contains all the short-term invariant features, including fund types, fund styles, asset size, ranking scores (manager performance), asset allocation, and industry allocation.
from ZFundPredictor import FundPredictor
predictor = FundPredictor(funds)
single_prediction = predictor.get_prediction(ticker, **params)
ensemble_prediction = predictor.ensemble_prediction(ticker, **tune_params)
For more details, please see the following section and the demo.
The current predicting algorithm is based on LSTM with sliding windows.
dist_EMA
: the distances between fund price and x-day exponential moving average (EMA) of the fund on each day.signal_EMA
: one-time EMA signal. -1
if the shorter EMA crosses above the longer EMA on that day, 1
vice versa, and 0
if there is no cross.status_BB
: short-term status of the Bollinger Bands indicator. -1
if the value of x-day BB indicator is below the lower band, 1
if it is above the upper band, and 0
otherwise.Currently in use:
lookback, lookahead
: [50, 1], [120, 2] and [120, 5]'ema'
- adopt only the EMA indictors dist_ema
(e.g. dist_ema5
refers to 5-day EMA, and dist_ema50
refers to 50-day EMA).'all'
- adopt all available indicators.A model with a 120-day Lookback period and a 5-day Lookahead period would look like this:
Layer (type) | Output Shape | Param # |
---|---|---|
lstm_0 (LSTM) | (None, 120, 256) | 268288 |
dropout_0 (Dropout) | (None, 120, 256) | 0 |
lstm_1 (LSTM) | (None, 256) | 525312 |
dropout_1 (Dropout) | (None, 256) | 0 |
dense_0 (Dense) | (None, 32) | 8224 |
dense_1 (Dense) | (None, 5) | 165 |
Total params: 801,989 | ||
Trainable params: 801,989 |
When get_prediction()
is called, the training set will go through the above layers and a single model prediction will be generated with the provided parameters.
ensemble_prediction()
allows a basket of MA options and a basket of dropout options. Each combination will form a new model, and based on their R-squared scores, the final prediction will be the weighted average of some of these models.
An illustration of ensemble_prediction()
:
monitor='val_loss', min_delta=1e-5, patience=5, restore_best_weights=True
On average, the training stops between epoch 10 and 20.Ideas and contributions welcome.
Contact: [email protected]
Distributed under the MIT license.