Skip to content

Commit

Permalink
Merge branch 'main' of github.com:chenditc/investment_data into main
Browse files Browse the repository at this point in the history
  • Loading branch information
chenditc committed Jun 13, 2023
2 parents db40eb3 + e98f28f commit 3c3a7ea
Show file tree
Hide file tree
Showing 5 changed files with 216 additions and 27 deletions.
55 changes: 28 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@

中文 README: [![ch](https://img.shields.io/badge/lang-ch-yellow.svg)](https://github.com/chenditc/investment_data/blob/master/docs/README-ch.md)

Chinese blog about this project: [量化系列2 - 众包数据集](https://mp.weixin.qq.com/s/Athd5hsiN_hIKKgxIiO_ow)

- [How to use it](#how-to-use-it)
Expand All @@ -17,6 +20,7 @@ Chinese blog about this project: [量化系列2 - 众包数据集](https://mp.we
* [Validation logic](#validation-logic)
- [Contribution Guide](#contribution-guide)
* [Add more stock index](#add-more-stock-index)
* [Add more data source or fields](#Add-more-data-source-or-fields)

<small><i><a href='http://ecotrust-canada.github.io/markdown-toc/'>Table of contents generated with markdown-toc</a></i></small>

Expand Down Expand Up @@ -79,35 +83,12 @@ The database table on dolthub is named with prefix of data source, for example `
- ts: Tushare data source
- ak: Akshare data source
- yahoo: Use Qlib's yahoo collector https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo

- baostock: [Baostock ](http://baostock.com/)
- final: Merged final data with validation and correction

## Initial import

- w(wind): Use one_time_db_scripts to import w_a_stock_eod_price table, used as initial price standard
- c(caihui): SQL import to c_a_stock_eod_price table
- ts(tushare):
1. Use tushare/update_stock_list.sh to load stock list
2. Use tushare/update_stock_price.sh to load stock price
- yahoo
1. Use yahoo collector to load stock price

## Daily Update
Currently the daily update is only using tushare data source and triggered by github action.
1. I maintained a offline job whcih runs [daily_update.sh](daily_update.sh) every 30 mins to collect data and push to dolthub.
2. A github action [.github/workflows/upload_release.yml](.github/workflows/upload_release.yml) is triggered daily, which then calls bash dump_qlib_bin.sh to generate daily tar file and upload to release page.

## Merge logic
1. Use w data source as baseline, use other data source to validate against it.
2. Since w data's adjclose is different from ts data's adjclose, we will use a **"link date"** to calculate a ratio to map ts adjclose to w adjclose. This can be the maximum first valid data for each data source. The reason we don't use a fixed value for link date is: Some stock might not be trading at specific date, and the enlist and delist date are all different. We store the link date information and adj_ratio in link_table. adj_ratio = link_adj_close / w_adj_close;
3. Append ts data to final dataset, the adjclose will be ts_adj_close / ts_adj_ratio

## Validation logic
1. Generate final data by concatinate w data and ts data.
2. Run validate by pair two data source:
- Compare high, low, open, close, volume absolute value
- Calcualte adjclose convert ratio use a link date for each stock.
- Calculate w data adjclose use link date's ratio, and compare it with final data.
## Initial loading and Validation logic for each table
- [final_a_stock_eod_price](docs/final_a_stock_eod_price.md)
- [final_a_stock_limit](docs/final_a_stock_limit.md)

# Contribution Guide
## Add more stock index
Expand All @@ -116,3 +97,23 @@ To add a new stock index, we need to change:
2. Add price download script. Change [tushare/dump_index_eod_price.py](https://github.com/chenditc/investment_data/blob/main/tushare/dump_index_eod_price.py) to add the index price. Eg. [Example Commit](https://github.com/chenditc/investment_data/commit/ae7e0066336fc57dd60d13b20ac456b5358ef91f)
3. Modify export script. Change the qlib dump script [qlib/dump_index_weight.py#L13](https://github.com/chenditc/investment_data/blob/main/qlib/dump_index_weight.py#L13), so that index will be dump and renamed to a txt file for use. [Example commit](https://github.com/chenditc/investment_data/commit/f41a11c263234587bc40491511ae1822cc509afb)

## Add more data source or fields
Please raise an issue to discuss the plan, example issue: https://github.com/chenditc/investment_data/issues/11

It should includes:
1. Why do we want this data?
2. How do we do regular update?
- Which data source would we use?
- When should we trigger update?
- How do we validate regular update complete correctly?
2. Which data source should we get historical data?
3. How do we plan to validate the historical data?
- Is the data source complete? How did we verify this?
- Is the data source accurate? How did we verify this?
- If we see error in validation, how will we deal with them?
4. Are we changing exisiting table or adding new table?



If the data is not clean, we might try hard to dig insight from it and find incorrect insight. So we want **high quality** data instead of **just data**.

115 changes: 115 additions & 0 deletions docs/README-ch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
------------------------------------------

关于此项目的中文博客:[量化系列2 - 众包数据集](https://mp.weixin.qq.com/s/Athd5hsiN_hIKKgxIiO_ow)

- [如何使用](#如何使用)
- [开发设置](#开发设置)
* [安装dolt](#安装dolt)
* [克隆数据](#克隆数据)
* [导出为qlib格式](#导出为qlib格式)
* [运行每日更新](#运行每日更新)
* [每日更新和输出](#每日更新和输出)
* [将tar文件解压到qlib目录](#将tar文件解压到qlib目录)
- [初衷](#初衷)
- [项目详细信息](#项目详细信息)
* [数据源](#数据源)
* [初始导入](#初始导入)
* [每日更新](#每日更新)
* [合并逻辑](#合并逻辑)
* [验证逻辑](#验证逻辑)
- [贡献指南](#贡献指南)
* [添加更多股票指数](#添加更多股票指数)
* [添加更多数据源或字段](#添加更多数据源或字段)+

<small><i><a href='http://ecotrust-canada.github.io/markdown-toc/'>使用markdown-toc生成的目录</a></i></small>

# 如何使用
1. 从GitHub上的最新发布页面下载tar压缩文件
2. 将tar文件解压到默认的qlib目录
```
wget https://github.com/chenditc/investment_data/releases/download/2023-04-20/qlib_bin.tar.gz
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2
```

# 开发设置
如果你想为这套脚本或数据做出贡献,你应该如何设置开发环境。

## 安装dolt
按照 https://github.com/dolthub/dolt 的指示进行

## 克隆数据
原始数据托管在dolt:https://www.dolthub.com/repositories/chenditc/investment_data

以dolt数据库的形式下载:

`dolt clone chenditc/investment_data`

## 导出为qlib格式
```
docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/
```

## 运行每日更新
你将需要tushare令牌来使用tushare api。从https://tushare.pro/ 获取tushare令牌。

```
export TUSHARE=<Token>
bash daily_update.sh
```

## 每日更新和输出
```
docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash daily_update.sh && bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/
```

## 将tar文件解压到qlib目录
```
tar -zxvf qlib_bin.tar.gz
-C ~/.qlib/qlib_data/cn_data --strip-components=2
```

# 初衷
1. 尝试通过组合多个数据源来填充缺失的数据,例如退市公司的数据。
2. 尝试通过跨多个数据源进行验证来纠正数据。

# 项目详细信息
## 数据源

dolthub上的数据库表以数据源的前缀命名,例如`ts_a_stock_eod_price`。前缀的含义:

- w(wind):高质量的静态数据源。只可用到2019年。
- c(caihui):高质量的静态数据源。只可用到2019年。
- ts:Tushare数据源
- ak:Akshare数据源
- yahoo:使用Qlib的yahoo收集器 https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo
- final:经过验证和校正的最终合并数据

## 数据表导入以及校验流程
- [final_a_stock_eod_price](final_a_stock_eod_price.ch.md)


# 贡献指南
## 添加更多股票指数
要添加一个新的股票指数,我们需要改变:
1. 添加指数权重下载脚本。更改[tushare/dump_index_eod_price.py](https://github.com/chenditc/investment_data/blob/main/tushare/dump_index_weight.py#L15) 脚本以导出指数信息。如果指数在tushare中不可用,则编写一个新脚本并添加到[daily_update.sh]([daily_update.sh](https://github.com/chenditc/investment_data/blob/main/daily_update.sh#L12))脚本中。[示例 Pull Request](https://github.com/chenditc/investment_data/commit/a906e4cb1b34d6a63a1b1eda80a4c734a3cd262f)
2. 添加价格下载脚本。更改[tushare/dump_index_eod_price.py](https://github.com/chenditc/investment_data/blob/main/tushare/dump_index_eod_price.py)以添加指数价格。例如[示例 Pull Request](https://github.com/chenditc/investment_data/commit/ae7e0066336fc57dd60d13b20ac456b5358ef91f)
3. 修改导出脚本。更改qlib dump脚本[qlib/dump_index_weight.py#L13](https://github.com/chenditc/investment_data/blob/main/qlib/dump_index_weight.py#L13),使得指数将被dump并重命名为一个txt文件供使用。[示例 Pull Request](https://github.com/chenditc/investment_data/commit/f41a11c263234587bc40491511ae1822cc509afb)

## 添加更多数据源或字段
请提出一个 Github Issue 来讨论这个计划,包括:
1. 为什么我们需要这些数据?
2. 我们如何进行日常更新?
- 会使用哪个数据源?
- 应该何时触发更新?
- 如何验证日常更新已正确完成?
3. 我们应该从哪个数据源获取历史数据?
4. 我们如何打算验证历史数据?
- 数据源是否完整?如何验证的?
- 数据源是否准确?如何验证的?
- 如果在验证中发现错误,我们将如何处理?
5. 是改变现有的表还是添加新的表?

示例 Github Issue:https://github.com/chenditc/investment_data/issues/11

如果数据不干净,我们在此基础上做的工作都就没有可信度。所以我们希望得到的是**高质量**的数据,而不仅仅是**数据**
25 changes: 25 additions & 0 deletions docs/final_a_stock_eod_price.ch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
## 初始导入
- w(wind):使用one_time_db_scripts导入w_a_stock_eod_price表,作为初始价格标准
- c(caihui):SQL导入到c_a_stock_eod_price表
- ts(tushare):
1. 使用tushare/update_stock_list.sh载入股票列表
2. 使用tushare/update_stock_price.sh载入股票价格
- yahoo
1. 使用yahoo收集器载入股票价格

## 每日更新
目前,每日更新仅使用tushare数据源,并由github action触发。
1. 我维护了一个离线任务,它每30分钟运行一次[daily_update.sh](daily_update.sh)以收集数据并推送到dolthub。
2. 一个github action [.github/workflows/upload_release.yml](.github/workflows/upload_release.yml)每日触发,然后调用bash dump_qlib_bin.sh生成每日tar文件并上传到发布页面。

## 合并逻辑
1. 使用w数据源作为基准,使用其他数据源进行验证。
2. 由于w数据的adjclose与ts数据的adjclose不同,我们将使用一个**链接日期**来计算比率,以将ts adjclose映射到w adjclose。这可以是每个数据源的最大第一个有效数据。我们不使用固定值作为链接日期的原因是:某些股票可能在特定日期没有交易,而上市和退市日期都不同。我们在link_table中存储链接日期信息和adj_ratio。adj_ratio = link_adj_close / w_adj_close;
3. 将ts数据附加到最终数据集,adjclose将为ts_adj_close / ts_adj_ratio

## 验证逻辑
1. 通过连接w数据和ts数据生成最终数据。
2. 通过配对两个数据源运行验证:
- 比较高、低、开、收、成交量的绝对值
- 使用每只股票的链接日期计算adjclose转换比率。
- 使用链接日期的比率计算w数据的adjclose,并将其与最终数据进行比较。
26 changes: 26 additions & 0 deletions docs/final_a_stock_eod_price.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
## Initial import

- w(wind): Use one_time_db_scripts to import w_a_stock_eod_price table, used as initial price standard
- c(caihui): SQL import to c_a_stock_eod_price table
- ts(tushare):
1. Use tushare/update_stock_list.sh to load stock list
2. Use tushare/update_stock_price.sh to load stock price
- yahoo
1. Use yahoo collector to load stock price

## Daily Update
Currently the daily update is only using tushare data source and triggered by github action.
1. I maintained a offline job whcih runs [daily_update.sh](daily_update.sh) every 30 mins to collect data and push to dolthub.
2. A github action [.github/workflows/upload_release.yml](.github/workflows/upload_release.yml) is triggered daily, which then calls bash dump_qlib_bin.sh to generate daily tar file and upload to release page.

## Merge logic
1. Use w data source as baseline, use other data source to validate against it.
2. Since w data's adjclose is different from ts data's adjclose, we will use a **"link date"** to calculate a ratio to map ts adjclose to w adjclose. This can be the maximum first valid data for each data source. The reason we don't use a fixed value for link date is: Some stock might not be trading at specific date, and the enlist and delist date are all different. We store the link date information and adj_ratio in link_table. adj_ratio = link_adj_close / w_adj_close;
3. Append ts data to final dataset, the adjclose will be ts_adj_close / ts_adj_ratio

## Validation logic
1. Generate final data by concatinate w data and ts data.
2. Run validate by pair two data source:
- Compare high, low, open, close, volume absolute value
- Calcualte adjclose convert ratio use a link date for each stock.
- Calculate w data adjclose use link date's ratio, and compare it with final data.
22 changes: 22 additions & 0 deletions docs/final_a_stock_limit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
## Why we need this
1. Stock price hit up limit or down limit price has special meaning in trading:
- In backtest, we cannot trade after the price hits this number.
- In feature design, this might means higher momentum.

## Initial import
Related SQL stored in Procedure: https://www.dolthub.com/repositories/chenditc/investment_data/compare/master/l5e2000o8fd479n5dbqfufpkqegutq0k?tableName=dolt_procedures

1. Take all (tradedate, symbol, lag(close) as pre_close) from final_a_stock_eod_price table into final_a_stock_limit.
2. import tushare's daily limit data and override if data already exist in final_a_stock_limit.
3. Drop data earlier than "1996-12-196", as stop price is introduced after that.
4. Join final_a_stock_limit data with bao_a_stock_eod_info, fill the up / down limit based on if the stock is ST.
5. Correct precision problem by cross checking the high price and the up limit price. If the diff is less than 1%, set the up limit price to high price. If the diff is more than 1%, remove the row to represent there is no limit at that day.
6. Delete all rows with no preclose / uplimit /downlimit.

## Daily Update
1. On each day, update the data from tushare directly into final_a_stock_limit_data.

## Validation logic
1. final_a_stock_eod_price.high <= final_a_stock_limit.up_limit
2. final_a_stock_eod_price.high >= final_a_stock_limit.up_limit
3. daily count of final_a_stock_limit > 1000

0 comments on commit 3c3a7ea

Please sign in to comment.