Original article was published by /u/emadboctor on Deep Learning
Hello, since I got into deep learning, I’ve been working with small datasets to create models in
keras functional api. What I’m used to do is to preprocess and clean / manipulate / visualize the data using
pandas and later create a tfrecord that I use for training. Recently I started working with stock data (1min frequency) so, I currently have somewhere between 10,000 – 30,000 stock signs for 10+ years of data that I stored in the following fashion: I got the data from polygon api and for each stock sign I create a separate
.parquet file and I have the files in a
GCP bucket. Now, if I’m going to create a dataset that will include most of the signs I have the following concerns:
- Variable length and frequency files which implies for example:
AAPLdf has 2,566,598 rows and
AMZNhas 1,928,479 rows for the same period, and some signs have fewer than 10,000 rows. What is a proper way of dealing with
- For calculating technical indicators, lagged returns and many other computations efficiently, I was thinking maybe I could use Google BigQuery and store the data and perform necessary computations using
SQLqueries, is there a better way? should I store all signs in 1 table with multiple indices? or use one table for each sign?
And for those who worked with intraday stock data before:
- What frequency do you recommend to avoid overfitting and get good results?
- What technical indicators work best for data with this frequency(1min)? I’m asking because
MACDand many other indicators calculate over periods of days and I’m not sure whether this can be applied to this frequency as well.