Best practices for handling large datasets

Original article was published by /u/emadboctor on Deep Learning

Hello, since I got into deep learning, I’ve been working with small datasets to create models in tensorflow / keras functional api. What I’m used to do is to preprocess and clean / manipulate / visualize the data using pandas and later create a tfrecord that I use for training. Recently I started working with stock data (1min frequency) so, I currently have somewhere between 10,000 – 30,000 stock signs for 10+ years of data that I stored in the following fashion: I got the data from polygon api and for each stock sign I create a separate .parquet file and I have the files in a GCP bucket. Now, if I’m going to create a dataset that will include most of the signs I have the following concerns:

  • Variable length and frequency files which implies for example: AAPL df has 2,566,598 rows and AMZN has 1,928,479 rows for the same period, and some signs have fewer than 10,000 rows. What is a proper way of dealing with NAN values?
  • For calculating technical indicators, lagged returns and many other computations efficiently, I was thinking maybe I could use Google BigQuery and store the data and perform necessary computations using SQL queries, is there a better way? should I store all signs in 1 table with multiple indices? or use one table for each sign?

And for those who worked with intraday stock data before:

  • What frequency do you recommend to avoid overfitting and get good results?
  • What technical indicators work best for data with this frequency(1min)? I’m asking because moving averages, MACD and many other indicators calculate over periods of days and I’m not sure whether this can be applied to this frequency as well.

submitted by /u/emadboctor
[link] [comments]