Full Stack Deep Learning — Data Management/Lukas Biewald

Source: Deep Learning on Medium

Full Stack Deep Learning — Data Management/Lukas Biewald

A lot of progress is made in making a dataset → a lot of time is used in processing the data → this is a common pattern in general data science projects.

Name of the game for deep learning → is data collection and more.

Another example → can be data flywheel → this is a more realistic approach to → let the people label their own data → something like Capacha → this is a good approach.

Or using some kind of synthetic data → is another important approach to → generating fake data.

Need a good method → for labeling data and managing data → where do we store those as well as how can we use them?

Version data is another importance → collecting data as time goes by → we need to version the data. (as well as creating data pipeline).

There are data labeling companies → we can both keypoint 3D and 2D and more. (few-shot learning can be done → for there is an AI system that assists the annotators)

QA is key → we are able to make sure that annotation is good. (we can hire people → or outsource this → amazon services → this can be done → the best ones are promoted to managers → but we need to remember → that if we are going to use Amazon → privacy might be bad).

Crowdsource → is another choice → this is cheaper but usually, the annotation is pretty bad. (hire a company → that does labeling → if you have the money → this is the best choice to go for).

But this all depends on the industry → for medical ones → we need some professional ones. (there are couple of companies).

Figure 8 → is one company that provides these services → labelbox and more. (but again they are pricing → but we can just use their own software → so rather than using the company → only use the software).

Data system → where we want to store the data where and manage the stored data → is very important. (can be distributed or parallel).

Object storage → for this SQL is used → but what if we do not have any text data? (Amazon S3 is another service for storing data → Amazon is king when it comes to these services).

SQL → is good for general data.

Data Lake → wow.

So there are multiple steps → have some data → and depending on the task → we need to change the data and get the data ready for a different task.

But during the training time, → it is a good idea → to have the data stored in one place → since the IO time is long → and the best approach is to have the data in one place. (there are different levels when it comes to management).

Data label level 2 → might be a good level → combination of web services and accessing them. (this is a good approach → the version of the data set is stored → hence we are able to train the model again).

There is an opensource → file manager for ML projects → if needed we are able to use this CLI. (but we have to decide on each project).

Each of the data is stored at different places → so have some graphs to model the relationship between the data → this is a complicated → pretty stupid idea.

Wow, → there is a python library that is able to manage these workflows. (ML → really is the game for python → there is also rabbit MQ).

The workload distribution is also important.

Weight and Bias → the CEO is talking → excitement around deep learning → it has grown a lot from the old days. (and AI salaries are now big → but → this is not going to last long).

However, there is some fake news → IBM failed on the Watson cancer project → now we are able to run some networks → and some things are working.

Brands → want to recognize their brains in social media → how they are showing up and where they are showing up. (skin classification is another approach).

3D pose estimation → and TSA using in the airport → we are able to kill some harmful plants → while using tractors. (or able to use robots to checks on shelves).

Starting from cancer → there are a lot of things we are able to achieve. (but to actually have a working system → this is very hard → this is not same as software engineering). (so the software development is not the same as deploying ML services).

How the future will look like? → Does no one know → what is some easy or hard problems? (he ran some Kaggle competition → in the first week there is a lot of progress made → but as time goes by the approach did not get state of the art performance).

Okay, → we are making progress but we really don’t know when the next self-driving cars are going to come. (also ML is unpredictable). (state of the art ML → would not generalize well).

We are in this period where → some of the systems are working while others are not. (explainability → is another important research area). (there are real-world damages can be done).

When we have more data → we are able to get the best performance.

As time increase → if we just have more data → we are able to increase the performance of the model → that’s about it → collect those data. (and active learning is another powerful design pattern).

In medical, → there is a good product for → a combination of hearing aid and deep learning stuff. (self-driving luggage bag and trash cans → those are one of the applications).

At the end of the day, → we have to solve a business problem.