Original article can be found here (source): Artificial Intelligence on Medium
The Concept and Vision of Bottos
The Concept and Vision of Bottos
It is predicted by the market researcher IDC that in 2025, the size of our ‘’digital universe’’ (all data created and copied every year) will reach over 175 zettabytes (with one zettabyte equalling one trillion gigabytes). For perspective; if you were to use broadband to transmit the entirety of this universe, you would still be going after 450 years.
From toilet seats to toasters, subways to wind turbines, more and more devices are becoming a source of data collection and gathering. The photos and videos uploaded to our social networks, the vast amounts of information generated by people simply commuting to work — all this can be summarized as a torrent of data. And everyone is racing to collect even more data.
With the inception of Artificial Intelligence (AI) technology, the quality and scale of data has become a bottleneck in development.
Back in 2001, Michele Banko and Eric Brill, two researchers at Microsoft, published an eye-opening paper named ‘’Scaling to Very Very Large Corpora for Natural Language Disambiguation’’.
In their research, they described the number of words trained for most jobs in the natural language processing field as less than a million — a very small data set.
Older algorithms, such as Naïve Bayes or Perceptron, have an error rate that is as high as 25%, compared to newer, memory-based algorithms which have an error rate of approximately 19%.
These are the four data points on the far left on the image below.
However, there were more surprises. Banko and Brill showed a remarkable result in their paper: as you add more data — several orders of magnitude in data — and keep the algorithm consistent, the error rate drops dramatically.
By the time you add three more orders of magnitude in data, the error rate already drops to less than 5%. In many areas, the difference between an error margin of 18% and of 5% can ‘’make or break’’ it’s practical use. But the question is if error rates will continue to decrease significantly as the data scale reaches higher orders of magnitude.
However, it’s not all about the amount of data — the quality of this data also plays a decisive role: if you were to train a model with junk data, you would end up with junk models.
‘’Junk’’ data may originate from intricate faults that are malicous / data that can be tampered with, or non-malicious crash failures — such as IOT sensor failures, data sources being out of order, or environmental radiation that causes bit flips.
Using the example of the AI program AlphaGo, which can play board games, it’s not hard to imagine that having the model analyze a massive manual was not enough to teach it how to play these games. Over two hundred experts assisted in the creation of a model that would later be able to defeat a champion. Once Machine Learning ability was introduced, the model was able to beat Ke Jie, who at the time (2017) was ranked number one in the entire world, in a best-of-three match of Go.
AlphaGo ended up defeating Ke Jie with a score of 3–0.
In the earlier stages of the internet, once a certain data threshold had been met, the value of new data decreased.
But this is not the case in the era of artificial intelligence. The self-learning capabilities of these algorithms is constantly improving — the more data, and the more recent the data returned to these models, the better their results will be. In short; large amounts of data of varying quality may improve the quality of an AI model — but only high-quality, recent data will allow a model to evolve.
Artificial intelligence has moved from the first stage of technology driven by algorithms and computing power, to the second stage of data, driven by a large amount of structured and reliable data.
We are now experiencing the transformation from the second stage towards the third stage of AI.
This can be noticed from multiple perspectives; on the one hand, deep learning models are upgrading — think of migration learning models and multi-task learning models. On the other hand, data efficiency is progressing: while initially millions of rough data sets were required to train a model, the same effectiveness can now be achieved with thousands to ten thousands of refined data sets.
While AI has been very influential in the fields of speech and image, its applications in health care, education and in our daily life are still very few. The first reason is that the data we encounter are often small data, such as personal data on mobile phones, testing data on education and medical care, customer service Q & A data, etc.
The second is that the deep learning model uses non-linear computation to transform the original features from low to high layers. This process is very complicated and fragile. If you deviate from the existing scene a little, its effect will be reduced.
The third is the application problem, especially the personalization of machine learning models and applications. For example, on mobile phones, when recommending information and services, it should be tailored to the individual. Since any individual’s data is small data, the problem of personalization is how to adapt the cloud’s general model to the small data on the terminal and make it work. This requires upgrading the deep learning model, such as the migration learning model, with ‘small data training’ as the direction. Specifically optimizing one kind of task parameter still has superior performance when processing tasks, which can help machine learning to migrate from the cloud to the mobile end.
Once migrated to the mobile end, individuals will have a fully personalized decentralized learning model, which is a subversion of the centralized model in the era of artificial intelligence 2.0.
This is the democratization of AI that ImageNet founder and Stanford university professor li feifei has been advocating. The decentralized AI model built through the integration of blockchain with artificial intelligence and data will be a milestone in this era of technology!
Generally speaking, the more accurate and quantitative the data is labeled, the better the effect of the model will be.
AI companies must find ways to accumulate more detailed and accurate data that fits their application directions. Different application directions require different data content and even different labeling methods.
This is a market of on-demand customization. To ensure the quality of labeling, most artificial intelligence companies and crowdsourcing platforms cannot meet the requirements at the same time. To some extent, high-quality labeling data determines the competitiveness of an AI company, in addition to data collection, data cleaning and so on.
In the third stage of AI development, we can not only provide personalized services for different users, but also make different decisions in different scenarios. At this stage, the requirements for the dimensions and quality of data collection are higher, and different decision-making schemes can be formulated according to different scenarios in real time to promote the event to a good situation, helping decision makers to gain a better insight into the root of the event and produce more accurate and more accurate decisions.
In the age of artificial intelligence, the value of data will gradually emerge, and this data is/will be giving birth to new economies. Under the continuous feeding of data, artificial intelligence releases new productivity, while blockchain reconstructs production relations.
As the scale and quality of data continue to gather, and artificial intelligence models continue to iterate and accumulate, we will enter a new era of distributed artificial intelligence.
Therefore, are creating the first “decentralized ai infrastructure”, which is customized for the ai industry in all parts — from the blockchain bottom layer, to the service layer, to the application layer., Bottos was born to create a distributed intelligent robotic system with those working in AI around the globe.