Big data platform construction: Hadoop-based data analysis platform

Original article was published by Sajjad Hussain on Artificial Intelligence on Medium


Big data platform construction: Hadoop-based data analysis platform

The development of the Internet has brought about explosive growth of all kinds of data. All operations related to access to the Internet have been turned into virtual data and recorded. With the big data era, an obvious change is full-sample data analysis. Faced with data scales of TB/PB and above, Hadoop has become the mainstream choice.

Enterprises need to conduct large-scale data analysis. It is undoubtedly a low-cost and high-efficiency choice to build a big data system platform based on open source Hadoop and its ecosystem.

Hadoop big data platform

Hadoop in the big data technology ecosystem, after so many years of development, the status of the basic core architecture is still stable. The scalability, robustness, computing performance and low cost of the Hadoop system make it in fact the mainstream big data analysis platform solution for Internet companies.

Based on Hadoop, data system planning and design can be carried out according to the actual business needs of the enterprise. In response to different specific needs, different data analysis architectures and framework components are used to solve practical problems.

Big data analysis platform demand planning

According to the timeliness requirements of data analysis, big data analysis can be divided into real-time data analysis and offline data analysis.

Real-time data analysis is generally used in financial, mobile and Internet B2C products, and often requires the analysis of hundreds of millions of rows of data returned within a few seconds, so as to achieve the goal of not affecting the user experience.

In the Hadoop ecosystem, these requirements can be reasonably planned. For most applications with less stringent feedback time requirements, such as offline statistical analysis, machine learning, search engine inverse index calculation, recommendation engine calculation, etc., offline analysis can be used to import log data through data collection tools Dedicated analysis platform.

The mainstream mass data collection tools, such as Facebook open source Scribe, LinkedIn open source Kafka, Taobao open source Timetunnel, Hadoop Chukwa, etc., can meet the log data collection and transmission requirements of hundreds of MB per second, and upload these data to On the Hadoop central system.

In addition, according to the data volume of big data, it is divided into three types: memory level, BI level, and mass level, which also need to be considered separately and adopt appropriate solutions.

The memory level here means that the amount of data does not exceed the maximum memory of the cluster. Usually, some in-memory databases can be used to store hot data in the memory to obtain very fast analysis capabilities, which is very suitable for real-time analysis services. In this regard, the application of MongoDB is very common.

BI level refers to the amount of data that is too large for memory, and mainstream BI products have data analysis solutions that support terabytes or more. There are many types, so I won’t list them in detail.

Massive level refers to the amount of data that has completely failed or is too expensive for databases and BI products. In such scenarios, Hadoop is undoubtedly a low-cost and efficient solution.

Regarding the construction of the big data platform, the data analysis platform based on Hadoop, the above is today’s sharing content. In the development of big data, Hadoop still occupies an important market position after many years. The mastery of related technologies is still an important ability requirement for industry workers.