7 Challenges Faced By Data Scientists In Data Processing In 2020
Each day we generate 2.5 quintillion bytes of data. All the data that is being generated by us while using the internet is raw and cannot be used by an organization to make data-driven decisions. We wanted to understand how data scientists have evolved in 2020 and the type of tools they are using now to tackle the challenges with these new and different forms of data. Therefore, we asked AIM Expert Network (AEN) members to share their insights on the challenges they face while processing different types of raw data and how they convert the same data into valuable assets for their organization. While building the anomaly detection system for our app we faced an issue with a huge amount of data that was coming from different sources/databases. The biggest challenge that we faced was to consider all forms of data being generated from the app and make it into one single format to centralize the observation. A real-time querying production database was also not possible. So action required here was to get this unstructured data all together in one database. For this, we used Google’s BigQuery, a relational database to have the data stored. The refresh cycle was twice daily to get data from our database and files and dump in the BigQuery database. Post this we worked on processing this data further. 2. Unlocking value out of Unstructured Text Data A major chunk of data that is stored by enterprises around the world is unstructured text data. Traditionally, an enormous amount of time, effort and resources have been spent by analysts around the world in data processing by transforming unstructured text data into a standardized format to find insights out of it. Overall, results have varied due to lack of right technology and unfortunately mostly low intelligence insights being derived out of data with benefits being outweighed by the cost. Recently, enterprises have realized the impact of using Ontologies which has reduced the burden on Data Processing from data engineers with its increased adoption. Ontologies help define common vocabulary and help in smooth knowledge management. Also, with the increased maturity and awareness of Graph databases (such as Neo4J, AWS Neptune, etc.) which are used for knowledge management and finding connections in text data, organizations are able to unlock value out of unstructured data. 3. Setting up the infrastructure and velocity of data The primary challenge in handling modern data requirements (especially streaming) is setting up the infrastructure owing to high volumes and velocity of data. This can be handled in a very efficient manner by using data streaming cloud services like Microsoft Azure. Accordingly, two PaaS services stand out viz. Azure Stream Analytics and Azure Databricks. The former is a first-party streaming service that gels well with messaging services like Azure IoT Hub or Event Hub. The article ‘An Introduction to Azure IoT with Machine Learning’ elucidates more on this. However, the latter i.e. Azure Databricks is a unified analytics platform to implement Lambda Architecture. Details on this can be found here: Lambda Architecture with Azure Databricks 4. Adapting to different tools to collect unstructured data The biggest challenge now and going forward in data processing is a change in the type of data that is coming in. Previously all the data was structured, but now, a lot of data is coming in an unstructured format from numerous sources like social media platforms, emails or shared cloud storage platforms. Analyzing, processing and storing this data has become a challenge that organizations are grappling with even today. The first thing in my opinion that any organization looking to become more data-driven needs to do is to revisit their data strategy including data collection mechanism, data entry points and the tools used for data processing and integration.
Posted on 7wData.be