Big Data - An Introduction

Big Data is a process of storing, analyzing, and dealing with data that is very large and complex for the traditional RDBMS systems to handle. There are several big data technologies like Hadoop that make the process possible. Big Data didn’t exist before a few decades. Growing digital technologies and the ability to store and process the data produced led to the start of the big data era.



Industries in different sectors realized the benefits of analyzing the data and hence arrived at several applications. One of the profitable sectors is personalized marketing. Companies like Amazon and Netflix started creating and using personal recommendation engines for their consumers. Amazon also uses sentiment analysis (NLP) on its customer reviews to provide a better customer experience based on the positive or negative reviews of a product. Every click a user makes on a browser is recorded and users are delivered with personalized ads that might suit their interests.

MNCs also analyze the collective behavior of their customers in order to important business decisions. Big data technologies made a large impact in the medical field. With better symptom detection techniques, a lot of deadly medical conditions can be treated early. Big data also plays a huge part in sequencing billions of genomes that help in identifying genetic disorders and mutations.

Now diving deep into forms of big data, they can be,

  • Structured (like RDBMS tables)

  • Semi-structured (like log files)

  • Unstructured (like images, social media data)

Though Big Data is beneficial, not every organization needs to implement big data technologies and process their data differently from their traditional processing methods. The need for Big data arises only when the problem comes under one of these criteria,

  • Volume (high volumes of data in TB or PB or even more)

  • Variety (Structured, Unstructured and Semi-structured)

  • Velocity (the speed of data generation, loading, and analysis)

  • Veracity (the quality and validity of the data)

  • Valence (how data items are connected to each other)

If a firm faces one of the above complications, then they can use big data to address their problems and get useful insights. Getting value or useful insights out of Big data is Data Science. And the professionals who possess skills such as data engineering, scientific method, math, statistics, advanced computing, domain expertise, hacker mindset, and visualization; are called data scientists and they make the data science process possible.

Big Data + Analysis Question → Insight

Now let’s learn more about how a company builds its big data strategy.


 

Building a Big Data strategy

Normally a strategy involves Aim, Policy, Plan, and Action. Here the strategy starts with defining Business objectives or goals. They can be long-term or short-term goals. Business Objectives are the questions that we need to ask in order to turn big data into insights.


Then, the leadership team should take initiative and also support the process. The reason for data science involvement must be understood and supported at all levels in order for the implementation to be successful. The leadership team should initiate to build diverse data science teams with diverse expertise and they should deliver as a team. They can always make the shift more effective by training the existing employees than recruiting new ones, as the existing employees would have better domain knowledge. They can also open R&D labs that can research and communicate findings to be implemented at a larger scale.


Sharing data within the organization should be encouraged by removing barriers to data access and eradicating data silos. Data silos are compartmented data within an organization that have no connection with each other. They lead to outdated, unsynchronized, even invisible data, and hinder opportunity generation for the business.


The organization as a whole should define big data policies that involve privacy and lifetime of the data involved, curation and quality, interoperability, and regulation. An analytics-driven culture helps the teams to work together and provide better outcomes.


Once a strategy is built, the data scientists need to ask the right questions in order to arrive at the best insights.


 


Steps in Data Science Process

Acquiring Data

The first goal of the organization after the establishing their objectives to identify what kind of data they require to find answers and how to acquire all the data. Data comes from many places (many formats) and data scientists should choose the apt methods to access them.

The common data formats are,

  • Traditional DB ⇒ They can be queried from SQL DBs such as MySQL, Oracle

  • Text files ⇒ They can be obtained by programs created with scripting languages (JS, Python, R, PHP, Ruby)

  • Remote data ⇒ Data from the internet can be acquired from SOAP (XML), REST (HTML, JSON), web socket (RSS, W3C)

  • NoSQL ⇒ They can be accessed via API, web services (HBASE, Cassandra, MongoDB)

Exploring and Preprocessing Data

Analyzing the Data

Reporting Insights

Turning Insights into Action