Big Data is a process of storing, analyzing, and dealing with data that is very large and complex for the traditional RDBMS systems to handle. There are several big data technologies like Hadoop that make the process possible. Big Data didn’t exist before a few decades. Growing digital technologies and the ability to store and process the data produced led to the start of the big data era.
Industries in different sectors realized the benefits of analyzing the data and hence arrived at several applications. One of the profitable sectors is personalized marketing. Companies like Amazon and Netflix started creating and using personal recommendation engines for their consumers. Amazon also uses sentiment analysis (NLP) on its customer reviews to provide a better customer experience based on the positive or negative reviews of a product. Every click a user makes on a browser is recorded and users are delivered with personalized ads that might suit their interests.
MNCs also analyze the collective behavior of their customers in order to important business decisions. Big data technologies made a large impact in the medical field. With better symptom detection techniques, a lot of deadly medical conditions can be treated early. Big data also plays a huge part in sequencing billions of genomes that help in identifying genetic disorders and mutations.
Now diving deep into forms of big data, they can be,
Structured (like RDBMS tables)
Semi-structured (like log files)
Unstructured (like images, social media data)
Though Big Data is beneficial, not every organization needs to implement big data technologies and process their data differently from their traditional processing methods. The need for Big data arises only when the problem comes under one of these criteria,
Volume (high volumes of data in TB or PB or even more)
Variety (Structured, Unstructured and Semi-structured)
Velocity (the speed of data generation, loading, and analysis)
Veracity (the quality and validity of the data)
Valence (how data items are connected to each other)
If a firm faces one of the above complications, then they can use big data to address their problems and get useful insights. Getting value or useful insights out of Big data is Data Science. And the professionals who possess skills such as data engineering, scientific method, math, statistics, advanced computing, domain expertise, hacker mindset, and visualization; are called data scientists and they make the data science process possible.
Big Data + Analysis Question → Insight
Now let’s learn more about how a company builds its big data strategy.
Building a Big Data strategy
Normally a strategy involves Aim, Policy, Plan, and Action. Here the strategy starts with defining Business objectives or goals. They can be long-term or short-term goals. Business Objectives are the questions that we need to ask in order to turn big data into insights.
Then, the leadership team should take initiative and also support the process. The reason for data science involvement must be understood and supported at all levels in order for the implementation to be successful. The leadership team should initiate to build diverse data science teams with diverse expertise and they should deliver as a team. They can always make the shift more effective by training the existing employees than recruiting new ones, as the existing employees would have better domain knowledge. They can also open R&D labs that can research and communicate findings to be implemented at a larger scale.
Sharing data within the organization should be encouraged by removing barriers to data access and eradicating data silos. Data silos are compartmented data within an organization that have no connection with each other. They lead to outdated, unsynchronized, even invisible data, and hinder opportunity generation for the business.
The organization as a whole should define big data policies that involve privacy and lifetime of the data involved, curation and quality, interoperability, and regulation. An analytics-driven culture helps the teams to work together and provide better outcomes.
Once a strategy is built, the data scientists need to ask the right questions in order to arrive at the best insights.
Steps in Data Science Process
The first goal of the organization after the establishing their objectives to identify what kind of data they require to find answers and how to acquire all the data. Data comes from many places (many formats) and data scientists should choose the apt methods to access them.
The common data formats are,
Traditional DB ⇒ They can be queried from SQL DBs such as MySQL, Oracle
Text files ⇒ They can be obtained by programs created with scripting languages (JS, Python, R, PHP, Ruby)
Remote data ⇒ Data from the internet can be acquired from SOAP (XML), REST (HTML, JSON), web socket (RSS, W3C)
NoSQL ⇒ They can be accessed via API, web services (HBASE, Cassandra, MongoDB)
Exploring and Preprocessing Data
The first step of exploring the data is by understanding what’s present in the data and to recognize trends.
To understand the data,
We have to understand the dependencies of different variables in data (Correlations between the variables)
We have to figure out if there is a consistent direction in which the values are moving, that is to identify the general trends
Then check for errors, that is identify the outliers that might affect the outcome of the data
Once we understand the data we can describe the data by applying statistical methods. We can find
how the data is located through mean, median.
how far and wide the data is spread through range and standard deviation
the frequency if data with mode
Then we can visualize the data with the help of heat maps (where the hotspots are), histograms (distribution of data, skewness/unusual dispersion), box plots (data distribution), line graphs (value of data change overtime) or scatter plots (correlation between two variables) to get even a better idea of the understood details of data.
After exploring the data, we have to preprocess to a certain format as the real world data is messy. There can be inconsistent values, duplicate values, missing values, invalid data, and can have outliers. To improve the quality of the data we process, we can remove the records with missing values, merge duplicate records, or if suitable we can generate best estimate for invalid values and can remove the outliers if necessary.
The data we acquire directly from the data sources might not suit the analysis format, we can implement certain techniques to manipulate the data to suit our needs,
Scaling can be done to change the range of values to be in a specified range. This is to avoid a few values dominating the data results.
Data can be transformed to remove noise and reduce variability
When data has too many dimensions we can reduce its dimensions. It involves finding a smaller subset of dimensions that captures most of the variations in data. Principle Component Analysis (PCA) is used.
We can remove redundant or irrelevant features, combine features, add features with the process of feature selection.
Analyzing the Data
INPUT DATA → ANALYSIS TECHNIQUE → MODEL → MODEL OUTPUT
Categories of Analyzing techniques:
Classification: to predict category of input data
Regression: to predict numerical value
Clustering: to organize similar items into groups
Association analysis: to find rules to capture association between items (ex: Market basket analysis)
Graph analysis: to use graph structures to find connection between entities
Select Technique → Build model → Validate model
For validating the model, we divide the input data into two and evaluate the model. For different techniques, we evaluate using different methods.
Evaluating the model
In Classification & Regression models, we compare the predicted values vs correct value to evaluate the model.
In clustering, the outcomes are to be compared with business goals
Once the model is evaluated, we can to determine the next steps of what to do with the analysis. If the results are bad, then we might have to tune our model and repeat the process. If the results are great, then we can use the outcome.
Once we are satisfied the outcome of the analysis, then we must decide on what to present. Generally all the findings must be presented. We can add more value to the insight by focusing on the main outcome, the added values and how the results turned up from the beginning to the end of the data science process.
Presenting the insights in a understandable way is more important. One has to choose a best visualization method to present different kind of outcome data. There are many visualization tools that can help the data scientists to present their outcomes such as Tableau, D3, Google developers charts, Leaflet, Timeline JS.
Turning Insights into Action
If the leadership team, the stakeholders are satisfied with the outcome, then they can use them for decision making. If they don’t find it favorable, the analysis process might have to repeated. The process is iterated till useful insights are acquired.