Introduction
Data drives the world today, and it's being analyzed every second, whether it's through your phone's Google Maps, your Netflix habits, or what's in your online shopping cart - in many ways, data can't be avoided, and it's very disruptive.
Analytics, or data analysis, refers to the examination of data sets and drawing conclusions about the information they contain, most commonly through the use of programs, software, and methods. Across commercial business industries, data analytics technologies are used at an industrial scale for the purpose of making calculated, informed business decisions.
The word 'Big Data' refers to the use of specialized techniques and technologies to process enormous amounts of data. It isn't easy to process these large and complex data sets using traditional database management tools. There are many examples, including weblogs, call logs, medical records, surveillance footage from the military, photos, and videos, and large-scale e-commerce sites.
We consider a very large dataset to be one that requires a minimum of one terabyte - if not hundreds of petabytes - of storage (1 petabyte equals 1024 terabytes). In terms of pictures and videos, Facebook alone is thought to store at least 100 petabytes.
TECHNIQUES FOR ANALYZING BIG DATA
There are several methods that can be used to analyze datasets based on disciplines such as computer science (particularly machine learning) and statistics. Our list of techniques covered in this section covers a range of industries. Please be aware that our list is by no means exhaustive. In fact, researchers are constantly developing new strategies and improving existing ones, primarily because of the need to analyze new combinations of data. There is no need for all of these techniques to be applied to big data—some can make good use of smaller datasets (for example, A/B testing, regression analysis). All of the techniques listed here can be applied to big data. In general, larger datasets with more diverse characteristics will generate more relevant results than smaller ones with fewer characteristics.
A/B testing
A method in which a control group is compared with a variety of test groups in order to determine what changes will improve a given objective variable, e.g., marketing response rate. A/B testing is sometimes called split testing or bucket testing. You can use this to determine what type of copy, layout, image, or color will increase conversion rates to a Website for e-commerce. Big data enables a wide range of tests to be initiated and analyzed, ensuring that groups are sizeable enough to detect meaningful (i.e., statistically significant) differences between the control and treatment groups. If more than one variable is simultaneously manipulated in the treatment, the multivariate generalization of this method, which applies statistical modeling, is called "A/B/N" testing.
Association rule learning
A set of methods for finding interesting relationships, i.e., "association rules," among variables in massive databases. To generate and test these rules, different algorithms are used. An example is market basket analysis, in which a retailer determines which products are frequently purchased together and uses that information for marketing (a commonly cited example is discovering that many supermarket shoppers who buy diapers also likely tend to purchase beer). Used for data mining.
Classification
These techniques are used to identify categories in which new data points belong based on data points that have already been categorized in a training set. It can be used, for instance, to predict segment-specific customer behavior (e.g., buying decisions, churn rate, consumption rate) for which there is a clear hypothesis or objective outcome. Used for data mining.
Cluster analysis
Statistical method for classifying objects based on similarities among diverse groups of objects, but without knowing in advance what characteristics make them similar. Segmenting consumers into similar groups for targeted marketing is an example of cluster analysis. Used for data mining
Crowdsourcing
Crowdsourcing is a method for collecting data submitted by a large group of people or crowd through an open call, usually through a networked medium such as the Internet. In this case, you're utilizing Web 2.0 technology and mass collaboration.
Data fusion and data integration
The process of integrating and analyzing data from multiple sources in order to uncover insights in a more efficient and accurate manner than if they were developed from just one source.
Data mining
Statistical and machine learning approaches are combined with database management to extract patterns from large datasets. These techniques include cluster analysis, association rule learning, regression, and classification. Customer data can be analyzed to identify segments that respond most quickly to an offer, employee data can be analyzed to identify attributes of the most successful employees, or market basket analysis can be used to predict what customers will purchase.
Ensemble learning
Multiple predictive models (each constructed using statistics and/or machine learning) are used to achieve better performance than any of the constituent models.
Genetic algorithms
This optimization technique involves encoding potential solutions as "chromosomes" that can combine and mutate in the same way as a natural evolution. Every chromosome is selected for survival within the context of a modeled "environment," determining its fitness or performance. These algorithms are often described as types of "evolutionary algorithms," which are ideal for solving nonlinear problems. A few examples include optimizing the performance of an investment portfolio and improving job scheduling in manufacturing.
Machine learning
Data-driven machine learning focuses on automatically recognizing complex patterns and making intelligent decisions from them.
Neural networks
Finding patterns in data using computational models inspired by the structure and workings of biological neural networks (such as the cells and connections found in the brain). Finding nonlinear patterns is a good application for neural networks.
Network analysis
An analysis technique for describing relationships among discrete nodes in a graph or a network. The social network analysis aims to investigate the connections among individuals in a group or organization, for example, how information moves or who has the most influence. The technology can be used, for instance, to identify key opinion leaders to target in marketing and to identify bottlenecks in information flows within enterprises.
Optimization
Numerical methods for redesigning complex systems and processes to improve their performance based on one or more objective metrics (e.g., cost, speed, or reliability). Among the applications for optimization are improving operational processes such as scheduling, routing, and floor planning, as well as formulating strategies such as product range strategy, linked investment analysis, and R&D portfolio strategy.
Sentiment analysis
The process of extracting and identifying subjective information from a text source using natural language processing and analytic techniques. This type of analysis includes identifying the product or feature about which a sentiment is expressed, determining the type of sentiment (e.g., positive, negative, neutral), and determining the degrees and strength of sentiment. The application of sentiment analysis in social media (e.g., blogs, microblogs, and social networks) may allow companies to measure how different customer segments and stakeholders are responding to their products and actions.
Spatial analysis
An analysis of topological, geometric, or geographical properties encoded in a data set using various techniques drawn from statistics. Spatial analysis often uses geographical information systems (GIS) that capture locational information, e.g., address and longitude/latitude coordinates. Spatial data can be incorporated into spatial regressions (for instance, how is consumer willingness correlated with location?) or simulations (for example, how would a manufacturing supply chain network work with sites located in different places?).
Statistics
This includes the study of survey methods and experiments, as well as how data is collected, organized, and interpreted. It is common for statistical techniques to be used to determine what relationships between variables could have been the result of chance (the "null hypothesis") and which relationships likely result from underlying causal relationships (i.e., those that are statistically significant).
Time series analysis
Combining statistical and signal processing techniques for analyzing sets of data points, each representing a value at a different time, in order to extract properties that are meaningful. For example, a time series analysis might look at the daily price of stock market or amount of patients diagnosed with a given condition on a daily basis. Forecasting time series involves using a mathematical model to predict future values of the series based on data from past values.
Visualization
Often used to synthesize data analysis results using images, diagrams, or animations in order to communicate a message.