Data mining is the process of examining vast quantities of data in order to make a statistically likely prediction. Data mining could be used, for instance, to identify when high spending customers interact with your business, to determine which promotions succeed, or explore the impact of the weather on your business.
Data mining principles have been around for many years in conjunction with data warehouses, and have now taken on greater prevalence with the advent of Big Data.
Data analytics and the growth in both structured and unstructured data has also prompted data mining techniques to change, since companies are now dealing with larger data sets with more varied content. Additionally, artificial intelligence and machine learning are automating the process of data mining.
Regardless of the technique, data mining typically evolves over three steps:
- Exploration: First you must prepare the data, paring down what you need and don’t need, eliminating duplicates or useless data, and narrowing your data collection to just what you can use.
- Modeling: Build your statistical models with the goal of evaluating which will give the best and most accurate predictions. This can be time-consuming as you apply different models to the same data set over and over again (which can be processor-intensive) and then compare the results.
- Deployment: In this final stage you test your model, against both old data and new data, to generate predictions or estimates of the expected outcome.
Leading Data Mining Techniques
Data mining is an highly effective process – with the right technique. The challenge is choosing the best technique for your situation, because there are many to choose from and some are better suited to different kinds of data than others. So what are the major techniques?
This form of analysis is used to classify different data in different classes. Classification is similar to clustering in that it also segments data records into different segments called classes. In classification, the structure or identity of the data is known. A popular example is e-mail to label email as legitimate or as spam, based on known patterns.
The opposite of classification, clustering is a form of analysis with the structure of the data is discovered as it is processed by being compared to similar data. It deals more with the unknown, unlike classification.
Anomaly or Outlier Detection
This is the process of examining data for errors that may require further evaluation and human intervention to either use the data or discard it.
A statistical process for estimating the relationships between variables which helps you understand the characteristic value of the dependent variable changes. Generally used for predictions, it helps to determine if any one of the independent variables is varied, so if you change one variable, a separate variable is affected.
This technique is what data mining is all about. It uses past data to predict future actions or behaviors. The simplest example is examining a person’s credit history to make a loan decision. Induction is similar in that it asks if a given action occurs, then another and another again, then we can expect this result.
Exactly as it sounds, summarization present a mode compact representation of the data set, thoroughly processed and modeled to give a clear overview of the results.
One of the many forms of data mining, sequential patterns are specifically designed to discover a sequential series of events. It is one of the more common forms of mining as data by default is recorded sequentially, such as sales patterns over the course of a day.
Decision Tree Learning
Decision tree learning is part of a predictive model where decisions are made based on steps or observations. It predicts the value of a variable based on several inputs. It’s basically an overcharged “If-Then” statement, making decisions on the answers it gets to the question it asks.
This is one of the most basic techniques in data mining. You simply learn to recognize patterns in your data sets, such as regular increases and decreases in foot traffic during the day or week or when certain products tend to sell more often, such as beer on a football weekend.
While most data mining techniques focus on prediction based on past data, statistics focuses on probabilistic models, specifically inference. In short, it’s much more of an educated guess. Statistics is only about quantifying data, whereas data mining builds models to detect patterns in data.
Data visualization is the process of conveying information that has been processed in a simple to understand visual form, such as charts, graphs, digital images, and animation. There are a number of visualization tools, starting with Microsoft Excel but also RapidMiner, WEKA, the R programming language, and Orange.
Neural network data mining is the process of gathering and extracting data by recognizing existing patterns in a database using an artificial neural network. An artificial neural network is structured like the neural network in humans, where neurons are the conduits for the five senses. An artificial neural network acts as a conduit for input but is a complex mathematical equation that processes data rather than feels sensory input.
You can’t have data mining without data warehousing. Data warehouses are the databases where structured data resides and is processed and prepared for mining. It does the task of sorting data, classifying it, discarding unusable data and setting up metadata.
Association Rule Learning
This is a method to identify interesting relations and interdependencies between different variables in large databases. This technique can help you find hidden patterns in the data that that might not otherwise be clear or obvious. It’s often used in machine learning.
Long-Term Memory Processing
Data processing tends to be immediate and the results are often used, stored, or discarded, with new results generated at a later date. In some cases, though, things like decision trees are not built with a single pass of the data but over time, as new data comes in, and the tree is populated and expanded. So long-term processing is done as data is added to existing models and the model expands.
Data Mining Best Practices
Regardless of which specific technique you use, here are key data mining best practices to help you maximize the value of your process. They can be applied to any of the 15 aforementioned techniques.
- Preserve the data. This should be obvious. Data must be maintained militantly, and it must not be archived, deleted, or overwritten once processed. You went through a lot of trouble to get that data prepared for generating insight, now vigilance must be applied to maintenance.
- Have a clear idea of what you want out of the data. This predicates your sampling and modeling efforts, never mind your searches. The first question is what do you want out of this strategy, such as knowing customer behaviors.
- Have a clear modeling technique. Be prepared to go through many modeling prototypes as you narrow down your data ranges and the questions you are asking. If you aren’t getting the answers you want, ask them a different way.
- Clearly identify the business problems. Be specific, don’t just say sell more stuff. Identify fine grain issues, determine where they occur in the sale, pre- or post-, and what the problem actually is.
- Look at post-sale as well. Many mining efforts focus on getting the sale but what happens after the sale — returns, cancellations, refunds, exchanges, rebates, write-offs – are equally important because they are a portent to future sales. They help identifying customers who will be more or less likely to make future purchases.
- Deploy on the front lines. It’s too easy leave the data mining inside the corporate firewall, since that’s where the warehouse is located and all data comes in. But preparatory work on the data before it is sent in can be done in remote sites, as can application of sales, marketing, and customer relations models.