dcsimg

Big Data Mining: 9 User Tips

Posted February 19, 2018 By  Cynthia Harvey
  • Previous
    Big Data Mining
    Next

    Big Data Mining: 9 User Tips

    Following best practices can help enterprises ensure that their data mining efforts are leading to valid — and valuable — insights.
  • Previous
    1. Follow the CRISP-DM methodology.
    Next

    1. Follow the CRISP-DM methodology.

    CRISP-DM stands for "cross-industry standard process for data mining." Developed by a consortium that included SPSS, Teradata, Daimler AG, NCR Corporation and OHRA, CRISP-DM a process model for data mining that has been around since the late 1990s. While it sometimes comes under criticism because it hasn't been updated in a long time, CRISP-DM remains the most popular process for conducting data mining, and in fact, it has become so ingrained as common practice that many analysts follow its steps without even realizing it. The methodology includes six phases: business understanding, data understanding, data preparation, modeling, evaluation and deployment. It's an iterative cycle and organizations can move back and forth between the various phases as necessary. Experts say that intentionally following the methodology can help data mining teams make sure that they haven't skipped steps, which could lead them to faulty conclusions.

  • Previous
    2. Develop your background knowledge.
    Next

    2. Develop your background knowledge.

    Fancy data mining tools and analytics are only as smart as the people using them. If you want your data mining efforts to be useful to your organization, you need to understand as much about your business as possible. That allows you to make sure you are answering the questions that are most important to the business and using the right data to answer those questions.

    For example, TripAdvisor analytics director Michael Berry tells a story about a time he was researching daily sales. He noticed that the data included a lot of $1 transactions. Because he didn't know what those transactions were from, he did a little more research and learned that they were credit card validations and not real transactions at all. He then rightly excluded those $1 transactions from his calculations and was able to give the business more accurate information about its sales. In this way, being skeptical, using common sense and being willing to ask questions and learn more about business operations can help you do a much better job of analysis.

  • Previous
    3. Combine external data with your internal data.
    Next

    3. Combine external data with your internal data.

    In the early days of data mining, organizations stuck primarily to the data that was in their own databases and data warehouses. These days, however, enterprises can get access to a wealth of external data, including Web and social media data, that can give them a more accurate picture of their customers and market circumstances. For example, some organizations find that customers involved in certain online activities or posting particular things to social media are more likely to accept cross-sell or up-sell offers. Including these external factors in customer segmentation models can impact overall revenues.

  • Previous
    4. Clean your data first.
    Next

    4. Clean your data first.

    It's always a mistake to skip over the data preparation step in the CRISP-DM model. Even well-tended data warehouses are likely to have fields with missing data, duplicate records or other errors. And these days, many data miners are accessing raw and unstructured data from data lakes or other repositories. Cleaning the data and getting it into a usable state is an absolute must. In this step, it's also vitally important to think through what the data is saying and apply common sense rather than just accepting the data as is. For example, if your data includes records for pregnant men or people who are listed as parents but have zero children, you need to go back and figure out where things went wrong.

  • Previous
    5. Watch out for sample bias.
    Next

    5. Watch out for sample bias.

    The conclusions that you reach will only be as valid and accurate as the data that you feed into your data mining models. TripAdvisor's Barry told another story that illustrated this point well. He was using data from online booking sites to research which countries had that highest average hotel rates. To his surprise, he found that some African countries like Botswana and Lesotho had higher prices that notoriously expensive countries like Switzerland. Before running off and advising management that they needed to target more customers in Botswana and Lesotho, he decided to double-check to see if sample bias was playing a role. It turned out that in the African countries, only the really expensive hotels offered online booking, so the averages that he was calculating from his online booking data didn't include all the budget-friendly accommodations in those areas.

    Whenever you get a result that seems out of line with your expectations, investigate thoroughly. You may have discovered something new and valuable — or you might have been led to a bad conclusion by incomplete or invalid data.

  • Previous
    6. Use a lot of different models.
    Next

    6. Use a lot of different models.

    The first model that you build to answer your data mining question will almost certainly need substantial refinement before it becomes useful. Many experts suggest trying a lot of different models in rapid succession in order to find the variables that are truly the most important for answering the question you are investigating, and then optimizing from there.

    Sometimes this process is called "throwaway modeling" because it is similar to "throwaway prototyping" in software development, where developers first quickly build a version of an application that they don't intend to use in order to get the right feature set. Once they have a better understanding of the project, they "throw away" that first prototype and follow coding best practices to create a version they intend to release. A similar approach can help data miners refine their models.

  • Previous
    7. Remember that correlation does not imply causation.
    Next

    7. Remember that correlation does not imply causation.

    Everyone who works in data science has heard that correlation does not imply causation, but it's still really easy to make that mistake when you get wrapped up in a project. As human beings, we're hardwired to believe that if one thing happens after another, the first thing caused the second — but it isn't necessarily so. If you find an interesting correlation between two variables, you need to do further testing to see if one really predicts the other or if it is just a coincidence. Again, common sense and business knowledge will be invaluable in this regard. (And if you've never seen them, the spurious correlation graphs that show how per capita cheese consumption correlates to the number of people who die by becoming tangled in their bedsheets or how the divorce rate in Maine corresponds to the per capita consumption of margarine can be a good reminder about this problem.)

  • Previous
    8. Test, test, test.
    Next

    8. Test, test, test.

    As mentioned in the previous slide, testing is critical to creating accurate models and valid forecasts. It's quite possible — and common — to create a model that very accurately predicts your sample data, but falls apart in the real world.

    One good way to test your models and hypotheses is to keep a holdout sample — data that you don't use when creating your model. After your model is created, go back and test it against the holdout sample. If it fails to predict your holdout data accurately, you know that your model needs more work.

  • Previous
    9. Create a visualization of your results.
    Next

    9. Create a visualization of your results.

    Everyone knows that a picture is worth a thousand words, but analysts who are used to dealing with numbers all day can sometimes forget the power of a good graph. Your data mining project isn't truly complete until you've presented your results, and a good presentation requires good visualizations. If you want management to take action as a result of the information you have provided, you need to present your conclusions in a way that is visually appealing and easy to grasp. Vendors offer a number of different software products that can help with this step. Investing the necessary time, money and effort into creating visualizations can help ensure that all your preceding data mining effort isn't wasted.

Data mining has been around for decades. Although the term is no longer as popular as it once was, in today's era of big data and machine learning, the activities traditionally associated with data mining have become more important than ever as a source for critical business insights.

In the 1990s, people used the term "data mining" specifically with regards to finding hidden insights in databases and data warehouses. As enterprise data stores grew and the term "big data" came into vogue, data mining expanded to encompass much more than reporting based on databases.

Today, the term "data mining" is often used interchangeably with "analytics" (although some experts argue that data mining is a subset of analytics). It's also closely related to business intelligence and data science. Essentially, data mining involves using algorithms and models to find patterns in internal and external stores of data. Those patterns — trends, correlations, anomalies, clusters, etc. — provide businesses with invaluable information about what has happened in the past and what might be likely to happen in the future.

Enterprises recognize that data mining and analytics can be a valuable resource towards improving customer service, increasing sales and obtaining competitive advantage. As a result, many are investing in data mining solutions. According to IDC, analytics is now a "mainstream" enterprise activity, and spending on big data and analytics, which includes data mining tools, is increasing at a compound annual growth rate of 11.9 percent. By 2020, total revenue in the category will likely top $210 billion.

For most organizations, mining their big has become a regular part of their business practice, and many are looking for ways to refine, improve and optimize their techniques. The following slideshow offers nine tips for making data mining more efficient, effective and useful.

Images from Pixabay.



0 Comments (click to add your comment)
Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 

IT Management Daily
Don't miss an article. Subscribe to our newsletter below.

By submitting your information, you agree that datamation.com may send you Datamation offers via email, phone and text message, as well as email offers about other products and services that Datamation believes may be of interest to you. Datamation will process your information in accordance with the Quinstreet Privacy Policy.





×
We have made updates to our Privacy Policy to reflect the implementation of the General Data Protection Regulation.