This chapter introduces the specific tasks involved in the knowledge discovery process. The process is iterative, commonly going back to earlier stages once you discover new patterns and improve your understanding of the data, as shown in Figure 2-1.
This chapter describes a process that follows these steps:
Identify the source of the data—expanded in “Identifying the Data”.
Prepare the data—expanded in “Preparing the Data”.
Build a model—expanded in “Building a Model”.
Evaluate the model—expanded in “Evaluating a Model”.
Deploy the model—expanded in “Deploying a Model”.
The task of identifying the data begins by deciding what data is needed to solve a problem. For example, predictability about customer behavior is often a necessary goal, recast in terms of a problem. In defining the problem, the investigator must identify the data needed to solve that problem and explore other possible sources of data.
Data may be in a difficult location or in an obscure form. Sometimes there are several initial databases that may be incompatible with each other. Further, if data is scanty or incomplete, more data may be needed. The form in which new data is to be collected depends on the form of existing data. Finally, the data may exist but need to be extracted from a central data warehouse. MineSet accepts both flat files and binary files as well as data from several commercial database vendors such as Informix, Oracle, and Sybase. Tools such as DBMS/COPY from Conceptual Software, Inc. allow you to convert data from over 100 formats to MineSet.
Data may have to be loaded from legacy systems or external sources, stored, and cleaned. Specifically, the following problems are common:
Data may be in a format incompatible with its end use (for example, EBCDIC format).
Data may have many missing, incomplete, or erroneous values.
Field descriptions may be unclear or confusing, or may mean different things depending on the source. For example, order date may mean the date that the order was sent, postmarked, received, or keyed in.
Data may be stale. Customers may have moved, changed households, or changed spending patterns.
MineSet can help you discern data quality problems in the initial stages of building a data warehouse.
In spite of being clean, data may need to be transformed before it is suitable for mining and visualization. Specifically, the input to the algorithms and visualizations must be a single table. While SQL commands can be given to MineSet, it is recommended that database administrators create the appropriate views to simplify operations for end users. While MineSet can perform powerful data operations, in some cases certain transformations need to be done prior to using MineSet.
Considerable planning and knowledge of your data should go into data transformation decisions. Data transformations are at the heart of developing a sound model. You may even need to go back and transform the data differently:
By adding columns, usually applying a mathematical formula to existing data.
By removing columns which are not pertinent, are redundant, or contain obvious, uninteresting predictors.
By filtering visualizations. For example, you may want to see only the strongest rules or the most profitable customer segments.
By changing a column's name.
By binning data—breaking up a continuous range of data into discrete segments.
By aggregating data—grouping records together, and finding the sum, maximum, minimum, or average values.
By sampling the data to get a random subset of the data (by percentage or count).
By applying a classifier that you have previously created, to label new records with a class label, or to estimate the probability of a given label value.
In MineSet, most of these transformations take place using the Data Transformation pane in Tool Manager.
At the core of the knowledge discovery process is model building, automatically done by analytical data mining algorithms. This is clarified in Chapter 3.
Evaluating the accuracy of a model refines your understanding of that model and its usefulness. Some models, notably the Decision Tree classifier and the Option Tree classifier, evaluate different parts of the model and display these models directly through visualization.
MineSet implements four model assessment methods: error estimation, confusion matrix, lift curve and ROI (return-on-investment) curve.
A model can be deployed by applying it to new data. New data can give rise to further questions, which may require further refinements.
In the telecommunications example in Chapter 3, a model can be created to determine which customers are likely to leave their phone carrier. Customer records can then be evaluated through the model to identify the specific customers most likely to leave. These customers can be given incentives to stay.
The next two chapters step you through the knowledge discovery process on the churn dataset—a prepared dataset of telecommunication customers. As you work through the examples, think of the process presented here and how your operations progress forward and loop back as shown in Figure 2-1.