Chapter 4. Further Explorations

This chapter continues to explore the MineSet tools. It assumes that you have worked through Chapter 3, “Churn Tutorial,” and prepares you to use other aspects of MineSet:

Exploring Data Clusters

When confronted with an unfamiliar dataset, you can discover evocative attributes or characteristics using the clustering algorithm. This algorithm segments records into clusters that are similar in several ways. For this example, return to the Tool Manager window, and begin a new history by returning to /usr/lib/mineset/data/churn.schema.

  1. In the Data Destination pane of the Tool Manager window, click the Mining Tools tab.

  2. Click the Cluster tab, and make these selections:

    Method: Single k-Means

    Number of Clusters: 3

  3. In the Data Transformations pane of Tool Manager, select and remove these columns from the Current Columns pane: state, account length, area code, phone number, international plan (because it correlates to total intl charge,) and voice mail plan (because it correlates to number vmail messages.) See Figure 4-1. Remove the selected columns using the Remove column button. Multiple selections can be made using the Control key.

    Figure 4-1. Removing Columns to Prepare for Clustering


    The columns that are removed are those that are not likely to influence clustering productively. The column churned is retained, to help explain results. You can experiment with which columns to remove as you explore the dataset further.

  4. Click Further Options to set the weight of the attributes.

    By default, the weight of each column is set to one (1), which means each column is given equal importance. Set the churned column to 0 for this example, to see if this attribute is generated spontaneously as the dataset is clustered. Click Set then OK.

  5. Click Go! on the right side of the Tool Manager window.

    The Status window on the bottom of Tool Manager shows the progress of the clustering operation, as the algorithm selects significant characteristics by which to group the records. The model is saved as churn.cluster. This particular clustering takes the algorithm about 10 iterations.

    Figure 4-2. Box Plots Produced by Cluster Visualizer


    Figure 4-2 shows a partial view of the result of Cluster Visualizer. The columns are sorted by their power in discriminating how one cluster differs from another. It is clear that the number of voice mail messages, total day minutes and total day charge are the most important columns. Color and means are quite different between clusters at the top of columns, yet as you scroll down the display, the differences are minimal.

  6. Click the button near the cluster number to change the attribute ordering, so that attributes that are important in discriminating this cluster from the others show at the top of the display.

  7. Choose File > Exit in the Cluster Visualizer window to close the window and return to the Tool Manager window.

Relating the Columns and Axes in the Model

With Cluster Visualizer you can look at independent attributes in a dataset, examine the most prominent, and see how each differs. However, to see how attributes relate to each other between clusters, Scatter Visualizer provides a clearer view. To apply the clustered model to Scatter Visualizer, you need to determine which columns should be mapped to the various axes.

  1. In the Data Transformations pane of Tool Manager, click on Apply Model and select churn.cluster from the list of available models. Click OK.

    Although Cluster Visualizer indicated three columns as the most important, each cluster's order of importance was independent, with no indication of interactions between attributes. At this point, the Column Importance tool is useful.

  2. In the Data Destinations pane of Tool Manager, select the Data File tab, and click the ...Server, and named: button. In the text field, type the filename churn-crop, and click Create File. This saves the abbreviated version of the churn dataset that is used later in this tutorial.

    Figure 4-3. Saving Data File to Server


Finding Important Columns in the Clustered Model

  1. In the Data Destination pane of Tool Manager, with Mining Tools selected, click on the tab labeled Col. Imp. (abbreviation for Column Importance). By default the tool selects the top three columns in terms of importance. The discrete label is Cluster.

    Figure 4-4. Column Importance Selections for Cluster


  2. Click Go!

    The displayed panel shows:

    1. number vmail messages
    2. total day charge
    3. total eve charge
    

    It seems that time spent on the phone during the day is a factor, with all other charges showing a correlation. The next step is to map these columns to axes in Scatter Visualizer.

Mapping to Scatter Visualizer

  1. In the Data Destination pane of Tool Manager, select the Viz Tools tab; then choose Scatter Visualizer from the Tools popup menu.

    Figure 4-5. Mapping Columns to Axes for Scatter Visualizer


  2. In the Data Transformations pane, map items in Current columns to items in Visual Elements by clicking first on the left pane, then the right. For this tutorial, map number vmail messages to Axis 1, total day charge to Axis 2, total eve charge to Axis 3, and Cluster to Entity-color. See Figure 4-5. When you applied the model, the column named Cluster was created.

  3. Click Invoke Tool.

    The Scatter Visualizer window in Figure 4-6 shows the clusters clearly differentiated in color. The blue cloud of scatterpoints represents cluster 2, and the flat pancake shape is split evenly between red and green—clusters 1 and 3. This pancake indicates very low numbers of voice mail messages. Clearly, total day charge and total evening charge are interdependent. If you click on an interesting visual point, the supporting data is displayed as an independent record. Dismiss the display with File > Exit and return to Tool Manager for the next step.

    Figure 4-6. Scatter Visualization Plotted from Clustering


Invoking a Decision Table

For this example, instead of using Scatter Visualizer to see your clustered data, you can visualize the same data as a Decision Table.

  1. In the Tool Manager window, choose File > Open New Data File.

  2. Click the Server File button, and select churn-crop.schema. This is the file saved earlier. If you exited MineSet between sessions you are automatically returned to where you left off.

  3. Click OK.

  4. In the Data Destination pane of the Tool Manager window, click the Mining Tools tab.

  5. Click the Classify tab, and make selections from these popup menus:

    Mode: Classifier & Error

    Inducer: Decision Table

    Discrete Label: churned

    Make sure you have the correct discrete label. You are about to induce a decision table and allow the algorithm to suggest which columns are most important to map to the X and Y axes.

  6. Verify the Suggest checkbox is checked on, then click Go!

    You can see columns being mapped to axes as the tool suggests appropriate mappings. The Status window on the bottom of Tool Manager shows progress and summary information about the induction process, including the classification error rate. When the induction step is done, the Decision Table Visualizer is automatically invoked, showing the model visually.

    Figure 4-7. Decision Table Showing Clustered Churn Results


Figure 4-7 shows a round pie in the right pane, like Evidence Visualizer, indicating the overall percentage of churn. In the left pane, the data is shown as cake charts or bars, that tell you, within this subset of the data, how much churn exists. Clearly, customers with high total day charge always churn.

Center the display using the middle mouse button, and use the slider in the upper left of the window to control the differences between segments. Use the middle and left mouse buttons together to enlarge or reduce the display size.

The Decision Table shows you data at different levels of detail, taking only a few columns into consideration at first, adding more detail as you examine further. Change the cursor mode from grasp to pick, and pass the cursor across the scene to display data above the window. Notice the bar that falls out of the expected pattern—total day charge less than 29.75, and customer service calls over 3.5. Click with right mouse button on that bar to drill down for details, click with middle mouse button to drill up. Dismiss the display using File > Exit when you are finished examining the Decision Table, and return to Tool Manager.

Targeting Customers Using a Classifier

Previously, you created classifiers to predict which customers are likely to churn. Now that you have such a model, you may want to target customers who are likely to churn before they churn. The lift curve helps accomplish this goal.

A lift curve is a plot in which the X axis shows the number of records from 0 to 100% and the Y axis shows the number of records corresponding to customers who have a given label value (Churn=yes in our case). Two curves are shown on the graph in Figure 4-10. The lower curve (red) shows the number of customers expected to churn given a random ordering of the records. The upper curve (white) shows the percentage of customers who churn when placed in order according to the classifier's score (probability estimate) for each record. Records representing customers that the classifier identifies as most likely to churn appear first; those less likely to churn appear last. The advantage that the classifier ordering provides can be seen by the difference between the classifier curve and the random curve.

In building this lift curve, a selected classifier is applied to the test set. In the example below, a specified segment of the dataset is used for training. Then the induced classifier run on the remainder of the dataset. Although lift curves can be generated easily by selecting Lift Curve from the Tool Options for classifiers, in this tutorial a more complex scenario is shown, one that involves sampling and application of a classifier to a dataset.

Creating a Training Sample

For this example, return to the Tool Manager base window, and begin a new history by using File > Open New Data File and returning to the Local File /usr/lib/mineset/churn.schema.

  1. In the Data Transformations pane, click Sample. In the Sampling dialog box type 40 for the percentage of sampling, and click OK.

    This choice simply samples a random 40% of the total dataset, from which the classifier is induced.

    Figure 4-8. Selecting a Sampling for Testing


  2. In the Data Destination pane of Tool Manager, select the Mining Tools tab; then choose the Classify tab and make selections from these popup menus:

    Mode: Classifier only

    Inducer: Decision Tree

    Discrete Label: churned

    You are inducing a decision tree classifier based on the random 40% sampling, and choosing classifier only because this is the training set. The test set will be the remainder of the dataset (excluding the 40% sampled records).

  3. Click Go!

    The resulting decision tree demonstrates the classifier, which is required in the next stage. Note that the root weight is substantially diminished, because the size of the sample is less than the complete dataset. Also note that no color appears at the base of each node, indicating that no error estimation is available.

You can see in the status field that the classifier is automatically saved under the name churn-dt.class. The next step is to use this classifier on the remainder of the churn dataset.

Applying a Model

Dismiss the Decision Tree window and return to Tool Manager window. Because you have used the first 40% of the dataset to build the model, you have the remaining 60% to use as a test set.

  1. Click Edit Prev. Op. in the Data Transformations pane and you should be presented with the Sampling Dialog box again.

  2. In the Sampling Dialog box, enter 40 in the Percentage text field again, but this time click the Complementary Sample button to indicate you want the other part of the sample.

  3. Click OK.

  4. Click on the Apply Model button in the Data Transformations pane.

  5. From the Test and Apply Model window choose churn-dt.class.

  6. Click the Test Model tab, turn off Show viz, turn on Show lift curve, and set the ROI/Lift label popup menu to yes.

    Having built a classifier based on the random sample, you are now planning to apply it to the remainder of the churn dataset.

    Figure 4-9. Preparing to Test Classifier on Full Dataset


  7. Click Run Test. The process takes some time. The resulting lift curve is shown in Figure 4-10, with the details of any selected point shown in the upper banner.

    Figure 4-10. Lift Curve


    Move the pointer along the white (model) line, clicking at various points to see the lift and percentage of customers with churn=yes. Look for the knee of the curve, in this example where the estimated probability of the classifier is 0.056.

    This is the point at which the return on investment in sending incentives to customers that may churn diminishes rapidly. The next step is to apply the classifier to the full dataset.

  8. Return to the Test and Apply Model dialog box; click the Apply Model tab and make these selections:

    Estimated probability values for label yes

    New column name: p_churned (You must type this in.)

    When you click Estimated probability values for label, yes is chosen to match the corresponding selection in the Test Model step. This process adds a new column representing the likelihood that certain people will churn (p_churned.) Click OK.

  9. On the Data Transformations pane of Tool Manager click Filter; in the Filter by Expression dialog box text field create the expression p_churned > 0.056. Check expression before clicking OK.

    This is the estimated probability figure retrieved from Step 8 shown in Figure 4-10. The intention is to select only those customers with the greatest likelihood of churning. In real-life, this step would be executed against unlabeled data in order to predict which of the existing customers are likely to churn.

    Figure 4-11. Filtering for the Probability of Churn


    The final step is to see the results in Record Viewer, eliminating unnecessary columns for easier reference, as follows:

  10. In the Data Destination pane of Tool Manager, click the Viz Tools tab, and select Record Viewer. In the Data Transformations pane, select all columns except area code, phone number and p_churned, then click the Remove Column button. Select multiple columns by pressing the Shift key for a range, or the Control key for specific selections.

  11. Click on Invoke Tool.

    The result is a useful phone list of those customers shown in Figure 4-12, who have the greatest likelihood of churning based on the model.

    Figure 4-12. Record Viewer Results


In Record Viewer, for every record there will be a number estimating the probability that they will churn. Filtering has retained those customers with highest numbers. That provides the list of only those potential churn customers you should give incentives to (for example, solicit by phone, send mail, and so forth.)

Reducing Misclassification Costs

You can reduce the cost of making mistakes in building the model using three important tools in MineSet: confusion matrix to give a detailed picture of errors and incorrect predictions, loss matrix to take into account that some mistakes are worse than others, and return-on-investment curve to show when investing more time or money is fruitless.

Displaying a Confusion Matrix

Return to the Tool Manager window, and reopen /usr/lib/mineset/churn.schema.

  1. In the Data Destination pane, select the Mining Tools tab; then choose the Classify tab and make selections from these popup menus:

    Mode: Classifier & Error

    Inducer: Decision Tree

    Discrete Label: churned

  2. Click Further inducer options and the Classifier options pane shown in Figure 4-13 appears.

    Figure 4-13. Classifier Options Pane Showing Confusion Matrix Checked On


  3. Click on Display Confusion Matrix in the lower right, then click OK.

  4. In the Tool Manager Data Destination pane click Go!

    The Confusion Matrix displays where the classifier makes mistakes in classifying. Dismiss the Tree Visualizer and examine the Confusion Matrix. From this you can construct a Loss Matrix based on what you now know about the data, to make certain kinds of errors less tolerable than others.

    Figure 4-14. Confusion Matrix Showing Correct and Incorrect Classifications


    In the window shown in Figure 4-14, the tall blue bar and the short white bar represent correct classifications; but substantial misclassification occurs in the category represented by the red bar—Predicted Class: no, Actual Class: yes. These customers were predicted not to churn, but actually did so, a costly mistake even at 4.6%. You can try to reduce that error by constructing a Loss Matrix based on what you now know about the data, and weight the errors represented by the red bar more heavily.

Defining a Loss Matrix

The purpose of this process is to control which errors the classifier will favor and which it will avoid.

  1. Dismiss the Confusion Matrix display with File > Exit and return to the Tool Manager window.

  2. Click Further inducer options to return to the Classifier options pane.

  3. In the second section of the upper left of the pane, click first Use Loss Matrix, wait briefly.

  4. Click Edit Matrix to weight the cost of making errors. The Loss Matrix pane similar to that shown in Figure 4-15 appears.

    Figure 4-15. Loss Matrix Showing Weighting


  5. Set these values across the rows of the Loss Matrix, reading from left to right:

    Actual Values: no: 10—0—3

    Actual Values: yes: 10—10—(-)10

    The count in the column under the question mark should be at least as high as any number in the matrix, to prevent the classifier from predicting “unknown.” If you predict a customer won't churn, and you are correct, you neither win nor lose (represented by zero). If you predict a customer will not churn, thus fail to mail to them, and they do churn, you incur a loss of 10 (represented by positive 10—since the numbers represent loss). If you incorrectly predict a customer will churn, and they didn't, you lose three, representing the cost of sending a mailing unnecessarily. If your mailing program works and you save yourself from losing a customer, you gain 10 (represented by negative 10).

Viewing a Return on Investment Curve

The Return on Investment curve lets you see what the cost is of making certain kinds of errors, and indicates to you the point at which it is no longer fruitful to continue taking action

  1. Make sure Backfit test set and Display Confusion Matrix, and Loss Matrix are checked on in the Classifier options pane.

  2. Ensure Display ROI Curve also is checked on.

  3. Ensure ROI/Lift label is set to yes and click OK.

  4. Click Go! in the Tool Manager window.

Three display windows appear, the Decision Tree and Confusion Matrix, and the ROI Curve. Notice that the Confusion Matrix shows the classifier is more conservative in make churn=no predictions, thus reducing false negatives. The Confusion Matrix now displays a different weighting, one of 4% which takes loss into account. The errors on one side have been increased, but those on the other have been decreased. Dismiss both the Confusion Matrix and the Decision Tree display using File > Exit, and examine the ROI curve window.

Figure 4-16. Return on Investment Curve


The ROI curve shown in Figure 4-16 bears a marked resemblance to a lift curve. The horizontal line across the middle represents zero profit and loss. The red line represents the expected performance if you were to take a random sample of the population and send them mail. You expect a loss if you mail to everyone, because of the cost of mailing. However, there is a point of optimum return on investment, represented by the knee of the curve, at 1448, or 15.2 percent of the population.

Further Exploration of MineSet

See the MineSet User's Guide for descriptions of these tools and what the analytical data mining algorithms can show. The manual is online and can be launched by selecting Help > MineSet User's Guide.

This tutorial has only been a brief introduction to the MineSet tool suite. Other aspects covered in the MineSet User's Guide include:

  • Scatter Visualizer.

  • Tree Visualizer for visualizing hierarchies.

  • Option Tree Inducer and Classifier.

  • Association Rules Generator and Visualizer.

  • Regression, to allow you to predict a continuous value instead of discrete.

  • Transformations, including binning, distribution, and indexing of arrays.

  • Record weighting, which allows assigning different weights to different records, because some records are more important than others (for example, highly profitable customers).

  • Learning Curve, which can help you determine whether sampling can be done on your dataset to speed up the knowledge discovery process, without losing much of the accuracy of the induced classifiers.

  • Many tool options, including color manipulation, message boxes.

  • Animation sliders for visual tools.

  • Batch processing. The program mineset_batch can be used to execute operations non-interactively. This is useful if a job needs to run regularly (for example, once a night).

  • Error estimation using advanced techniques such as cross-validation.

Also described in the MineSet User's Guide are the technical details of file and data manipulation.


Note: Data mining algorithms find correlations that may not be causal. A well-known discovery is the strong correlation between shoe size and reading ability: the larger one's shoe size, the better the reading ability. This correlation, while true, is not causal; both shoe size and reading ability improve with age (as children get older, their shoe size and ability to read both increase.) You are cautioned against attributing causality to discovered correlations. Wearing larger shoes is unlikely to increase your reading ability.