This chapter steps you through a possible knowledge discovery process using the churn dataset provided with MineSet. It is assumed that MineSet is installed on the system you use, together with all the sample files. Each step is explained in detail. Unless otherwise noted, each step builds on the step before. These steps are:
The churn dataset deals with telecommunications customers—people who use the phone regularly. Customers have a choice of carriers, or companies providing them with telephone service. When these customers change carriers they are said to “churn,” which results in a loss of revenue for the previous carrier. A telecommunications company is likely to have a database of call records containing call information (source, destination, date, duration), a billing database, a customer database, and a customer service database. Relevant information about the customer appears in all these databases. This information, when combined, yields a set of customer signatures. The churn dataset provided with MineSet is such a set; the step of identifying the data and creating customer signatures into records has already been done. This dataset, which is used in the rest of the chapter, contains one record per customer.
Start MineSet by typing in a shell window:
mineset |
or double-click the MineSet icon on your desktop. If you have previously executed MineSet, you may be presented with a restored session. If this happens, cancel the initial dialog box shown in Figure 3-1 and continue as described below.
Choose File > Connect to Server. At Server Name: enter localhost or the name of the server on which MineSet is installed. At Login name:enter your user name and password. Click OK.
In the Tool Manager window, choose File > Open New Data File, change the directory pathname to /usr/lib/mineset/data/, and select churn.schema. A series of entries appears in the right-hand Preview Columns pane as shown in Figure 3-2. Click OK.
This gives you access to a dataset of telecommunications customers. The next time you run MineSet, you are automatically returned to this position, and any option selections you make are saved.
You can see the records in spreadsheet form, after bringing up MineSet Tool Manager, by following these steps:
In the Data Destination pane of the Tool Manager window, click the Viz Tools tab; from the Tool popup menu, choose Record Viewer.
Click Invoke Tool at the lower right.
This churn dataset is used in the rest of the chapter and contains one record per customer. The Data Transformations pane on the left side of Tool Manager lists columns with their type: state (string), account length (double), and so forth. Columns are defined as double or float if they are numbers, or string if they are made up of characters.
The data appears as a spreadsheet. Some columns and their meanings are shown in Table 3-1.
Table 3-1. Details of Columns Shown in MineSet Record Viewer
Column name | Value |
|---|---|
state | Two-letter abbreviation for the customer's U.S. state of residence |
account length | Numerical value indicating the number of months the customer has been with the long-distance carrier |
area code | Typical three-digit telephone company designations |
phone number | Typical three+four-digit telephone company designations |
international plan | Special pricing package for international calls, expressed as a yes/no value |
voice mail plan | Special pricing package for customers with voice mail provided by the carrier, expressed as a yes/no value |
number of voice mail messages | Average number of voice mail messages per day |
total day minutes | Number of minutes charged at the carrier's day rate |
number customer service calls | Number of calls this customer made for assistance to carrier customer support in the last six months |
churned | Whether this customer changed long-distance carriers in the last six months, expressed as a yes/no value |
Choose File > Exit to close the Record Viewer.
You should see the Tool Manager window again, still using the churn data source.
In the Data Destination pane of the Tool Manager window, the Viz Tools tab is still displayed; from the Tool popup menu, choose Statistics Visualizer.
Click the Invoke Tool button.
A display appears consisting of a number of box plots and histograms. The box plots show summary statistics for continuous variables, the histograms show the distribution of values for discrete variables.
Each box plot (on the left in Figure 3-3) shows statistics about data from a single column, including the minimum, maximum, mean, median, and two out of four quartiles (25th and 75th percentiles). These values are marked as lines, and the standard deviation is shown after the +/- sign.
The mean is the number found by adding the data in a column, then dividing by the number of records. The median is middle number when numbers in a given column are arranged in order of size. The standard deviation is a measure of the dispersion of the data in a column.
The histograms consist of specific discrete values: state names or yes/no values. Scroll down to find the churn histogram in the lower right of the display (see Figure 3-3, right). Notice the total values and their distribution, which shows that 707 customers out of 5,000 have left the carrier. The churn column is of greatest interest throughout this tutorial.
Choose File > Exit in the Statistics Visualizer window to close the window and return to the Tool Manager window.
You are now ready to perform analytical data mining. Verify that MineSet Tool Manager is connected to the appropriate server, and that the data source is /usr/lib/mineset/data/churn.schema. If you exited MineSet between sessions, the history file automatically returns to where you left off.
In the Data Destination pane of the Tool Manager window, click the Mining Tools tab.
Click the Classify tab, and make selections from these popup menus:
Mode: Classifier & Error
Inducer: Evidence
Discrete Label: churned
You are about to induce an evidence classifier to help characterize the customers who are likely to churn. The default mode, Classifier & Error, employs a holdout method on the data, inducing the classifier from two-thirds of the data and leaving the remainder as a test set to estimate the error rate.
Click Go!
The Status window on the bottom of Tool Manager shows progress and summary information about the induction process, including the estimated error rate of 12%. When the induction step is done, the Evidence Visualizer is automatically invoked, showing the model visually.
In the Evidence pane (on the left of Figure 3-4), rows of charts represent columns in the dataset, each showing the proportion of customers who churn. Switch cursor mode from grasp to pick, and click on the box titled Evidence. The pie charts show probable outcome, just as the cake charts showed evidence. For this tutorial these square cakes and round pies are termed “charts.”
Attributes are sorted by discriminating power for the label “churned,” starting from the top.
In the Label Probability pane on the right of the screen in Figure 3-4, the round pie chart reflects the probability of seeing this label in the data for a randomly chosen record, ignoring all attribute values. Mathematically, this is the number of records with the class label, divided by the total number of records.
Pass the pointer across a chart in the left pane, and the evidence or probability shows above the display window. Click on a chart to update right pane to show the expected probability according to the model.
Use the thumbwheels at the screen's border to dolly back and forth, and tilt either the X or Y axis to get a closer view of any chart. In grasp mode, the middle mouse button moves your view in the display. In pick mode, select multiple charts by holding down the Shift key and pressing the middle mouse button.
You can see the factors that affect churning, because the slice representing churn increases from left to right on the first and second rows, so a serious problem is evident. Customers that use the company's service the most, also churn at a higher rate. The company is not just losing customers; the lost customers are its most valuable.
To find out about a class label (for example, churned =yes), select a value in the Label Probability pane on the right. Click on the button near the label yes in the right pane, and the evidence is shown as bars. Pointing to bars will show you the estimated probabilities.
The discriminating attributes shown here can be used to choose axes for a scatterplot visualization. The state attribute, shown relatively high on the list of attributes, hints at a possible geographical relationship.
Evidence models show attributes independently; however in many datasets a combination of attributes determines the label. The error estimation shown in the status window, indicates that the classifier is expected to have about a 12% error rate. Later we generate a decision tree that is significantly more accurate. Because total day minutes and total day charge are related, only total day charge is used for the next step. Close the Evidence Visualizer (File > Exit) before moving on to work with the Splat Visualizer.
The Splat Visualizer requires that the column mapped to color must have a numerical value. The churned column is a string that must be converted to a number (p_churned, indicating the probability of churning), before mapping it in Splat Visualizer. The next procedure shows how this is accomplished.
In the Data Destination pane of Tool Manager, select the Viz Tools tab, then choose Splat Visualizer from the Tools popup menu.
In the Data Transformations pane of Tool Manager, click Add Column.
From the Add Column dialog box, in the New Column Name text field, enter the new name, p_churned (see Figure 3-5). The intention is to make a column of numbers, based on the churned column.
From the Add Column dialog box, in the Defined By Expression text field, create the expression (`churned`==”yes”)? 100:0. You can create this expression from the building block in the two scrolled lists ”Add column name to expression” and “Add op to expression”, or you can type it in directly. This expression translates to “for all the values in the churned column that are yes, give them the value of 100, otherwise give them the value of 0.” The purpose of this is to translate a string (yes or no) into a numerical value. Verify that the New type button is set to double.
Click Check Expression to ensure there are no syntax errors. Click OK to add the column.
In the Data Transformations pane, map items in Current columns to items in Visual Elements by clicking first the left pane, then the right. For this tutorial, map total day charge to Axis 1, number customer service calls to Axis 2, international plan to Axis 3, and p_churned to color, shown in Figure 3-6. Only those entities without an asterisk require mapping.
Click Invoke Tool.
The data is plotted on the Splat Visualizer window, shown in Figure 3-7. The slider bar in the upper left varies the color density. Use the question mark in the right toolbar for help on window manipulation. Rotate the splat plot with the grasping hand until any trends stand out. Splat Visualizer allows you analyze complex data by using the varying behavior in several dimensions.
You can save the current state of the Tool Manager including special options by choosing Save Current Session As from the File menu, and specifying churn1.mineset. To save a snapshot of the current visualization, choose Save As from the File menu, and specify churn1-out.rgb.
In the visualization shown in Figure 3-7, the highest probability of churn occurs in two places: in the yellow to red areas when total day charge is high, shown in the bottom of this figure; and when the total day charge is low and customer service calls are high (near the upper left of this figure). Low-paying customers who make many customer service calls leave. These are customers you may not want to stay, because they cost you money and bring little reward. The high-paying customers at the bottom of the figure are a better target.
Close Splat Visualizer and return to the Tool Manager Window.
As shown in Figure 3-4, the Evidence model indicated that state was a powerful discriminating attribute. This section builds on previous computations to display data geographically to illustrate how churn varies by state. You have already added the column (p_churned) from existing columns in the dataset.
You can now transform the data into a smaller dataset that contains the average churn per state. Such a transformation is called aggregation.
In the Data Transformations pane of the Tool Manager window click Aggregate.
In the Aggregate dialog box move p_churned into the left column, click Average and Count on, and turn off Sum. Leave state in the central column and move all the rest to the right column. (Hold down the Shift key to gather multiple columns.) Make sure your screen looks like Figure 3-8. Click OK to apply your choices.
Click the Viz Tools tab in the Data Destination pane of the Tool Manager window; choose Record Viewer from the Tool popup menu; click Invoke Tool. You should see a record for each state with the average churn and number of customers for that state.
Close the Record Viewer window and return your focus to the Tool Manager window. You will now link this data onto a map of the United States.
Choose Map Visualizer from the Tool popup menu, and click the Tool Options button on the Viz Tools tab of the Tool Manager window.
Click the Find File button to the right of the Entities File text field and select usa.state.hierarchy. The status of the dialog box is shown in Figure 3-9. The pathname is /usr/lib/mineset/mapviz/gfx_files. Click OK to retrieve that file, and OK to dismiss the Map Viz Options dialog box.
The next step is to link the visual elements to the columns.
From the Data Transformations pane of Tool Manager, map the columns to entities in the Data Destination pane. Map items in Current Columns to items in Visual Elements by clicking on first the left pane, then the right.
Map state to Entity-Bars and avg_p_churned to Color-Bars, and count_p_churned to Height-Bars.
Click Invoke Tool to view the map distribution of churned customers according to state (Figure 3-11).
The tool shows the distribution of churned customers across the United States. For each state, the color indicates the probability of churn and the height indicates the number of customers in that state. For example, in Figure 3-11, Maine is chosen, showing an average churn rate of 18.4466%, but based on the churn count of 103. In other words, the average is based on only 103 customers. West Virginia shows the greatest height, with a probability of churn based on 158 customers. States showing the clearest, brightest colors calculate an average churn rate over 21% (Texas, Montana, Washington, California, and New Jersey). This visualization indicates that there is no obvious relationship between churn and geography, although different states do have different churn rates.
Close Map Visualizer using File > Exit. The next example explores the Decision Tree classifier, using the same dataset to produce a different visualization.
Unlike the Evidence classifier, the Decision Tree classifier can show attribute interactions, that is, combinations of attribute values that affect the label. For this section, start with a fresh file by selecting File > Open New Data File, and entering /usr/lib/mineset/data/churn.schema. You will now build a decision tree classifier and visualize it.
Click the Mining Tools tab from the Data Destination pane.
Click the Classify tab, and make selections from these popup menus:
Mode: Classifier & Error
Inducer: Decision Tree
Discrete Label: churned
Click Go!
MineSet classifies and creates the Decision Tree model as shown in Figure 3-12. Notice that the estimated error rate is significantly improved (5.40%) over the Evidence Visualizer in Figure 3-4, confirming the earlier hypothesis that interactions between attributes are significant. In Figure 3-12, every node in the decision tree has two bars on it, one for each label value. Pointing to a bar will show the record count and percentage for that label value. Every node has a base, indicating the number of records that reach it, and a color, indicating the estimated error rate for the subtree, (see legend on bottom of the visualization).
In this example the root of the decision tree is marked with total day minutes, indicating that this is the single most important factor—how long these customers talked, with a dividing threshold of 264.45 minutes.
You can virtually fly through the Tree Visualizer landscape using the middle mouse button. By pointing to the red bar (churned yes) at the root, you can see that 14.14% of the customers churn. Follow the right line (total day minutes > 264.45) to the child node, which contains customers that talk frequently on the phone, with 59.31% of the customers churning. Again follow the right line from the child node, which shows that of those customers who talk frequently on the phone, those with voice mail churn at a much lower rate of 9.33%. Perhaps offering voice mail to customers can help reduce churning.
It is important to understand that this tree was automatically induced from the data. The attributes chosen for nodes and the thresholds are determined by the process of induction.
To drill-through and see the original data, select a node base or a bar and choose Selections > Show Original Data. A Record Viewer will show the records matching the node you selected.
If you would like to explore MineSet further, and discover more about applying a classifier, continue to the next chapter, Chapter 4, “Further Explorations.”