This chapter discusses the components and capabilities of the Association Rules generation and visualization tool. After a brief overview, the first sections cover the kind of rules that are generated during the rule generation step and define the vocabulary. The next sections explain how to construct a visualization of the rules using the Scatter Visualizer. The final sections list and describe the sample files provided for these tools, and show how you can convert an old style .ruleviz file to a .scatterviz format. This chapter replaces Chapter 9, “Using the Rules Visualizer,” in the MineSet User's Guide.
This chapter contains the following sections:
The Association Rules tool lets you mine data by constructing, verifying, and graphically representing models of patterns in large databases. These patterns are expressed by rules of association, which indicate the frequency of items occurring together in a database.
Discovering and graphically displaying association rules can be relevant to many enterprises. Some examples of where the Association Rules tool may generate useful associations are supermarket inventory planning, shelf planning, and attached mailing in direct marketing.
There are two steps involved in working with association rules:
Rules generation. The data file is processed by the Association Rules Generator, which creates a file usable by the visualizer.
Rules visualization. This operation displays the generated association rules.
The execution sequence of association rules generation and visualization is shown schematically in Figure 2-1.
The Association Rules Generator can generate both simple (one-to-one) and multiway association rules. This section describes simple association rules. For a description of multiway rules, see “Multiway Association Rules”.
A simple association rule states that given that X is true, there is a certain probability that Y is also true. MineSet refers to X as the left-hand side (LHS) of the rule and Y as the right-hand side of the rule (RHS).
One example of applying association rules is to obtain “market basket” data for customer buying patterns. Here, a market basket is the set of items bought by a customer on a single visit to a store. An example rule in this context might be: “80% of the people who buy diapers also buy baby powder.” This percentage is known as the confidence of the rule.
In this example, “diapers” is the item on the left-hand side (LHS) of the rule, and “baby powder” is the item on the right-hand side (RHS) of the rule.
Some applications of these rules are:
If item A appears on the RHS, the LHS can help us determine what the store should do to boost sales of this item.
If item B appears on the LHS, the RHS can help us determine what products might be affected if the store were to discontinue item B.
The Association Rules Generator processes an input file, then generates an output file consisting of the rules. If X and Y are items in a record, then a rule such as:
X=>Y
indicates that whenever X occurs in a record, expect Y to occur with some frequency.
The strength of the association is quantified by four numbers:
The first number, the confidence of the rule, quantifies how often X and Y occur together as a fraction of the number of records in which X occurs. For example, if the confidence is 50%, X and Y occur together in 50% of the records in which X occurs. Thus, knowing that X occurs in a record, the probability that Y also occurs in that record is 50%.
The second number, the support of the rule, quantifies how often X and Y occur together in the file as a fraction of the total number of records. For example, if the support is 1%, X and Y occur together in 1% of the total number of records.
You can specify a minimum support threshold for the generated rules. The default minimum support threshold is 1%. The lower the minimum support, the more rules are generated, and the slower the performance of the tool might be. You can also specify a minimum confidence threshold for the generated rules. The minimum confidence threshold default is 50%.
Rules that meet a minimum support threshold are important for two reasons:
A rule might have business value only if a reasonably significant fraction of records support the rule. For example, if everyone who buys caviar also buys vodka, the rule Caviar =\>Vodka has 100% confidence. However, if only a handful of people buy caviar, the rule might be of limited value to the retailer.
A rule might not be statistically significant if a very small number of records support the rule. The rule might be due to chance, and it would not be prudent to make decisions based on such a rule.
The third number, expected confidence, is the frequency of occurrence of the RHS item in the dataset. So the difference between expected confidence and confidence is a measure of the change in predictive power due to the presence of the LHS item. Expected confidence gives an indication of what the confidence would be if there were no relationship between the items.
The fourth number is lift. The lift is the ratio of confidence to expected confidence. The greater the number, the more unexpected the rule.
The Association Rules Generator does not report rules in which the confidence is less than the expected confidence. In other words, a rule such as A=>B is not reported if the frequency of A and B occurring together is less than the frequency of B alone.
| Note: Given just Y and a rule of the form X=>Y, nothing is known about X. Rules specify implications only from the LHS to the RHS. |
Table 2-1 summarizes the four numbers that quantify the strength of each association rule.
Table 2-1. Association Rules Components
Measure | Description | Statistical Description |
|---|---|---|
Support | Frequency of LHS and RHS occurring together. | P(LHS ∩ RHS) |
Confidence | Of all occurrences of LHS, the fraction where RHS is also seen, or the support divided by the frequency of occurrence of LHS items. | P(RHS | LHS) |
Expected confidence | Frequency of occurrence of RHS items. | P(RHS) |
Lift | Ratio of confidence to expected confidence. |
|
Association rules are displayed graphically to permit you to explore and compare the generated rules. The rules are presented on a grid landscape in the Scatter Visualizer. The left-hand side (LHS) items are on one axis, and right-hand side (RHS) items are on the other. As shown in Figure 2-2, attributes of a rule are displayed at the junction of its LHS and RHS item. The display can include bars, disks, and labels.
If the displayed view is too small, item labels do not appear on the sides of the axes. You can zoom in on the view until the item labels appear (see the Dolly description in “Thumbwheels” in the MineSet User's Guide). You can also view the labels for a particular rule by placing the mouse pointer over an individual bar when the mouse is in select mode (see Figure 2-6). All of the details for that particular rule will be displayed in the upper left-hand corner of the view area.
A legend indicating the mapping between displayed attributes (such as bar heights and colors) and the values associated with the underlying rules (such as confidence and support) can be displayed at the bottom of the main window.
The Tool Manager creates the two files that are required to generate the rules visualization:
A rules file that results from running the Association Rules Generator, named the .rules.data file
A .rules.scatterviz file
The .rules midfix is not required, but will be used whenever these files are generated by the Tool Manager.
This section describes how the components of the Association Rules tool can be configured using the Tool Manager. The Tool Manager greatly simplifies the task of configuring the Association Rules tool. However, if you prefer, you can construct a configuration file for this tool using an editor (see Appendix A, “Using the Association Rules Generator With Transaction-Style Data,” in this Addendum) or by invoking MIndUtil directly to produce the rules (see Appendix I, “Command-Line Interface to MIndUtil: Analytical Data Mining Algorithms,” in the MineSet User's Guide).”
The steps required to connect to a data source are described in Chapter 3, “The Tool Manager,” of the MineSet User's Guide.
To show how to set up simple associations, the following example uses the cars database table. Let's say that you want to find out if there is an association between miles per gallon, horsepower, and the year the car was built. For example, did mileage improve over time? Did engines become less powerful? The following steps (and Figure 2-3) show you how to set up the associations and map table columns to the data you want to study.
Connect to a MineSet server. Refer to Chapter 2, “Setting Up MineSet,” in the MineSet User's Guide if you need help.
Open a data source.
(Optional step) Number-valued columns are binned automatically, using uniform weight. If you prefer different bins, from the Data Transformations pane, choose specific numeric columns to bin before using the associations engine. Alternatively, to have each value considered individually, use the “Change Types” transformation to convert the column to string type. This prevents automatic binning altogether. (The binning operation and the options available for it are described in detail in Chapter 3, “The Tool Manager,” in the MineSet User's Guide.) Use conversion of type to string carefully, as it may lead to less “meaningful” rules from the association engine. For example, instead of using discrete values for the weightlbs attribute in the cars table such as 3504, 3693, 3436, 3433, and so on, it may be more meaningful to give weightlbs_bin value ranges such as 1600-2500, 2501-3500, and so on.
Choose the Mining Tools tab from the Data Destination tab.
Choose the Assoc. tab (abbreviation for Associations) from the Mining Tools tab.
After selecting a data source, you can run the Association Rules Generator immediately. Or you can choose settings from the following selections:
Confidence—lets you specify the minimum confidence threshold for rules. Rules with a confidence below this value are not generated. The default is 50%. The possible values are 0–100.
Support—lets you specify the minimum support threshold as a percentage of the total number of records. Rules with a support below this value are not generated. The default is 1%. The possible values are 0–100.
(Optional) Once you have made your association rule options selections, click the RuleViz Mappings button to map columns to visual elements.
The Association Rules tool allows for record weighting for those cases in which you want to specify that certain records are more important than others or when you want to compensate for uneven sampling. If Weight by Column is not checked, then each record has a weight of one.
To enable record weighting, click the Weight by Column checkbox. When the box is checked, a popup menu appears that allows you to choose the column which contains the weight for each record. The Weight is attribute? box, if checked, includes the weight column in the rules found by the Association Rule Generator. If the box is unchecked, the weight column will be excluded from any rules found by the Generator.
See “Record Weighting: Not All Records Were Sampled Equally” in the MineSet User's Guide for a further explanation of record weighting.
The Association Rules tool lets you map attributes of the rules to visual elements of the display. Clicking on the RuleViz Mappings button brings up the Association Rules Mappings panel shown in Figure 2-4.
The visual elements that can be mapped are listed below; the items preceded by “*” are optional:
Height - Bars—lets you specify what the bar heights represent.
*Height - Disks—lets you specify what the disk heights represent.
*Color - Bars—lets you specify what the bar colors represent.
*Color - Disks—lets you specify what the disk colors represent.
*Label - Bars—lets you specify what the bar labels represent.
The default mappings are as follows:
Support to bar height
Lift to bar color
There are five ways to start the rules visualizer:
Use the Tool Manager to configure and start the Association Rules tool (see “Configuring the Association Rules Tool Using the Tool Manager”). The Association Rules tool automatically launches the Scatter Visualizer tool.
Double-click the Scatter Visualizer icon, which is in the MineSet page of the icon catalog. The icon is labeled .rules.scatterviz. Because no configuration file is specified, the start-up screen requires you to use File > Open to select one.
If you know which configuration file you want to use, double-click the icon for that file. This starts the Scatter Visualizer and automatically loads the configuration file you specified. This works only if the configuration filename ends in .scatterviz (which is always the case for configuration files created for the Scatter Visualizer via the Tool Manager).
Drag the configuration file icon onto the Scatter Visualizer icon. This starts the Scatter Visualizer and automatically loads the configuration file you specified. This works even if the configuration filename does not end in .scatterviz.
Enter this command at the UNIX shell command-line prompt:
scatterviz [ filename.scatterviz ] |
When starting the Scatter Visualizer, you must specify the configuration file, not the data file.
| Note: If you wish to eliminate the dialogs that pop up to indicate progress, use the -quiet option. You can enable this option permanently by adding the line following line to your .Xdefaults file: |
*minesetQuiet:TRUE |
The Association Rules tool displays the data from a rules file in the Scatter Visualizer using the specifications of a valid configuration file. For example, specifying group.rules.scatterviz results in the image shown in Figure 2-5.
The rules are presented on a grid, initially displayed with left-hand side (LHS) items displayed on the left side of the window and right-hand side (RHS) items on the right. A rule is displayed at the junction of its LHS and RHS items. The display can include bars, disks, and labels. For example, in Figure 2-5, bar heights correspond to support and bar colors correspond to lift.
When the scene is zoomed in enough, the LHS and RHS axes are labeled with the item names, unless this has been turned off in the configuration file. (To view the grid and rules at closer range, use the Dolly thumbwheel, described in “Thumbwheels” in the MineSet User's Guide.)
You can change the labels as well as what the heights and colors of the bars and disks represent by modifying the configuration file via the Tool Manager (see Chapter 3, “The Tool Manager,” in the MineSet User's Guide) or by using an editor to change the configuration file. Color maps are automatically produced when a variable is mapped to disk or bar color. If you wish to change these default color maps, you can edit the configuration file.
Placing the mouse cursor over an Association Rules object as shown in Figure 2-6 causes that object's information to be displayed. The information is displayed as long as the cursor remains over the object. If you position the cursor over an object and click the left mouse button, that same information appears in the Selection Window, which is above the main window, under the selection label. In addition, the bar gets selected and appears in a separate window containing all selected rules. Multiple rules may be selected by holding down the Shift key while clicking.
This information remains visible until another object is selected, or until no object is selected (if you click the black background). Using the mouse, you can cut and paste text from the selection window into other applications, such as reports or databases.
The drill through expression is determined by “anding” together selected rules. Since the columns in the original table do not match the columns in the .rules.data file, the rules Generator produces a special column to help construct the filter expression when a drill through is performed. This means that changing the drill through preferences panel has no effect, because a special string-valued column has already been mapped to drill through in the .rules.scatterviz file.
When you drill through on a rule, MineSet shows all the records that satisfy the rule.
See Chapter 18, “Selection and Drill Through,” in the MineSet User's Guide for more information about drill through.
Several external controls surround the main window, including buttons and thumbwheels. (These are the same as those in other MineSet visualization tools and are described in “Buttons” and “Thumbwheels” in the MineSet User's Guide.)
Since association rules are displayed using the Scatter Visualizer, the pulldown menus are documented in “Pulldown Menus,” in Chapter 7 of the MineSet User's Guide.
In some cases, it is useful to have more complex rules that have multiple items on the LHS and/or the RHS. These are multiway association rules. Figure 2-7 illustrates the Tool Manager Association panel configured for multiway rules generation.
If you check the Multiway Rules button, the Association Rule Generator generates all rules which satisfy the minimum support and confidence thresholds, including those that have more than one item in the LHS and RHS. An example of such a rule might be “beer and linguini implies potato chips and salsa and wine.”
Multiway rules are displayed using the Record Viewer rather than the Scatter Visualizer. They are displayed with one rule per row. The first two columns of the table contain the number of items in the LHS and RHS. The next four columns contain the support, confidence, expected confidence, and lift values. The last two columns contain the LHS and RHS items. In the LHS and RHS columns, the items are separated by the word “and.” In the example rule above, the LHS contains two items and is represented as “beer and linguini.” The RHS contains three items and is represented as “potato chips and salsa and wine.”
You can limit the size of the rules generated by entering a number in the “Max total items per rule” field. This number indicates the maximum number of items that are allowed in any rule. The number of items in a rule is the sum of the number of items in the LHS and RHS. The example rule above has five items; simple rules have two items.
| Note: Generating multiway rules can take a long time. Watch the status window for an indication of the number of rules generated at each iteration. If too many rules are being generated, cancel the operation and increase the minimum support or confidence thresholds, or decrease the maximum allowable number of items per rule. |
The provided sample data and configuration files demonstrate the features and capabilities of the Association Rules tool.
The following sample rules and configuration files are provided for visualization. Some of these files correspond to hierarchical datasets. Rules files contain the generated rules obtained by running the Association Rules Generator. The files containing the rules should, by convention, have a .rules.data extension. Each configuration file specifies how the corresponding rules file is displayed. Configuration files must have a .scatterviz extension. The files mentioned in this subsection are in the /usr/lib/MineSet/scatterviz/examples directory:
group.rules.data and group.rules.scatterviz
These files provide the generated rules and configuration specifications for product groups, such as bread and baked goods, dairy milk, and carbonated beverages.
category.rules.data and category.rules.scatterviz
These files provide the generated rules and configuration specifications for product categories within product groups, such as refrigerated or non-refrigerated milk.
people94.rules.data and people94.rules.scatterviz
These files provide the generated rules and configuration specifications for a census database, showing associations between marital status, education level, age, income, and other variables.
germanCredit.rules.data and germanCredit.rules.scatterviz
These files provide the generated rules and configuration specifications for a credit database from Germany, showing associations between credit history, employment, savings, and other variables.
If you have existing .ruleviz files that you wish to convert to .scatterviz format, there are a few simple modifications you need to make. This can be done by editing the existing .ruleviz file and saving it as a .scatterviz file. Example 2-1 and Example 2-2 show the differences between the .ruleviz and .scatterviz formats. Example 2-2 has embedded comments to help you with the changes. Both configuration files use the same data file.
| Note: In the old ruleviz file format, size was called height, confidence was called predictability, and support was called prevalence. |
MineSet 2.5
input
{
file “group.rules”;
}
expressions
{
double `pred/expected`= predictability/expected;
}
view
{
height predictability;
height max 10;
height legend on;
disk height expected;
disk height legend label “disk height: expected predictability”;
color prevalence;
color colors “white” “purple”;
color scale 0 10;
color legend “0%” “10%”;
message “%s implies %s\npredictability: %.2f predictability/expected:
%.2f prevalence: %.2f”, LHS, RHS, predictability, `pred/expected`,
prevalence;
options grid size 3;
options hide disk distance 600;
options hide item distance 600;
}
|
MineSet 2.6
input
{
# Rename group.rules to group.rules.data:
file “group.rules.data”;
# The schema for the rules.data file is always
# the following. Add these lines:
int nlhs;
int nrhs;
float support;
float confidence;
float `expected confidence`;
string LHS;
string RHS;
}
expressions {
float lift = confidence / `expected confidence`;
}
view
{
# This replaces height predictability:
size confidence, scale 1.;
size legend label “Bar Height: confidence”;
# This replaces disk height expected:
disk height `expected confidence`, scale 1.;
disk height legend label “Disk Height: expected confidence”;
# This replaces color prevalence:
color support;
color colors “white” “purple”, legend label “Color: support”;
color scale 0 9;
color legend “0%” “9%”;
# Add these two axis mappings (not present in old file):
axis RHS, max 100, orderby alpha;
axis LHS, max 100, orderby alpha;
# Make sure the shape type is bar:
options entity shape bar;
options axis label size 20;
message “%s implies %s\n support=%2.2f%%, confidence=%2.2f%%,
expected confidence %2.2f%%, lift=%2.2f”,LHS, RHS, support, confidence,
`expected confidence`, lift;
options grid color “#202020”;
options hide disk distance 600;
options hide entity label distance 600;
}
|