This appendix replaces Appendix F, “Creating Data and Configuration Files for the Rules Visualizer,” in the MineSet User's Guide, and describes how to use two MineSet commands, assoccvt and assocgen, to find the association rules in a file presented in transaction-style format. To generate association rules from data stored in a table, use the Tool Manager interface, described in Chapter 2, “Using the Association Rules Tool.”
This appendix contains the following sections:
| Note: The programs used in this appendix are installed on the MineSet server. Only rules with lift greater than or equal to 1 are produced by the command-line Association Rules Generator. |
The examples used in this appendix can be found in the /usr/lib/MineSet/assoccvt/examples/ and /usr/lib/MineSet/assocgen/examples/ directories. Descriptions and instructions for use can be found in the README file in these directories.
MineSet provides tools for analyzing and visualizing data that is stored in a tabular format, either in a database, or in a flat file. The MineSet Tool Manager allows you to generate and visualize the association rules in a table. If your data is in transaction format, you can use the utilities described in this appendix to find the association rules in your data. Transaction-style data is data in a flat file in which transactions are split across several rows. Each row has two columns; one corresponds to a transaction identifier (transaction ID), and the other corresponds to an item in the transaction.
For example, in a point-of-sale file representing supermarket transactions, a file might look like this:
10012 wheat bread 10012 beer 10012 tortilla chips 10012 eggs 10013 cereal 10013 chicken 10014 all-purpose flour |
Rows are grouped by transaction ID. With this style of data, MineSet finds associations between items present in the same transaction. For example, the data may contain the association “eggs implies beer.”
There are three steps to finding the association rules present in transaction-style data:
Converting the data to MineSet's format (the assoccvt step)
Running the Association Rules Generator (the assocgen step)
Loading the results into the MineSet Tool Manager for further manipulation (for example, visualization using the Scatter Visualizer or the Record Viewer)
The association data converter requires:
A raw data file (consisting of your own data for running associations)
A format file, which describes the raw data file's format
The raw data file must be in the format shown in “About Transaction-Style Data”, in which there is one transaction ID and one item per row. Neither the transaction ID nor the item need be contiguous within the row of data, however. In addition, the records in the file must be of exactly the same length, and they must be grouped by transaction ID (though they need not be sorted).
The format file specifies the format of the raw data file to the association data converter. This format file must contain the following items in the order shown:
The letter “S” to indicate the file type
The number of bytes in each row, excluding the end-of-line character. Each row in the data file must have this many bytes.
The number of fields that make up the transaction ID
The total number of bytes in the transaction ID
The offset and number of bytes for each field that makes up the transaction ID
The number of fields that make up the item
The total number of bytes in the item
The offset and length in bytes for each field that makes up the item
A flag indicating whether or not there is a field describing the item. If so, this field will be output in the generated rules along with the name. This should be either a 0 (meaning No) or 1 (meaning Yes)
If the description flag is 1, the following are also required:
Number of fields that make up the description
Total number of bytes in the description
Offset and length in bytes for each field that makes up the description
| Note: Each column number is zero-based. |
Most data files use only one field each for the item and the transaction identifier. For the example data listed above, assuming that each line is 80 characters wide (plus one for the end-of-line character), the format file would be:
S 80 1 5 0 5 1 74 6 74 0 |
This format file allows for a great deal of flexibility. For instance, there need be no separator between the transaction identifier and the item. The two may even be overlapping, as in:
bread 10012wheat 25 apple 10012fuji 25 banana 10012bunch 25 |
In this case, both the transaction ID and the item contain two fields of different lengths. If the total line width is 21 bytes, then the format file would be:
S 21 (21 bytes per line, not including end-of-line character) 2 7 (transaction ID is two fields, 7 bytes total) 7 5 (first transaction ID field is 5 bytes starting at column 7) 19 2 (second transaction ID field is 2 bytes starting at column 19) 2 13 (item is two fields, 13 bytes total) 0 6 (first item field is 6 bytes starting at column 0) 12 7 (second item field is 7 bytes starting at column 12) 0 (no description fields) |
To find the association rules in your data, you must first convert the data into an intermediate binary format, which the Association Rules Generator uses to find the rules. This conversion step uses the raw data file and format file, and produces two new files:
The output data file, containing the converted data
The output names file, containing auxiliary descriptor information used by the Association Rules Generator
The assoccvt program converts the data. Its usage is:
assoccvt [-ifile raw] [-ofile binary] format names |
where raw is the name of the raw data file, binary is the name of the produced binary data file, format is the name of the format file, and names is the name of the output names file. If the -ifile parameter is omitted, standard input is used instead. Similarly, if the -ofile parameter is omitted, the standard output is used instead.
The following command illustrates the use of the association data converter on the example file in /usr/lib/MineSet/assoccvt/examples. The file sing.data is an example of data in transaction format and has some simple grocery store transactions. Each line has a transaction number and the name of an item bought in that transaction. The format of this file is described by sing.format.
assoccvt -ifile sing.data -ofile sing.bin sing.format sing.names |
To test whether the files for data conversion are correctly installed, run the preceding command from the shell command line. Then, using the UNIX diff command, compare the files created to those with the same name in /usr/lib/MineSet/assoccvt/examples. Compare sing.bin with /usr/lib/MineSet/assoccvt/examples/sing.bin, and compare sing.names with /usr/lib/MineSet/assoccvt/examples/sing.names.
The Association Rules Generator takes items in a set of data and generates association rules from them. The required input files are described in the following subsections. The output of the Association Rules Generator is a specially formatted rules file, which can be loaded into MineSet as a flat file for examination.
Rules are generated by invoking the assocgen command, along with one or more parameters. Options fall into one of the following categories:
Rule Generation Options—control the process of rule generation.
Rule Restriction Options—place restrictions on the set of generated rules.
The -ropts string separates the two sets of options. This string is required if any options from the second set are used.
An example rule generation command line might be:
assocgen -prev 20 -tran sing.bin -ropts -names sing.names \-rout sing.rules |
See “Rule Generation Options” and “Rule Restriction Options” for explanations of the parameters.
| Note: In the assocgen program, support is called prevalence, and confidence is called predictability. Therefore, the parameters for the support and confidence thresholds are -prev and -pred. |
Table A-1 lists the set of options for controlling the rule-generation process. A description of each option follows the table. In the following description, %s represents a string-valued parameter, %d an integer-valued parameter, and %f a floating point-valued parameter.
Table A-1. Options for Controlling Rule Generation
Option Format | Default Value | Comments |
|---|---|---|
-tran %s | (stdin) | Data file path |
-prev %f | (1.0) | Support threshold (as a percentage) |
-uniq %d |
| Number of items in dataset |
-dir %s | (/usr/tmp) | Directory for temporary files |
-tprefix %s | (A_) | Prefix for temporary files |
-msg %s | (assocgen.msg) | Message file |
Table A-2 lists the set of options for restricting generated rules. Options in this set are used after those listed in Table A-1 and separated on the command line from the former options by -ropts. A description of each option follows the table.
Table A-2. Options for Restricting Generated Rules
Option Format | Default Value | Comments |
|---|---|---|
-pred %f | (50.0) | Minimum confidence (as a percentage) |
-names %s |
| Name of file containing item descriptions |
-rout %s | (stdout) | Name of file in which to output rules |
The data listed in Table A-3 is an example of market basket data. This data can be found in the file /usr/lib/MineSet/assoccvt/examples/sing.data.
Transaction ID | Item |
|---|---|
10 | Jam |
10 | Eggs |
10 | Chips |
10 | Bread |
10 | Butter |
10 | Milk |
20 | Soda |
20 | Eggs |
20 | Butter |
20 | Bread |
30 | Soda |
30 | Eggs |
30 | Milk |
30 | Bread |
30 | Butter |
40 | Eggs |
40 | Chips |
40 | Juice |
40 | Bread |
50 | Milk |
50 | Chips |
50 | Bread |
50 | Beer |
60 | Soda |
60 | Juice |
60 | Beer |
70 | Beer |
70 | Chips |
70 | Wine |
80 | Juice |
80 | Cookies |
80 | Chips |
90 | Chips |
90 | Cookies |
90 | Milk |
95 | Bread |
95 | Cookies |
95 | Milk |
You can generate rules from the data in Table A-3 by using first assoccvt (see “Association Rules Generator Command-Line Operation”), and then running the assocgen command:
assocgen -prev 20 -tran sing.bin -ropts -names sing.names -rout sing.rules |
The rules file that is output has the following format:
1 1 30.0000 100.00 40.00 Butter Eggs |
The fields in each line correspond to:
The number of items on the LHS of the rule (always 1)
The number of items on the RHS of the rule (always 1)
The support
The confidence
The expected confidence
The name (or code) of the item on the LHS
The name (or code) of item on the RHS
The expected confidence is the frequency of occurrence of the RHS items. The difference between expected confidence and observed confidence is a measure of the increase in predictive power due to the presence of the LHS. Expected confidence gives an indication of what the confidence would be if there were no relationship between the items.
For a further description of the relationships between support, confidence, expected confidence, and lift, see Chapter 2, “Using the Association Rules Tool.”
If the minimum support threshold is 20% (8 records out of 38 in the example below), and the default minimum confidence threshold is 50%, the assocgen program generates the set of rules shown in Table A-4.
Table A-4. Rule Generation Example 1
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
1 | 1 | 30.0000 | 100.00 | 40.00 | Butter | Eggs |
1 | 1 | 40.0000 | 66.67 | 40.00 | Bread | Eggs |
1 | 1 | 20.0000 | 66.67 | 40.00 | Soda | Eggs |
1 | 1 | 20.0000 | 66.67 | 60.00 | Juice | Chips |
1 | 1 | 20.0000 | 66.67 | 60.00 | Beer | Chips |
1 | 1 | 20.0000 | 66.67 | 60.00 | Cookies | Chips |
1 | 1 | 30.0000 | 60.00 | 60.00 | Milk | Chips |
1 | 1 | 40.0000 | 100.00 | 60.00 | Eggs | Bread |
1 | 1 | 30.0000 | 100.00 | 60.00 | Butter | Bread |
1 | 1 | 40.0000 | 80.00 | 60.00 | Milk | Bread |
1 | 1 | 20.0000 | 66.67 | 60.00 | Soda | Bread |
1 | 1 | 30.0000 | 75.00 | 30.00 | Eggs | Butter |
1 | 1 | 20.0000 | 66.67 | 30.00 | Soda | Butter |
1 | 1 | 30.0000 | 50.00 | 30.00 | Bread | Butter |
1 | 1 | 40.0000 | 66.67 | 50.00 | Bread | Milk |
1 | 1 | 20.0000 | 66.67 | 50.00 | Butter | Milk |
1 | 1 | 20.0000 | 66.67 | 50.00 | Cookies | Milk |
1 | 1 | 30.0000 | 50.00 | 50.00 | Chips | Milk |
1 | 1 | 20.0000 | 50.00 | 50.00 | Eggs | Milk |
1 | 1 | 20.0000 | 66.67 | 30.00 | Butter | Soda |
1 | 1 | 20.0000 | 50.00 | 30.00 | Eggs | Soda |
The rules visualizer graphically displays, using the Scatter Visualizer, the rules resulting from the Association Rules Generator. The rules visualization requires:
A configuration file, which specifies various display parameters (the .schema file)
A rules file in the internally required format (the .rules file)
The rules generated by assocgen can be loaded into MineSet as a flat file for further analysis and visualization. You can use the Scatter Visualizer, for example, to visualize the rules in the same way that the Tool Manager is used to visualize rules generated from tabular data.
To visualize rules, you need a MineSet .schema file corresponding to the .rules file you just created. An example .schema file can be found in /usr/lib/MineSet/assocgen/examples/rules.schema. Edit this file to specify the name of the file containing the rules. Once you've done this, you can load the .schema file into MineSet as a flat file. The .schema file describes the columns in the rules file:
MineSet 2.6
# Example schema for loading rules into MineSet
input {
options backslash on;
file “assoc.rules.data”;
int `lhs size`;
int `rhs size`;
float support;
float confidence;
float `expected confidence`;
string LHS;
string RHS;
}
|
Change the name of the file in the file clause to your rules file. Once you load the data into MineSet, it may be useful to add a column `lift`, of type double, with the expression `confidence`/`expected confidence`.
The rules file is generated by the Association Rules Generator (see “Association Rules Generator”).