Appendix A. Using the Association Rules Generator With Transaction-Style Data

This appendix replaces Appendix F, “Creating Data and Configuration Files for the Rules Visualizer,” in the MineSet User's Guide, and describes how to use two MineSet commands, assoccvt and assocgen, to find the association rules in a file presented in transaction-style format. To generate association rules from data stored in a table, use the Tool Manager interface, described in Chapter 2, “Using the Association Rules Tool.”

This appendix contains the following sections:


Note: The programs used in this appendix are installed on the MineSet server. Only rules with lift greater than or equal to 1 are produced by the command-line Association Rules Generator.

The examples used in this appendix can be found in the /usr/lib/MineSet/assoccvt/examples/ and /usr/lib/MineSet/assocgen/examples/ directories. Descriptions and instructions for use can be found in the README file in these directories.


Note: Read Chapter 2, “Using the Association Rules Tool,” before using this appendix.


About Transaction-Style Data

MineSet provides tools for analyzing and visualizing data that is stored in a tabular format, either in a database, or in a flat file. The MineSet Tool Manager allows you to generate and visualize the association rules in a table. If your data is in transaction format, you can use the utilities described in this appendix to find the association rules in your data. Transaction-style data is data in a flat file in which transactions are split across several rows. Each row has two columns; one corresponds to a transaction identifier (transaction ID), and the other corresponds to an item in the transaction.

For example, in a point-of-sale file representing supermarket transactions, a file might look like this:

10012 wheat bread
10012 beer
10012 tortilla chips
10012 eggs
10013 cereal
10013 chicken
10014 all-purpose flour

Rows are grouped by transaction ID. With this style of data, MineSet finds associations between items present in the same transaction. For example, the data may contain the association “eggs implies beer.”

There are three steps to finding the association rules present in transaction-style data:

  • Converting the data to MineSet's format (the assoccvt step)

  • Running the Association Rules Generator (the assocgen step)

  • Loading the results into the MineSet Tool Manager for further manipulation (for example, visualization using the Scatter Visualizer or the Record Viewer)

Association Data Converter Requirements

The association data converter requires:

  • A raw data file (consisting of your own data for running associations)

  • A format file, which describes the raw data file's format

Raw Data File

The raw data file must be in the format shown in “About Transaction-Style Data”, in which there is one transaction ID and one item per row. Neither the transaction ID nor the item need be contiguous within the row of data, however. In addition, the records in the file must be of exactly the same length, and they must be grouped by transaction ID (though they need not be sorted).

Format File

The format file specifies the format of the raw data file to the association data converter. This format file must contain the following items in the order shown:

  • The letter “S” to indicate the file type

  • The number of bytes in each row, excluding the end-of-line character. Each row in the data file must have this many bytes.

  • The number of fields that make up the transaction ID

  • The total number of bytes in the transaction ID

  • The offset and number of bytes for each field that makes up the transaction ID

  • The number of fields that make up the item

  • The total number of bytes in the item

  • The offset and length in bytes for each field that makes up the item

  • A flag indicating whether or not there is a field describing the item. If so, this field will be output in the generated rules along with the name. This should be either a 0 (meaning No) or 1 (meaning Yes)

  • If the description flag is 1, the following are also required:

    • Number of fields that make up the description

    • Total number of bytes in the description

    • Offset and length in bytes for each field that makes up the description


Note: Each column number is zero-based.

Most data files use only one field each for the item and the transaction identifier. For the example data listed above, assuming that each line is 80 characters wide (plus one for the end-of-line character), the format file would be:

S
80
1 5
0 5
1 74
6 74
0

This format file allows for a great deal of flexibility. For instance, there need be no separator between the transaction identifier and the item. The two may even be overlapping, as in:

bread  10012wheat  25
apple  10012fuji   25
banana 10012bunch  25

In this case, both the transaction ID and the item contain two fields of different lengths. If the total line width is 21 bytes, then the format file would be:

S
21     (21 bytes per line, not including end-of-line character)
2 7    (transaction ID is two fields, 7 bytes total)
7 5    (first transaction ID field is 5 bytes starting at column 7)
19 2   (second transaction ID field is 2 bytes starting at column 19)
2 13   (item is two fields, 13 bytes total)
0 6    (first item field is 6 bytes starting at column 0)
12 7   (second item field is 7 bytes starting at column 12)
0      (no description fields)

Association Data Converter Command-Line Operation

To find the association rules in your data, you must first convert the data into an intermediate binary format, which the Association Rules Generator uses to find the rules. This conversion step uses the raw data file and format file, and produces two new files:

  • The output data file, containing the converted data

  • The output names file, containing auxiliary descriptor information used by the Association Rules Generator

The assoccvt program converts the data. Its usage is:

assoccvt [-ifile raw] [-ofile binary] format names 

where raw is the name of the raw data file, binary is the name of the produced binary data file, format is the name of the format file, and names is the name of the output names file. If the -ifile parameter is omitted, standard input is used instead. Similarly, if the -ofile parameter is omitted, the standard output is used instead.

Association Data Converter Examples

The following command illustrates the use of the association data converter on the example file in /usr/lib/MineSet/assoccvt/examples. The file sing.data is an example of data in transaction format and has some simple grocery store transactions. Each line has a transaction number and the name of an item bought in that transaction. The format of this file is described by sing.format.

assoccvt -ifile sing.data -ofile sing.bin sing.format sing.names

To test whether the files for data conversion are correctly installed, run the preceding command from the shell command line. Then, using the UNIX diff command, compare the files created to those with the same name in /usr/lib/MineSet/assoccvt/examples. Compare sing.bin with /usr/lib/MineSet/assoccvt/examples/sing.bin, and compare sing.names with /usr/lib/MineSet/assoccvt/examples/sing.names.

Association Rules Generator

The Association Rules Generator takes items in a set of data and generates association rules from them. The required input files are described in the following subsections. The output of the Association Rules Generator is a specially formatted rules file, which can be loaded into MineSet as a flat file for examination.

Association Rules Generator File Requirements

The Association Rules Generator program, assocgen, requires:

  • A data file in the internally required format

  • A names file in the internally required format

Association Rules Generator Command-Line Operation

Rules are generated by invoking the assocgen command, along with one or more parameters. Options fall into one of the following categories:

  • Rule Generation Options—control the process of rule generation.

  • Rule Restriction Options—place restrictions on the set of generated rules.

The -ropts string separates the two sets of options. This string is required if any options from the second set are used.

An example rule generation command line might be:

assocgen -prev 20 -tran sing.bin -ropts -names sing.names \-rout sing.rules

See “Rule Generation Options” and “Rule Restriction Options” for explanations of the parameters.


Note: In the assocgen program, support is called prevalence, and confidence is called predictability. Therefore, the parameters for the support and confidence thresholds are -prev and -pred.


Rule Generation Options

Table A-1 lists the set of options for controlling the rule-generation process. A description of each option follows the table. In the following description, %s represents a string-valued parameter, %d an integer-valued parameter, and %f a floating point-valued parameter.

Table A-1. Options for Controlling Rule Generation

Option Format

Default Value

Comments

-tran %s

(stdin)

Data file path

-prev %f

(1.0)

Support threshold (as a percentage)

-uniq %d

 

Number of items in dataset

-dir %s

(/usr/tmp)

Directory for temporary files

-tprefix %s

(A_)

Prefix for temporary files

-msg %s

(assocgen.msg)

Message file


-tran %s  

Specifies the path for the file. By default, the file is read from stdin.

-prev %f  

Specifies the minimum support threshold as a percentage of the total number of records. The default is 1.0%. If the support threshold results in a minimum count less than 3, an error message is displayed, and no rules are generated.

-uniq %d  

Specifies the number of unique or distinct items across all records (if known). Specifying this (or an upper bound) speeds processing.

-dir %s  

Specifies the directory in which to store temporary files, including the message file (see -msg, below). The default is the current directory.

-tprefix %s  

Specifies the prefix to be used for temporary files, except the message file (see -msg, below). The default prefix is A_.

-msg %s  

Specifies the message file, which contains diagnostic output. The default is assocgen.msg.

Rule Restriction Options

Table A-2 lists the set of options for restricting generated rules. Options in this set are used after those listed in Table A-1 and separated on the command line from the former options by -ropts. A description of each option follows the table.

Table A-2. Options for Restricting Generated Rules

Option Format

Default Value

Comments

-pred %f

(50.0)

Minimum confidence (as a percentage)

-names %s

 

Name of file containing item descriptions

-rout %s

(stdout)

Name of file in which to output rules


-pred %f  

Specifies the minimum confidence threshold for rules. Rules with a confidence below this value are not generated. The default is 50%.

-names %s  

Specifies the name of the file that contains the descriptions of the items. This is typically the names file created during the assoccvt step.

-rout %s  

Specifies the name of the file to which rules are to be written. If this is not specified, rules are written to stdout.

Association Rule Example

The data listed in Table A-3 is an example of market basket data. This data can be found in the file /usr/lib/MineSet/assoccvt/examples/sing.data.

Table A-3. Data Example

Transaction ID

Item

10

Jam

10

Eggs

10

Chips

10

Bread

10

Butter

10

Milk

20

Soda

20

Eggs

20

Butter

20

Bread

30

Soda

30

Eggs

30

Milk

30

Bread

30

Butter

40

Eggs

40

Chips

40

Juice

40

Bread

50

Milk

50

Chips

50

Bread

50

Beer

60

Soda

60

Juice

60

Beer

70

Beer

70

Chips

70

Wine

80

Juice

80

Cookies

80

Chips

90

Chips

90

Cookies

90

Milk

95

Bread

95

Cookies

95

Milk

You can generate rules from the data in Table A-3 by using first assoccvt (see “Association Rules Generator Command-Line Operation”), and then running the assocgen command:

assocgen -prev 20 -tran sing.bin -ropts -names sing.names -rout sing.rules

The rules file that is output has the following format:

1    1    30.0000 100.00 40.00 Butter Eggs

The fields in each line correspond to:

  • The number of items on the LHS of the rule (always 1)

  • The number of items on the RHS of the rule (always 1)

  • The support

  • The confidence

  • The expected confidence

  • The name (or code) of the item on the LHS

  • The name (or code) of item on the RHS

The expected confidence is the frequency of occurrence of the RHS items. The difference between expected confidence and observed confidence is a measure of the increase in predictive power due to the presence of the LHS. Expected confidence gives an indication of what the confidence would be if there were no relationship between the items.

For a further description of the relationships between support, confidence, expected confidence, and lift, see Chapter 2, “Using the Association Rules Tool.”

If the minimum support threshold is 20% (8 records out of 38 in the example below), and the default minimum confidence threshold is 50%, the assocgen program generates the set of rules shown in Table A-4.

Table A-4. Rule Generation Example 1

 

 

 

 

 

 

 

1

1

30.0000

100.00

40.00

Butter

Eggs

1

1

40.0000

66.67

40.00

Bread

Eggs

1

1

20.0000

66.67

40.00

Soda

Eggs

1

1

20.0000

66.67

60.00

Juice

Chips

1

1

20.0000

66.67

60.00

Beer

Chips

1

1

20.0000

66.67

60.00

Cookies

Chips

1

1

30.0000

60.00

60.00

Milk

Chips

1

1

40.0000

100.00

60.00

Eggs

Bread

1

1

30.0000

100.00

60.00

Butter

Bread

1

1

40.0000

80.00

60.00

Milk

Bread

1

1

20.0000

66.67

60.00

Soda

Bread

1

1

30.0000

75.00

30.00

Eggs

Butter

1

1

20.0000

66.67

30.00

Soda

Butter

1

1

30.0000

50.00

30.00

Bread

Butter

1

1

40.0000

66.67

50.00

Bread

Milk

1

1

20.0000

66.67

50.00

Butter

Milk

1

1

20.0000

66.67

50.00

Cookies

Milk

1

1

30.0000

50.00

50.00

Chips

Milk

1

1

20.0000

50.00

50.00

Eggs

Milk

1

1

20.0000

66.67

30.00

Butter

Soda

1

1

20.0000

50.00

30.00

Eggs

Soda


Rules Visualization

The rules visualizer graphically displays, using the Scatter Visualizer, the rules resulting from the Association Rules Generator. The rules visualization requires:

  • A configuration file, which specifies various display parameters (the .schema file)

  • A rules file in the internally required format (the .rules file)

Rules Visualization File Requirements

The rules generated by assocgen can be loaded into MineSet as a flat file for further analysis and visualization. You can use the Scatter Visualizer, for example, to visualize the rules in the same way that the Tool Manager is used to visualize rules generated from tabular data.

The Schema File

To visualize rules, you need a MineSet .schema file corresponding to the .rules file you just created. An example .schema file can be found in /usr/lib/MineSet/assocgen/examples/rules.schema. Edit this file to specify the name of the file containing the rules. Once you've done this, you can load the .schema file into MineSet as a flat file. The .schema file describes the columns in the rules file:


MineSet 2.6
# Example schema for loading rules into MineSet

input {
        options backslash on;
        file “assoc.rules.data”;
        int `lhs size`;
        int `rhs size`;
        float support;
        float confidence;
        float `expected confidence`;
        string LHS;
        string RHS;
}

Change the name of the file in the file clause to your rules file. Once you load the data into MineSet, it may be useful to add a column `lift`, of type double, with the expression `confidence`/`expected confidence`.

The Rules File

The rules file is generated by the Association Rules Generator (see “Association Rules Generator”).