This chapter describes the new features in MineSet 2.6. Table 1-1 provides an overview of these features, with references to where the subjects are described in greater detail, if applicable.
Table 1-1. New Feature Overview
New Feature | Short Description |
|---|---|
New Licensing Plans | MineSet has several new licensing plans that are easily tailored to your needs. See “New Licensing Plans”. |
Internationalization | MineSet 2.6 provides support for international datasets. Text labels in the graphical interface still appear in English, but you may now view multibyte column names and data values in the language corresponding to the data encoding. See “Internationalization”. |
64-Bit Support | Large memory (64-bit) is supported on IRIX 6.4 and later releases. See “64-Bit Support”. |
Year 2000 Compliance | MineSet now supports Y2K-compliant dates. See “Year 2000 Compliance”. |
New Mining Tool Plugin API | MineSet 2.6 provides an API that third-party vendors can use to extend the functionality of MineSet through the use of plugins. See “New Mining Tool Plugin API”. |
New Data-Importing Utility | The new MineSet data-importing utility, dataschema, automatically creates MineSet data and schema files from flat file formats. See “New Data-importing Utility (dataschema)”. |
New Bin Names | Bin names have a new format that better defines the range of values within each bin. See “New Bin Names”. |
New Histogram Visualizer | The new Histogram Visualizer automatically bins all of the continuous-type columns in the data and sends the result to the Statistics Visualizer for display. See “New Histogram Visualizer”. |
New Record Viewer | The new Java-based Record Viewer provides extra functionality such as record numbering, sorting by various criteria, filtering, searching, and writing to HTML tables. The functionality is described in Chapter 3, “Using the Record Viewer.” |
The Tool Manager has two new features: • The new Column Sort operation sorts the list of columns by name. • The Add Column and Filter panels now support the if (A) then (B) else (C) expression. This expression means that if A is true, use the value of B, otherwise use the value of C. | |
New Format for Rules Visualizer | Previous versions of MineSet used a separate tool for association rules visualization. Now, association rules are shown using the Scatter Visualizer, and there are some changes in the generation of the rules. A full description of the new format is covered in Chapter 2, “Using the Association Rules Tool.” Configuration file information for the new format is found in Appendix A, “Using the Association Rules Generator With Transaction-Style Data.” |
The Evidence Visualizer now allows filtering on the set of displayed attributes. | |
Scatter Visualizer Enhancements | The Scatter Visualizer now allows you to show a trail of motion to demonstrate the changing animation path of an entity. See “Scatter Visualizer Enhancements”. |
Scatter Visualizer Configuration File Enhancements | Several statements have been added to the Scatter Visualizer configuration file to support the new association rules visualization capability of the Scatter Visualizer. See “Scatter Visualizer Configuration File Enhancements”. |
Decision Table Visualization Enhancements | The Decision Table Visualizer has two new features. It now allows filtering on attribute values in the same way that the Scatter Visualizer does, and it has an Evidence mode that is the same as the Evidence mode of the Evidence Visualizer. See “Decision Table Visualization Enhancements”. |
New Decision Tree Inducer Options | The Decision Tree Inducer now has an extended set of splitting criteria and pruning methods. See “New Decision Tree Inducer Options”. |
New Web Publishing Option | All visualization tools now provide the option of creating a file based on a visualization that may be published on the Web. See “New Web Publishing Option”. |
Changes in File Exchange Procedures Between MineSet and SAS | There have been a few changes to the file exchange procedures between MineSet and SAS. See “Changes in File Exchange Procedures Between MineSet and SAS” |
MineSet implements a client-server architecture. The client and the server need separate licenses. Typically, a client is used by a single user on a desktop system, while the server runs on a larger system that may be shared by multiple clients simultaneously. The client side runs the Tool Manager and the visualization tools, and the server side runs the DataMover and analytics. You may run the client and the server on the same system, but you need both a client license and a server license.
Client licenses are simple. There is only one type of client license, and it is tied to a system ID. Once you have a MineSet client license on your system, you may run an unlimited number of visualizers and tool-manager processes on that system as long as they read local files and do not connect to a server.
Server licenses are more complex. In essence, each server license allows one simultaneous connection to the MineSet server. For example, if you have five users working simultaneously, each using one client, you need five server licenses on the server system. Unlike a client license which is unlimited on one system, a server license means support for only one active session.
There are two types of MineSet server licenses:
Normal server licenses are priced in such a way that they become cheaper the more licenses you buy. As of MineSet 2.6, there are three tiers of Normal server licenses:
Basic, which provides the first session on the server.
2 to 4, which provides a less expensive license for each additional session up to four sessions. You must have at least one Basic license in order to obtain a “2 to 4” license.
5 and up, which provides an even less expensive license for each session above four. You must have at least one Basic license and three “2 to 4” licenses to obtain a “5 and up” license.
Lastly, there is a special kind of inexpensive server evaluation license called “MineSet Light.” If your dataset has fewer than 5,000 records, MineSet uses all of them. If your dataset has more than 5,000 records, MineSet auto-samples to 5,000 records and uses only those.
Mixing Varsity with any other license type is not allowed. Mixing Normal with Light licenses is not recommended. In MineSet 2.6, exceeding the number of licenses on the server generates a warning. The warning alerts you to the fact that you are exceeding your server licensing terms. Stricter control of license usage may be introduced in the future.
Beginning with version 2.6, MineSet supports international datasets. Text labels in the graphical interface still appear in English, but you can now view multibyte column names and data values in the language corresponding to the data encoding. MineSet automatically supports EUC encoding for Japanese, Chinese, and Korean, provided the corresponding WorldView product is installed. For other languages and encodings see “Extending to Other Languages and Encodings”.
The locale and fonts for the language you are using must be present on both the client and the server system, as well as any system used for remote display. To see a list of locales installed on your system, enter the following command at a UNIX shell prompt:
locale -a |
To set the locale, set the environment variable LANG to the appropriate locale from the list generated by the command above. For example, to set the locale to Japanese, EUC encoding, using csh, enter the following command:
setenv LANG ja_JP.EUC |
Then simply invoke MineSet from the same shell. To permanently set the locale for all applications, consult your IRIX documentation.
For MineSet to run in a locale other than those included in the installation, copy the resource files to the appropriate directory and modify them. MineSet visualization tools use Open Inventor with both 2D and 3D fonts. For text to appear properly, you must have Type III (often called CID outline) fonts installed.
Resource files are included in the installation for the following locales:
ja_JP.EUC
ko_KR.euc
zh_CN.ugb
zh_TW.ucns
To run MineSet in locale locale_name (see “Setting the Locale” for how to list your installed locales):
Install MineSet as usual.
Log in as root.
Copy the following resource files from /usr/lib/X11/app-defaults to /usr/lib/X11/locale_name/app-defaults:
Clusterviz
Dtableviz
Eviviz
Mapviz
Mineset
Scatterviz
Splatviz
Statviz
Treeviz
Edit the resource files in /usr/lib/X11/locale_name/app-defaults. You will need to know the resource names and the specifications for the fonts you want to use (see Table 1-2 for an example).
Set the locale to locale_name and invoke MineSet.
The changes needed for Korean are given in Table 1-2. The fonts listed came from the lists in the following files:
/usr/lib/X11/fonts/ps2xlfd_map.korean
/usr/lib/X11/fonts/ps2xlfd_map.korean.outline
Table 1-2. Resource File Changes for Korean (ko_KR.euc)
Files | English Resources (some lines are wrapped) | Korean Resources (some lines are wrapped) |
|---|---|---|
Clusterviz, | titleFont: screen12 | titleFont: screen12,-ksg-mj-medium-r-normal--14-130-75-75-c-140-ksc5601.1987-0 |
Clusterviz, | gradationsFont: screen11 | gradationsFont: screen11,-ksg-mj-medium-r-normal--12-110-75-75-c-120-ksc5601.1987-0 |
Clusterviz, | balloonFont: screen11 | balloonFont: screen11,-ksg-mj-medium-r-normal--12-110-75-75-c-120-ksc5601.1987-0 |
Clusterviz, | xFontEncoding: ISO8859-1 | xFontEncoding: ksc5601.1987-0 |
Dtableviz, | myDefaultFont: Helvetica-Narrow | myDefaultFont: Helvetica-Narrow;Gungso-Regular--KSC-H |
Mineset | zoom2*fontList: -*-*-medium-r-*-*-6-*-*-*-*-*-*-* | zoom2*fontList: -*-*-medium-r-*-*-6-*-*-*-*-*-*-*;-ksg-*-medium-*--12-*: |
| zoom3*fontList: -*-*-medium-r-*-*-8-*-*-*-*-*-*-* | zoom3*fontList: -*-*-medium-r-*-*-8-*-*-*-*-*-*-*;-ksg-*-medium-*--12-*: |
| zoom4*fontList: -*-*-medium-r-*-*-10-*-*-*-*-*-*-* | zoom4*fontList: -*-*-medium-r-*-*-10-*-*-*-*-*-*-*;-ksg-*-medium-*--14-*: |
| zoom5*fontList: -*-*-medium-r-*-*-12-*-*-*-*-*-*-* | zoom5*fontList: -*-*-medium-r-*-*-12-*-*-*-*-*-*-*;-ksg-*-medium-*--14-*: |
| zoom6*fontList: -*-*-medium-r-*-*-14-*-*-*-*-*-*-* | zoom6*fontList: -*-*-medium-r-*-*-14-*-*-*-*-*-*-*;-ksg-*-medium-*--18-*: |
| zoom7*fontList: -*-*-medium-r-*-*-16-*-*-*-*-*-*-* | zoom7*fontList: -*-*-medium-r-*-*-16-*-*-*-*-*-*-*;-ksg-*-medium-*--24-*: |
| zoom8*fontList: -*-*-medium-r-*-*-24-*-*-*-*-*-*-* | zoom8*fontList: -*-*-medium-r-*-*-24-*-*-*-*-*-*-*;-ksg-*-medium-*--24-*: |
Large memory (64-bit) is supported on IRIX 6.4 and later releases. If you have IRIX 6.2, you can still use the 32-bit data mining utility, but you must upgrade to IRIX 6.5 in order to obtain 64-bit support and pthreads. To get the full advantage of 64-bit addressing you may also need to change the systune resource parameters, depending on your system configuration.
The systune parameters determine the default limits on the available system resources. Table 1-3 lists the systune parameter values that Silicon Graphics recommends (for more details see the systune(1M) man page):
Parameter | Definition | Recommended Value |
|---|---|---|
Current limit on the number of threads | 1024 | |
Current limit on memory usage | The amount of physical memory on your machine | |
Current limit on virtual memory usage | The size of the logical swap space on your machine or about twice the physical memory | |
Current limit on number of open file | 1024 or the limit on the number of threads |
| Note: You must reboot your machine after installing the new parameters. |
MineSet now supports Y2K-compliant dates. In the U.S. locale, dates may be entered in the form MM/DD/YY or MM/DD/YYYY. MineSet follows the X/Open standard for two-digit years: numbers greater than 68 are assumed to be the years 1969 to 1999, and numbers less than or equal to 68 are assumed to be the years 2000 to 2068.
In European locales, dates may be entered in the form DD/MM/YY or DD/MM/YYYY, with the same handling of two-digit years as above.
In either locale, if you enter a two-digit year, it is automatically expanded to a four-digit year in the display.
Now, with the use of an API, third-party vendors can extend the functionality of MineSet 2.6.
The MineSet Mining Tool Plugin API provides a means for third parties to plug in a GUI to the Mining Panel in the Tool Manager, save options to a .mineset file, and send these options to the DataMover. The DataMover then runs the third-party mining tool program and manages its output of models and model visualization files. The model visualization files are then sent back to the Tool Manager, which runs the requested visualization. Models created by third-party plugin mining tools can later be applied to a dataset using the Apply Model transformation in the Tool Manager.
At startup, MineSet looks for third-party dynamic shared object (DSO) libraries in /usr/lib/mineset/plugins.
In MineSet 2.6, a clustering algorithm named AutoClassPro (ACPro) from Ultimode Systems is available as an add-on. If ACPro is installed, documentation for it can be found in /usr/acpro/doc.
dataschema is a MineSet data-importing utility that automatically creates MineSet data and schema files from flat file formats.
It handles arbitrary text (flat) input files as long as:
There is only one record per line
Fields within each record are separated by some special character (which can be any character)
It imposes no limits on input data sizes such as the number of columns or rows.
It automatically identifies the field separator character.
It automatically identifies column types.
It supports column names on the first line of input (if they are given).
It supports either UNIX or DOS style CR/LF ends of lines.
It supports leading and trailing space stripping from constant-length fields.
It supports MineSet dates and missing (null) values.
dataschema requires either perl4 or perl5 to run.
You can call dataschema from any command shell with the input files you want it to process. For example, if you type:
dataschema /tmp/mydata.csv |
dataschema reads /tmp/mydata.csv, analyzes it, and creates two output files in the current directory, mydata.schema and mydata.data, that can be read by MineSet. Editing mydata.schema for further customization is encouraged (especially for changing column names). If your input contains column names on the first line (separated by the same separator as the actual data fields), these column names are used in the schema file.
dataschema is flexible and supports several options. Invoking dataschema without any arguments prints out a usage message including all the supported options.
For more information on dataschema, visit the following Web page:
In MineSet 2.6, bin names have a new format. MineSet 2.5 used the convention - 10, 10-20, 20-30, 30+ for bin names, which led to some confusion. The - 10 was often thought to be negative ten, and it was not clear which bin contained the boundary points. MineSet 2.6 uses a modified interval notation for bin names:
(lower-bound ... upper-bound] |
The “(” indicates that the lower bound is not included in the range. The “]” indicates that the upper bound is included in the range. For example, (10.5 ... 12.6] indicates the range of values over 10.5 up to and including 12.6, more formally, { X : 10.5 < X <= 12.6 }. If the lower bound is omitted, the range includes all values less than and including the upper bound. For example, (... 10.5] indicates the range of values less than or equal to 10.5, or more formally, { X : X <= 10.5 }.
If the upper bound is omitted, the range includes all values greater than the lower bound. For example, (12.6 ...] indicates the range of values greater than 12.6, or more formally, { X : X > 12.6 }.
The example of the MineSet 2.5 bin names - 10, 10-20, 20-30, 30+ can be expressed in MineSet 2.6 as (... 10], (10 ... 20], (20 ... 30], (30 ...]. Other examples make the naming scheme clear:
| (... -1] | Values under and including -1 | |
| (-1 ... 10] | Values over -1 up to and including 10 | |
| (10 ... 20] | Values over 10 up to and including 20 | |
| (20 ...] | Values over 20 |
The Histogram Visualizer automatically bins all of the continuous-type columns of in the data and sends the result to the Statistics Visualizer. Figure 1-1 shows the following Histogram Visualizer options:
You can pick the number of bins or allow MineSet to do it for you.
You can set the trimming factor. The trimming factor indicates the fraction of extreme values to be excluded from the value range prior to generating bins. The default trimming fraction is 0.05. This excludes the 5% of the instances with the most extreme values (2.5% with the lowest values in the range and 2.5% with the highest values in the range). Trimming tends to reduce the influence of outliers on the generation of thresholds.
The Scatter Visualizer now allows you to show a trail of motion to demonstrate the changing animation path of an entity. When you create an animation, the trail shows behind each selected entity, in the form you have selected. The motion option menu, located at the bottom right of the ScatterViz control panel, allows you to select from:
No trails—the default
Line trails—a thin colored line
Fade-out trails—a similar colored line, most opaque at its most recent position
Tube trails—trails in 3D tubular form, showing changes of size as the entity moves through the animation path. Too many tube trails may slow animation noticeably.
All trails are color-coded according to the originating entity. If an entity changes from red to blue as the summary slider changes from one position to another, the corresponding trail will also be shown changing color gradually between the two positions. Trails are made between points whose unmapped attributes stay the same over the course of the path.
Aggregated data grouped by a small number of columns tends to be an excellent candidate for the display of motion trails. Initially, motion trails are displayed for all points in the scatterplot affected by the path. Selecting any entity by clicking on it with the mouse causes only the selected point to display a trail. This can be used to reduce visual clutter. Entities with null positions appear as breaks in the trails.
Figure 1-2 shows an example of the Scatter Visualizer with tube motion trails.
To support the new association rules visualization capability of the Scatter Visualizer, the following statements have been added to the Scatter Visualizer configuration file. Refer first to Appendix D, “Creating Data and Configuration Files for the Scatter Visualizer,” in the MineSet User's Guide to understand the basic format of the Scatter Visualizer's configuration file.
The optional disk height statement describes how a field is to be mapped to a disk height. The available clauses are the same as for the size statement. This statement must be present for disks to appear. The syntax of the disk height statement is:
disk height clause1, clause2,... |
For a full description of the size statement, refer to the “View Section” in the MineSet User's Guide.
The optional disk color statement describes how a field is to be mapped to the disk color. The available clauses are the same as for the color statement. If a disk height statement exists, but no disk color statement, the disks are the same colors as the entities. The syntax of disk color is:
disk color clause1, clause2,... |
For a full description of the color statement, refer to the “View Section” in the MineSet User's Guide.
The drillthrough statement specifies a string-valued attribute that provides the filter expression used when drilling through on selected entities. The syntax has the form:
drillthrough var |
This mapping option is useful when the dataset loaded into the Scatter Visualizer does not match the original dataset from which the data was derived. This most commonly occurs if some intermediate mining algorithm transformed the original dataset into a new dataset with different columns. In MineSet, this happens when the Association Rules Generator produces rules and outputs them as ScatterViz configuration and data files to be visualized. If a drill through column is specified using this statement, then the Scatter Visualizer bypasses normal drill through based on column preferences, and uses this column to build the filtering expression by “anding” together the expressions in the drill through column corresponding to the entities that were selected.
The orderby clause of the axis statement allows you to specify that the labels along an axis are alphabetically ordered (for string values mapped to an axis). The only orderby option available is “alpha,” for alphabetical ordering. The statement:
axis LHS, orderby alpha; |
forces string values to appear alphabetically on the LHS axis. If no orderby clause is present, string values are ordered by the attribute mapped to color.
For a full description of the axis statement, refer to “The View Section” in the MineSet User's Guide.
The Decision Table Visualizer has two new features: filtering and Evidence mode.
The Decision Table Visualizer now allows filtering on attribute values in the same way that the Scatter Visualizer does. See the “View Menu” section in the MineSet User's Guide for a discussion of the use of filtering.
The Decision Table Visualizer now has an Evidence mode that is the same as the Evidence mode of the Evidence Visualizer. Go to the View menu of the Decision Table, and choose Evidence Mode to activate it.
The distribution in each cake now shows conditional probabilities, rather than distributions based on the record weights, which are shown initially. This is useful if one of the classes is small.
In Figure 1-3, the label is “race” and the distribution of the data is based on weight. Because most records in this dataset are labeled race=white, it is difficult to discern what values give evidence for other races. Switching to Evidence mode (Figure 1-4) makes it clear which regions give evidence for races less prevalent in the data. See “Selecting Items in the Main Window,” in the MineSet User's Guide for the technical description of how evidence is computed.
Normalized conditional probabilities (evidence) are shown at each cake when using the Evidence mode. From the visualizer window pull down the View menu and choose Show as Evidence.
In Figure 1-4, if the racial breakdown at a cake matches the prior probability (shown in the Label Probability window at the right of the view area), then the slices are of equal size. Bigger or smaller slices indicate correspondingly more or less evidence for a given race. If the slices for a cake are all of equal size, then the racial breakdown for that combination of values is the same as the prior distribution (the distributions shown originally in the pie in the right window.)
The Decision Tree Inducer now has an extended set of splitting criteria and pruning methods.
The set of splitting criteria in MineSet 2.6 has been extended to include chi-square and Gini.
Chi-square applies the chi-square statistical independence test to all candidate splits. It then selects the split that leads to the least independent breakdown of the label values.
Gini is the splitting criterion used in CART (Classification And Regression Trees). Like Mutual Info, Gini measures the change in purity between the parent node and the weighted average of the purities of the child nodes. Unlike Mutual Info, Gini calculates the node purity as one minus the sum of the squared label probabilities at that node.
MineSet 2.6 has three pruning options for decision trees: Confidence, Cost Complexity, and None.
Confidence is the default pruning method used in MineSet, and is based on the heuristic pruning techniques developed in C4.5. It compares the resubstitution error of a subtree with the error if that subtree were replaced with a single node. If the error rate of the node is within the confidence interval of the subtree error, then the subtree is replaced by the single node.
The confidence pruning parameter allows you to change the amount of pruning that MineSet performs. Higher values indicate more pruning; lower values indicate less pruning. The parameter is used to scale the size of a confidence interval in which pruning occurs. The lowest possible value is 0. With a pruning parameter of 0, a subtree is pruned only when the error rate of the single node is at least as low as that of the subtree. There is no upper limit on the confidence pruning parameter. The default factor, 0.7, has been determined empirically to be a reasonable setting in many domains.
Cost complexity is the pruning technique developed in CART (Classification And Regression Trees). Cost complexity pruning attempts to generate optimally sized trees by trading off the error rate of the tree (its cost) and the number of leaves in the tree (its complexity). During cost complexity, pruning the training set is partitioned into a learning set and a pruning set. The learning set is used to grow a pruning tree. This tree is pruned to generate a sequence of trees with decreasing complexity. The pruning set is then used to identify the minimum cost tree in this sequence. The size of the minimum cost tree is noted. The learning and pruning sets are recombined and used to grow a tree. This tree is then pruned to the size of the minimum cost tree.
The cost complexity pruning parameter allows you to select trees smaller than the minimum cost tree. The parameter indicates the number of standard errors more costly than the minimum cost tree that you are willing to accept. Setting the parameter to zero selects the minimum cost tree; setting the parameter to 0.5 selects the minimum size tree that had an error rate no more than 0.5 standard errors worse than the minimum cost tree. The default setting, 0, selects the minimum size tree that has the minimum cost. Higher numbers indicate more pruning. If your data might contain noise (errors and anomalies), increase the number to create smaller trees. If the tree is pruned back to a single node, decrease the number to decrease the amount of pruning and show more of the tree's structure.
Pruning is slower than limiting the tree height or increasing the split lower bound because a full tree is built and then pruned. Pruning, however, is done selectively, resulting in lower error rates.
All visualization tools now provide the option of creating a file based on a visualization that may be published on the Web. From the File menu, select the “Publish on the Web” option. This runs a script that produces a .mtr file which is placed in a selectable directory. The default directory to appear in the file selection dialog box is one of the following:
A directory defined by the MINESET_WEB_DIR environment variable, if present
Or the directory $HOME/public_html if present
Otherwise, the current working directory is used.
There have been a few changes to the file exchange procedures between MineSet and SAS. The following sections describe the changes or replace equivalent sections in Chapter 9, “File Exchange Between MineSet and SAS,” in the MineSet User's Guide:
“SAS Installation Location” describes a new procedure for when SAS is not installed in the default location.
“Converting MineSet Data Files to SAS Data Sets” replaces the first few paragraphs of the equivalent section in the MineSet User's Guide.
“Converting SAS Data Sets to MineSet Data Files” replaces the first few paragraphs of the equivalent section in the MineSet User's Guide.
If SAS is not installed in the default location (/usr/sbin/sas on IRIX, c:\sas\sas.exe on Windows NT), the environment variable SAS_CMD must be set to the installed location of the SAS executable. For example:
setenv SAS_CMD /usr/people/joe/sas/bin/sas (IRIX) set “SAS_CMD=c:\Program Files\sas\sas.exe” (Windows NT) |
Use mineset2sas to convert MineSet data files into SAS data sets. The syntax for this is:
mineset2sas <MineSet file> <SAS datafile> [options] |
The options are:
-svsc to save the script sent to SAS. The script normally is deleted after use.
-names <namefile> to save trimmed column names in <namefile>. The script normally is deleted after use.
For example:
mineset2sas cars cars.ssd01 -svsc -names cars.names |
Use sas2mineset to convert SAS data sets into MineSet data files. The syntax for this is:
sas2mineset <SAS datafile> <MineSet file> [options] |
The options are:
-nodata creates only a .schema file, no .data file.
-svsc saves the scripts sent to SAS.
-nolabel indicates that you do not want labels used for column names.
-names <namefile> restores long column names from <namefile>, created by mineset2sas.
For example:
sas2mineset houses.ssd01 -svsc -names houses.names |