Several sample configuration and data files are available with MineSet to demonstrate the capabilities of various tools. In this section sample files for each tool are detailed and explained. The entries are arranged in alphabetical order by tool.
Windows users find the files in which MineSet was installed, under \examples.
IRIX users find the files in /usr/lib/MineSet/examples.
The file descriptions are:
The following sample data and configuration files are provided to visualize Association Rules based on prepared datasets. Some of these files correspond to hierarchical datasets. Rules files contain the generated rules obtained by running the Association Rules Generator. The files containing the rules should, by convention, have a .rules.data extension. Each configuration file specifies how the corresponding rules file is displayed. Configuration files must have a .scatterviz extension. The files mentioned in this subsection are in the Windows directory in which MineSet was installed, under \examples, or the IRIX directory /usr/lib/MineSet/scatterviz/examples.
group.rules.data and group.rules.scatterviz
These files provide the generated rules and configuration specifications for product groups, such as bread and baked goods, dairy milk, and carbonated beverages.
category.rules.data and category.rules.scatterviz
These files provide the generated rules and configuration specifications for product categories within product groups, such as refrigerated or non-refrigerated milk.
adult94.rules.data and adult94.rules.scatterviz
These files provide the generated rules and configuration specifications for a census dataset, showing associations between marital status, education level, age, income, and other variables.
germanCredit.rules.data and germanCredit.rules.scatterviz
These files provide the generated rules and configuration specifications for a credit dataset from Germany, showing associations between credit history, employment, savings, and other variables.
cars.rules.data and cars.rules.scatterviz
These files provide generated rules and configurations specifications for the cars dataset, showing associations among the various attributes.
The following example shows a case in which clustering might be useful. This example is associated with a sample dataset provided with MineSet. It shows how to work with the Clustering mining tool, and explains the different outcomes and options.
The cars dataset is relatively simple, dealing with familiar concepts of horsepower, vehicle weight, and time required to reach 60 mph.
When the Cluster Visualizer first appears, the attributes are arranged from top to bottom according to how useful they are for differentiating among all clusters.
If you select cluster 1, that cluster then controls the priority ordering of attributes represented by the bar charts and histograms. The order of attributes in other clusters will also be based on cluster 1. For example, if you click on cluster 1, the attribute sequence is cylinders, weight, then miles per gallon. The change in ordering will only appear in the visualization, not the basic dataset. You can compare the same row across the other clusters to see how that attribute differs from cluster to cluster. When you select cluster 2, you see a different order of attributes at a lower level. In this case, origin is most important, then cylinder, then horsepower, then miles per gallon.
The following example shows a case in which Column Importance might be useful. This example is associated with a sample dataset provided with MineSet. It shows how to work with the Column Importance mining tool, and explains the different outcomes and options.
When customers change their phone carrier from one telecommunications company to another, this is termed “churning.” This is a common problem in the telecommunications industry. The files churn.schema and churn.data were used to generate this example. Windows users can find them in the directory in which MineSet was installed, under \data. IRIX users can find them in /usr/lib/MineSet/data.
Running the simple Column Importance mode yields the following three attributes:
Total Day Minutes.
Number of customer service calls.
State.
By running “compute improved purity” from the advanced mode, you can see that Total Day Charge and Total Day Minutes have the same purity ranking (48.67). By moving one of them to the right (for example, Total Day Minutes) and rerunning Compute Improved Purity, you can see that there is no value to the other (Total Day Charge). These two attributes are highly correlated.
Looking at the attributes when Total Day Minutes is on the right, we can see that the following are good:
International plan (4.1)
Number of Customer Service Calls (8.1).
State (4.7)
You can choose to move International Plan to the right, because this information is readily available and easy to measure.
The other two attributes (Number of Customer Service Calls and State) remain highly important (in fact, their importance increases), so they are apparently not correlated with the International Plan.
By looking at the importance of attributes this way, you can determine which ones can be substituted with others that are equally good (or almost as good), but are easier to measure or understand. By looking at the purity, you can determine how much the additional attributes help. For example, in the above scenario, state significantly improves the purity. In the iris dataset, the third attribute chosen (sepal length) raises the purity only slightly higher. The simpler, two-dimensional scatterplot will give nearly as much information as a three-dimensional one.
The following examples illustrate cases in which the Decision Tree inducer can be useful. Each of these examples is associated with a sample data file provided with MineSet. By running the inducer, you can generate the -dt.treeviz files described below.
The data files can be loaded into MineSet by opening the corresponding .schema file from the data directory, (for example churn.schema). The classifier visualization files, which have a -dt.treeviz extension, can be opened from the examples directory.
Windows users find these files in the data and examples directories of the directory in which MineSet was installed.
IRIX users find these files in the data and examples directories of /usr/lib/MineSet/treeviz/examples.
When customers change their phone carrier from one telecommunications company to another, this is termed “churning.” This is a common problem in the telecommunications industry. The file in the examples directory, churn-dt.treeviz, shows a Decision Tree classifier induced for this problem. The file was generated by running the inducer on the file in the data directory, churn.schema, with the label set to churn (yes, no). The file given is fictitious, but based on patterns found in real data.
In this tree the root split is on the amount of time the customers talk during the day (total day minutes). Customers who talk more than 264 minutes per day churn at a significantly higher rate than those who don't (60% versus 11%). These also are probably the most profitable customers.
The left subtree represents customers who talk less than 264 minutes per day. They have a churn rate of 11%; but if they make more than three customer service calls, the churn rate increases to 49%.
The right subtree represents customers who talk over 264 minutes per day. They have a churn rate of 59%; but if they have a voice-mail plan, the rate decreases to 9.3%. If they do not have a voice-mail plan, the churn rate is almost 75%.
The cars dataset contains information about different models of cars from the 1970s and early 1980s. Attributes include weight, acceleration, and miles per gallon (mpg). The file from the examples directory, cars-dt.treeviz, shows the Decision Tree classifier induced for this problem. This file was generated by running the inducer on the file from the data directory, cars.schema, with the label set to origin (Japan, U.S., Europe). If you have a dataset of car attributes, you might want to know what characterizes cars of different origins.
Windows users find these files in the directory in which MineSet was installed, under examples\cars-dt.treeviz and \data\cars.schema
IRIX users find these files in /usr/lib/MineSet/treeviz/examples/cars-dt.treeviz and /usr/lib/MineSet/data/cars.schema
Note that in the tree the left split is on brand. The root split is not brand because the Decision Tree inducer penalizes multi-way splits; and the split on cubic_inches was deemed a better discriminator. You can use the Tool Manager Remove Column transformation to hide the brand, thus making the problem more interesting.
In the Decision Tree, you can see that cubic inches is an excellent discriminator for U.S.-made cars. Cars with large engines (>169.5 cubic inches) are all made in the U.S., but smaller cars are made everywhere. By choosing Selections > Show Original Data, you can see that the one car with a big engine that was not made in the US is a Mercedes. Note that in this tree, the root node (that is, the entire training dataset) has many more U.S. cars (62.50%), yet after a single split on the cubic inches attribute, it is more difficult to predict the origin of cars with small engines. The purity of the root is 16.2 showing that there is one class (U.S., in this case) that is dominant. The right node (cubic inches > 169.5) has purity 96.81, indicating that we have identified a very pure subpopulation (almost all cars with large engines were made in the U.S). Indeed, the error rate for the right subtree is estimated at 0% (green base). The left node from the root has purity 0.23 and a much higher error rate of 31.25% (orange base). This subproblem is much harder than the original one: the number of records for each class is approximately the same.
The adult dataset contains information about working adults. This dataset was extracted from the U.S. Census Bureau. It contains data about people older than 16, with a gross income of more than $100 per year who work at least one hour a week. You might want to know how to characterize males and females. The file adult-sex-dt.treeviz shows the Decision Tree classifier induced for this problem. This file was generated by running the inducer on adult.schema, with the label set to sex. This dataset contains almost 50,000 records; so running the Decision Tree Inducer can take several minutes when you run this on your workstation.
Windows users find these files in the directory in which MineSet was installed, under \examples\adult-sex-dt.treeviz and \data\adult.schema
IRIX users find these files in /usr/lib/MineSet/treeviz/examples/adult-sex-dt.treeviz and /usr/lib/MineSet/data/adult.schema
The resulting visualization provides the following insights:
Relationship is a giveaway attributes for some values. Husbands usually are male. (Interestingly, there is one husband that is a female, showing data quality problems at the Census Bureau, which does not recognize same-sex marriages.) Similarly, if the person is a wife, the person is usually a female, except for three records that show otherwise.
To make the problem more interesting, remove the relationship attribute and generate a new Decision Tree. In this case:
The most important attribute is marital status.
From the height of the bases, most people are either divorced, married to a civilian spouse, or never married. Few are married with spouse absent, separated, married to armed-forces spouse, or widowed.
The distribution at the root shows more males in this dataset. (This dataset contains information about working adults and is not representative of the entire population.)
The left-most node contains divorced working adults. We can see that the distribution is more balanced than at the root (60% female, 40% male). The second node contains married working adults. We can see that 89% are males. The third node contains working adults that have never married. Their numbers are approximately equal to those in the divorced group, with slightly more males. The right-most node contains working widowed adults, of which 81% are females (probably because of their higher life expectancy). The term “widowed” refers to anyone who has lost a spouse.
If you want to target working females for a new product, you can use the search panel to identify segments that have a large population of females. You can do this by choosing
sex matches female (click female on the top portion of the window)
subtree weight > 1000
percent > 80
Three yellow spotlights show the matching nodes. Since two are on one path, look at the node closest to the root (on the right). The paths translate into the rules
marital status = Widowed implies that 81.23% are female marital status = Divorced and occupation = administrative clerical implies that 87.67% are female |
In this training set, 1233 (widowed) and 1045 (divorced and occupation) females satisfy these rules out of 16,192 at the root. This simple segment contains over 14% of the working women in the dataset.
If you have a dataset of working adults, you might want to find out what factors affect salary. You might then divide the records into two classes: those adults earning under $50,000 a year, and those earning more. Each record then has an attribute with one of two values: “- 50,000” and “50,000+”. You can run a MineSet classifier to help determine what factors influence salary. The examples file adult-salary-dt.treeviz shows the Decision Tree classifier induced for this problem. This file was generated by running the inducer on the data file adult.schema with gross_income binned at the user-specified threshold of 50000 and the label set to gross_income_bin.
Windows users find these files in the directory in which MineSet was installed, under \examples\adult-salary-dt.treeviz and \data\adult.schema
IRIX users find these files in /usr/lib/MineSet/treeviz/examples/adult-salary-dt.treeviz and /usr/lib/MineSet/data/adult.schema
The resulting visualization provides the following insights:
The root, which represents the entire training set, shows 76.07% of the working adults earn under $50,000.
Age is the most important factor. Only 3.07% of the people under 27 years old earn more than $50,000. The base color is green, indicating a very accurate rule (about 3% error rate).
Education is an important factor for predicting salary for people over 27 years old. The Census Bureau assigns education levels to each person. The Decision Tree classifier splits on 12.5; the level 13 matches a Bachelor's degree. People with a Bachelor's degree or higher, go right to the node where about 55% earn over $50,000.
Of the segment that is older than 27 years and well educated, relationship is an important predictor of salary. For those persons that are married, chances of earning $50,000 or more increase to 73% for husbands and 75% for wives. (However, the node containing wives has a small base, indicating that few females match this rule.) If the person in this group is not married, chances of earning $50,000 or more decrease to 27% for males and 25% for females.
In this dataset, each record describes four characteristics of iris flowers: petal width, petal length, sepal width, and sepal length. Each iris was further classified into the types iris-setosa, iris-versicolor, or iris-virginica. The goal is to understand what characterizes each iris type.
Before running a classifier, click the Importance tab in the Tool Manager's Classifiers tab; then click Go. You obtain a ranking of the importance of the features: petal_width, petal_length, and sepal_length. You can map these to the axes in the Scatter Visualizer, with the iris_type mapped to the color, and see the clusters.
The file iris-dt.treeviz shows the Decision Tree classifier induced for this problem. This file was generated by running the inducer on iris.schema.
Windows users find these files in the directory in which MineSet was installed, under \examples\iris-dt.treeviz and \data\iris.schema
IRIX users find these files in /usr/lib/MineSet/treeviz/examples/iris-dt.treeviz and /usr/lib/MineSet/data/iris.schema
Running the Tree Visualizer, you can see that the root has 6% error rate, even though the purity is very low (0). The purity measures the skewness of the distribution, and, at the root, the distribution is perfectly uniform: 50 records for each label value. The left branch (petal-length <=2.6 inches) goes to a green node (zero error) containing only iris-setosas. The other branches are also quickly able to separate the classes using another test on the petal_width. The path petal-length > 2.6 and petal-width <= 1.65 and petal-length > 5 ends with an impure leaf containing 4 records. There are three records of type iris-virginica and one of iris-versicolor. The Decision Tree did not split this node because it was deemed insignificant (by default, every split must contain two children with at least a weight of two). The node color is also black, indicating that no test instances reach this node, so we do not have an estimated error rate for it.
To summarize: the flowers with petal length <= 2.6 inches are predicted as iris-setosa, those with petal length > 2.6 inches and <=5 inches and petal width <= 1.65 inches are predicted as iris-versicolor, and those with a petal length >2.6 inches and a petal width > 1.65 or petal length > 5 inches and petal width <= 1.65 are predicted as iris-virginica.
Because the Decision Tree makes binary splits on continuous attributes while Column Importance discretizes the data, the root split of the tree is different from the first attribute in column importance (see “Column Importance ” in Chapter 1 for more details).
The file mushroom-dt.treeviz shows the Decision Tree classifier induced for the classification of mushrooms. This file was generated by running the inducer on mushroom.schema.
Windows users find these files in the directory in which MineSet was installed, under \examples\mushroom-dt.treeviz and \data\mushroom.schema
IRIX users find these files in /usr/lib/MineSet/treeviz/examples/mushroom-dt.treeviz and /usr/lib/MineSet/data/mushroom.schema
The goal is to understand which mushrooms are edible and which are poisonous, given this dataset. There are over 8000 records in this set; thus, running this inducer might take several seconds.
Each mushroom has many characteristics, including cap color, bruises, and odor. If you build a Decision Tree classifier, you can see that using only the odor attribute lets you determine in 50% of the cases whether the mushroom is poisonous or edible. If the mushroom has no odor, there is a 3.4% chance it is poisonous. The next attribute to look at is the shape of the stalk. If it tapers, the mushroom is edible; but if it enlarges, there is a 11.6% chance the mushroom is poisonous. There are 1032 mushrooms that reach this node. You can follow the tree down further nodes to see what other attributes to consider.
This dataset consists of voting records. The goal is to identify the party a congress person belongs to given data about key votes. The dataset includes votes for each member of the U.S. House of Representatives on the 16 key votes identified by the Congressional Quarterly Almanac (CQA). The CQA lists nine types of votes: voted for, paired for, and announced for (these three are simplified to yes); voted against, paired against, and announced against (these three are simplified to no); voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three are simplified to an unknown disposition).
Before running a classifier, look at the 16 votes to see if you can perceive which features are important. Then run the Decision Tree classifier.
The file vote-dt.treeviz shows the Decision Tree classifier induced for this problem. This file was generated by running the inducer on vote.schema.
The breast cancer dataset contains information about women undergoing breast cancer diagnosis. Each record is a patient with attributes such as cell size, clump thickness, and marginal adhesion. The final attribute is whether the diagnosis is malignant or benign. The file breast-dt.treeviz shows the Decision Tree classifier induced for this problem. This file was generated by running the inducers on breast.schema.
Windows users find these files in the directory in which MineSet was installed, under \examples\breast-dt.treeviz and \data\breast.schema
IRIX users find these files in /usr/lib/MineSet/treeviz/examples/breast-dt.treeviz and /usr/lib/MineSet/data/breast.schema
The Decision Tree shows that uniformity of cell size is a very strong discriminatory attribute. While the root distribution is about 65% versus 35% (purity is 7.07), the two children of the root are much more skewed, with the left node having an error rate of only 1.29%. The root alone is an excellent discriminator: if you limit the tree height to a single level, the error rate is 7.3%.
The hypothyroid diseases dataset is similar to the one for breast cancer, except that we are trying to predict hypothyroidism rather than cancer. The file hypothyroid-dt.treeviz shows the Decision Tree classifier induced for this problem. This file was generated by running the inducer on hypothyroid.schema.
Windows users find these files in the directory in which MineSet was installed, under \examples\hypothyroid-dt.treeviz and \data\hypothyroid.schema
IRIX users find these files in /usr/lib/MineSet/treeviz/examples/hypothyroid-dt.treeviz and /usr/lib/MineSet/data/hypothyroid.schema
There are 3163 records in this dataset and most of them do not have hypothyroid (95.23%). This means that one can predict “negative” and be correct most of the time. However, we are worried about those people that have hypothyroidism, yet the model predicts to be healthy. The false negatives are very important. By selecting a confusion matrix from Further Inducer Options, you'll see that there are five patients with hypothyroidism who were misclassified.
Looking at the Decision Tree, you can see that the root node is green (highly accurate). The single attribute on fti at the root shows that it is relatively easy to identify many of the negative diagnoses. People with high fti are 99.7% negative, and all those where the value is unknown are also negative (perhaps the doctor decided not to measure this attribute because something else was obvious), but the rest (218 people) are difficult to diagnose cases. We started with 3163 records, but only 218 are really “interesting” to mine because it was very easy to determine the classification of most cases. In this example most of the data is uninteresting and you want to concentrate on a small part quickly. Of the 218 people, you can see that about 66% are positive and 34% negative.
As you move down the tree, increase the height scale (slider on the top left of the visualizer) to see the different heights. The node that catches most of the people with hypothyroidism has the conditions “fti <= 64.5 and tsh > 5.95.” It contains 140 of the 151 records that have hypothyroidism.
This dataset is a diagnosis problem for diabetes using statistics gathered from a Native American tribe in Phoenix, Arizona. The task is to determine whether a patient has diabetes, given some medical attributes, such as blood pressure, body mass, glucose level, and age.
The file pima-dt.treeviz shows the Decision Tree classifier induced for this problem. This file was generated by running the inducer on pima.schema.
Windows users find these files in the directory in which MineSet was installed, under \examples\pima-dt.treeviz and \data\pima.schema
IRIX users find these files in /usr/lib/MineSet/treeviz/examples/pima-dt.treeviz and /usr/lib/MineSet/data/pima.schema
There are 3,186 records in this DNA dataset. The domain is drawn from the field of molecular biology. Splice junctions are points on a DNA sequence at which “superfluous” DNA is removed during protein creation. The task is to recognize exon/intron boundaries, referred to as EI sites; intron/exon boundaries, referred to as IE sites; or neither. The IE borders are referred to as “acceptors” and the EI borders are “donors.” The records were originally taken from GenBank 64.1 (genbank.bio.net). The attributes provide a window of 60 nucleotides. The classification is the middle point of the window, thus providing 30 nucleotides at each side of the junction.
In this example, the root of the Decision Tree shows the distribution of the three classes. By pointing to the bars, you can see that the composition is about 24% exon/intron, 24% intron/exon, and 52% none. The “left_01” in front of the root node indicates that this is an important attribute to look at first. The “left_01” notation refers to the first nucleotide found to the left of the splice junction in question. The choices of attribute values for this first nucleotide (and all nucleotides in general) are the “A”, “G”, “T”, and “C” nucleotides. If the “left_01” nucleotide is a “G”, then the “G” branch is taken and followed to the next node, where the distribution now shows that such a nucleotide is more likely to be an “exon/intron” or an “intron/exon” than at the root: the distribution is 34% for “exon/intron,” 42% for “intron/exon”, and 24% for “none.” If the “left_01” nucleotide is an “A”, “T”, or “C”, then the corresponding “A”, “T”, or “C” branch is taken instead and in all three cases, the probability of “none” increases dramatically (87%, 87%, and 95% respectively). This testing and branching process is repeated until the final node with the predicted class (“exon/intron”, “intron/exon”, or “none”) is reached.
For this dataset, the Evidence Classifier is more appropriate than a Decision Tree due to the probabilistic nature of this domain. This can be verified by comparing the estimated error rates.
The following examples show cases in which the Decision Table can be useful. Each of these examples is associated with a sample data file provided with MineSet. By running the Decision Table inducer (with Suggest Using Feature Search turned on in the Further Inducer Options), you can generate the -dtab.dtableviz and -dtab.dtableviz.data files described below.
| Note: Note: |
Windows users find these files in the directory in which MineSet was installed, under \examples and \data
IRIX users find these files in /usr/lib/MineSet/treeviz/examples and /usr/lib/MineSet/data
Churn is when a customer leaves one company for another. This example shows what causes customer churn for a telephone company. The files churn.schema and churn.data were used to generate this example. Windows users can find them in Program Files\SGI\MineSet\data. IRIX users can find them in /usr/lib/MineSet/data.
The file churn-dtab.dtableviz shows the structure of the classifier induced using the attribute churned as the label. The error rate for this classifier is 5.5%. Of the records, 14.3% represent customers who churned. The two attributes selected for the first level of detail were number of customer service calls and total day charge. By looking at the distribution over these two attributes, you can see that churn increases as total day charge increases, except when the total day charge is less than 29.75. Then the churn is high if the number of service calls is more than 3. About 3/4 of the records have total day charge less than 38 and 3 or fewer customer service calls.
Begin to drill down on regions where it's not clear when a customer churns. Figure A-1 shows drilling down on all cakes in which there was not a clear majority class. The next attributes considered are international plan and number of vmail messages. Among those with heavy day charge, it appears clear that having the international plan and having few voice mail messages correlates well with customer churn. Selecting the lower right cake in each of the drill-down regions, and then brushing with the mouse across the box next to “churned=yes” in the probability pane on the right, shows that only 3.4 percent of the customers in the selected regions churned.
Now drill down further on the cake in the upper left of each previous drill-down region. Doing so shows a very similar distribution for each. The consistent pattern shows: high total evening charge and high total international charge correlate well with churn. You can even drill-down another level to see the effect of total international calls, but doing so leaves so few records from which to draw conclusions, you could not be confident of a prediction made based on this sample. If you are interested in how total international calls affects churn and how it correlates with other variables, return to the Tool Manager and explicitly map total international calls to a higher level in the hierarchy, and rerun the decision table.
Although “state” is fairly well correlated with churn, it was not selected because the algorithm has a built-in preference for variables with few values. This prevents the algorithm from selecting attributes like social security number which uniquely identify each record, thus yielding high training set accuracy, but are not useful for classifying future unlabeled data.
The cars dataset contains information about different models of cars from the 1970s and early 1980s. Attributes include weight, acceleration, and miles per gallon (mpg). The file cars-dtab.dtableviz shows the structure of the Decision Table Classifier induced for this dataset. This file was generated by running the inducer on cars.schema with the label set to “origin” (Japan, U.S., Europe).
Windows users find these files in the directory in which MineSet was installed, under \examples\cars-dtab.dtableviz and \data\cars.schema
IRIX users find these files in /usr/lib/MineSet/examples/dtableviz/cars-dtab.dtableviz and /usr/lib/MineSet/data/cars.schema
Since the brand attribute uniquely determines the origin, the structure of the classifier is extremely simple. The only two attributes shown are brand and cylinders. It is interesting to see which brands tend toward high or low cylinder types. For example, there are 21 different models of Mazda, and 18 models of Honda, but they all have 5 or fewer cylinders. Conversely, Cadillac only makes cars with six or more cylinders.
This example could probably be made more interesting by first removing the brand attribute. Another useful transformation might be to convert cylinders to string so each unique cylinder value is shown, rather than a bin. Alternatively, one can create additional levels of detail beyond brand and cylinder, by mapping them explicitly.
The adult dataset contains information about working adults. This dataset was extracted from the U.S. Census Bureau. It contains data about people older than 16, with a gross income of more than $100 per year who work at least one hour a week. You might want to know how to characterize males and females.
The file adult-sex-dtab.dtableviz shows the structure of the Decision Table Classifier induced for this problem. This file was generated by running the inducer on adult.schema, with the label set to sex, after removing the relationship column (which would have made the classifier trivial). To make it easier to see the distribution of records for each combination of values, you can scale the cake heights using the scale slider on the left.
Windows users find these files in the directory in which MineSet was installed, under \examples\adult-sex-dtab.dtableviz and \data\adult-sex.schema
IRIX users find these files in /usr/lib/MineSet/examples/dtableviz/adult-sex-dtab.dtableviz and /usr/lib/MineSet/data/adult-sex.schema
In the Decision Table Visualizer, the Label Probability Pane shows that the prior probability of working males is higher than that of females. The Evidence Visualizer showed us that marital status and occupation are very important attributes for determining gender, however, it did not show us the dependencies between these two attributes. The top level shows several interactions. For example, most people with occupation craft repair are married civilian spouses (more specifically 98.6 of them are husbands), while most people with occupation “Other-service” have “Never-married” (48% male).
At first it may seem odd that most of the people with “marital_status = Married-civilian-spouse” are male, but once you consider that this data was probably gathered from tax returns, it seems reasonable that the wives of these males are not working, but filing jointly with their husbands.
The divorce rate seems highest among those with “occupation = Admin-clerical”. Those in “Other-service” also have high divorce rates, but they seem to prefer separation, as the number who are separated is even greater than that of “Admin-clerical.”
Suppose you wanted to find out the probability of being female given that a person is “widowed” and has “occupation=Adm-clerical”. In the evidence visualizer one can get an approximate answer (94.7% female) for this by clicking on these two attribute values. Here we can get the exact answer by clicking the left mouse on the cake at the intersection of these two values (95.2% female).
Drill down to the lowest level on the cake for “Married-civilian-spouse” and “Occupation= Unknown.” There is a pattern evident here more than any of the other combinations of occupation and marital status. For this cake, the younger members tend to be women, and the older members tend to be men.
For a dataset of working adults, you might want to find out what factors affect salary. First bin gross income into two bins, those that earn less that 50,000, and those earning over 50,000. You can run a MineSet classifier to help determine what factors influence salary. The file adult-salary-dtab.dtableviz shows the Decision Table classifier induced for this problem. This file was generated by running the inducer on adult.schema with gross income divided into five bins using user-specified thresholds.
Windows users find these files in the directory in which MineSet was installed, under \examples\adult-salary-dtab.dtableviz and \data\adult-salary.schema
IRIX users find these files in /usr/lib/MineSet/examples/dtableviz/adult-salary-dtab.dtableviz and /usr/lib/MineSet/data/adult-salary.schema
Since the label is numeric, a continuous spectrum is used to assign colors to each class. Also the classes in the probability pane on the right are not sorted by slice size because they have a numeric order. Red is assigned to the highest bin (50,000+).
The two attributes chosen at the top of the hierarchy are relationship and “education_ num.” The attribute education num is not particularly useful because it is simply an enumeration of the different educations possible, not years of education, as you might think. However there is an approximate correlation. Replace this column with education if you prefer to see the actual string values. If you simply remove the column, education_num, and rerun using feature search, the algorithm may not pick education at the top of the hierarchy because it has so many values.
The order of the attributes in this model were selected automatically to increase accuracy. Often a model in which domain knowledge is used to perform the mappings can give a more useful visualization. Such a model is provided by adult-salary3-dtab.dtableviz, and shown in Figure A-2. Here the salary has first been binned into 3 ranges (20,000, and 60,000 are the thresholds). The attributes mapped at level one are: relationship and sex; level two has education and occupation; and level 3 has hrswk (hours worked per week) and age.
At the top level we know “relationship” and “sex” have a strong correlation. Of course we expect all husbands to be male, and wives to be female, but we can see right away this is not the case. By picking on the cake for male wives we see that there are 3 of them and all their salaries fall in the 20,000-60,000 range. Drilling down on these cakes reveals more information about these anomalous records. You may wish to select these cakes (using the left mouse button) and drill through to the underlying data so you can find the values of all the other fields.
Click the right mouse on the background; this will drill down globally to the next level. Every cake is now replaced with a matrix for every combination of education and occupation. The ordering is the same for every matrix, and overall the ordering is by correlation with income. If you choose Nominal Order > By Weight from the pulldown menu, the overall ordering will be by record weights. The most prevalent occupations and education levels will appear in the lower left corner of each matrix. “High school grads” is the most prevalent education level, and “Professional-specialty” is the most common occupation, but there are not many high school graduates whose occupation is professional specialty.
Returning to ordering nominals by income, you can see distinct distributions for each combination of sex and relationship. There is not much difference between males and females whose relationship is not-in-family. The difference between unmarried males and females, however, is very pronounced. (See Figure A-3). There is a very distinctive cluster of red cakes in the lower left of the male matrix that does not exist in the female matrix. By scaling up the heights somewhat, you will notice that the female matrix has obvious spikes at occupation = “admin clerical” and “other service.” No such spikes are visible in the corresponding male matrix.
Click with the right mouse button on the most populous cake (male husbands). This operation may take a minute to perform because the visualizer needs to construct all the geometry for the next level for this cake. The geometry is constructed on demand because the time needed to create it all at the beginning would be excessive, and wasteful since the user rarely explores many of the high detail regions. If you right click on the background by mistake you may be forced to wait a very long time if the amount of detail at the next level is very long - as it is in this case. If the drill-down will take a long time, a progress bar with a cancel button is displayed.
Consider the many age by hrswk distributions that are displayed for every combination of occupation and education. The first surprising fact is that, in spite of the many hundreds of cakes shown, there is a single spike at that accounts for 2.5% of all husbands! (See Figure A-3) If you had to pick characteristics for a typical husband, it would be reasonable to say he is a HS-grad doing craft-repair, aged between 41 and 59 and working between 38 and 41 hours a week.
Compare the salary distributions for husbands who are HS-grads in sales with those who are HS-grads and executive managers. Although the distribution of age and hours worked is similar, the probability of being in the income greater than 60,000 class is 34% for this group of managers, compared with 27% for the salesmen. To see these probabilities shown at the top, first click the button next to the 60000+ income class in the label probability pane on the right, then pick cakes on the left.
In this dataset, each record describes four characteristics of iris flowers: petal width, petal length, sepal width, and sepal length. Each iris was further classified into the types iris-setosa, iris-versicolor, or iris-virginica. The goal is to understand what characterizes each iris type.
Before running a classifier, click the Column Importance tab in the Tool Manager's Classifiers tab; then click Suggest then Go. You obtain a ranking of the importance of the features: petal width, petal length, and sepal length. You can map these to the axes in the Scatter Visualizer, with the iris type mapped to the color and see the clusters.
The file iris-dtab.dtableviz shows the structure of the Decision Table Classifier induced for this problem. This file was generated by running the inducer on iris.schema.
In the Decision Table Visualizer, we can see that petal width is an excellent discriminatory attribute. When you add sepal width you see that all instances of iris versicolor appear in the “sepal width < 3.05” bin for those records which have petal width of between 0.75 and 1.65.
Drill down on the three cakes which are not 100% pure. The top two cakes each contain a single instance of iris-versicolor which prevent them from being pure. For the “sepal width < 3.05” cake it is very difficult to isolate the anomalous iris-versicolor. For that particular cake, however, the iris versicolor is isolated by using petal length.
The file mushroom-dtab.dtableviz shows the structure of the Decision Table Classifier induced for this problem. This file was generated by running the inducer on mushroom.schema.
The goal is to understand which mushrooms are edible and which ones are poisonous, given this dataset. There are over 8000 records in this set; thus, running this inducer might take several minutes. Note that under the default mode of the one-third holdout for accuracy estimation, a third of the records are kept for testing.
Each mushroom has many characteristics, including cap-color, bruises, and odor. In the Decision Table Visualizer odor and stalk-shape appear at the top level. Note that odor alone does an excellent job of discriminating edibility. Only when there is no odor and the stalk-shape is “enlarging” is there any ambiguity. So naturally we drill down on this lone cake. Now we see just the records with these 2 values broken down by their values for bruises and gill-size. Notice the interaction between gill-size and bruises. This interaction is difficult to discern using any other classifier.
Since all the attributes in this dataset are nominal, all the values are sorted by how well they predict edibility. You might want to order the values alphabetically or by weight (prevalence). To do this, select the appropriate method from the nominal order menu. If you considered either bruises or gill_size alone you would not be able to predict large classes of completely edible or poisonous mushrooms, but by considering them together, we see that if there are no bruises and the gill-size is broad, then all 814 mushrooms of this type are edible. Conversely, if there are bruises and the gill-size is narrow, then all 11 mushrooms of this type are poisonous. To disambiguate the other two cases we would have to drill down further.
In the Decision Table Visualizer, move the% Weight Threshold slider to the right. Eventually those with musty odor will be deleted from the scene. The reason for this is that there are fewer than 1% of the records labeled “odor=musty.”
This dataset consists of voting records. The goal is to identify the party to which a congress person belongs given data about key votes. The dataset includes votes for each member of the U.S. House of Representatives on the 16 key votes identified by the Congressional Quarterly Almanac (CQA). The CQA lists nine types of votes: voted for, paired for, and announced for (these three are simplified to yes), voted against, paired against, and announced against (these three are simplified to no), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three are simplified to an unknown disposition).
Before running a classifier, look at the 16 votes to see if you can perceive which features are important. Then run the Decision Table Visualizer. For this dataset, you may wish to order the values alphabetically, so that all “yes” votes appear in the upper right and “no” votes appear in the lower left.
A the top level we see “synfuels corporation cutback” and “physician fee freeze.” There is a fascinating relationship between these variables that would be next to impossible point out with any other model. All but three who voted against physicians was a democrat. Nearly every democrat that voted against physicians also voted against the synfuels cutback (only 3 of 206 did not fit this pattern). Surprisingly, all but five republicans that voted for the physicians, voted against the synfuels cutback.This odd relationship between these very different issues hints that these bills may have been connected in ways that would require further investigation.
Most of the cakes at the top level are nearly pure except for the middle one (which contains only 6 records) and the one where the representatives voted yes on both issues. Here we can drill-down another level to discriminate the political affiliation of the representatives in this group. Doing so uses “anti-satellite test ban” and “adoption of the budget resolution” to further discriminate among the 55 representatives in this group.
The file vote-dtab.dtableviz shows the structure of the Decision Table Classifier induced for this problem. This file was generated by running the inducer on vote.schema.
The breast cancer dataset contains information about women undergoing breast cancer diagnosis. Each record represents a patient with attributes such as cell size, clump thickness, and marginal adhesion. The final attribute is whether the diagnosis is malignant or benign. The file breast-dtab.dtableviz shows the structure of the Decision Table Classifier induced for this problem. This file was generated by running the inducer on breast.schema.
Windows users find these files in the directory in which MineSet was installed under \examples\breast-dtab.dtableviz and \data\breast.schema
IRIX users find these files in /usr/lib/MineSet/examples/dtableviz/breast-dtab.dtableviz and /usr/lib/MineSet/data/breast.schema
In the Decision Table Visualizer, mitosis and uniformity of cell shape are shown at the top of the hierarchy. If both attributes have low values at the same time the given sample is 99.2% likely to be benign. On the other hand, 100% of the training records that had high values for both, were malignant.
Drill-down on the four cakes that are not so pure. Now marginal adhesion and bare-nuclei are used to discriminate. There are far fewer records in each cake at this level; as a result there is more noise, and trends are more difficult to detect. High values for both marginal-adhesion and bare-nuclei seem to contribute to malignancy, but its uncertain. Note that the first value of bare-nuclei is null. The distributions for these null cakes are more suspect than others so you may wish to hide them by unchecking View > Show Nulls.
If you drill down globally two more levels, you can note a few interesting features. The cakes get very small, and there are large regions of the multi-dimensional space which are empty. There are a few tiny regions where many records are clustered. There is one huge spike (100% benign) where all the values are low. This spike alone accounts for about 20% of the data.
The hypothyroidism dataset is similar to the one for breast cancer. The file hypothyroid-dtab.dtableviz shows the structure of the Decision Table Classifier induced for this problem. This file was generated by running the inducer on hypothyroid.schema.
There are 3,163 records in this dataset and most of them do not have hypothyroidism (95.45%). This means that one can predict “negative” and be correct most of the time. However, we are worried about those people that have hypothyroidism, yet the model predicts to be healthy. The false negatives are very important.
This is a case where you might want to adjust the loss matrix to skew the posterior probability toward predicting hypothyroidism in order to avoid false negatives. There might be a high cost associated with predicting that someone is healthy when they actually have the disease; predicting them sick when they are actually healthy means they merely have to take a more accurate test or a treatment they do not need.
Using the Decision Table Visualizer on this dataset, we note:
This dataset is a diagnosis problem for diabetes using statistics gathered from an Indian tribe in Phoenix Arizona. The task is to determine whether a patient has diabetes, given some medical attributes, such as blood pressure, body mass, glucose level, and age.
The file pima-dtab.dtableviz shows the structure of the Decision Table Classifier induced for this problem. This file was generated by running the inducer on pima.schema.
Using the Decision Table Visualizer we note:
The file dna-dtab.dtableviz shows the structure of the Decision Table Classifier induced for this problem. This file was generated by running the inducer on dna.schema.
There are 3,186 records in this DNA dataset. The domain is drawn from the field of molecular biology. Splice junctions are points on a DNA sequence at which “superfluous” DNA is removed during protein creation. The task is to recognize exon/intron boundaries, referred to as EI sites; intron/exon boundaries, referred to as IE sites; or neither. The IE borders are referred to as “acceptors” and the EI borders are “donors.” The records were originally taken from GenBank 64.1 (genbank.bio.net). The attributes provide a window of 60 nucleotides. The classification is the middle point of the window, thus providing 30 nucleotides at each side of the junction.
From the Decision Table Visualizer, you can see a surprising pattern not nearly as evident in any other classifier model. At the top level there is a pronounced interaction between left_01 and right_02. Exon/intron is only present if right_02 = T. Intron/extron is only present if left_01=G. For other values of left_01 and right_01 there are very few splice junctions.
Drill down globally to the next level (left_02 and right_01). Among the records where right_02 = T and left_01 = G we see a pattern which is consistent with the patterns along each edge.
The following examples show cases in which classifiers might be useful. Each of these examples is associated with a sample dataset provided with MineSet. By running the inducer, you can generate the .eviviz files described below.
The data files can be loaded into MineSet by opening the corresponding .schema file from the data directory, (for example churn.schema). The classifier visualization files, which have a .eviviz extension, can be opened from the examples directory.
Churn is when a customer leaves one company for another. This example shows what causes customer churn for a telephone company.
The files churn.schema and churn.data were used to generate this example. To load a data file into MineSet, open the .schema file.
The file churn.eviviz shows the structure of the classifier induced using the attribute churned as the label. The error rate for this classifier is 12%. 14.1% of the records represent customers who churned. The two most important attributes, total day minutes and total day charge, are clearly correlated. If you run the inducer after selecting Automatic Feature Selection from the Further Inducer Options, the error-rate drops to 10.5% using only 4 attributes (total day charge, number of service calls, voice mail plan, and number of voice mail messages). All 29 customers who had a total day charge above 53.78 churned.
A high number of customer service calls is a predictor of churn. Many customer service calls might indicate frustration in using a complicated equipment or receiving unreliable service. Customers with the International plan are also more likely to churn. The people in some states were much more likely to churn than those in others; for example, California and New Jersey have the most churn, Virginia the least. To see just those states that have more than 2% of the total number of records, slide the % Weights Threshold slider all the way to the right. This eliminates most of the values for state from the display. If you also select Nominal Order > Weight, then the state with the most records, West Virginia (WV), is left-most. Many of the attributes (at the bottom of the list) are not useful in discriminating churn. Note that day charge is a great predictor, but night charge is not.
The cars dataset contains information about different models of cars from the 1970s and early 1980s. Attributes include weight, acceleration, and miles per gallon (mpg). The file cars.eviviz shows the structure of the Evidence Classifier induced for this problem. This file was generated by running the inducer on cars.schema with the label set to origin (Japan, U.S., Europe) and the cylinders column changed to type string. The cylinders were changed to type string in order to see all values and avoid the automatic discretization.
Windows users find these files in the directory in which MineSet was installed, under \examples\cars-eviviz and \data\cars.schema
IRIX users find these files in /usr/lib/MineSet/examples/eviviz/cars.eviviz and /usr/lib/MineSet/data/cars.schema
If you have a dataset of car attributes, you might want to know what characterizes cars of different origins. From the distribution of label values in the pie on the right we can see that most cars in this dataset were made in the U.S. (62.5%) and a smaller number in Japan (20.2%) and Europe (17.3%). Clearly brand is the best predictor of origin, since each brand is associated with only one country of origin. For this reason, it has the highest importance and is at the top of the list. By looking at the height of the pies, it can be seen that many cars have four cylinders, most weigh less that 3000 lbs and most can reach 60 miles per hour in less than 20 seconds but more than 13.
Look at the distribution of slices for individual attribute values. If a car has an engine size >169 cubic inches, it is almost certainly made in the U.S.; it certainly was not made in Japan. Other pies show that U.S. cars generally have six or eight cylinders, low miles per gallon, high horsepower (over 134), heavy weight (over 2981 lbs), and fast acceleration. Japanese cars have better gas mileage, three or four cylinders (and a few six cylinders), and smaller engines. If you click “Europe” in the Label Probability Pane, you can see bars representing evidence for a car being European. For example, five cylinders strongly indicates that a car is European. The height of the corresponding pie, however, shows that there were only three cars with five cylinders in the data. If a car's mileage is good, there is much evidence for it being European. If a car's mileage is less than 41, then there is an 83% chance that it's European. If a car is European, there is only a 10.4% chance that its mileage is better than 41 mpg. But only 2% of Japanese cars—and no U.S. cars—have mpg in this range, so Europe gets the most evidence.
Suppose you wanted to predict where a car came from knowing only that it got 40 mpg and weighed 3000 lbs. Select the appropriate pies (or bars): mpg=30.95-41.15 and weightlbs=2981.5+. The resulting probability distribution on the right shows 84% U.S., 16% European. There is no possibility it is Japanese because there were no Japanese cars in the training set with weightlbs>2981.5. If you run the inducer again with Laplace correction turned on (with a value of .5), you get a different answer: 16% chance for European, 82% chance for U.S., and a 2% chance for Japanese. This is because Laplace correction prevents any slice in the cake charts from going completely to zero. Certainly, there is no fundamental reason why the Japanese could not make a car that weighs more than 2981lbs; hence, when the probabilities (pies) are multiplied together, the possibility of predicting a Japanese car is not eliminated.
The adult dataset contains information about working adults. This dataset was extracted from the U.S. Census Bureau. It contains data about people older than 16, with a gross income of more than $100 per year who work at least one hour a week. You might want to know how to characterize males and females. The file adult-sex.eviviz shows the structure of the Evidence Classifier induced for this problem. This file was generated by running the inducer on adult.schema, with the label set to sex, after removing the relationship column (which would have made the classifier trivial).
Windows users find these files in the directory in which MineSet was installed, under \examples\adult-sex.eviviz and \data\adult-sex.schema
IRIX users find these files in /usr/lib/MineSet/examples/eviviz/adult-sex.eviviz and /usr/lib/MineSet/data/adult.schema
In the Evidence Visualizer, the Label Probability Pane shows that the prior probability of working males is higher than that of females.
Marital status is the most important predictor of gender. If a worker is a married-civilian-spouse there is a greater probability of being male. A worker who is widowed and working, however, is much more likely to be female.
The second attribute listed shows occupation. Study this to learn which occupations are popular with a particular gender. The various occupations are listed from left to right in order of decreasing male dominance: Armed forces (100%), Craft-repair (95%), Transport-moving (95%), and Farming-fishing (94%). Female trades are Private-house-service (94%) and Adm-clerical (67%). By clicking on the button next to “Female” in the Label Probability Pane, and then moving the mouse over occupation = Adm-clerical, one can see that 23% of females have an Adm-clerical job. Conversely, given that one's job is Adm-clerical, there is a 67% chance that the gender is Female.
Suppose you wanted to find out the probability of being female given that a person is widowed and has occupation = Adm-clerical. This can be done by clicking on the values and reading 95% from the text at the top when you move the mouse over the box next to “Female” (in select mode).
If the working class is either self-employed-incorporated or self-employed-not-incorporated, the probability that the person is a male is higher. Conversely, if the working class is state-gov, the conditional probability that the person is a female is higher, but the posterior probability (after taking into account the prior probability) is not higher (click it and look at the posterior probability on the right). The size of the female slice increased by selecting state-gov, but not so much that it would lead you to predict that a person was female, given only that they worked for the state.
By rotating the view, you can see that most people work in private industry by looking at the height of the charts.
By looking at the gross-income attribute, you can see that the higher the income range, the higher the probability of being male.
Education generally does not indicate much about gender, except for doctorate degrees, where you are more likely to find males.
Different occupations have different distributions for males and females.
The race attribute shows that African-Americans have a higher percentage of females working than the percentage of other races in the conditional probability. Click the value to see that the posterior is about equal between males and females.
Males in this dataset work more hours per week than do females.
If you have a dataset of working adults, you might want to find out what factors affect salary. First bin gross_income into five bins, with thresholds at 10,000, 20,000, 30,000, and 60,000. Each record then has an attribute with one of five values. You can run a MineSet classifier to help determine what factors influence salary. The file adult-salary.eviviz shows the Evidence classifier induced for this problem. This file was generated by running the inducer on adult.schema with gross_income divided into five bins using user-specified thresholds.
The attributes in the Evidence Visualizer are ranked by importance; thus, relationship, marital status, age, occupation, education, hours per week, and sex are considered most important. Since the label is numeric, a continuous spectrum is used to assign colors to each class. Red is assigned to the highest bin (60,000+). The class labels are listed in the Label Probability Pane according to slice size. As you click on values in the Main Window, the order of the class labels changes to keep the label for the largest predicted class at the top.
Relationship shows that husbands and wives are likely to make more money than unmarried workers or workers not in a family. Wives have slightly higher incomes than husbands.
Marital status shows that most people are married (the second chart from the left is tall). Married workers earn more money than unmarried people.
Age shows that age is a crucial factor. Until the age of 61, when many people retire, the probability of making over $50,000 increases as workers get older.
Different occupations yield different probabilities. Executive and professional jobs raise the evidence for making over $60,000 per year.
Education is an important factor. When considering just education, the highest evidence for earning over $60,000 is given to workers whose educational level includes a masters or doctoral degree, or matriculation from professional schools.
Hours per week show that the more hours worked, the higher the evidence for earning more money.
Sex shows that being a female gives evidence for making less than $60,000 per year.
Adjust the Percent Weights slider to remove values of native_country, education and occupation with low weights are removed.
In this dataset, each record describes four characteristics of iris flowers: petal width, petal length, sepal width, and sepal length. Each iris was further classified into the types iris-setosa, iris-versicolor, or iris-virginica. The goal is to understand what characterizes each iris type.
Before running a classifier, click the Column Importance tab in the Tool Manager's Classifiers tab; then click Go You obtain a ranking of the importance of the features: petal width, petal length, and sepal length. You can map these to the axes in the Scatter Visualizer, with the iris_type mapped to the color and see a natural clustering.
The file iris.eviviz shows the structure of the Evidence Classifier induced for this problem. This file was generated by running the inducer on iris.schema.
In the Evidence Visualizer, we can see that petal length and petal width are excellent discriminatory attributes, while sepal length and sepal width are not as good. Move the importance threshold slider to the right to see that the sepal-based attributes disappear first.
The file mushroom.eviviz shows the structure of the Evidence Classifier induced for this problem. This file was generated by running the inducer on mushroom.schema.
The goal is to understand which mushrooms are edible and which ones are poisonous, given this dataset. There are over 8000 records in this set; thus, running this inducer might take several seconds. Note that under the default mode of the one-third holdout for accuracy estimation, a third of the records are kept for testing.
Each mushroom has many characteristics, including cap color, bruises, and odor. By default the Evidence Visualizer orders attributes by importance (that is, usefulness in predicting the label). Odor and spore print color appear at the top of the list because the distributions in the cake charts is most different from value to value for these attributes. Since all the attributes in this dataset are nominal, all the values are sorted from left to right by how well they predict edibility. You might want to order the values alphabetically or by weight (prevalence). To do this, select the appropriate method from the nominal order menu. You can see a characterization of poisonous mushrooms by changing the pointer to an arrow (click the arrow icon at the top right of the main screen or press the Esc key), then clicking the button by that class label in the right pane. High bars are associated with values that indicate the mushrooms are poisonous.
In the Evidence Visualizer, move the Detail slider to the right. The attributes with the lowest importance are removed from the scene. The most important attribute by far is odor, as its importance is 92; all other attributes have importance less than 48. Almost all values are good discriminators, but if there is no odor (none), then there is a mix of both classes. The Evidence Visualizer lets you see specific values that might be critical, even if the attribute itself is not always important. For example, stalk_color_below_ring is not a good discriminatory attribute because most of the time it takes on the value white. White offers no predictive power because there are equal amounts of edible and poisonous mushrooms with this value. When stalk_color_below_ring takes the value gray or buff, it provides excellent discrimination, but there are very few mushrooms with these values.
This dataset consists of voting records. The goal is to identify the party a congress person belongs to given data about key votes. The dataset includes votes for each member of the U.S. House of Representatives on the 16 key votes identified by the Congressional Quarterly Almanac (CQA). The CQA lists nine types of votes: voted for, paired for, and announced for (these three are simplified to yes), voted against, paired against, and announced against (these three are simplified to no), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three are simplified to an unknown disposition).
Before running a classifier, look at the 16 votes to see if you can perceive which features are important. Then run the Evidence Visualizer. For this dataset, you might want to order the values alphabetically, so that all no votes are on the left, undecided is in the middle, and yes is on the right.
Some issues clearly define one's party affiliation. Democrats tended to vote for a physician fee freeze and aid for El Salvador, while Republicans voted for adoption of a budget resolution and aid to the Contras in Nicaragua.
Immigration was an issue not split along party lines; nevertheless, politicians had strong positions on it because only 7 out of the 235 were undecided on this issue.
The file vote.eviviz shows the structure of the Evidence Classifier induced for this problem. This file was generated by running the inducer on vote.schema.
The breast cancer dataset contains information about women undergoing breast cancer diagnosis. Each record represents a patient with attributes such as cell size, clump thickness, and marginal adhesion. The final attribute is whether the diagnosis is malignant or benign. The file breast.eviviz shows the structure of the Evidence Classifier induced for this problem. This file was generated by running the inducer on breast.schema.
In the Evidence Visualizer, you can see that sample_code_number was discretized into one range that is equally split, meaning that it does not indicate whether the breast cancer is benign or malignant.
The hypothyroidism dataset is similar to the one for breast cancer. The file hypothyroid.eviviz shows the structure of the Evidence Classifier induced for this problem. This file was generated by running the inducer on hypothyroid.schema.
There are 3,163 records in this dataset and most of them do not have hypothyroidism (95.45%). This means that one can predict “negative” and be correct most of the time. However, we are worried about those people that have hypothyroidism, yet the model predicts to be healthy. The false negatives are very important.
Look at the cake chart for tsh between 6.35 and 27.5. It shows much evidence for hypothyroidism. When you click on it, however, the posterior probability pie on the right still predicts “negative” because the prior probability for “negative” was so great.
This is a case where you might want to adjust the Loss Matrix to skew the posterior probability toward predicting hypothyroidism in order to avoid false negatives. There might be a high cost associated with predicting that someone is healthy when they actually have the disease; predicting them sick when they are actually healthy means they take a more accurate test or a treatment they do not need.
In the Evidence Visualizer, you can see that fti is very important. The first two ranges (besides the unknown) give a lot of evidence for hypothyroidism.
This dataset is a diagnosis problem for diabetes using statistics gathered from an Indian tribe in Phoenix Arizona. The task is to determine whether a patient has diabetes, given some medical attributes, such as blood pressure, body mass, glucose level, and age.
The file pima.eviviz shows the structure of the Evidence Classifier induced for this problem. This file was generated by running the inducer on pima.schema.
In the Evidence Visualizer, you can see that many attributes are irrelevant by themselves. As plasma_glucose increases, the probability of having diabetes increases. The number of pregnancies is also a good indicator when it is high (above 6), as is age (above 27).
The file dna.eviviz shows the structure of the Evidence Classifier induced for this problem. This file was generated by running the inducer on dna.schema.
There are 3,186 records in this DNA dataset. The domain is drawn from the field of molecular biology. Splice junctions are points on a DNA sequence at which “superfluous” DNA is removed during protein creation. The task is to recognize exon/intron boundaries, referred to as EI sites; intron/exon boundaries, referred to as IE sites; or neither. The IE borders are referred to as “acceptors” and the EI borders are “donors.” The records were originally taken from GenBank 64.1 (genbank.bio.net). The attributes provide a window of 60 nucleotides. The classification is the middle point of the window, thus providing 30 nucleotides at each side of the junction.
From the Evidence Visualizer, you can see that attributes near the center are chosen as very important. Attributes further away from the splice junction are less important.
If you click and select the charts in the left pane corresponding to “left_01: G” and “left_02: A”, then the pie chart in the label probability pane on the right will change to show the probability distribution of each class as predicted by the evidence classifier. Given these two values, the pie chart shows that the evidence model built assigns the highest probability to “intron/exon”, followed by “exon/intron” and “none”.
The accuracy improves slightly if you invoke automatic feature selection, although running time increases dramatically (sometimes hours). In such cases, run feature selection once, and continue mining only with the chosen features.
The provided sample configuration and data files demonstrate the Map Visualizer's features and capabilities.
Windows users find these files in the directory in which MineSet was installed, under \examples\mapviz. The .gfx and .hierarchy files can be found in \config\mapviz.
IRIX users find these files in /usr/lib/MineSet/examples/mapviz. The .hierarchy and .gfx can be found in /usr/lib/MineSet/mapviz/gfx_files.
blocks.mapviz, blocks.data, blocks.gfx, and blocks.hierarchy
This simple example shows four adjacent blocks. The height and color of each block varies based on the underlying data in blocks.data. You can drill up using the middle mouse button (see the section) to see the upper pair and the lower pair of blocks aggregate; then drill up again to see these upper and lower blocks aggregate into a single block. You can drill down using the right mouse button to see the objects of finer granularity reappear.
population.australia.mapviz, population.australia.data, australia.states.gfx, and australia.states.hierarchy
The data file contains one row for each Australian state and territory. Each row contains three tab-separated items: a keyword name for the state or territory, the population value, and the size of the territory.
This sample graphically displays the 1991 population and population density of the Australian states and territories. Heights of the graphical objects represent the relative population; color represents the relative population density. A legend at the bottom of the display describes the color range and the associated values.
population.canada.mapviz, population.canada.data,
canada.provinces.gfx, and canada.provinces.hierarchy
The data file contains one row for each Canadian province and territory. In this example, each row contains 13 blank-separated values (one for each decade between 1871 and 1991).
This sample graphically displays the population and population density of the Canadian provinces and territories from 1871 to 1991, in 10-year increments. The animation control panel lets you dynamically view the datasets across a range of time. Animation operation is explained in “Animation Control Panel” in Chapter 1.
population.europe.mapviz, population.europe.data, europe.countries.hierarchy, and europe.countries.gfx
When graphically displayed, this shows the 1992 population and population density of countries in Western and Central Europe.
population.usa.mapviz, population.usa.data, usa.state.gfx, and usa.state.hierarchy
When graphically displayed, this shows the population and population density of the United States from 1770 to 1990. The animation controls let you dynamically view population and density changes across time.
population.usa.city.mapviz, population.usa.city.data, usa.state .gfx, usa.state.hierarchy, and usa.city.gfx and usa.city.hierarchy
The usa.state.gfx file specifies the United States, which is displayed as a background. The usa.city.gfx file specifies the location of the cities on this background. The .data file specifies the population of each city.
This sample graphically displays the population of the 48 largest U.S. cities from 1950 to 1990. No data has been mapped to the colors. The animation controls let you dynamically view changes across time.
perhouse.perage.mapviz, perhouse.perage.data,
usa.state.gfx, and usa.state.hierarchy
This sample graphically displays consumer household spending data from July-August 1988 to May-June 1991. Color is mapped to the gender of the spending household member; height represents the average dollar amount spent per household for a given time period and age group. This data has two independent dimensions: time and age. The highest spending is indicated in the summary window by the areas with the greatest color density, namely “May-June 1989 (Age: 30-39)” and “May-June 1990 (Age: 30-39).”
telecom.mapviz, telecom.data, usa.city.lines.gfx, usa.city.lines.hierarchy, usa.state.gfx, and usa.state.hierarchy
This sample graphically displays a flat map with arched lines on it. These lines connect two endpoints. The lines can have variable width and color. In this example, the widths and colors are random; however, they could relate to the volume and duration of the connections between the endpoints.
fasta.m.data, fasta.m.mapviz, fasta.m.gfx, and fasta.m.hierarchy
The data file for this example contains the partial results of a full biological sequence comparison between two complete genomes (courtesy of Dr. Tom Flores, European Bioinformatics Institute). When graphically displayed, scientists can quickly identify and locate the regions of similarity between the two genomes. The ability to display such large amounts of information in a visual data exploration method such as this could be extended to include much more information about the individual genomes. Scientists could explore this data more easily and thereby perhaps better understand the function and purpose of the similar genetic sequences.
In this example, the “map” is the circular-shaped genome of a biological organism called Mycoplasma genitalium (MG). The MG genome is divided into 500 equal segments, each representing a 1000-nucleotide sequence in the genome. The slider selects one of the segments of the second genome, called Haemophilus influenzae (HI), for cross-comparison between the two genomes. The Summary Window in the Animation Control Panel indicates which segments show the greatest similarities, and you can move the slider to examine those particular segments of interest. The bar heights and colors on the “map” therefore indicate the relative similarity of each MG segment to each HI segment, where higher bars correspond to greater measures of similarity. This similarity is measured by the “Reciprocal Evalues,” which ranges from 0.0 to 1.0.
The following examples show cases in which the Option Tree inducer can be useful. Each of these examples is associated with a sample data file provided with MineSet. By running the inducer, you can generate the -odt.treeviz files described below. The text describing the scenario and goal for each task is described in Tree Visualizer Sample Files. Here we describe the specific advantages and disadvantages of Option Trees for several of the example datasets.
| Note: The data files, which have a .schema extension, are located in the data directory on the client workstation. The classifier visualization files, which have a -odt.treeviz extension, reside on the client workstation in the examples directory. To load a data file into MineSet, open the .schema file. |
Windows users find these files in the directory in which MineSet was installed, under \examples and \data
IRIX users find these files in /usr/lib/MineSet/examples/treeviz and /usr/lib/MineSet/data
The Option Tree for this dataset shows that total day charge, total day minutes, and customer service calls are all good attributes for the root: they all have approximately the same estimated error rate. You can choose to fly down to one subtree or another, based on your preferences and understanding of the data. Note that while the right subtree starts with customer service calls, the second test is on total daily charge or total daily minutes (as the root's left option). However, because a split already occurred on an attribute, the thresholds are different.
The Option Tree for this dataset shows several good attributes for the root, including: cubic inches, cylinders, weight lbs, mpg, and brand. Note that the root has a lower estimated error rate than any of the children.
This is an example where Option Trees seem to be performing worse than Decision Trees. The root for the Decision Tree shows 6% error and the root for the Option Tree shows 8% error, so it seems that Option Trees perform worse. However:
The standard deviation of the error estimate is fairly high: 3.88% and 3.39%. A rule of thumb in statistics is that if the difference is less than two standard deviations, the difference is not statistically significant at the 95% confidence level. A difference of 2% is not larger than even a single standard deviation; hence, the classifier error rates are probably not statistically different at the 95% confidence level
For small files (Iris has 150 records), different random seeds give different results. For example, changing the seed to 3 improves the Option Tree classifier's error from 8% to 4% without changing the Decision Tree classifier's error rate (remember to reset the seed). This does not imply that a more accurate classifier has been generated, rather that the error estimate is not stable. Because only 50 records are used for testing, each mistake is 2%. The difference between 4% and 8% is making two more mistakes.
For small files (Iris has 150 records), use the “Estimate Error” option in MineSet. It results in better estimates that have narrower confidence intervals. When you run this mode, the status window shows that the Decision Tree classifier has an estimated error of 4.67% +/- 1.73%, and the Option Tree classifier has an estimated error of 4.00% +/- 1.61%. The difference is not significant in this case either, but the Option Tree is slightly superior.
Even if the error rate is higher for Option Trees, they might be (and usually are) better at assigning probability estimates. For this dataset, the estimated mean squared error for Decision Trees is 3.94; for Option Trees it is 3.67 (although the difference is not significant at the 95% confidence level).
The Option Tree for this dataset shows that all five options chosen at the root have zero error rate estimates. Looking at the result, you might prefer the left option (bruises) because it is as accurate but is easier to measure than odor (the root test of the induced Decision Tree). You might want to remove odor and gill size, then build a regular Decision Tree that turns out to be just as accurate (0% estimated error rate).
Note, however, that removal of a root option to have a sibling option selected by the Decision Tree might not necessarily result in the same accurate classifier that is shown in the Option Tree. The removed attribute might have been used lower down in the tree. For example, removing brand from the cars dataset significantly increases the error rate, even though four out of five options do not use it at the root.
This dataset behaves very similarly to the Iris dataset. The Option Tree has the same error rate as the Decision Tree. Under “Estimate error,” the cross-validated estimate shows that it is slightly better than the Decision Tree (but not significantly so at the 95% level) both on error rate and on mean squared error.
The error rate for Option Trees is slightly lower than that for Decision Trees, both for Classifier & Error and for Estimate Error; however, the difference is not significant (at 95%).
The error rates for this dataset are very low (less than 1%), but this is because most people who were tested for hypothyroid (95%) did not suffer from it. If we use a loss matrix that attempts to avoid false negatives (by penalizing by 100 a prediction of negative when the actual value is hypothyroid), we can see that the loss for Option Trees is significantly lower than that of Decision Trees: 182 versus 523 (total), or 0.17 versus 0.5 (per record). This difference is significant at the 95% confidence level.
For this dataset, the Option Tree is slightly more accurate than the Decision Tree; however, looking at the root options, you might notice that it chooses left 1,2, and right 1,2,5. Given the background knowledge that attributes closer to the boundary can be more important, you might want to exclude the option split on right 5. After updating the maximum number of root options to 4 (down from 5), the error rate increases from 5.65% to 6.59%. This might be surprising, given that the root no longer uses right 5 as an option; another effect of changing the number of root options from 5 to 4 was to also reduce the number of options that appear further down the tree (because of the decrease parameter). This caused the individual error rates for each of the other 4 subtrees to increase. Still, the option tree's error rate is significantly better (at the 95% confidence level) than the Decision Tree error rate of 7.06% +/- 0.79%.
The following examples show cases in which regression might be useful and highlight some of the capabilities of the Regression Tree Inducer. Each of these examples is associated with a sample data file provided with MineSet. By running the inducer, you can generate the -rt.regress files described below.
| Note: The data files, which have a .schema extension, are located in the data directory on the client workstation; and the regressor visualization files, with a -rt.treeviz extension, reside on the client workstation in examples directory. To load a data file into MineSet, open the .schema file. |
Windows users find these files in the directory in which MineSet was installed, under \examples and \data
IRIX users find these files in /usr/lib/MineSet/examples/treeviz and /usr/lib/MineSet/data
The churn dataset contains generated information on the calling patterns of a telecommunication company's customers. In the classification examples, this dataset is used to determine which factors lead a customer to churn, or leave the company for one of its competitors. In this regression example, we will try to determine what factors influence how much the company charges each customer per day.
The file churn-rt.treeviz shows the Regression Tree generated on this data set to predict the total day charge. Interestingly, the tree branches on only one attribute throughout, total day minutes; continuously dividing this attribute further and further into progressively smaller ranges. This is because total day minutes is directly proportional to total day charge—the customers are charged only for the minutes they use the system. The Regression Tree is able to adapt to this fact.
The cars dataset contains information about different models of cars from the 1970s and the early 1980s. Attributes in this data set include weight, acceleration and miles per gallon. The file cars-rt.treeviz shows the Regression Tree regressor induced on this data set, using miles per gallon as the continuous label.
By clicking on the top node, we see that the average mpg of cars in this dataset is around 23.5. The first split in the Regression Tree for this dataset shows that the most important factor contributing to the mileage of a car is its weight. The Regression Tree has uncovered the well-known fact that heavier cars get lower mileage. By looking at the two children of the base node, we note that the right child is bluer than the left one, that is it gets fewer miles per gallon. By highlighting the nodes, we see that cars that weigh less than 3018 lbs. get around 28.3 mpg, while cars weighing more get around 16.6 mpg.
Heading over to the heavier cars, we see that the next split is on the horsepower of the car, and that more powerful cars tend to get lower mileage. The split at the next level that is on the year the car was made, with newer cars getting better mileage. Now, let's look for an unusual car. Using the filter panel, let's try and find a node with a mean mpg < 24 but with a maximum > 30. Doing this filter, we quickly reduce the tree to one node, cars weighing less than 3018 lbs, with more than 77 horsepower, and made before 1980. In this category, there is an unusual car; by selecting the rightmost bar on that node and drilling through via the Selections > Show Original Data menu item, we see that this car is a 1978 Dodge, weighing around 2000 lbs, with 83 hp, but getting a high 33.5 miles per gallon.
The adult dataset contains information about working adults, extracted from the U.S. Census Bureau. It contains data about people older than 16, with a gross income of more than $100 per year, who work at least one hour a week. We can use the Regression Tree Inducer to determine which factors influence a person's salary; as well as to give a rough prediction of what that person's salary would be, given the other information.
The file adult-rt.treeviz shows the Regression Tree regressor induced on the adult dataset, using gross income as the continuous label. Note that this data set is large (around 50000 records), and therefore inducing the regressor on this data set may take a few minutes on your workstation.
The bars at the top node provide a histogram of salary values in the Census Bureau's data. Note that the amount of data available decreases as the salary level increases. We have a lot of data for people earning around $3,000 a year, but less so as that figure increases. This trend is reversed in the last part of the histogram that indicates a sizeable amount of data on people earning roughly $100,000 per year. This discrepancy might be the result of either a genuine trend in the data, or a biased sampling.
The first division in the Regression Tree is on the age attribute. As expected, younger people generally make less money than older people. Brushing the top node and its two children nodes, we can see some summary statistics for these three groupings of people. We note that the mean salary for everyone in this study is around $33,500, while the mean salary of people under 27 is around $14,300; and the mean salary of those over 27 is around $40,000.
Following the next two divisions of people under 27, we see that the tree again splits them into two categories: those 23 and under, and those over 23. Interestingly, the split past these two divisions is the same, and on the hours per week attribute, indicating that for both age ranges the more hours worked, the higher one's gross income.
Now, focusing on those over 27, we find that the tree splits immediately on the amount of education a person has had. Those with an education number 13 and over (which corresponds to a bachelor's degree), tend to make more money. By looking over the two children of the education number split, we can see that most of the people making around $90,000 a year have at least some advanced education.
We can use the filter panel, to quickly locate those categories of people making on average over $50,000 a year. In the filter panel, select mean > 50000. Top level nodes disappear in this filter, as making that amount of money is a rare occurrence. People with a bachelor's degree who are over 27 fall in this category. By following the left branch of the first split to the end, we find another group of people in this category: married men over 36 years old, who work over 35 hours a week and have a good education (10 years or more).
If we revisit the filter, and look for nodes with an absolute deviation of larger than $25,000, we can find those people whose economic condition offers the widest variability. The first remaining node in this filter is those people over 27 and with a bachelor's degree. The histogram above this node shows a distribution centered around its mean, but with an unusual number of people making around $100,000 a year.
Each record in this dataset describes five characteristics of iris flowers, petal width, petal length, sepal width, sepal length, and iris type. Our goal in this regression is to predict the petal width based on the other characteristics. The file iris-rt.treeviz shows the results of the Regression Tree Inducer run on this dataset in order to predict petal width.
Looking at the top node, we see a gap in the petal width values, where no flowers exist. The Regression Tree Inducer splits on this data set first using the petal length variable. If the petal length is less than 2.6, only a restricted set of petal widths seems possible. On the other hand, petal length values greater than 2.6 indicate a more even distribution with larger corresponding petal lengths. The mean petal width for those irises with petal lengths less than 2.6 is 0.24, while the corresponding mean petal width for those with lengths greater than 2.6 is 1.68. Following the large petal length irises, we see that the tree splits again on petal length, this time on the value 4.85. These two consecutive splits on the same variable point to some kind of restricted functional relationship between these two variables.
Going back to those irises with a petal width less than 2.6, we see that the following split is on the sepal width attribute. Interestingly, the values in this part of the tree seem segregated, with those irises with sepal width less than 3.25 taking on values in three narrow but separated ranges.
This dataset is a diabetes diagnosis problem using statistics gathered from an Indian tribe in Phoenix, Arizona. The file pima-rt.treeviz shows the results of running the Regression Tree Inducer on this dataset, using the plasma glucose level as the predicted continues variable.
The first split in this tree is on the diabetes indicator, showing that people with diabetes tend to have a higher plasma glucose level than those without (141 versus 110). The next split is the 2-hour serum insulin attribute, where values greater than 125 lead to higher plasma glucose.
The Regression Tree predicts by following the decisions at each of the nodes from the top node, while examining new records. For example a diabetic patient with 2-hour serum insulin of 110 would have a predicted plasma glucose level of 105.
The provided sample data and configuration files demonstrate the Scatter Visualizer's features and capabilities. The following .data and .scatterviz files are in the examples directory. To load a data file into MineSet, open the .schema file.
Windows users find these files in the directory in which MineSet was installed, under \examples.
IRIX users find these files in /usr/lib/MineSet/examples/scatterviz.
The Scatter Visualizer sample files are as follows:
company.data
This file contains fictitious sales data of several insurance companies in three product categories: life insurance, auto insurance, and home insurance. The data span ten years (in increments of one year) and includes five income brackets (the customer's annual income).
company.scatterviz
This file specifies that the years form one slider dimension and the income brackets form the other slider. Sales of life insurance, auto insurance, and home insurance become the three dimensions in the Scatter Visualizer landscape. The color density in the slider summary window represents the total sales of all companies across all categories of insurance.
company-total.scatterviz
This file contains the same specifications as company.scatterviz, except that the size of each company is determined by the total sales of that company across all the categories of insurance.
company-life.scatterviz
This file contains the same specifications as company.scatterviz, except that the color of each object indicates the life insurance sales as a fraction of total sales.
store-type.data and store-type.scatterviz
These files show sales of various product groups by store type during a three-year period. The single independent variable for which a slider appears is time. Each entity represents a store type (such as Food Store, Drug Store, Service Station, and so forth). For each store type, the data file contains the total sales of several product groups, such as alcoholic beverages, cereal, and so forth. The data spans 36 months, in increments of one month.
The configuration file uses the month as the single slider dimension. One axis is sales of alcoholic beverages, the other is sales of tobacco products. A third axis is not used.
brand.data and brand.scatterviz
These files show sales of several soft-drink brands in a variety of store types. In this dataset the brands form the entities, and the store types are associated with the axes. The total sales are mapped to the size of each brand. The color mapping is random. Since there are no independent variables, no slider is present.
cars.data and cars.scatterviz
These files show the weight, horsepower, model year, and acceleration of several car models. The axes are cubic inches, mpg, and time to 60. Weight has been mapped to size.
people.data and people.scatterviz
These files show the height, weight, density, and cholesterol level for a fictitious population sample.
nl.births.data and nl.births.scatterviz
These files show birth patterns in the Netherlands. For each region, the population density, birth rate, and population are shown. The animation sliders are mapped to the age of the mother and the year.
adult94.data and adult94.scatterviz
These files show a complex example with scatterviz applied to adult.data. The three axes in the visualization are avg_hrswk (that is, average hours worked per week), avg_gross_income, and avg_education_num. Unfortunately “education num” does not correspond exactly to number of years of education, but it is close. The slider on the right side animates across different age ranges. Each aggregate was created by grouping by occupation, race and sex. This means that there is an entity for every combination of values for these three attributes. The color shows different occupations, as shown in the legend. The size of each entity corresponds to record counts. The summary slider is also colored by data density. To find out how this visualization was created, you may select Start Tool Manager from the File menu. This will bring up the Tool Manager with the session used to create this example.
Initially the scene shows information for people under 20 years of age. Note that the average hours worked (about 14) and the average income (about $4000) are low. If you animate over age using the slider, and examine the scene from the three orthogonal views (try using the lower 3 buttons to the right of the main window), you will notice various trends emerge. For example, if you orient the scene so you see only income by hours per week, you can see that people start to work longer hours as they age, until about age 25, then they seldom work more that 49 hours per week until they retire. Income, however, grows until age 50, then plateaus, then goes lower again. The actual trend depends somewhat on the career choice and other factors.
Suppose you were interested in comparing trends between the occupations craft-repair and prof-specialty. Open the Filter panel (View > Show Filter Panel) and select just “craft-repair” and “prof-specialty” from the list of occupations. Now when you animate, you can see that “prof-specialty” actually starts with lower incomes, but quickly outpaces “craft-repair” as people age. “Prof-specialty” is much higher on the education axis than “craft-repair”. You may wish to limit your filter further by showing just females, or those of a certain race. Also try selecting some of the different motion trail options while animating.
census.data and census.scatterviz
These files also show a plot of aggregated census data. The original dataset contained about 150,000 rows. After aggregation, there is a cube (an aggregate) for every combination of education, sex, industry1, and occupations (as these were the group-by columns).
The provided sample data and configuration files demonstrate the Splat Visualizer's features and capabilities. The files are in the examples directory.
Windows users find these files in the directory in which MineSet was installed, under \examples.
IRIX users find these files in /usr/lib/MineSet/examples/splatviz.
mushroom
The
mushroom.data file contains pre-aggregated data concerning more than 5,000 mushrooms. The group by columns were: odor, gill_color, and cap_color. For every combination of these three columns in the original data, there is a count and an average edibility, where 0 is edible, and 1 is poisonous. The average edibility between 0 and 1 means some of the mushrooms in that aggregate are edible and some are poisonous, since mushrooms can not be partially poisonous.
The visualization shows that the unique values for each of these columns have been sorted along the axes according to average edibility. Odor is clearly the best determinant of edibility. Also note that most splats are either all 0 or all 1, meaning these three columns are useful in segmenting the two classes of mushrooms. In fact, the column importance feature was used to select the columns mapped to axes. Lower the opacity slider to determine which splats have the highest counts. The most opaque splat represents 288 mushrooms having common values for odor, gill_color, and cap_color. To confirm this try filtering based on sum_count_poison>280 and picking on the remaining splats to see their counts. Note that all mushrooms with gill_color=buff are poisonous.
adultJobs
The
adultJobs.data file was derived from adult94, a dataset provided with the distribution. It was created using an aggregation that grouped by education, occupation, hours_worked_per_week (binned), and age (binned). The gross_income column was aggregated by count and average. For a display using the Splat Visualizer, age_bin was mapped to a slider, while the other group-by columns were mapped to axes. The count_gross_income column was mapped to opacity, and avg_gross_income was mapped to color.
When the slider is in the left-most position, the color of the plot is almost entirely blue. This means that regardless of occupation, education, or number of hours worked, most people younger than 20 have low incomes. Move the slider to the right, and note how incomes rise faster for higher education and occupations toward the end of the axis. By the opacity variation you can see that the most common types of education are HS, some college and Bachelors degree.
Moving the Summary slider shows how the distribution of income changes with respect to the axis columns as people age.
adultJobs2
The adultJobs2 file is also based on the adult94 dataset. Here, the axis columns are working_class, education, and occupation. The two columns mapped to sliders are age (binned) and hours_worked_per_week (binned). Again, income was aggregated by count and average for use with opacity and color, respectively. Since there are more positions on the 2D slider, there are fewer records represented by each position. This causes greater variation of color and opacity. The red region in the center of the hrs_per_week dimension of the Summary slider shows that nearly everyone works between 35 and 45 hours per week. Note that some occupations are aligned with specific working classes. For example, everyone in the Armed-forces has Fed-Government for their working class.
censusIncome
This example is based on a dataset similar to adult94, but was not included with the distribution because of its size. In an attempt to understand the differences between gross income and total income, gross_income, total_income, and hrs_per_week have been mapped to axes. Color shows age. By studying the image we can learn that there are many records where total_income=gross_income, but there are also a larger portion of records with high total_income, but 0 gross_income. It is surprising that in many cases gross_income is greater that total_income.
Note where the people of different ages are concentrated. Many old people (yellow) are in the hrs_per_wk=0 plane. They are probably retirees. Many children and young adults (blue) are in the line gross_income=total_income=0. Note the fairly opaque splats near the outside edges of the volume. These positions include all points that fell in the maximum bin shown for an axis. For example, the highest bin for total_income is 70,300+. Any point higher than 70,300 goes in this bin.
To better see the varying density, adjust the opacity slider. At low opacity scales, the diagonal lines show that for most people gross_income=total_income, or they have just total_income and no gross_income. As you raise the scale, you can see that almost the entire volume contains data. This dataset contains 150,000 records.
churn
Churn is when a customer leaves one company for another. This example shows customer churn for a telephone company. The data used to generate this example is
churn.schema.
Using column importance, we found that total_day_charge, number_customer_service_calls, and international_plan were important discriminators. These columns were mapped to axes. We then created a new numeric column, churn, which equals churned==Yes, and mapped it to color.
In the resulting visualization, red areas of the volume indicate high churn. The area corresponding to three or more customer service calls and low total_day charge corresponds to high churn. You might want to weight big-spending customers more heavily than others. To do this, create a new column, total_charge, equal to
`total_day_charge`+`total_eve_charge`+`total_night_charge` |
or some power of this sum. Then map this total_charge column to opacity. This means every record is weighted by total_charge. Now the visualization shows additional areas of interest near the high end of the total_day_charge axis.
The provided sample configuration and data files demonstrate the Tree Visualizer's features and capabilities. The following files are in the examples directory.
Windows users find these files in the directory in which MineSet was installed, under \examples.
IRIX users find these files in /usr/lib/MineSet/examples/treeviz and /usr/lib/MineSet/data.
The Tree Visualizer sample files are as follows:
store.data and store.treeviz
When graphically displayed, these files show hypothetical sales data for a store chain. The hierarchy includes the entire chain, regions, states, cities, and individual stores. Four products are shown for each level in the hierarchy. In this configuration, heights represent sales in dollars; colors represent the percentage of the target dollar amount.
stateRevenue.data and stateRevenue.treeviz
When graphically displayed, these files show the revenue components of every state's
budgets for 1992, as obtained from the United States Census Bureau (from http://www.census.gov/govs/state/stfin92.dat). Heights represent the dollar amounts in taxes. The descendent nodes in the background show the contribution of various taxes to the total revenues shown in the root node.
beer.data and beer2.data, and beer.treeviz and beer2.treeviz
When graphically displayed, these files show fictitious data based on consumer research of beer purchases. The hierarchy contains three levels:
The first is category (for example, beer or ale).
The second level is brand codes (randomly assigned).
The third is the individual product codes; for example, twelve-pack versus six-pack (randomly assigned).
Each chart contains seven bars, representing seven age groups. Bar height represents the total dollars spent by that age group. Colors represent the percentage of dollars spent by males and females. Brands, products, and data used in these files are samples only.
Both beer.treeviz and beer2.treeviz produce the same graphical output, but they have been constructed differently. In beer.treeviz, each type of beer is represented by a single record, with values for male and for female consumption; these values are stored in an enumerated array.
In beer2.treeviz, there are seven records for each beer, with each record representing one age group. Note that in the beer file, the age groups are represented in the configuration file; in the beer2 file, they are included in the data file.
The beer file requires less storage space than the beer2 file; however, the configuration file is a little more complicated. In some cases, it might be easier to produce data in the form used by the beer2 file.