**Homework 3**

Answer the following questions: (10 point each)

1. The following table summarizes a data set with three attributes A, B. C and two class labels *, -. Build a two-level decision tree.

A

B

C

Number of

Instances

–

+

T

T

T

5

0

F

T

T

0

10

T

F

T

10

0

F

F

T

0

5

T

T

F

0

10

F

T

F

2

5

0

T

F

F

10

0

F

F

F

0

25

a. According to the classification error rate, which attribute would be chosen as the first splitting attribute? For each attribute, show the contingency table and the gains in classification error rate.

b. Repeat for the two children of the root node.

c. How many instances are misclassified by the resulting decision tree?

d. Repeat parts (a), (b), and (c) using C as the splitting attribute.

e. Use the results in parts (c) and (d) to conclude about the greedy nature of the decision tree induction algorithm.

2. Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity. Example: Age in years. Answer: Discrete, quantitative, ratio

a. Gender in terms of Mor F.

b. Temprature as measured by people’s judgments.

c. Height as measured by people’s height.

d. Body Mass Index (BMI) as an index of weight-for-height that is commonly used to classify underweight, overweight and obesity in adults.

e. States of matter are solid, liquid, and gas.

3. For the following vectors, x and y, calculate the indicated similarity or distance measures.

a. (a) x : (1,0,0,1), y : (2,1,1,2) cosine, correlation, Euclidean

b. (b) x : (1,1,0,0), y : (1,1,1,0) cosine, correlation, Euclidean, Jaccard

4. Construct a data cube from Fact Table. Is this a dense or sparse data cube? If it is sparse, identify the cells that are empty. The data cube is shown in Data Cube Table.

Fact Table

Product ID

Location ID

Number Sold

1

1

2

2

3

1

3

1

2

2

10

6

5

22

2

Data Cube Table

Product ID

Location ID

Number Sold

1

2

3

1

2

3

10

5

0

0

22

2

6

0

0

16

27

2

Total

15

24

6

45

5. Consider the decision tree shown below:

a. Compute the generalization error rate of the tree using the optimistic approach.

b. Compute the generalization error rate of the tree using the pessimistic approach. (For simplicity, use the strategy of adding a factor of 0.5 to each leaf node.)

c. Compute the generalization error rate of the tree using the validation set shown above. This approach is known as reduced error pruning.

Training:

InstanceA

C

B

+

–

+

–

0

1

0

1

0

1

A

B

C

Class

1

0

0

0

+

2

0

0

1

+

3

0

1

0

+

4

0

1

1

–

5

1

0

0

+

6

1

0

0

+

7

1

1

0

–

8

1

0

1

+

9

1

1

0

+

10

1

1

0

+

Validation:

Instance

A

B

C

Class

11

0

0

0

+

12

0

1

1

+

13

1

1

0

+

14

1

0

1

–

15

1

0

0

+

2