CHAPTER 9 Working with SSAS Data Mining

chapter summary

  • Creating a data mining model is easy with the Data Mining Wizard and Data Mining Designer tools in BIDS. And Data Mining Viewers display a model’s findings in intuitive ways.
  • Preparing data for mining can be a complex and time-consuming task.
  • You can mine relational data or cube data.
  • You can create models that use simple cases based on a single table or complex cases with nested tables.
  • Nested tables provide a lot of flexibility for modeling business questions.
  • You can influence algorithms by setting their parameters.
  • After creating the mining models, it is important to evaluate their accuracy.
  • For evaluating predictive models, you make predictions on the test dataset. You can show the quality of predictions with lift chart, profit chart, or classification matrix tools.
  • You browse the mining models by using the DMX language. You can write DMX queries in SSMS. You can also use the Prediction Query Builder in SSMS and BIDS to create prediction DMX queries.
  • An SSRS report can use a mining model as its source.
  • You control access to data sources, mining structures, and mining models through SSAS roles.
  • SSAS supports Windows Authentication only.


Lesson 1: Preparing and Creating Data Mining Structures

1. you need to predict which customers are about to leave your company. How can you accomplish this task?

Predicting which customers might leave your company is a typical task for data mining. you should use a predictive algorithm such as Decision Trees to discover patterns in the data that you have about customers who have already left your company. you should then use the same algorithm to predict the churn based on data from existing customers.

2. What are the different components that make up the SQL Server BI suite?

The SQL Server BI suite includes SSAS cubes, SSAS data mining, SSRS, and SSIS

3. Why would you prepare separate training and test sets?

Whenever you use predictive models, you have to test the predictions. you train the models by using the training sets and then test the models by making predictions with the trained models on the test set.

4. Which algorithm is appropriate for fraud detection?

for fraud detection, you can use the Clustering algorithm. you can cluster the transactions and then find the transactions that do not fit well in any of the clusters.

5. In the practice, why do you think you had to discretize the continuous attributes?

you had to discretize the continuous attributes because of the naive Bayes algorithm, which accepts discrete attributes only. If you did not discretize them, you would have to ignore them in the naïve Bayes model.

6. Do you get the same results when you use different algorithms for predictions?

no, different algorithms and even different parameters of a single algorithm give you different results—for example, slightly different predictions. That is why you have to evaluate the models by using them to make predictions on the test set.

7. How can you prepare training and test sets?

To prepare training and test sets, you can use the Data Mining Wizard and Data Mining Designer in BIDS to specify the percentage of the holdout data for the test set. you can also use the

TABLESAMPLE option of the T-SQL SELECT statement

or the SSIS Row Sampling Transformation and Percentage Sampling Transformation.

Lesson 2: Creating Models and Applying Algorithms

1. When you created a new cube and a dimension from the mining model, did you notice any objects other than the database and a cube dimension that were created in the Analysis Services project?

BIDS also created an additional DSV for this dimension.

2. How can you limit the depth of a Decision Tree?

you can use the MInIMuM_SuPPoRT and CoMPLEXITy_PEnALTy parameters to control the growth of the tree.

3. Can you precisely control the number of clusters in the Sequence Clustering algorithm?

no, the Sequence Clustering algorithm can add additional clusters if it finds distinctive and important sequences.

4. Can you always use order line items for sequences?

It depends on the way you do sales. With Internet sales, you typically can collect data that shows the sequence of items a customer placed into a market basket. However, in retail stores, you cannot rely on sequences because the sequence of purchasing gets completely remixed at the cashier.

5. for how many time points in the future should you do forecasting?

The further you go into the future, the less you can rely on the forecasts. However, this also depends on the algorithm you use for forecasting. The ARIMA algorithm is much better than ART for long-term forecasting.

6. Which algorithm would you use to find the best way to arrange products on shelves in a retail store?

you should use the Association Rules algorithm for this task. for example, you can order the products that are commonly in a basket close together to help your customers remember to buy all, rather than just one, of them in a single purchase.

Lesson 3: Validating Models, Using DMX Queries, and Using Prediction Queries in Reports

1. What are the three types of charts you can use to evaluate predictive models?

you can use a lift chart for global statistics, a lift chart for a single value, and a profit chart to evaluate predictive models.

2. How can you evaluate a Time Series model?

you can make historical predictions to evaluate a Time Series model.

3. How do you evaluate Clustering models?

you should evaluate Clustering models from a business perspective.

4. using DMX, how can you add a mining model to an existing structure so that you can share the structure with other models?

you can use the ALTER MINING STRUCTURE DMX statement to add a mining model to an existing structure so that it can be shared with other models.

5. Can you use DMX to drill through to the sample cases you used for training a mining model?

yes, you can use the DMX SELECT FROM <model>.CASES syntax to drill through to the sample cases you used to train a mining model.

Lesson 4: Securing and Processing Data Mining Models

1. Can you use SQL Server logins for SSAS authentication?

no, you cannot use SQL Server logins for SSAS authentication. SSAS supports Windows Authentication only.

2. Do end users need the Process permission on a mining structure?

no, end users typically do not need any Process permission.

3. As an administrator, how would you prevent usage of the Clustering data mining algorithm?

you can disable the Clustering algorithm by using the Analysis Services Properties dialog box in SSMS.

4. What processing option deletes the training data in a mining structure without affecting its mining models?

use the Process Clear Structure option to purge the structure data without affecting the models inside the structure.

Working with SSAS Data Mining

 

Case scenario

 

The Adventure Works Sales Department wants to know the reasons for its customer churn. The department has requested that you implement one or more mining models to uncover these reasons. You have to prepare the data as well. The Adventure Works Finance Department is satisfied with the Decision Trees model it uses to predict the payment discipline of new customers. However, to better understand the reasons behind the predictions, staff members would like to see the source cases that the model used for training. The Finance Department also gets information about potential customers in the form of Excel 2007 worksheets. They would like to perform advanced checks on this data. For example, they would like to test the data for suspicious rows.

  1. How would you prepare the data for finding the churn information?
  2. Which algorithms would you use?
  3. How many models would you create?
  4. How can you allow the Finance Department to see the source cases that the data mining model used for training?
  5. How can you use SSAS data mining in Excel 2007 to find the suspicious rows?

Answers

 

1. Preparing the data for customer churn is not an easy task. The problem is that you typically do not have a simple Churn attribute in your source data. You should find out from the Sales Department how it wants to define churn. For example, you could create a new attribute that shows whether a customer has purchased anything from Adventure Works in the past six months and then use this as a predictable attribute. Do not forget to randomly split the existing data into training sets and test sets.


2. You can use Decision Trees, Naive Bayes, Neural Network, Clustering, Logistic Regression, and Linear Regression algorithms for predictions.


3. In a real-life project, you should create multiple mining models by using different algorithms and different parameters of the algorithms. Alternatively, you could try to define the attribute that measures the churn differently and apply the models on the new predictable variable. Last, you could use different input attributes for different models. You could have a single mining structure and ignore some attributes in some models. You would then deploy the model with the best performance into the production environment.


4. To allow the Finance Department to see the source cases that the data mining model used for training, you have to enable drillthrough on your mining model. To do this, first give the Drill Through permission to the SSAS database role that includes the Finance Department users. In addition, you should give this role the Read Access permission and the Read Definition permission for the data source. Last, the Finance Department users must have permissions that allow access to the source data. For example, if the source is in a SQL Server table, the users need Select permission on that table.


5. You can download the Microsoft SQL Server 2008 Data Mining Add-ins for Microsoft Office 2007 (see References) and install them on the computers in the Finance Department. Then, with the help of an ad hoc Clustering model, the Finance Department users will be able to use Excel 2007 Add-ins to analyze the worksheet data. With the Clustering model, you can easily find outliers, which are cases that do not fit well in any cluster. Outliers typically include suspicious data. Of course, you must allow session mining models on your SSAS.