Friday, 19 October 2018

Data Mining Process in R Language

Irawen October 19, 2018 R No comments

Phases in a typical Data Mining effort:

1. Discovery
     Frame business problem
     Identify analytics component
     Formulate initial hypotheses

2. Data Preparation
     Obtain dataset form internal and external sources
     Data consistency checks in terms of definitions of fields, units of measurement, time periods etc.,
     Sample

3. Data Exploration and Conditioning
   Missing data handling, Range reason ability, Outliers,
   Graphical or Visual Analysis
   Transformation, Creation of new variables, and Normalization
   Partitioning into Training, validation, and Test datasets

4. Model Planning
- Determine data mining task such as prediction, classification etc.
   - Select appropriate data mining methods and techniques such as regression, neural networks, clustering etc.

5. Model Building
   Building different candidate models using selected techniques and their variants using training data
   Refine and select the final model using validation data
   Evaluate the final model on test data

6. Results Interpretation
      Model evaluation using key performance metrics

7. Model Deployment
       Pilot project to integrate and run the model on operational systems

Similar data mining methodologies developed by SAS and IBM Modeler (SPSS Clementine) are called SEMAA and CRISP-DM respectively

Data mining techniques can be divided into Supervised Learning Methods and Unsupervised Learning Methods

Supervised Learning
- In supervised learning, algorithms are used to learn the function 'f' that can map input variables (X) into output variables (Y)
                        Y = f(X)
- Idea is to approximate 'f' such that new data on input variables (X) can predict the output variables (Y) with minimum possible error (ε)

Supervised Learning problem can be grouped into prediction and classification problems

Unsupervised Learning
- In Unsupervised Learning, algorithms are used to learn the underlying structure or patterns hidden in the data

Unsupervised Learning problems can be grouped into clustering and association rule learning problems

Target Population
- Subset of the population under study
- Results are generalized to the target population

Sample
- Subset of the target population

Simple Random Sampling
- A sampling method where in each observation has an equal chance of being selected.

Random Sampling
- A sampling method where in each observation does not necessarily have an equal chance of being selected

Sampling with Replacement
- Sample values are independent

Sampling without Replacement
- Sample values aren't independent

Sampling results in less no. of observation than the no. of total observation in the dataset

Data Mining algorithms
- Varying limitations on number of observation and variables

Limitation due to computing power and storage capacity

Limitations due to statistical being used

How many observation to build accurate models?

Rare Event, e.g., low response rate in advertising by traditional mail or email
- Oversampling of 'success' cases
- Arise mainly in classification tasks
- Costs of misclassification
- Costs of failing to identify 'success' cases are generally more than costs of detailed review of all cases
- Prediction of 'success is likely to come at cost of misclassifying more 'failure' cases as 'success' cases than usual