Phases in a typical Data Mining effort:
1. Discovery
Frame business problem
Identify analytics component
Formulate initial hypotheses
2. Data Preparation
Obtain dataset form internal and external sources
Data consistency checks in terms of definitions of fields, units of measurement, time periods etc.,
Sample
3. Data Exploration and Conditioning
Missing data handling, Range reason ability, Outliers,
Graphical or Visual Analysis
Transformation, Creation of new variables, and Normalization
Partitioning into Training, validation, and Test datasets
4. Model Planning
- Determine data mining task such as prediction, classification etc.
- Select appropriate data mining methods and techniques such as regression, neural networks, clustering etc.
5. Model Building
Building different candidate models using selected techniques and their variants using training data
Refine and select the final model using validation data
Evaluate the final model on test data
6. Results Interpretation
Model evaluation using key performance metrics
7. Model Deployment
Pilot project to integrate and run the model on operational systems
Similar data mining methodologies developed by SAS and IBM Modeler (SPSS Clementine) are called SEMAA and CRISP-DM respectively
Data mining techniques can be divided into Supervised Learning Methods and Unsupervised Learning Methods
Supervised Learning
- In supervised learning, algorithms are used to learn the function 'f' that can map input variables (X) into output variables (Y)
Y = f(X)
- Idea is to approximate 'f' such that new data on input variables (X) can predict the output variables (Y) with minimum possible error (ε)
Supervised Learning problem can be grouped into prediction and classification problems
Unsupervised Learning
- In Unsupervised Learning, algorithms are used to learn the underlying structure or patterns hidden in the data
Unsupervised Learning problems can be grouped into clustering and association rule learning problems
Target Population
- Subset of the population under study
- Results are generalized to the target population
Sample
- Subset of the target population
Simple Random Sampling
- A sampling method where in each observation has an equal chance of being selected.
Random Sampling
- A sampling method where in each observation does not necessarily have an equal chance of being selected
Sampling with Replacement
- Sample values are independent
Sampling without Replacement
- Sample values aren't independent
Sampling results in less no. of observation than the no. of total observation in the dataset
Data Mining algorithms
- Varying limitations on number of observation and variables
Limitation due to computing power and storage capacity
Limitations due to statistical being used
How many observation to build accurate models?
Rare Event, e.g., low response rate in advertising by traditional mail or email
- Oversampling of 'success' cases
- Arise mainly in classification tasks
- Costs of misclassification
- Costs of failing to identify 'success' cases are generally more than costs of detailed review of all cases
- Prediction of 'success is likely to come at cost of misclassifying more 'failure' cases as 'success' cases than usual
1. Discovery
Frame business problem
Identify analytics component
Formulate initial hypotheses
2. Data Preparation
Obtain dataset form internal and external sources
Data consistency checks in terms of definitions of fields, units of measurement, time periods etc.,
Sample
3. Data Exploration and Conditioning
Missing data handling, Range reason ability, Outliers,
Graphical or Visual Analysis
Transformation, Creation of new variables, and Normalization
Partitioning into Training, validation, and Test datasets
4. Model Planning
- Determine data mining task such as prediction, classification etc.
- Select appropriate data mining methods and techniques such as regression, neural networks, clustering etc.
5. Model Building
Building different candidate models using selected techniques and their variants using training data
Refine and select the final model using validation data
Evaluate the final model on test data
6. Results Interpretation
Model evaluation using key performance metrics
7. Model Deployment
Pilot project to integrate and run the model on operational systems
Similar data mining methodologies developed by SAS and IBM Modeler (SPSS Clementine) are called SEMAA and CRISP-DM respectively
Data mining techniques can be divided into Supervised Learning Methods and Unsupervised Learning Methods
Supervised Learning
- In supervised learning, algorithms are used to learn the function 'f' that can map input variables (X) into output variables (Y)
Y = f(X)
- Idea is to approximate 'f' such that new data on input variables (X) can predict the output variables (Y) with minimum possible error (ε)
Supervised Learning problem can be grouped into prediction and classification problems
Unsupervised Learning
- In Unsupervised Learning, algorithms are used to learn the underlying structure or patterns hidden in the data
Unsupervised Learning problems can be grouped into clustering and association rule learning problems
Target Population
- Subset of the population under study
- Results are generalized to the target population
Sample
- Subset of the target population
Simple Random Sampling
- A sampling method where in each observation has an equal chance of being selected.
Random Sampling
- A sampling method where in each observation does not necessarily have an equal chance of being selected
Sampling with Replacement
- Sample values are independent
Sampling without Replacement
- Sample values aren't independent
Sampling results in less no. of observation than the no. of total observation in the dataset
Data Mining algorithms
- Varying limitations on number of observation and variables
Limitation due to computing power and storage capacity
Limitations due to statistical being used
How many observation to build accurate models?
Rare Event, e.g., low response rate in advertising by traditional mail or email
- Oversampling of 'success' cases
- Arise mainly in classification tasks
- Costs of misclassification
- Costs of failing to identify 'success' cases are generally more than costs of detailed review of all cases
- Prediction of 'success is likely to come at cost of misclassifying more 'failure' cases as 'success' cases than usual