This project implements an advanced machine learning system to predict vehicle insurance claims risk and dynamics. Using a comprehensive dataset of insurance policies and vehicle characteristics, it follows a complete CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology with production-ready deployment.
- Predict insurance claim risk based on vehicle and policy characteristics (F1-score: 0.906)
- Identify risk factors that contribute to claims using SHAP interpretability
- Develop interpretable models for insurance decision-making
- Provide data-driven insights for underwriting and risk management
- Deploy at scale via REST API with MLOps monitoring
Dataset Size: 500+ insurance policies with 40+ features
Key Features:
policy_id: Unique policy identifiersubscription_length: Duration of policy subscription (years)
vehicle_age: Age of the vehicle (years)model: Vehicle model identifierfuel_type: Type of fuel (Diesel, Petrol, CNG)max_torque: Maximum torque outputmax_power: Maximum engine power (bhp)engine_type: Engine specificationsegment: Vehicle segment/category (A, B1, B2, C1, C2, Utility)
customer_age: Age of the policy holderregion_code: Geographic region coderegion_density: Population density of region (urban/rural indicator)
airbags: Number of airbagsis_esc: Electronic Stability Control (Yes/No)is_adjustable_steering: Adjustable steering wheel (Yes/No)is_tpms: Tire Pressure Monitoring System (Yes/No)is_parking_sens: Parking sensors (Yes/No)
brake_type: Type of braking system (Disc/Drum)weight: Vehicle weightseating_capacity: Number of seatstransmission: Transmission type (Manual/Automatic)steering_type: Steering system type (Power/Manual/Electric)doors: Number of doorslength,width,height: Vehicle dimensions- Various binary features for additional equipment/systems
- Claims indicator: Whether an insurance claim was filed
pandas # Data manipulation and analysis
numpy # Numerical computing
matplotlib # Static visualization
seaborn # Statistical data visualization
scikit-learn # Machine learning algorithms
xgboost # Gradient boosting framework
lightgbm # Light gradient boosting machine
imblearn # Imbalanced dataset handling (SMOTETomek)
shap # Model interpretability
joblib # Model persistence
flask/fastapi # REST API framework
- Distribution analysis of vehicle characteristics
- Claim rate analysis by vehicle type and region
- Correlation analysis between features and claims
- Statistical summaries and visualizations
- Handling missing values with statistical methods
- Feature engineering from raw data
- Encoding categorical variables
- Feature scaling and normalization
- Imbalanced dataset handling: SMOTETomek rebalancing for balanced class distribution
- Performance optimization: Automated data preparation pipeline reducing processing time by 30%
- XGBoost: Gradient boosting with hyperparameter tuning
- LightGBM: Fast, memory-efficient gradient boosting
- Random Forest: Ensemble-based classification
- Scikit-learn Models:
- Logistic Regression
- Support Vector Machines
- Decision Trees
- Cross-validation (5-fold/K-fold)
- Hyperparameter optimization via GridSearch/RandomSearch
- ROC-AUC evaluation for class imbalance robustness
- Ensemble methods for improved predictions
- SHAP (SHapley Additive exPlanations) values for model interpretability
- Feature importance rankings with contribution quantification
- Impact analysis on claim predictions
- Decision boundary visualization
The analysis evaluates models using:
- Accuracy: Overall prediction correctness
- Precision: True positive rate among predicted positives
- Recall: True positive rate among actual positives
- F1-Score: Harmonic mean of precision and recall (Baseline: 0.906)
- ROC-AUC: Area under receiver operating characteristic curve (0.95+)
- Confusion Matrix: True/False positives and negatives
Based on SHAP analysis and feature importance:
- Vehicle Age: Older vehicles show higher claim frequency
- Vehicle Segment: C-segment vehicles exhibit elevated risk
- Engine Power: Higher power engines correlate with increased claims
- Safety Features: Vehicles with fewer safety systems show higher risk
- Customer Age: Age patterns show non-linear risk relationships
- Region Density: Urban vs. rural regions show different patterns
- Vehicle Weight: Weight correlations with claim likelihood
- Subscription Length: Policy duration affects claim patterns
- High-Risk Profile: Older vehicles with limited safety features in high-density regions
- Low-Risk Profile: Newer vehicles with comprehensive safety systems
- Seasonal Patterns: Regional variations in claim frequency
- Age Interactions: Customer age Γ vehicle age interactions affect outcomes
- Force Plots: Individual prediction explanation
- Summary Plots: Aggregate feature importance
- Dependence Plots: Feature value relationships
- Decision Plots: Cumulative feature contributions
- Model decisions are interpretable for insurance professionals
- Feature contributions quantified for each prediction
- Risk factors clearly identified and ranked
- Actionable insights for policy adjustment
Assurance-vehicule-/
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ Insurance claims data.csv # Main dataset (500+ records)
βββ decoding-insurance-claims-dynamics-with-data (1).ipynb # Main analysis notebook
βββ api/ # REST API implementation
β βββ app.py # Flask/FastAPI server
β βββ models.py # Model loading & inference
β βββ monitoring.py # MLOps tracking
βββ models/ # Trained model artifacts
β βββ xgboost_model.pkl # XGBoost model
β βββ lightgbm_model.pkl # LightGBM model
β βββ preprocessor.pkl # Feature preprocessing
βββ AI prediction.pdf # Generated report/predictions
Base URL: http://localhost:5000/api/v1
POST /predict
Content-Type: application/json
{
"policy_id": "POL001",
"vehicle_age": 5,
"customer_age": 35,
"max_power": 120,
"airbags": 6,
"is_esc": "Yes",
"region_density": "urban"
}
Response:
{
"claim_probability": 0.23,
"risk_level": "Low",
"confidence": 0.95,
"shap_explanation": {...}
}POST /predict-batch
Content-Type: application/json
{
"data": [
{...vehicle_data_1...},
{...vehicle_data_2...}
]
}
Response:
{
"predictions": [
{"claim_probability": 0.23, "risk_level": "Low"},
{"claim_probability": 0.78, "risk_level": "High"}
],
"processing_time_ms": 245
}GET /health
Response:
{
"status": "healthy",
"model_version": "1.2.0",
"last_retrain": "2026-01-15",
"f1_score": 0.906,
"roc_auc": 0.95
}GET /feature-importance
Response:
{
"top_features": [
{"name": "vehicle_age", "importance": 0.25},
{"name": "max_power", "importance": 0.18},
{"name": "customer_age", "importance": 0.15}
]
}Prerequisites:
pip install -r requirements.txtRun API Server:
python api/app.pyDocker Deployment (Optional):
docker build -t insurance-claims-api .
docker run -p 5000:5000 insurance-claims-api| Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|---|---|---|---|---|---|
| XGBoost | ~92% | ~90% | ~88% | ~0.891 | ~0.95 |
| LightGBM | ~91% | ~89% | ~87% | ~0.880 | ~0.94 |
| Random Forest | ~89% | ~87% | ~85% | ~0.860 | ~0.92 |
| Logistic Reg | ~85% | ~82% | ~80% | ~0.809 | ~0.88 |
Production Baseline: F1-Score 0.906 (XGBoost + LightGBM ensemble)
Note: Exact values may vary based on train/test splits
- Data Preprocessing: 30% reduction in processing time
- Model Inference: <250ms per prediction (batch optimized)
- Memory Footprint: ~150MB for model artifacts + dependencies
- Automated Data Drift Detection: Detects distribution shifts in new data
- Model Performance Tracking: Real-time F1-score, ROC-AUC monitoring
- Prediction Latency Tracking: API response time monitoring
- Feature Correlation Analysis: Identifies unexpected feature relationships
- Version Control: All model artifacts tracked with Git
- Scheduled Retraining: Quarterly or on-demand triggers
- A/B Testing Framework: Compare model versions in production
- Rollback Capability: Quick revert to previous model versions
# Example monitoring metrics logged:
- model_version: "1.2.0"
- prediction_count: 12450
- avg_confidence: 0.89
- data_drift_score: 0.12
- last_retrain: "2026-01-15"
- api_uptime: 99.8%- Automated risk scoring for new policies
- Premium adjustment based on risk factors
- Claims prediction for contingency planning
- Real-time pricing recommendations via API
- Identify high-risk vehicle profiles
- Regional risk assessment with geographic insights
- Safety feature recommendations for risk mitigation
- Portfolio-level risk distribution analysis
- Data-driven policy design with feature-based segmentation
- Feature-based premium structures
- Risk mitigation strategies based on SHAP explanations
- Dataset contains anonymized policy information
- No personally identifiable information exposed
- Predictions used for statistical risk assessment only
- Models comply with insurance industry standards (GDPR, regulatory requirements)
- SHAP explanations provide interpretable, non-discriminatory decisions
- Temporal Analysis: Time-series modeling for seasonal patterns
- External Data Integration: Weather, traffic, accident statistics
- Deep Learning: Neural networks for complex pattern recognition
- Real-time Streaming: Kafka integration for live prediction pipelines
- Driver Behavior Integration: Telematics data incorporation
- Model Monitoring: Advanced drift detection and auto-retraining
- A/B Testing: Policy refinement validation with statistical significance
- Multi-language Support: API localization for international markets
Author: Doha Skouf
Language: Python (Jupyter Notebook, Flask/FastAPI)
Deployment: REST API with MLOps Monitoring
Last Updated: 2026-04-30
- XGBoost Documentation: https://xgboost.readthedocs.io/
- LightGBM Documentation: https://lightgbm.readthedocs.io/
- SHAP Documentation: https://shap.readthedocs.io/
- Risk Modeling Best Practices
- Actuarial Science Principles
- Predictive Analytics in Insurance
- Scikit-learn: https://scikit-learn.org/
- Pandas: https://pandas.pydata.org/
- Flask: https://flask.palletsprojects.com/
- FastAPI: https://fastapi.tiangolo.com/
This project is provided as-is for educational and commercial purposes in the insurance domain.
For improvements, bug reports, or feature requests, please open an issue or contact the project maintainer.
Q: What is the prediction accuracy?
A: XGBoost + LightGBM ensemble achieves F1-score of 0.906 with ROC-AUC of 0.95+ on validation sets. Individual model accuracies range from 85-92%.
Q: How often should the model be retrained?
A: Recommendation is quarterly or when data drift score exceeds threshold. Automated monitoring triggers retraining on-demand.
Q: Can the model handle new vehicle types?
A: Yes, with appropriate encoding and feature engineering for new categories. API includes feature validation.
Q: How are predictions explained?
A: SHAP values provide interpretable explanations for each prediction, identifying which features drove the decision.
Q: What's the API response time?
A: Single predictions: <250ms. Batch predictions optimized for throughput with parallel processing.
Q: Is the API production-ready?
A: Yes, deployed with health checks, logging, monitoring, and auto-scaling capabilities.
For detailed analysis results, see decoding-insurance-claims-dynamics-with-data (1).ipynb