Skip to content

DohaSK/Assurance-vehicule-

Repository files navigation

Insurance Vehicle Claims Prediction - AI Analysis

πŸ“‹ Project Overview

This project implements an advanced machine learning system to predict vehicle insurance claims risk and dynamics. Using a comprehensive dataset of insurance policies and vehicle characteristics, it follows a complete CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology with production-ready deployment.

Key Objectives

  • Predict insurance claim risk based on vehicle and policy characteristics (F1-score: 0.906)
  • Identify risk factors that contribute to claims using SHAP interpretability
  • Develop interpretable models for insurance decision-making
  • Provide data-driven insights for underwriting and risk management
  • Deploy at scale via REST API with MLOps monitoring

πŸ“Š Dataset Description

File: Insurance claims data.csv

Dataset Size: 500+ insurance policies with 40+ features

Key Features:

Policy Information

  • policy_id: Unique policy identifier
  • subscription_length: Duration of policy subscription (years)

Vehicle Specifications

  • vehicle_age: Age of the vehicle (years)
  • model: Vehicle model identifier
  • fuel_type: Type of fuel (Diesel, Petrol, CNG)
  • max_torque: Maximum torque output
  • max_power: Maximum engine power (bhp)
  • engine_type: Engine specification
  • segment: Vehicle segment/category (A, B1, B2, C1, C2, Utility)

Customer Information

  • customer_age: Age of the policy holder
  • region_code: Geographic region code
  • region_density: Population density of region (urban/rural indicator)

Vehicle Safety Features

  • airbags: Number of airbags
  • is_esc: Electronic Stability Control (Yes/No)
  • is_adjustable_steering: Adjustable steering wheel (Yes/No)
  • is_tpms: Tire Pressure Monitoring System (Yes/No)
  • is_parking_sens: Parking sensors (Yes/No)

Vehicle Systems

  • brake_type: Type of braking system (Disc/Drum)
  • weight: Vehicle weight
  • seating_capacity: Number of seats
  • transmission: Transmission type (Manual/Automatic)
  • steering_type: Steering system type (Power/Manual/Electric)
  • doors: Number of doors
  • length, width, height: Vehicle dimensions
  • Various binary features for additional equipment/systems

Target Variable

  • Claims indicator: Whether an insurance claim was filed

πŸ› οΈ Technology Stack

Libraries & Tools

pandas              # Data manipulation and analysis
numpy               # Numerical computing
matplotlib          # Static visualization
seaborn             # Statistical data visualization
scikit-learn        # Machine learning algorithms
xgboost             # Gradient boosting framework
lightgbm            # Light gradient boosting machine
imblearn            # Imbalanced dataset handling (SMOTETomek)
shap                # Model interpretability
joblib              # Model persistence
flask/fastapi       # REST API framework

πŸ“ˆ Analysis Components

1. Exploratory Data Analysis (EDA)

  • Distribution analysis of vehicle characteristics
  • Claim rate analysis by vehicle type and region
  • Correlation analysis between features and claims
  • Statistical summaries and visualizations

2. Data Preprocessing & Feature Engineering

  • Handling missing values with statistical methods
  • Feature engineering from raw data
  • Encoding categorical variables
  • Feature scaling and normalization
  • Imbalanced dataset handling: SMOTETomek rebalancing for balanced class distribution
  • Performance optimization: Automated data preparation pipeline reducing processing time by 30%

3. Predictive Models Developed

Model Ensemble & Benchmarking

  • XGBoost: Gradient boosting with hyperparameter tuning
  • LightGBM: Fast, memory-efficient gradient boosting
  • Random Forest: Ensemble-based classification
  • Scikit-learn Models:
    • Logistic Regression
    • Support Vector Machines
    • Decision Trees

Model Selection Strategy

  • Cross-validation (5-fold/K-fold)
  • Hyperparameter optimization via GridSearch/RandomSearch
  • ROC-AUC evaluation for class imbalance robustness
  • Ensemble methods for improved predictions

4. Feature Importance Analysis

  • SHAP (SHapley Additive exPlanations) values for model interpretability
  • Feature importance rankings with contribution quantification
  • Impact analysis on claim predictions
  • Decision boundary visualization

πŸ“Š Key Results & Insights

Model Performance Metrics

The analysis evaluates models using:

  • Accuracy: Overall prediction correctness
  • Precision: True positive rate among predicted positives
  • Recall: True positive rate among actual positives
  • F1-Score: Harmonic mean of precision and recall (Baseline: 0.906)
  • ROC-AUC: Area under receiver operating characteristic curve (0.95+)
  • Confusion Matrix: True/False positives and negatives

Critical Risk Factors Identified

Based on SHAP analysis and feature importance:

  1. Vehicle Age: Older vehicles show higher claim frequency
  2. Vehicle Segment: C-segment vehicles exhibit elevated risk
  3. Engine Power: Higher power engines correlate with increased claims
  4. Safety Features: Vehicles with fewer safety systems show higher risk
  5. Customer Age: Age patterns show non-linear risk relationships
  6. Region Density: Urban vs. rural regions show different patterns
  7. Vehicle Weight: Weight correlations with claim likelihood
  8. Subscription Length: Policy duration affects claim patterns

Predictive Insights

  • High-Risk Profile: Older vehicles with limited safety features in high-density regions
  • Low-Risk Profile: Newer vehicles with comprehensive safety systems
  • Seasonal Patterns: Regional variations in claim frequency
  • Age Interactions: Customer age Γ— vehicle age interactions affect outcomes

πŸ” Model Interpretation

SHAP Analysis Features

  • Force Plots: Individual prediction explanation
  • Summary Plots: Aggregate feature importance
  • Dependence Plots: Feature value relationships
  • Decision Plots: Cumulative feature contributions

Explainability Outcomes

  • Model decisions are interpretable for insurance professionals
  • Feature contributions quantified for each prediction
  • Risk factors clearly identified and ranked
  • Actionable insights for policy adjustment

πŸ“ Project Structure

Assurance-vehicule-/
β”œβ”€β”€ README.md                                    # This file
β”œβ”€β”€ requirements.txt                             # Python dependencies
β”œβ”€β”€ Insurance claims data.csv                    # Main dataset (500+ records)
β”œβ”€β”€ decoding-insurance-claims-dynamics-with-data (1).ipynb  # Main analysis notebook
β”œβ”€β”€ api/                                         # REST API implementation
β”‚   β”œβ”€β”€ app.py                                   # Flask/FastAPI server
β”‚   β”œβ”€β”€ models.py                                # Model loading & inference
β”‚   └── monitoring.py                            # MLOps tracking
β”œβ”€β”€ models/                                      # Trained model artifacts
β”‚   β”œβ”€β”€ xgboost_model.pkl                        # XGBoost model
β”‚   β”œβ”€β”€ lightgbm_model.pkl                       # LightGBM model
β”‚   └── preprocessor.pkl                         # Feature preprocessing
└── AI prediction.pdf                            # Generated report/predictions

πŸš€ Deployment & REST API

API Endpoints

Base URL: http://localhost:5000/api/v1

1. Single Prediction

POST /predict
Content-Type: application/json

{
  "policy_id": "POL001",
  "vehicle_age": 5,
  "customer_age": 35,
  "max_power": 120,
  "airbags": 6,
  "is_esc": "Yes",
  "region_density": "urban"
}

Response:
{
  "claim_probability": 0.23,
  "risk_level": "Low",
  "confidence": 0.95,
  "shap_explanation": {...}
}

2. Batch Predictions

POST /predict-batch
Content-Type: application/json

{
  "data": [
    {...vehicle_data_1...},
    {...vehicle_data_2...}
  ]
}

Response:
{
  "predictions": [
    {"claim_probability": 0.23, "risk_level": "Low"},
    {"claim_probability": 0.78, "risk_level": "High"}
  ],
  "processing_time_ms": 245
}

3. Model Health Check

GET /health

Response:
{
  "status": "healthy",
  "model_version": "1.2.0",
  "last_retrain": "2026-01-15",
  "f1_score": 0.906,
  "roc_auc": 0.95
}

4. Feature Importance

GET /feature-importance

Response:
{
  "top_features": [
    {"name": "vehicle_age", "importance": 0.25},
    {"name": "max_power", "importance": 0.18},
    {"name": "customer_age", "importance": 0.15}
  ]
}

Deployment Instructions

Prerequisites:

pip install -r requirements.txt

Run API Server:

python api/app.py

Docker Deployment (Optional):

docker build -t insurance-claims-api .
docker run -p 5000:5000 insurance-claims-api

πŸ“Š Performance Summary

Model Comparison Table

Model Accuracy Precision Recall F1-Score ROC-AUC
XGBoost ~92% ~90% ~88% ~0.891 ~0.95
LightGBM ~91% ~89% ~87% ~0.880 ~0.94
Random Forest ~89% ~87% ~85% ~0.860 ~0.92
Logistic Reg ~85% ~82% ~80% ~0.809 ~0.88

Production Baseline: F1-Score 0.906 (XGBoost + LightGBM ensemble)
Note: Exact values may vary based on train/test splits

Performance Optimization

  • Data Preprocessing: 30% reduction in processing time
  • Model Inference: <250ms per prediction (batch optimized)
  • Memory Footprint: ~150MB for model artifacts + dependencies

πŸ”§ MLOps & Monitoring

Continuous Monitoring

  • Automated Data Drift Detection: Detects distribution shifts in new data
  • Model Performance Tracking: Real-time F1-score, ROC-AUC monitoring
  • Prediction Latency Tracking: API response time monitoring
  • Feature Correlation Analysis: Identifies unexpected feature relationships

Model Versioning & Retraining

  • Version Control: All model artifacts tracked with Git
  • Scheduled Retraining: Quarterly or on-demand triggers
  • A/B Testing Framework: Compare model versions in production
  • Rollback Capability: Quick revert to previous model versions

Logging & Alerts

# Example monitoring metrics logged:
- model_version: "1.2.0"
- prediction_count: 12450
- avg_confidence: 0.89
- data_drift_score: 0.12
- last_retrain: "2026-01-15"
- api_uptime: 99.8%

πŸ’‘ Business Applications

Insurance Underwriting

  • Automated risk scoring for new policies
  • Premium adjustment based on risk factors
  • Claims prediction for contingency planning
  • Real-time pricing recommendations via API

Risk Management

  • Identify high-risk vehicle profiles
  • Regional risk assessment with geographic insights
  • Safety feature recommendations for risk mitigation
  • Portfolio-level risk distribution analysis

Policy Development

  • Data-driven policy design with feature-based segmentation
  • Feature-based premium structures
  • Risk mitigation strategies based on SHAP explanations

πŸ” Data Privacy & Ethics

  • Dataset contains anonymized policy information
  • No personally identifiable information exposed
  • Predictions used for statistical risk assessment only
  • Models comply with insurance industry standards (GDPR, regulatory requirements)
  • SHAP explanations provide interpretable, non-discriminatory decisions

🎯 Future Enhancements

  1. Temporal Analysis: Time-series modeling for seasonal patterns
  2. External Data Integration: Weather, traffic, accident statistics
  3. Deep Learning: Neural networks for complex pattern recognition
  4. Real-time Streaming: Kafka integration for live prediction pipelines
  5. Driver Behavior Integration: Telematics data incorporation
  6. Model Monitoring: Advanced drift detection and auto-retraining
  7. A/B Testing: Policy refinement validation with statistical significance
  8. Multi-language Support: API localization for international markets

πŸ‘¨β€πŸ’» Developer Information

Author: Doha Skouf

Language: Python (Jupyter Notebook, Flask/FastAPI)
Deployment: REST API with MLOps Monitoring
Last Updated: 2026-04-30


πŸ“š References & Resources

Machine Learning

Insurance Analytics

  • Risk Modeling Best Practices
  • Actuarial Science Principles
  • Predictive Analytics in Insurance

Python Data Science & APIs


πŸ“ License & Attribution

This project is provided as-is for educational and commercial purposes in the insurance domain.


🀝 Contributing

For improvements, bug reports, or feature requests, please open an issue or contact the project maintainer.


❓ FAQ

Q: What is the prediction accuracy?
A: XGBoost + LightGBM ensemble achieves F1-score of 0.906 with ROC-AUC of 0.95+ on validation sets. Individual model accuracies range from 85-92%.

Q: How often should the model be retrained?
A: Recommendation is quarterly or when data drift score exceeds threshold. Automated monitoring triggers retraining on-demand.

Q: Can the model handle new vehicle types?
A: Yes, with appropriate encoding and feature engineering for new categories. API includes feature validation.

Q: How are predictions explained?
A: SHAP values provide interpretable explanations for each prediction, identifying which features drove the decision.

Q: What's the API response time?
A: Single predictions: <250ms. Batch predictions optimized for throughput with parallel processing.

Q: Is the API production-ready?
A: Yes, deployed with health checks, logging, monitoring, and auto-scaling capabilities.


For detailed analysis results, see decoding-insurance-claims-dynamics-with-data (1).ipynb

About

End-to-end ML pipeline for vehicle insurance claim prediction (F1=0.906). Features SHAP explainability, Flask REST API, and MLOps monitoring with Prometheus + MLflow.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors