This project provides a real-time data generator that simulates user activity for an e-commerce platform. It is a significant update to the original theLook eCommerce dataset, a popular public dataset from Looker available on the Google Cloud Marketplace.
While the original dataset is static and designed for batch analytics, this generator has been re-engineered to support a continuous, real-time stream of data. It inserts events directly into a PostgreSQL database, making it an ideal source for Change Data Capture (CDC) pipelines using tools like Debezium to stream database changes into Apache Kafka. This allows you to model and test modern, event-driven data architectures using a familiar and comprehensive e-commerce schema.
- Real-Time Event Simulation: Utilizes Python's
asyncioto generate a continuous stream of e-commerce events with a configurable average frequency (QPS). - Comprehensive Data Models: Creates realistic and interconnected data for Users, Orders, Order Items, and web-style Events, reflecting a true e-commerce environment.
- Realistic User Behavior: Simulates various user journeys, from initial purchase to order cancellations and returns, and even anonymous browsing ("ghost") sessions.
- Direct PostgreSQL Integration: Connects directly to a PostgreSQL database to insert data, managing tables within a specified schema.
- Resilient & Robust: Includes an automatic retry mechanism with exponential backoff to handle transient database connection errors gracefully.
- Highly Configurable: Offers extensive command-line options to control every aspect of the simulation, from event rates and user geography to database credentials.
The generator's logic is managed by the TheLookECommSimulator class, centered around two key methods: initialize() and run().
This method performs the one-time setup required to prepare the environment before the simulation begins.
- Schema and Table Creation: It executes DDL statements to create all necessary tables (
users,orders,order_items,events,products,dist_centers,heartbeat) within the specified schema if they don't already exist. - Initial Data Seeding: To ensure the simulation can start with a realistic state, it populates the database with initial data:
- Users: Creates and inserts an initial set of users based on the
--init-num-usersargument. - Products: Loads and inserts all product information from
products.csv. - Distribution Centers: Loads and inserts data for distribution centers from
distribution_centers.csv.
- Users: Creates and inserts an initial set of users based on the
This is the main event loop that continuously generates data.
- Event Pacing: The loop is paced using
random.expovariate(avg_qps), which creates a realistic, variable delay between events that averages out to the desired queries-per-second (--avg-qps). - Error Resilience: The loop tracks consecutive database errors. If an operation fails, it logs a warning and retries after a short delay. If the number of consecutive errors exceeds a threshold (3), the simulation stops to prevent runaway failures. A successful operation resets the counter.
- Core Task (Purchases)
_simulate_purchases(): This task runs in every iteration of the loop and represents the primary activity of the simulation:- It begins by selecting a random user from the database.
- There is a chance (
--user-update-prob) that the selected user's address will be updated before the purchase, simulating a user moving or correcting their information. - A new
Orderis created for the user with aProcessingstatus. - One or more
OrderItemrecords are generated for the order, each linked to a random product. - A sequence of
Eventrecords is created to mimic the user's path to purchase (e.g., visiting the homepage, a department page, the product page, adding to the cart, and completing the purchase). - All new records (the order, its items, and the associated events) are written to the database concurrently.
- Probabilistic Side Tasks -
_simulate_side_tasks(): After a purchase is simulated, a set of secondary, probabilistic tasks are run to add variety to the data:- New User Creation: Based on
--user-create-prob, a completely new user may be created and inserted. - Ghost Event Generation: Based on
--ghost-create-prob, the generator simulates an anonymous user browsing session by creating a series ofEventrecords that are not associated with anyuser_id. - Order Status Update: Based on
--order-update-prob, the generator will select a random historical order and advance its status (e.g., fromProcessingtoShipped, orDeliveredtoReturned). This also updates the associated order items and generates newEvents for cancellations or returns.
- New User Creation: Based on
- Python 3.8+
- A running PostgreSQL instance
- (Optional) A running Kafka cluster if you intend to use the Debezium/Kafka integration.
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate -
Install dependencies:
pip install -r datagen/look-ecomm/requirements.txt
-
Run the generator and view all available options:
python datagen/look-ecomm/data_generator.py --help
usage: data_generator.py [-h] [--avg-qps AVG_QPS] [--max-iter MAX_ITER] [--init-num-users INIT_NUM_USERS] [--country COUNTRY] [--state STATE]
[--postal-code POSTAL_CODE] [--user-create-prob USER_CREATE_PROB] [--user-update-prob USER_UPDATE_PROB]
[--order-update-prob ORDER_UPDATE_PROB] [--ghost-create-prob GHOST_CREATE_PROB] [--host HOST] [--user USER]
[--password PASSWORD] [--db-name DB_NAME] [--schema SCHEMA] [--batch-size BATCH_SIZE] [--echo | --no-echo]
[--bootstrap-servers BOOTSTRAP_SERVERS] [--topic-prefix TOPIC_PREFIX] [--create-topic | --no-create-topic]
Generate theLook eCommerce data
options:
-h, --help show this help message and exit
--avg-qps AVG_QPS Average events per second.
--max-iter MAX_ITER Max number of successful iterations. Default -1 for infinite.
--init-num-users INIT_NUM_USERS
Initial number of users to create.
--country COUNTRY User country.
--state STATE User state.
--postal-code POSTAL_CODE
User postal code.
--user-create-prob USER_CREATE_PROB
Probability of generating a new user. Default is 0.05. Set to 0 to disable.
--user-update-prob USER_UPDATE_PROB
Probability of updating a user address. Default is 0. Set to 0 to disable.
--order-update-prob ORDER_UPDATE_PROB
Probability of updating an order status. Default is 0. Set to 0 to disable.
--ghost-create-prob GHOST_CREATE_PROB
Probability of generating a ghost event. Default is 0.05. Set to 0 to disable.
--host HOST Database host.
--user USER Database user.
--password PASSWORD Database password.
--db-name DB_NAME Database name.
--schema SCHEMA Database schema.
--batch-size BATCH_SIZE
--echo, --no-echo
--bootstrap-servers BOOTSTRAP_SERVERS
Bootstrap server addresses.
--topic-prefix TOPIC_PREFIX
Kafka topic prefix.
--create-topic, --no-create-topic
Enable or disable automatic topic creation.
