Skip to content

dvcoolarun/web2pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

45 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Production Deployment

For production use cases requiring managed infrastructure and scaling:

  • Production API: DocuQueue offers a managed document generation service ($5/mo) at docuqueue.com β€” no infrastructure maintenance required.
Screenshot 2026-06-14 at 2 39 33β€―PM

Web2pdf

CLI to convert webpages to PDFs

Web2pdf is a command line tool that converts webpages to Beautifully formatted pdfs.

webp2pdf

Features

  • Features
    • πŸ’₯ Batch Conversion: Convert multiple webpages to PDFs in one go.
    • πŸ•·οΈ Recursive Crawling: Crawl and convert entire websites starting from a single URL with configurable depth.
    • πŸ”„ Custom Styling: Tailor the appearance of your PDFs with customizable CSS, allowing you to adjust everything from fonts to background colors.
    • πŸ“„ Additional CSS: Flexibility to add custom CSS for further customization.
    • πŸ”— Multi-column Support: Benefit from multi-column support for more complex PDF layouts.
    • πŸ“š Page Numbers: Add page numbers to your PDFs for easier navigation.
    • πŸ”’ Table of Contents: Automatically generate a table of contents based on the headings in your HTML.
    • 🚦 Page Breaks: Control page breaks to ensure your PDFs are formatted exactly as you want them.
    • 🌐 Smart Link Filtering: Automatically filters to same-domain links to avoid external sites.
    • πŸ‘ Much more

Usage/Installation

To get started run:

git clone https://github.com/dvcoolarun/web2pdf.git

System Dependencies

WeasyPrint requires system libraries that must be installed before installing Python packages. Install them based on your operating system:

macOS (using Homebrew)

brew install cairo pango gdk-pixbuf libffi pkg-config

Note: If you encounter library loading errors after installation, see the Troubleshooting section below.

Linux (Ubuntu/Debian)

sudo apt-get update && sudo apt-get install -y \
    build-essential \
    python3-dev \
    libffi-dev \
    libssl-dev \
    libxml2-dev \
    libxslt1-dev \
    libjpeg-dev \
    libpango1.0-dev \
    libcairo2-dev \
    libgirepository1.0-dev \
    gobject-introspection

Linux (Fedora/RHEL/CentOS)

sudo dnf install -y \
    gcc \
    python3-devel \
    libffi-devel \
    openssl-devel \
    libxml2-devel \
    libxslt-devel \
    libjpeg-turbo-devel \
    pango-devel \
    cairo-devel \
    gobject-introspection-devel

With Conda (Recommended)

You need to install system dependencies (see above) PRIOR TO creating the conda environment. The order matters. If you jumped the gun and created your conda environment already like a boss, that unfortunately will backfire and land you in the troubleshooting section below. Sorry!

# Install system dependencies first (see System Dependencies section above)
# Then create and activate conda environment
conda env create -f environment.yml
conda activate web2pdf

# Run the tool
python main.py

Troubleshooting

macOS: "cannot load library 'libgobject-2.0-0'" Error

If you encounter this error after installing system dependencies, WeasyPrint may not be able to find the libraries. Try one of these solutions:

Option 1: Set environment variables (Recommended) Add to your shell configuration file (~/.zshrc for zsh or ~/.bash_profile for bash):

export DYLD_LIBRARY_PATH="/opt/homebrew/lib:$DYLD_LIBRARY_PATH"
export PKG_CONFIG_PATH="/opt/homebrew/lib/pkgconfig:$PKG_CONFIG_PATH"

Then reload: source ~/.zshrc (or source ~/.bash_profile)

Option 2: Reinstall WeasyPrint After installing system dependencies, reinstall WeasyPrint:

conda activate web2pdf
pip uninstall weasyprint -y
pip install weasyprint

Option 3: Use conda-forge WeasyPrint (Alternative) You can try installing WeasyPrint from conda-forge which may handle dependencies better:

conda activate web2pdf
conda install -c conda-forge weasyprint

Usage Examples

Interactive Mode (Original)

python main.py
# Enter URLs when prompted

Single URL

python main.py https://example.com

Recursive Crawling

# Crawl a website with default depth of 2
python main.py https://example.com --recursive

# Crawl with custom depth
python main.py https://example.com --recursive --depth 3

# Short form
python main.py https://example.com -r -d 3

# Use default output directory (~/web2pdf_output)
python main.py https://example.com --recursive

# Specify custom output directory (will be created if it doesn't exist)
python main.py https://example.com --recursive --output ./my_pdfs

# Adjust rate limiting (default: 5 seconds between batches of 10 requests)
python main.py https://example.com --recursive --rate-limit 3

# Create a single assembled PDF with all pages combined
python main.py https://example.com --recursive --assemble

# Force re-processing of existing PDFs (default: skip existing files)
python main.py https://example.com --recursive --force

The recursive mode will:

  • Start from the provided URL
  • Crawl all links on the same domain
  • Go up to the specified depth (default: 2)
  • Convert each found page into a separate PDF file
  • Generate filenames based on the URL structure (e.g., web2pdf.pdf for github.com/zaphodbeeblebrox3rd/web2pdf)

Development

You can clone the repository and install the package using the following commands

git clone
cd webp2pdf
pipenv install

Contributing

This CLI is in its early version, and we encourage the community to help improve code, testing, and additional features. Feel free to contribute to the project by submitting pull requests, reporting issues, or suggesting new features. Your contributions are highly appreciated!

About

πŸ”„ CLI to convert Webpages to PDFs πŸš€

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages