From c041defba84b5addcdb7a87ab3a51860ba9d932d Mon Sep 17 00:00:00 2001
From: Simone Gasparini <simone.gasparini@gmail.com>
Date: Tue, 26 Aug 2025 18:36:10 +0200
Subject: [PATCH 1/4] first draft for AI coding conventions

---
 .github/copilot-instructions.md |   1 +
 AI_DEVELOPMENT_GUIDE.md         | 107 ++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)
 create mode 100644 .github/copilot-instructions.md
 create mode 100644 AI_DEVELOPMENT_GUIDE.md
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
new file mode 100644
index 00000000..3f010848
--- /dev/null
+++ b/.github/copilot-instructions.md
@@ -0,0 +1 @@
+See [AI_DEVELOPMENT_GUIDE.md](../AI_DEVELOPMENT_GUIDE.md) for full coding conventions.
\ No newline at end of file
diff --git a/AI_DEVELOPMENT_GUIDE.md b/AI_DEVELOPMENT_GUIDE.md
new file mode 100644
index 00000000..29854d4b
--- /dev/null
+++ b/AI_DEVELOPMENT_GUIDE.md
@@ -0,0 +1,107 @@
+# AI Development Guide for PopSift
+
+This guide defines how AI-assisted code generation should be done in this repository.  
+It ensures that contributions (from GitHub Copilot, ChatGPT, Claude, etc.) follow a **consistent, modern, and maintainable style**.
+
+---
+
+## General Principles
+
+- Always prioritize **readability** and **clarity** over micro-optimizations.
+- Follow **modern C++17 best practices**.
+- Keep host-side C++ and CUDA device code **cleanly separated**.
+- Prefer **modularity**: each class or major component should live in its own file.
+- Code should be **self-documenting** whenever possible, with clear naming and structure.
+
+---
+
+## C++ Guidelines
+
+- **Standard**: Use **C++17**. Prefer `constexpr`, `auto`, `enum class`, range-based for loops, and smart pointers (`std::unique_ptr`, `std::shared_ptr`).
+- **Memory Management**: Use RAII. Avoid raw `new`/`delete` except in CUDA contexts where unavoidable.
+- **Error Handling**:
+  - Use exceptions in host C++ code.  
+  - In CUDA, check and propagate error codes using helper utilities/macros. Never ignore errors.
+- **Namespaces**: Group related functions/classes logically. Avoid polluting the global namespace.
+- **Headers**:
+  - Keep headers minimal; forward declare instead of including heavy dependencies.
+  - Each header should be guarded with `#pragma once`.
+- **Style**:
+  - `snake_case` for variables and functions.  
+  - `CamelCase` for class and struct names.  
+  - `ALL_CAPS` for macros and compile-time constants.
+
+---
+
+## CUDA Guidelines
+
+- Separate **kernels** from host orchestration code.
+- Name kernels descriptively, e.g. `compute_gradient_kernel`.
+- Document assumptions about:
+  - Thread/block layout
+  - Shared memory usage
+  - Synchronization requirements
+- Use `__restrict__` and `constexpr` where appropriate for performance and clarity.
+- Prefer small, focused kernels over overly complex ones.
+- Always validate CUDA API calls.
+
+---
+
+## Threading Guidelines
+
+- **Host Threading**: Use `std::thread` and synchronization primitives from `<mutex>`.
+- **CUDA Streams**: Use multiple streams for concurrent kernel execution.
+- **Thread Safety**: Document thread safety guarantees for all public APIs.
+- **Avoid**: Raw pthreads or platform-specific threading APIs.
+
+---
+
+## Modularity and Organization
+
+- Keep code **organized by functionality** (e.g., detection, description, GPU utilities).
+- Avoid very long functions (>50 lines); refactor into helpers when possible.
+- Prefer **free functions** in namespaces over singletons or unnecessary wrapper classes.
+- Keep algorithms and data structures reusable when possible.
+
+---
+
+## Performance Guidelines
+
+- **Memory Access Patterns**: Prefer coalesced memory access in CUDA kernels. Document stride patterns.
+- **Shared Memory**: Use shared memory for data reuse within thread blocks. Document bank conflicts.
+- **Register Usage**: Monitor register pressure in kernels. Aim for high occupancy.
+- **Asynchronous Operations**: Use CUDA streams for overlapping computation and memory transfers.
+- **Profiling**: Profile with `nvprof` or Nsight before optimizing. Document performance assumptions.
+- **Memory Bandwidth**: Consider memory bandwidth as the primary bottleneck for most kernels.
+
+---
+
+## Documentation
+
+- Use **Doxygen-style comments** for public APIs, classes, and CUDA kernels.
+- Document algorithm choices and any CUDA-specific design tradeoffs.
+- Update examples and README when new features are introduced.
+- At each update ensure that the changelog is also updated following the [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format.
+  - for each new feature, bug fix, or breaking change, add a corresponding entry in the changelog.
+  - the description should be short but informative, followed by the relevant PR link.
+
+---
+
+## Git Guidelines
+
+- **Branch Names**: `feature/description`, `fix/issue-number`, `refactor/component`
+- **Commit Messages**: Use conventional commits format: `[feat]`, `[fix]`, `[refactor]`, `[doc]` etc.
+- **File Organization**: Keep related files in logical directories
+- **Ignore Patterns**: Update `.gitignore` for build artifacts and IDE files
+
+---
+
+## Commit & PR Guidelines
+
+- Keep commits small and focused (one feature or fix per commit).
+- Do not commit untracked files that are not relevant.
+- PRs should include:
+  - Clear description of changes
+  - Explanations for algorithmic choices or CUDA-specific design decisions
+  - Updated tests or examples if applicable
+- Code must pass existing CI checks before merging.

From e38c5afd18e4f8bfe2d341c4014477b7457e59fa Mon Sep 17 00:00:00 2001
From: Simone Gasparini <simone.gasparini@gmail.com>
Date: Wed, 27 Aug 2025 09:02:41 +0200
Subject: [PATCH 2/4] Add markdownlint configuration file

---
 .markdownlint.json | 6 ++++++
 1 file changed, 6 insertions(+)
 create mode 100644 .markdownlint.json

diff --git a/.markdownlint.json b/.markdownlint.json
new file mode 100644
index 00000000..95b2714e
--- /dev/null
+++ b/.markdownlint.json
@@ -0,0 +1,6 @@
+{
+  "default": true,
+  "MD013": false,
+  "MD024": false,
+  "MD033": false
+}

From 936bcdeb3b00a05938cdfd273b3d8f71cf505633 Mon Sep 17 00:00:00 2001
From: Carsten Griwodz <griff@ifi.uio.no>
Date: Fri, 29 Aug 2025 14:40:05 +0200
Subject: [PATCH 3/4] some initial thoughts on the AI guide

---
 AI_DEVELOPMENT_GUIDE.md | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/AI_DEVELOPMENT_GUIDE.md b/AI_DEVELOPMENT_GUIDE.md
index 29854d4b..db113654 100644
--- a/AI_DEVELOPMENT_GUIDE.md
+++ b/AI_DEVELOPMENT_GUIDE.md
@@ -9,7 +9,8 @@ It ensures that contributions (from GitHub Copilot, ChatGPT, Claude, etc.) follo
 
 - Always prioritize **readability** and **clarity** over micro-optimizations.
 - Follow **modern C++17 best practices**.
-- Keep host-side C++ and CUDA device code **cleanly separated**.
+- Keep device-side __global__ functions in the same source file as the host-side C++ code that starts this kernel.
+- Always compile __device__ functions with the functions that call them. Preferably declare them static inline.
 - Prefer **modularity**: each class or major component should live in its own file.
 - Code should be **self-documenting** whenever possible, with clear naming and structure.
 
@@ -17,7 +18,12 @@ It ensures that contributions (from GitHub Copilot, ChatGPT, Claude, etc.) follo
 
 ## C++ Guidelines
 
-- **Standard**: Use **C++17**. Prefer `constexpr`, `auto`, `enum class`, range-based for loops, and smart pointers (`std::unique_ptr`, `std::shared_ptr`).
+- **Standard**:
+  - Use **C++17**. Prefer `constexpr`, `auto` and `enum class`.
+  - Use range-based for loops on the host side.
+  - Use smart pointers (`std::unique_ptr`, `std::shared_ptr`) on the host side.
+  - Never pass smart pointers as parameters to __global__ functions.
+  - Avoid dynamic memory allocation on the device side.
 - **Memory Management**: Use RAII. Avoid raw `new`/`delete` except in CUDA contexts where unavoidable.
 - **Error Handling**:
   - Use exceptions in host C++ code.  

From 0c5fdf251a110090fa12eeb181ae2ae010794bae Mon Sep 17 00:00:00 2001
From: Carsten Griwodz <griff@ifi.uio.no>
Date: Mon, 1 Sep 2025 17:00:12 +0200
Subject: [PATCH 4/4] some more CUDA calling considerations

---
 AI_DEVELOPMENT_GUIDE.md | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/AI_DEVELOPMENT_GUIDE.md b/AI_DEVELOPMENT_GUIDE.md
index db113654..5416d3f8 100644
--- a/AI_DEVELOPMENT_GUIDE.md
+++ b/AI_DEVELOPMENT_GUIDE.md
@@ -22,16 +22,23 @@ It ensures that contributions (from GitHub Copilot, ChatGPT, Claude, etc.) follo
   - Use **C++17**. Prefer `constexpr`, `auto` and `enum class`.
   - Use range-based for loops on the host side.
   - Use smart pointers (`std::unique_ptr`, `std::shared_ptr`) on the host side.
+  - Dynamic memory allocation on the device side is strongly discouraged.
   - Never pass smart pointers as parameters to __global__ functions.
-  - Avoid dynamic memory allocation on the device side.
-- **Memory Management**: Use RAII. Avoid raw `new`/`delete` except in CUDA contexts where unavoidable.
+- **Memory Management**:
+  - Use RAII on the host side.
+  - Avoid all dynamic memory allocation on the device side.
+  - Understand that reference-counting smart pointers cannot be kept consistent between
+    host and device, and that kernels run asynchronously from host code.
 - **Error Handling**:
   - Use exceptions in host C++ code.  
   - In CUDA, check and propagate error codes using helper utilities/macros. Never ignore errors.
 - **Namespaces**: Group related functions/classes logically. Avoid polluting the global namespace.
 - **Headers**:
   - Keep headers minimal; forward declare instead of including heavy dependencies.
-  - Each header should be guarded with `#pragma once`.
+    However, small helper functions declared `static inline __device__` use several times should be
+    included instead of copying the code.
+  - Each header should be guarded with `#pragma once`. ifndef/endif guards should be used in special
+    circumstances only.
 - **Style**:
   - `snake_case` for variables and functions.  
   - `CamelCase` for class and struct names.  
@@ -41,14 +48,19 @@ It ensures that contributions (from GitHub Copilot, ChatGPT, Claude, etc.) follo
 
 ## CUDA Guidelines
 
-- Separate **kernels** from host orchestration code.
+- Separate **kernels** (`__global__` functions) from host orchestration code, but keep
+  them in the same module as the host core that starts them.
 - Name kernels descriptively, e.g. `compute_gradient_kernel`.
 - Document assumptions about:
   - Thread/block layout
   - Shared memory usage
   - Synchronization requirements
 - Use `__restrict__` and `constexpr` where appropriate for performance and clarity.
-- Prefer small, focused kernels over overly complex ones.
+- Avoid writing kernels that use `local memory`, limit variables to registers and shared
+  memory as much as possible. To achieve this, prefer focused kernels over complex ones.
+- To structure larger kernels, use `__device__` functions that are declared
+  `static inline __device__`. Ensure that caller and device functions are compiled together.
+- Avoid dynamic parallelism.
 - Always validate CUDA API calls.
 
 ---