Skip to content

fix(assets): repair double-encoded UTF-8 in asset names#2966

Open
vpetersson wants to merge 3 commits into
masterfrom
fix/repair-mojibake-asset-names
Open

fix(assets): repair double-encoded UTF-8 in asset names#2966
vpetersson wants to merge 3 commits into
masterfrom
fix/repair-mojibake-asset-names

Conversation

@vpetersson
Copy link
Copy Markdown
Contributor

Issues Fixed

Asset names that arrive double-encoded from an upstream uploader (the classic UTF-8 bytes decoded as Latin-1 mojibake — e.g. Formulários stored as Formulários) were saved verbatim. The garbled name then shows up both in the web UI asset list and in the viewer's Showing asset … log line.

Description

Anthias never produces this corruption itself — Django's multipart parser and DRF both decode as UTF-8, and the viewer's stderr is UTF-8 — but it stored whatever a misbehaving client sent in the request body.

  • Add a conservative, deterministic repair_mojibake helper. It only rewrites a string when every character is in the Latin-1 range and those bytes form a valid, different UTF-8 string — the unambiguous signature of double-encoded UTF-8. Correctly-stored names (Formulários, Café, 日本語) raise on one of the two steps and pass through untouched, so good data can't be corrupted. Idempotent.
  • Apply it as a single write-side guardrail in Asset.save, so every create/update path (web form, all four API versions, legacy Screenly import) stores a clean name.
  • Add a one-time, idempotent data migration to repair rows stored before the guardrail existed.

Checklist

  • I have performed a self-review of my own code.
  • New and existing unit tests pass locally and on CI with my changes.
  • I have done an end-to-end test for Raspberry Pi devices.
  • I have tested my changes for x86 devices.
  • I added a documentation for the changes I have made (when necessary).

- add `repair_mojibake` helper that safely undoes the classic
  `UTF-8 bytes decoded as Latin-1` corruption (e.g. `Formulários`
  → `Formulários`); no-op on correctly-encoded text
- apply it in `Asset.save` so every write path (web form, all API
  versions, legacy import) stores a clean name
- one-time data migration repairs names stored before the guardrail

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vpetersson vpetersson requested a review from a team as a code owner June 1, 2026 07:08
@vpetersson vpetersson self-assigned this Jun 1, 2026
@vpetersson vpetersson requested a review from Copilot June 1, 2026 07:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a defensive repair for asset names that arrive as classic double-encoded UTF-8 mojibake (UTF-8 bytes decoded as Latin-1), ensuring newly saved assets store cleaned names and providing a one-time migration to fix previously stored corrupted rows.

Changes:

  • Add repair_mojibake() helper and apply it in Asset.save() as a write-side guardrail.
  • Add a data migration to repair existing Asset.name values stored before the guardrail.
  • Add unit tests covering the helper, model save behavior, and the migration logic.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
tests/test_mojibake_repair.py Adds tests for mojibake repair helper, save-time guardrail, and migration repair pass.
src/anthias_server/app/models.py Introduces repair_mojibake() and applies it in Asset.save() to clean names on write.
src/anthias_server/app/migrations/0007_repair_mojibake_asset_names.py Adds a one-time data migration to repair pre-existing garbled asset names.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/anthias_server/app/models.py Outdated
Comment thread src/anthias_server/app/models.py Outdated
Comment thread src/anthias_server/app/migrations/0007_repair_mojibake_asset_names.py Outdated
- Soften the safety claim in repair_mojibake / migration docstrings: the
  Latin-1 + valid-UTF-8 test is a strong heuristic, not a proof. A
  genuinely Latin-1 name whose bytes are also valid UTF-8 (e.g. `©` →
  `©`) is indistinguishable from mojibake and gets rewritten; document
  the accepted trade-off instead of overclaiming "can't corrupt good
  data".
- Return the original string when the repair produces no change, to
  match the stated "only rewrites when different" contract.
- Stream the migration's rows with .only('asset_id', 'name').iterator()
  instead of caching the whole assets table in memory.
- Add a regression test pinning the documented `©` false positive.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread src/anthias_server/app/models.py
Asset.save() repaired self.name unconditionally, so a metadata-only or
reachability save (update_fields excluding 'name', e.g. v2 views'
save(update_fields=['metadata'])) would mutate the in-memory name
without writing the column — diverging the instance from the stored row
and leaving any DB-side garble unrepaired. Repair only when name is
actually persisted (update_fields is None or includes 'name'), and add a
regression test for the metadata-only path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 1, 2026

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment on lines +57 to +60
class Migration(migrations.Migration):
dependencies = [
('anthias_app', '0006_asset_metadata'),
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants