-
-
Notifications
You must be signed in to change notification settings - Fork 711
fix(assets): repair double-encoded UTF-8 in asset names #2966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vpetersson
wants to merge
3
commits into
master
Choose a base branch
from
fix/repair-mojibake-asset-names
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+236
−0
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
66 changes: 66 additions & 0 deletions
66
src/anthias_server/app/migrations/0007_repair_mojibake_asset_names.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| """One-time repair of double-encoded UTF-8 in existing ``Asset.name``. | ||
|
|
||
| A misbehaving uploader that double-encodes a filename stores e.g. | ||
| ``Formulários`` instead of ``Formulários`` (the UTF-8 bytes of ``á`` | ||
| read back as the Latin-1 chars ``Ã`` + ``¡``). Anthias never produces | ||
| this itself, but it stored whatever the request body carried, so | ||
| already-uploaded assets keep the garbled name in the UI and in the | ||
| viewer's ``Showing asset …`` log line. ``Asset.save`` now repairs new | ||
| writes; this migration fixes the rows that pre-date that guardrail. | ||
|
|
||
| The repair logic is inlined rather than imported from | ||
| ``anthias_server.app.models`` on purpose — migrations are frozen | ||
| snapshots of intent, and a future change to the model helper must not | ||
| retroactively alter what this one-time data fix did. | ||
|
|
||
| Idempotent: a name is rewritten only when every character is in the | ||
| Latin-1 range *and* those bytes form a valid UTF-8 string that differs | ||
| from the input — a strong heuristic for double-encoded UTF-8, though not | ||
| a proof (a genuinely Latin-1 name whose bytes are also valid UTF-8, e.g. | ||
| ``©`` → ``©``, is indistinguishable and gets rewritten too; such | ||
| collisions are vanishingly rare in real filenames). Correctly stored | ||
| names (``Formulários``, ``Café``, ``日本語``) raise on the encode or | ||
| decode step and are left untouched, so a re-run changes nothing. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| from django.db import migrations | ||
|
|
||
|
|
||
| def _repair_mojibake(text): # type: ignore[no-untyped-def] | ||
| if not text: | ||
| return text | ||
| try: | ||
| repaired = text.encode('latin-1').decode('utf-8') | ||
| except (UnicodeEncodeError, UnicodeDecodeError): | ||
| return text | ||
| return repaired if repaired != text else text | ||
|
|
||
|
|
||
| def _repair_names(apps, schema_editor): # type: ignore[no-untyped-def] | ||
| asset_model = apps.get_model('anthias_app', 'Asset') | ||
| # ``.only()`` + ``.iterator()`` streams rows in chunks instead of | ||
| # caching the whole table in memory; per-row ``save`` still works. | ||
| rows = ( | ||
| asset_model.objects.exclude(name__isnull=True) | ||
| .only('asset_id', 'name') | ||
| .iterator() | ||
| ) | ||
| for asset in rows: | ||
| repaired = _repair_mojibake(asset.name) | ||
| if repaired != asset.name: | ||
| asset.name = repaired | ||
| asset.save(update_fields=['name']) | ||
|
|
||
|
|
||
| class Migration(migrations.Migration): | ||
| dependencies = [ | ||
| ('anthias_app', '0006_asset_metadata'), | ||
| ] | ||
|
Comment on lines
+57
to
+60
|
||
|
|
||
| operations = [ | ||
| migrations.RunPython( | ||
| _repair_names, reverse_code=migrations.RunPython.noop | ||
| ), | ||
| ] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,115 @@ | ||
| """Tests for the double-encoded-UTF-8 (mojibake) repair on asset names. | ||
|
|
||
| Covers the pure helper, the write-side ``Asset.save`` guardrail (which | ||
| catches new uploads from any API version or the web form), and the | ||
| data-migration logic that fixes rows stored before the guardrail | ||
| existed. | ||
| """ | ||
|
|
||
| import pytest | ||
|
|
||
| from anthias_server.app.models import Asset, repair_mojibake | ||
|
|
||
| # ``Formulários`` round-tripped through the classic ``UTF-8 bytes read | ||
| # as Latin-1`` corruption. Kept as the canonical mojibake fixture so the | ||
| # intent is obvious without sprinkling non-ASCII escapes through the | ||
| # assertions. | ||
| GARBLED = 'Formulários'.encode('utf-8').decode('latin-1') | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| 'given, expected', | ||
| [ | ||
| # Genuine mojibake — the one case we repair. | ||
| (GARBLED, 'Formulários'), | ||
| # Correctly-encoded text must survive untouched: a multi-byte | ||
| # accent, a name with several Latin-1 accents, and CJK each | ||
| # raise on the encode or decode step and short-circuit. | ||
| ('Formulários', 'Formulários'), | ||
| ('Café Über señor', 'Café Über señor'), | ||
| ('日本語', '日本語'), | ||
| # ASCII / empty / None are no-ops. | ||
| ('Plain Name 2', 'Plain Name 2'), | ||
| ('', ''), | ||
| (None, None), | ||
| # A lone Latin-1 lead byte is not valid UTF-8 once re-decoded, | ||
| # so it is left alone rather than mangled. | ||
| ('Ã', 'Ã'), | ||
| # Documented false positive: a genuinely Latin-1 ``©`` (U+00C2 | ||
| # U+00A9) has bytes that are also valid UTF-8 (``©``), so it is | ||
| # indistinguishable from mojibake and gets rewritten. Accepted | ||
| # trade-off — see ``repair_mojibake``'s docstring. | ||
| ('©', '©'), | ||
| ], | ||
| ) | ||
| def test_repair_mojibake(given: str | None, expected: str | None) -> None: | ||
| assert repair_mojibake(given) == expected | ||
|
|
||
|
|
||
| def test_repair_mojibake_is_idempotent() -> None: | ||
| once = repair_mojibake(GARBLED) | ||
| assert repair_mojibake(once) == once == 'Formulários' | ||
|
|
||
|
|
||
| @pytest.mark.django_db | ||
| def test_save_repairs_mojibake_name() -> None: | ||
| asset = Asset.objects.create(name=GARBLED, mimetype='image') | ||
| asset.refresh_from_db() | ||
| assert asset.name == 'Formulários' | ||
|
|
||
|
|
||
| @pytest.mark.django_db | ||
| def test_save_leaves_clean_name_untouched() -> None: | ||
| asset = Asset.objects.create(name='Café Über señor', mimetype='image') | ||
| asset.refresh_from_db() | ||
| assert asset.name == 'Café Über señor' | ||
|
|
||
|
|
||
| @pytest.mark.django_db | ||
| def test_save_skips_repair_when_name_not_in_update_fields() -> None: | ||
| """A metadata-only save must not silently rewrite ``name``. | ||
|
|
||
| With ``update_fields`` excluding ``name`` the column isn't written, | ||
| so repairing ``self.name`` would only diverge the in-memory instance | ||
| from the stored row. The repair is skipped and the stored value is | ||
| left as-is. | ||
| """ | ||
| asset = Asset.objects.create(name='placeholder', mimetype='image') | ||
| Asset.objects.filter(pk=asset.pk).update(name=GARBLED) | ||
| asset.refresh_from_db() | ||
|
|
||
| asset.metadata = {'foo': 'bar'} | ||
| asset.save(update_fields=['metadata']) | ||
|
|
||
| # In-memory name untouched, and the DB still holds the raw value. | ||
| assert asset.name == GARBLED | ||
| asset.refresh_from_db() | ||
| assert asset.name == GARBLED | ||
|
|
||
|
|
||
| @pytest.mark.django_db | ||
| def test_migration_repairs_existing_rows() -> None: | ||
| """The migration's repair pass fixes pre-existing garbled rows. | ||
|
|
||
| ``Asset.save`` now cleans names on write, so to exercise the | ||
| migration's own logic against a *stored* mojibake row we write the | ||
| column directly with ``QuerySet.update`` (which bypasses ``save``). | ||
| """ | ||
| import importlib | ||
|
|
||
| migration = importlib.import_module( | ||
| 'anthias_server.app.migrations.0007_repair_mojibake_asset_names' | ||
| ) | ||
|
|
||
| asset = Asset.objects.create(name='placeholder', mimetype='image') | ||
| Asset.objects.filter(pk=asset.pk).update(name=GARBLED) | ||
|
|
||
| class _Apps: | ||
| @staticmethod | ||
| def get_model(app_label: str, model_name: str) -> type[Asset]: | ||
| return Asset | ||
|
|
||
| migration._repair_names(_Apps(), None) | ||
|
|
||
| asset.refresh_from_db() | ||
| assert asset.name == 'Formulários' |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.