-
Notifications
You must be signed in to change notification settings - Fork 20
fix: Multi-byte UTF-8 character input #798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
777b2e3
8a1259c
5ec0efa
f300243
7a01d06
aef9186
3183ee9
f6fda1b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -143,10 +143,11 @@ public struct Terminal: Terminaling { | |
| } | ||
|
|
||
| public func readCharacter() -> Character? { | ||
| if let char = readRawCharacter() { | ||
| return Character(UnicodeScalar(UInt8(char))) | ||
| let reader = UTF8Reader { | ||
| guard let rawChar = readRawCharacter() else { return nil } | ||
| return UInt8(truncatingIfNeeded: rawChar) | ||
| } | ||
| return nil | ||
| return reader.readCharacter() | ||
| } | ||
|
|
||
| /// Returns the size of the terminal if available. | ||
|
|
@@ -269,3 +270,49 @@ public struct Terminal: Terminaling { | |
| } | ||
| } | ||
| } | ||
|
|
||
| /// A reader that decodes UTF-8 encoded bytes into characters. | ||
| struct UTF8Reader { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Swift has the primities to handle this covering more scenarios like Grapheme clusters. Have you tried to do something like this instead? struct UTF8Reader {
private var iterator: AnyIterator<UInt8>
private var codec = Unicode.UTF8()
private var buffer = ""
init(readByte: @escaping () -> UInt8?) {
self.iterator = AnyIterator(readByte)
}
mutating func readCharacter() -> Character? {
while true {
switch codec.decode(&iterator) {
case .scalarValue(let scalar):
buffer.unicodeScalars.append(scalar)
// When we have more than one grapheme cluster,
// we know the first one is complete
if buffer.count > 1 {
return buffer.removeFirst()
}
case .emptyInput:
// No more input, return whatever is buffered
return buffer.isEmpty ? nil : buffer.removeFirst()
case .error:
return nil
}
}
}
}
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried the Unicode.UTF8 codec, but it doesn't work for interactive terminal input. The codec tries to read the next byte even after a complete UTF-8 sequence. For file processing this is fine since EOF returns nil, but for terminal input, getchar() blocks waiting for the user's next input. To use the codec, we need to determine the byte length first and read those bytes upfront. But at that point, the UTF-8 decoding is essentially done, so String(bytes:encoding:) is all we need.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. A fix would be to avoid the streaming codec for TTY input: read the first byte, determine the UTF-8 sequence length from that, then read exactly that many bytes (non-blocking or blocking), and finally decode with String(bytes:encoding:). That prevents getchar() from blocking for an extra byte and keeps multi-byte characters intact.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Example sketch: // Read 1st byte, determine sequence length, then read exactly that many bytes.
guard let first = readRawCharacter() else { return nil }
let firstByte = UInt8(truncatingIfNeeded: first)
guard let length = UTF8Sequence.length(forFirstByte: firstByte) else { return nil }
var bytes: [UInt8] = [firstByte]
for _ in 1..<length {
guard let next = readRawCharacter() else { return nil }
let nextByte = UInt8(truncatingIfNeeded: next)
guard UTF8Sequence.isContinuationByte(nextByte) else { return nil }
bytes.append(nextByte)
}
return String(bytes: bytes, encoding: .utf8)?.firstThis avoids the streaming decoder and prevents
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pepicrft Thanks for the feedback. I believe my current implementation already follows the approach you suggested. it reads the first byte, determines the sequence length, reads exactly that many bytes, and decodes with String(bytes:encoding:). The only difference is that I've split the logic into helper methods. Do you see any issues with the current implementation? |
||
| private let readByte: () -> UInt8? | ||
|
|
||
| /// Creates a reader with the given byte source. | ||
| /// - Parameter readByte: A closure that returns the next byte, or `nil` if no more bytes are available. | ||
| init(readByte: @escaping () -> UInt8?) { | ||
| self.readByte = readByte | ||
| } | ||
|
|
||
| func readCharacter() -> Character? { | ||
| guard let firstByte = readByte() else { return nil } | ||
| guard let length = sequenceLength(forFirstByte: firstByte) else { return nil } | ||
| guard let bytes = bytes(forSequenceOfLength: length, startingWith: firstByte) else { return nil } | ||
| return character(from: bytes) | ||
| } | ||
|
|
||
| private func sequenceLength(forFirstByte byte: UInt8) -> Int? { | ||
| switch byte { | ||
| case 0x00 ... 0x7F: 1 // ASCII | ||
| case 0xC2 ... 0xDF: 2 // 2-byte sequence (0xC0-0xC1 are overlong encodings) | ||
| case 0xE0 ... 0xEF: 3 // 3-byte sequence | ||
| case 0xF0 ... 0xF4: 4 // 4-byte sequence (0xF5+ exceeds Unicode range) | ||
| default: nil | ||
| } | ||
| } | ||
|
|
||
| private func bytes(forSequenceOfLength length: Int, startingWith firstByte: UInt8) -> [UInt8]? { | ||
| var result: [UInt8] = [firstByte] | ||
| for _ in 1 ..< length { | ||
| guard let byte = readByte(), isContinuationByte(byte) else { return nil } | ||
| result.append(byte) | ||
| } | ||
| return result | ||
| } | ||
|
|
||
| private func isContinuationByte(_ byte: UInt8) -> Bool { | ||
| // UTF-8 continuation bytes have the pattern 10xxxxxx (0x80-0xBF) | ||
| (byte & 0xC0) == 0x80 | ||
| } | ||
|
|
||
| private func character(from bytes: [UInt8]) -> Character? { | ||
| String(bytes: bytes, encoding: .utf8).flatMap(\.first) | ||
|
Ryu0118 marked this conversation as resolved.
|
||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| import Testing | ||
|
|
||
| @testable import Noora | ||
|
|
||
| struct UTF8ReaderTests { | ||
| @Test(arguments: TestCase.allCases) | ||
| func decodesSingleCharacter(testCase: TestCase) { | ||
| var iter = testCase.bytes.makeIterator() | ||
| let reader = UTF8Reader { iter.next() } | ||
| #expect(reader.readCharacter() == testCase.expected) | ||
| } | ||
|
|
||
| @Test | ||
| func readsConsecutiveCharactersWithoutByteLeakage() { | ||
| let bytes: [UInt8] = [ | ||
| 0x41, // A (1-byte) | ||
| 0xC3, 0xA9, // é (2-byte) | ||
| 0xE4, 0xB8, 0xAD, // 中 (3-byte) | ||
| 0xF0, 0x9F, 0x98, 0x80, // 😀 (4-byte) | ||
| ] | ||
| var iter = bytes.makeIterator() | ||
| let reader = UTF8Reader { iter.next() } | ||
|
|
||
| #expect(reader.readCharacter() == "A") | ||
| #expect(reader.readCharacter() == "é") | ||
| #expect(reader.readCharacter() == "中") | ||
| #expect(reader.readCharacter() == "😀") | ||
| #expect(reader.readCharacter() == nil) | ||
| } | ||
|
|
||
| struct TestCase: CustomTestStringConvertible, Sendable { | ||
| let bytes: [UInt8] | ||
| let expected: Character? | ||
| let testDescription: String | ||
|
|
||
| static let allCases: [TestCase] = [ | ||
| // 1-byte sequences (ASCII) | ||
| TestCase(bytes: [0x41], expected: "A", testDescription: "ASCII letter"), | ||
| TestCase(bytes: [0x00], expected: "\0", testDescription: "null character"), | ||
| TestCase(bytes: [0x7F], expected: "\u{7F}", testDescription: "ASCII max (DEL)"), | ||
|
|
||
| // 2-byte sequences | ||
| TestCase(bytes: [0xC3, 0xA9], expected: "é", testDescription: "Latin: French e-acute"), | ||
| TestCase(bytes: [0xD0, 0x90], expected: "А", testDescription: "Cyrillic: Russian A"), | ||
|
|
||
| // 3-byte sequences | ||
| TestCase(bytes: [0xE3, 0x81, 0x82], expected: "あ", testDescription: "Japanese hiragana"), | ||
| TestCase(bytes: [0xE4, 0xB8, 0xAD], expected: "中", testDescription: "Chinese hanzi"), | ||
| TestCase(bytes: [0xEA, 0xB0, 0x80], expected: "가", testDescription: "Korean hangul"), | ||
| TestCase(bytes: [0xE2, 0x82, 0xAC], expected: "€", testDescription: "Euro sign"), | ||
|
|
||
| // 4-byte sequences | ||
| TestCase(bytes: [0xF0, 0x9F, 0x98, 0x80], expected: "😀", testDescription: "Emoji"), | ||
| TestCase(bytes: [0xF0, 0x9F, 0x87, 0xAF], expected: "🇯", testDescription: "Regional indicator J"), | ||
|
|
||
| // Invalid sequences | ||
| TestCase(bytes: [0x80], expected: nil, testDescription: "Invalid: lone continuation byte"), | ||
| TestCase(bytes: [0xFF], expected: nil, testDescription: "Invalid: 0xFF is never valid"), | ||
| TestCase(bytes: [0xC3], expected: nil, testDescription: "Invalid: incomplete 2-byte sequence"), | ||
| TestCase(bytes: [0xE3, 0x81], expected: nil, testDescription: "Invalid: incomplete 3-byte sequence"), | ||
| TestCase(bytes: [0xF0, 0x9F, 0x98], expected: nil, testDescription: "Invalid: incomplete 4-byte sequence"), | ||
| TestCase(bytes: [0xC0, 0x80], expected: nil, testDescription: "Invalid: overlong encoding"), | ||
| TestCase(bytes: [0xF5, 0x80, 0x80, 0x80], expected: nil, testDescription: "Invalid: exceeds Unicode range"), | ||
| TestCase(bytes: [0xC3, 0x00], expected: nil, testDescription: "Invalid: bad continuation byte"), | ||
|
|
||
| // Empty input | ||
| TestCase(bytes: [], expected: nil, testDescription: "Empty input"), | ||
| ] | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new UTF8Reader instance on every call to readCharacter() is inefficient. The closure captures readRawCharacter and gets wrapped each time. Consider making UTF8Reader a stored property of Terminal or caching it to avoid repeated allocation and closure creation overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the struct is lightweight (just a closure) and is created only on user keypress, so there's no measurable overhead.