Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 50 additions & 3 deletions cli/Sources/Noora/Utilities/Terminal.swift
Original file line number Diff line number Diff line change
Expand Up @@ -143,10 +143,11 @@ public struct Terminal: Terminaling {
}

public func readCharacter() -> Character? {
if let char = readRawCharacter() {
return Character(UnicodeScalar(UInt8(char)))
let reader = UTF8Reader {
guard let rawChar = readRawCharacter() else { return nil }
return UInt8(truncatingIfNeeded: rawChar)
}
return nil
return reader.readCharacter()
Comment on lines 145 to +150
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new UTF8Reader instance on every call to readCharacter() is inefficient. The closure captures readRawCharacter and gets wrapped each time. Consider making UTF8Reader a stored property of Terminal or caching it to avoid repeated allocation and closure creation overhead.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the struct is lightweight (just a closure) and is created only on user keypress, so there's no measurable overhead.

}

/// Returns the size of the terminal if available.
Expand Down Expand Up @@ -269,3 +270,49 @@ public struct Terminal: Terminaling {
}
}
}

/// A reader that decodes UTF-8 encoded bytes into characters.
struct UTF8Reader {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swift has the primities to handle this covering more scenarios like Grapheme clusters. Have you tried to do something like this instead?

struct UTF8Reader {
    private var iterator: AnyIterator<UInt8>
    private var codec = Unicode.UTF8()
    private var buffer = ""
    
    init(readByte: @escaping () -> UInt8?) {
        self.iterator = AnyIterator(readByte)
    }
    
    mutating func readCharacter() -> Character? {
        while true {
            switch codec.decode(&iterator) {
            case .scalarValue(let scalar):
                buffer.unicodeScalars.append(scalar)
                // When we have more than one grapheme cluster,
                // we know the first one is complete
                if buffer.count > 1 {
                    return buffer.removeFirst()
                }
            case .emptyInput:
                // No more input, return whatever is buffered
                return buffer.isEmpty ? nil : buffer.removeFirst()
            case .error:
                return nil
            }
        }
    }
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the Unicode.UTF8 codec, but it doesn't work for interactive terminal input.

The codec tries to read the next byte even after a complete UTF-8 sequence. For file processing this is fine since EOF returns nil, but for terminal input, getchar() blocks waiting for the user's next input.

To use the codec, we need to determine the byte length first and read those bytes upfront. But at that point, the UTF-8 decoding is essentially done, so String(bytes:encoding:) is all we need.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. A fix would be to avoid the streaming codec for TTY input: read the first byte, determine the UTF-8 sequence length from that, then read exactly that many bytes (non-blocking or blocking), and finally decode with String(bytes:encoding:). That prevents getchar() from blocking for an extra byte and keeps multi-byte characters intact.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example sketch:

// Read 1st byte, determine sequence length, then read exactly that many bytes.
guard let first = readRawCharacter() else { return nil }
let firstByte = UInt8(truncatingIfNeeded: first)

guard let length = UTF8Sequence.length(forFirstByte: firstByte) else { return nil }
var bytes: [UInt8] = [firstByte]
for _ in 1..<length {
  guard let next = readRawCharacter() else { return nil }
  let nextByte = UInt8(truncatingIfNeeded: next)
  guard UTF8Sequence.isContinuationByte(nextByte) else { return nil }
  bytes.append(nextByte)
}

return String(bytes: bytes, encoding: .utf8)?.first

This avoids the streaming decoder and prevents getchar() from blocking for an extra byte on interactive input.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pepicrft Thanks for the feedback. I believe my current implementation already follows the approach you suggested. it reads the first byte, determines the sequence length, reads exactly that many bytes, and decodes with String(bytes:encoding:). The only difference is that I've split the logic into helper methods.

Do you see any issues with the current implementation?

private let readByte: () -> UInt8?

/// Creates a reader with the given byte source.
/// - Parameter readByte: A closure that returns the next byte, or `nil` if no more bytes are available.
init(readByte: @escaping () -> UInt8?) {
self.readByte = readByte
}

func readCharacter() -> Character? {
guard let firstByte = readByte() else { return nil }
guard let length = sequenceLength(forFirstByte: firstByte) else { return nil }
guard let bytes = bytes(forSequenceOfLength: length, startingWith: firstByte) else { return nil }
return character(from: bytes)
}

private func sequenceLength(forFirstByte byte: UInt8) -> Int? {
switch byte {
case 0x00 ... 0x7F: 1 // ASCII
case 0xC2 ... 0xDF: 2 // 2-byte sequence (0xC0-0xC1 are overlong encodings)
case 0xE0 ... 0xEF: 3 // 3-byte sequence
case 0xF0 ... 0xF4: 4 // 4-byte sequence (0xF5+ exceeds Unicode range)
default: nil
}
}

private func bytes(forSequenceOfLength length: Int, startingWith firstByte: UInt8) -> [UInt8]? {
var result: [UInt8] = [firstByte]
for _ in 1 ..< length {
guard let byte = readByte(), isContinuationByte(byte) else { return nil }
result.append(byte)
}
return result
}

private func isContinuationByte(_ byte: UInt8) -> Bool {
// UTF-8 continuation bytes have the pattern 10xxxxxx (0x80-0xBF)
(byte & 0xC0) == 0x80
}

private func character(from bytes: [UInt8]) -> Character? {
String(bytes: bytes, encoding: .utf8).flatMap(\.first)
Comment thread
Ryu0118 marked this conversation as resolved.
}
}
70 changes: 70 additions & 0 deletions cli/Tests/NooraTests/Utilities/UTF8ReaderTests.swift
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import Testing

@testable import Noora

struct UTF8ReaderTests {
@Test(arguments: TestCase.allCases)
func decodesSingleCharacter(testCase: TestCase) {
var iter = testCase.bytes.makeIterator()
let reader = UTF8Reader { iter.next() }
#expect(reader.readCharacter() == testCase.expected)
}

@Test
func readsConsecutiveCharactersWithoutByteLeakage() {
let bytes: [UInt8] = [
0x41, // A (1-byte)
0xC3, 0xA9, // é (2-byte)
0xE4, 0xB8, 0xAD, // 中 (3-byte)
0xF0, 0x9F, 0x98, 0x80, // 😀 (4-byte)
]
var iter = bytes.makeIterator()
let reader = UTF8Reader { iter.next() }

#expect(reader.readCharacter() == "A")
#expect(reader.readCharacter() == "é")
#expect(reader.readCharacter() == "中")
#expect(reader.readCharacter() == "😀")
#expect(reader.readCharacter() == nil)
}

struct TestCase: CustomTestStringConvertible, Sendable {
let bytes: [UInt8]
let expected: Character?
let testDescription: String

static let allCases: [TestCase] = [
// 1-byte sequences (ASCII)
TestCase(bytes: [0x41], expected: "A", testDescription: "ASCII letter"),
TestCase(bytes: [0x00], expected: "\0", testDescription: "null character"),
TestCase(bytes: [0x7F], expected: "\u{7F}", testDescription: "ASCII max (DEL)"),

// 2-byte sequences
TestCase(bytes: [0xC3, 0xA9], expected: "é", testDescription: "Latin: French e-acute"),
TestCase(bytes: [0xD0, 0x90], expected: "А", testDescription: "Cyrillic: Russian A"),

// 3-byte sequences
TestCase(bytes: [0xE3, 0x81, 0x82], expected: "あ", testDescription: "Japanese hiragana"),
TestCase(bytes: [0xE4, 0xB8, 0xAD], expected: "中", testDescription: "Chinese hanzi"),
TestCase(bytes: [0xEA, 0xB0, 0x80], expected: "가", testDescription: "Korean hangul"),
TestCase(bytes: [0xE2, 0x82, 0xAC], expected: "€", testDescription: "Euro sign"),

// 4-byte sequences
TestCase(bytes: [0xF0, 0x9F, 0x98, 0x80], expected: "😀", testDescription: "Emoji"),
TestCase(bytes: [0xF0, 0x9F, 0x87, 0xAF], expected: "🇯", testDescription: "Regional indicator J"),

// Invalid sequences
TestCase(bytes: [0x80], expected: nil, testDescription: "Invalid: lone continuation byte"),
TestCase(bytes: [0xFF], expected: nil, testDescription: "Invalid: 0xFF is never valid"),
TestCase(bytes: [0xC3], expected: nil, testDescription: "Invalid: incomplete 2-byte sequence"),
TestCase(bytes: [0xE3, 0x81], expected: nil, testDescription: "Invalid: incomplete 3-byte sequence"),
TestCase(bytes: [0xF0, 0x9F, 0x98], expected: nil, testDescription: "Invalid: incomplete 4-byte sequence"),
TestCase(bytes: [0xC0, 0x80], expected: nil, testDescription: "Invalid: overlong encoding"),
TestCase(bytes: [0xF5, 0x80, 0x80, 0x80], expected: nil, testDescription: "Invalid: exceeds Unicode range"),
TestCase(bytes: [0xC3, 0x00], expected: nil, testDescription: "Invalid: bad continuation byte"),

// Empty input
TestCase(bytes: [], expected: nil, testDescription: "Empty input"),
]
}
}