[AURON #2257] Avoid URI reparsing in JNI Hadoop paths#2264
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug in Auron’s JNI Hadoop file wrappers where paths containing a literal # could be truncated due to java.net.URI fragment parsing, and adds Java regression coverage to prevent recurrence.
Changes:
- Adjusted JNI bridge path handling to avoid fragment truncation when
#appears in the path string. - Added
JniBridgeTestregression tests covering literal#handling and percent-encoding behavior. - Added a test-scoped Hadoop runtime dependency to support the new unit test.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
auron-core/src/main/java/org/apache/auron/jni/JniBridge.java |
Changes how input/output paths are converted to Hadoop Path objects to avoid # fragment truncation. |
auron-core/src/test/java/org/apache/auron/jni/JniBridgeTest.java |
Adds regression tests asserting # is preserved and that read/write path encoding behavior is stable. |
auron-core/pom.xml |
Adds hadoop-client-runtime as a test dependency to compile/run the new test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public static FSDataInputWrapper openFileAsDataInputWrapper(FileSystem fs, String path) throws Exception { | ||
| // the path is a URI string, so we need to convert it to a URI object | ||
| return FSDataInputWrapper.wrap(fs.open(new Path(new URI(path)))); | ||
| return FSDataInputWrapper.wrap(fs.open(toInputPath(path))); | ||
| } |
| public static FSDataOutputWrapper createFileAsDataOutputWrapper(FileSystem fs, String path) throws Exception { | ||
| return FSDataOutputWrapper.wrap(fs.create(new Path(new URI(path)))); | ||
| return FSDataOutputWrapper.wrap(fs.create(new Path(path))); | ||
| } | ||
|
|
||
| private static Path toInputPath(String path) throws URISyntaxException { | ||
| String safePath = path.indexOf('#') >= 0 ? path.replace("#", "%23") : path; | ||
| return new Path(new URI(safePath)); | ||
| } |
| } | ||
|
|
||
| private static Path toInputPath(String path) throws URISyntaxException { | ||
| String safePath = path.indexOf('#') >= 0 ? path.replace("#", "%23") : path; |
There was a problem hiding this comment.
This seems a little brittle; why do we need this? can we do something like so?
public static FSDataInputWrapper openFileAsDataInputWrapper(FileSystem fs, String path) throws Exception {
return FSDataInputWrapper.wrap(fs.open(new Path(path)));
}
yew1eb
left a comment
There was a problem hiding this comment.
The fix for the read path is inconsistent with the write path, and the toInputPath workaround has an edge-case bug.
|
|
||
| private static Path toInputPath(String path) throws URISyntaxException { | ||
| String safePath = path.indexOf('#') >= 0 ? path.replace("#", "%23") : path; | ||
| return new Path(new URI(safePath)); |
There was a problem hiding this comment.
Two issues here:
-
Inconsistent fix:
createFileAsDataOutputWrapperwas changed tonew Path(path)(the correct simple fix), butopenFileAsDataInputWrapperstill goes throughnew URI(...)after escaping#. Ifnew Path(path)is correct for writes, the same change should work for reads — the PR description itself says the fix is "change fromnew Path(new URI(path))tonew Path(path)". Why does the read path need a different approach? -
Double-encoding bug:
path.replace("#", "%23")will corrupt a path that already contains a literal%23(i.e. a percent-encoded#) by turning it into%2523. If the simplernew Path(path)works for writes, applying it uniformly to reads as well would fix both issues at once.
Which issue does this PR close?
Closes #2257
Rationale for this change
Auron's JNI Hadoop file wrappers currently reconstruct Hadoop paths with
new Path(new URI(path)).This does not preserve Hadoop
Path(String)semantics before the path is passed back toFileSystem.When a raw Hadoop path string contains a literal
#, Java URI parsing treats the suffix after#as a fragment, so the actual Hadoop path is truncated.For example, the intended path:
is opened as:
What changes are included in this PR?
This PR stops reparsing Hadoop path strings through java.net.URI in JniBridge.
The path reconstruction is changed from:
This preserves
Hadoop Path(String)semantics.Add a regression test for JNI Hadoop file wrapper path handling when the path contains a literal #.
Are there any user-facing changes?
This fixes a bug where Hadoop paths containing a literal
#could be truncated.No new APIs, configs, or migration steps are required.
How was this patch tested?
Ran the focused Java regression test:
Result: