Swartzn/feat/add xtreemstore rst#312
Conversation
407202d to
fa556f3
Compare
iamjoemccormick
left a comment
There was a problem hiding this comment.
Dropping a note I finished with my initial pass through this.
a2a3f5e to
4e1d63a
Compare
4e1d63a to
634f0aa
Compare
Introduce s3ApiClient and s3Provider abstractions along with newS3WithOptions for extending s3Client for new s3-compatible providers without changing the base S3 flow.
634f0aa to
cce8e6e
Compare
iamjoemccormick
left a comment
There was a problem hiding this comment.
Submitting review feedback on the first two commits (up to "add xtreemstore provider").
| executeAfter := rst.GetJobExecuteAfter(j.Get()) | ||
| j.Segments = make([]*Segment, 0, len(workRequests)) | ||
| for _, wr := range workRequests { | ||
| if executeAfter != nil { | ||
| wr.SetExecuteAfter(executeAfter) | ||
| } |
There was a problem hiding this comment.
todo: unless I'm missing something doesn't this still do what we were trying to avoid? Generating an timestamp on one node (Remote) that is then passed to the Sync nodes which who's times may not be in sync?
I would prefer instead we make this a DelayExecution field on the job request that uses the protobuf duration type that is propagated to a DelayExecution field on the work requests.
I would also propose this is set by the GenerateWorkRequests method, if appropriate for that RST client type. It doesn't feel right to set this one field in GenerateSubmission().
There was a problem hiding this comment.
I saw this but was distracted by the other comments and forgot to come back to it. Anyway, this required adding a DelayExecution field to both beeremote.JobRequest and flex.WorkRequest which allows the sync node to convert the delay to ExecuteAfter in SubmitWorkRequest. GenerateWorkRequests is still called by remote.
See e85cf75
| zap.Bool("hasWorkResult", entry.WorkResult != nil), | ||
| zap.Bool("hasStatus", status != nil), | ||
| ) | ||
| m.scheduler.AddRescheduleWorkToken(submissionId, time.Time{}) |
There was a problem hiding this comment.
question: so when this happens we still add a token because when the journal is replayed later this WR will still get picked up and presumably the invalid result sent back to Remote?
Edit: I see this was added with the "update job state to running..." commit. Was this a bug? If so it'd be worth a mention in the commit message.
There was a problem hiding this comment.
Yes, the work needs to be processed in order to complete/update remote.
It was a bug. I missed adding m.scheduler.AddRescheduleWorkToken(submissionId, time.Time{}) the first time. Showed up in my testing.
iamjoemccormick
left a comment
There was a problem hiding this comment.
Posting review comments for the remaining commits.
| if _, err := w.beeRemoteClient.UpdateWorkRequest(work.ctx, result.Work); err != nil { | ||
| log.Warn("unable to update remote job status to running; continuing work request without retrying", zap.Error(err)) | ||
| } |
There was a problem hiding this comment.
suggestion(blocking): we discussed this on Slack, but adding a note here so we don't loose track.
Sync intentionally never informed Remote when a work request shifted to "running" to reduce load on Remote and the jobs DB. I would prefer we keep it that way, as for the most part Remote never needs to know about this unless a user checks the job status.
What we could do is implement this issue https://github.com/ThinkParQ/bee-remote/issues/14 so if the user runs remote status or remote job list Remote will reach out and refresh the job and work request statuses. The GetJobsRequest message already has a UpdateWorkResults field that was intended to control if only the latest results known to Remote are returned, or if it also probes the Sync nodes.
We could either set UpdateWorkResults by default for CTL commands, or add a new --refresh-results flag for this. My thinking is for the most part users only care to know once jobs reach a terminal state, the intermediate states aren't that interesting unless you're trying to debug specific job issues.
…on so they're always updated
Remove unused s3Provider related components. Add comments for s3ApiClient, S3Client, and newS3WithOptions.
…sync work manager's pullInWork function
…t supported yet so it should return ErrUnsupportedOpForRST
cce8e6e to
b01a2c0
Compare
…After time on executing sync node to avoid time synchronization issues
…d emit flow Replace the bulk submit API with an emit callback passed to BuildBulkRequest instead of returning job requests from submit. This lets BuildBulkRequests, append and submit run under different contexts while providers continue emitting through the builder. Also fixes submission errors from being reported when jobs were not actually submitted because the context was canceled.
2ba6ab2 to
9d97563
Compare
iamjoemccormick
left a comment
There was a problem hiding this comment.
Just a handful of remaining observations, my comment on GenerateSubmission() is the main blocker.
| } | ||
|
|
||
| } else { | ||
| workRequests = rst.RecreateWorkRequests(j.Get(), j.GetSegments()) |
There was a problem hiding this comment.
todo: the RecreateWorkRequests() path was updated to handle SetDelayExecution(), but I think we also need to set this in the GenerateWorkRequests() path?
I'm actually not sure if the RecreateWorkRequests() path in GenerateSubmission() is ever executed anymore, elsewhere just calls RecreateWorkRequests() directly.
| if client.IncludeInBulkRequest(walkCtx, jobRequest) { | ||
| rstId := jobRequest.GetRemoteStorageTarget() | ||
|
|
||
| bulkRequestStatesMu.Lock() |
There was a problem hiding this comment.
suggestion(non-blocking): to avoid lock contention in a hot path consider making this a RWMutex as after the initial population of bulkRequestStates the map is only read.
Assisted-by: Claude:claude-opus-4-6
|
|
||
| type SubmitBulkRequestFn func(ctx context.Context) | ||
| type EmitBulkRequestFn func(ctx context.Context, request *beeremote.JobRequest) | ||
| type AppendBulkRequestFn func(ctx context.Context, request *beeremote.JobRequest) |
There was a problem hiding this comment.
suggestion(non-blocking): we should mention AppendBulkRequestsFn is called under a lock so it should be fast. For example of a provider did I/O in append it could really slow everything down.
Assisted-by: Claude:claude-opus-4-6
…ng rst.Provider interface functions
ThinkParQ/protobuf#71 is required.
Here's a few flowcharts to show the bulk request workflow BULK_XTREEMSTORE_FLOW.md
What does this PR do / why do we need it?
Required for all PRs.
Related Issue(s)
Required when applicable.
Where should the reviewer(s) start reviewing this?
Only required for larger PRs when this may not be immediately obvious.
Are there any specific topics we should discuss before merging?
Not required.
What are the next steps after this PR?
Not required.
Checklist before merging:
Required for all PRs.
When creating a PR these are items to keep in mind that cannot be checked by GitHub actions:
For more details refer to the Go coding standards and the pull request process.