Retain up to microsecond precision when importing sas7bdat files#330
Conversation
8fd1d3f to
2ab3a38
Compare
|
hi Thanks for the PR. I think the change is straightforward, but I need a bit more context. Explain why you need this in first place and submit a SAS file with fractional seconds as example, it need to be included in the test as well (Check contributing, issue should be discussed before sending a PR). 1- Changes to Readstat are not allowed, but I do not see any at the moment, so I guess it is fine. |
|
Hi @ofajardo, apologies for the confusing description and for the lack of issue. The description refers to the state before I submitted #332. As such, bullet point 2 can be ignored here and can be discussed in #332. |
|
OK, sounds good. Look at my comments in the other PR as well. For this one it would be important to provide a SAS example and writing a simple test that checks that the microseconds are read as expected. Probably one file and test would be enough for both PRs |
|
I have a test and a test file ready. Unfortunately, |
|
Hi, the vectorized version must stay by default, because the non vectorized version is very slow and I had complains in the past, and that is why the vectorized version was put in place. If there is no graceful way to make the higher precision work with the vectorized version, one option is not too capture the high precision by default but activate it by an argument and in such case avoid the vectorized path. Saying from the top of my head as I have not seen it. But in my experience so far, precision higher than seconds is rare (I have not seen it so far myself), so making it optional is an option. |
|
Does this still qualify as vectorized? From 186d2b17589f85bf95b9c8f05a2c6f4379aed277 Mon Sep 17 00:00:00 2001
From: Julian Sikorski <belegdol+github@gmail.com>
Date: Wed, 13 May 2026 16:40:19 +0200
Subject: [PATCH] Working concept for SAS using long long
---
pyreadstat/_readstat_parser.pyx | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/pyreadstat/_readstat_parser.pyx b/pyreadstat/_readstat_parser.pyx
index 42a0390..f2fb7ba 100644
--- a/pyreadstat/_readstat_parser.pyx
+++ b/pyreadstat/_readstat_parser.pyx
@@ -199,6 +199,8 @@ cdef object transform_datetime(py_datetime_format var_format, double tstamp, py_
cdef int secs
cdef double msecs
cdef int usecs
+ cdef long long days_usecs
+ cdef long long unix_to_origin_usecs
cdef object mydat
# For polars we are going to return an epoch from unix origin,
@@ -233,13 +235,16 @@ cdef object transform_datetime(py_datetime_format var_format, double tstamp, py_
return mydat.date()
elif var_format == DATE_FORMAT_DATETIME:
if output_format == "polars":
- # we want to return seconds from unix
+ unix_to_origin_usecs = <long long> (unix_to_origin_secs * 1e6)
+ # we want to return microseconds from unix
if file_format == FILE_FORMAT_STATA:
- # tstamp is in millisecons
- return (tstamp/1000) - unix_to_origin_secs
+ # tstamp is in milliseconds
+ return (tstamp * 1e3) - unix_to_origin_usecs
else:
# tstamp in seconds
- return tstamp - unix_to_origin_secs
+ days_usecs = <long long> (floor(tstamp) * 1e6)
+ usecs = <int> (round(tstamp % 1 * 1e6))
+ return days_usecs - unix_to_origin_usecs + usecs
if file_format == FILE_FORMAT_STATA:
# tstamp is in millisecons
@@ -1107,7 +1112,7 @@ cdef object dict_to_dataframe(object dict_data, data_container dc):
if var_format == DATE_FORMAT_DATE:
date_cols.append(column)
if datetime_cols:
- data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='s'))
+ data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='us'))
if date_cols:
data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*date_cols), time_unit='d'))
--
2.54.0Or does the entire double to timestamp conversion need to happen in |
|
Another option would be to limit the feature to milliseconds and document that anything beyond that might me imprecise. |
|
milliseconds can be an option. As described one problem is the speed, it has to at least match the current one. Another problem with increasing the precision by default is that it may limit the oldest date that can be processed, and that is an issue when you have old datetimes in your data, in that sense seconds was very good. Please control for that as well. |
|
I will investigate. As things are currently in the master branch, the polars path will happily read fractional seconds from the SAS files, it is just that the results might differ in the last bit. Only the pandas path was explicitly discarding the fraction, which is what this PR changes. |
|
It looks like the way I created the file has no influence on the precision challenges. pyreadstat/pyreadstat/_readstat_parser.pyx Line 232 in 1adde46 1829-09-11T10:47:37.282617 becomes 1829-09-11T10:47:37.282618. Before the subtraction, the floating point value is -4111996342.7173829079. After subtraction, it becomes -4427615542.7173824310. The former rounds up whereas the latter rounds down. It appears that one needs to convert to integer microseconds before the subtraction for maximum precision as done in #330 (comment), but this moves the majority of operations out of the vectorized polars code.Given the nature of storing time as a floating point number, loss of precision due to rounding is expected at some point. How much is acceptable? Microsecond precision is only guaranteed for around 142 years around 1960 for SAS, anything beyond that is imprecise by nature. Using a test file with milliseconds only would pass the tests but this feels like cheating. Regarding the performance, is there a particular file you would like me to benchmark? Is the large dataset mentioned in the readme publicly available? |
|
Here is the vectorised version for reference: From 32b083c82485f9472246d0cad15abbc8c0a2892e Mon Sep 17 00:00:00 2001
From: Julian Sikorski <belegdol+github@gmail.com>
Date: Thu, 14 May 2026 01:24:12 +0200
Subject: [PATCH] vectorised version
---
pyreadstat/_readstat_parser.pyx | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/pyreadstat/_readstat_parser.pyx b/pyreadstat/_readstat_parser.pyx
index 42a0390..ef73a27 100644
--- a/pyreadstat/_readstat_parser.pyx
+++ b/pyreadstat/_readstat_parser.pyx
@@ -1107,7 +1107,14 @@ cdef object dict_to_dataframe(object dict_data, data_container dc):
if var_format == DATE_FORMAT_DATE:
date_cols.append(column)
if datetime_cols:
- data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='s'))
+ data_frame = data_frame.with_columns(
+ [
+ pl.from_epoch(
+ (pl.col(c) % 1 * 1e6).round().cast(pl.Int64) + (pl.col(c).floor() * 1e6).cast(pl.Int64),
+ time_unit='us')
+ for c in datetime_cols if data_frame[c].len() > 0
+ ]
+ )
if date_cols:
data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*date_cols), time_unit='d'))
--
2.54.0 |
|
It turns out that generating a test file with precision limited to milliseconds only is not enough. I generated one with 1000 datetimes between years 1700 and 2200, and I still got one mismatch: |
|
I managed to get it working. |
|
hi @belegdol, below the code to generate the test data set. In my case, as you see I am saving it as parquet (but you can do as csv, or whatever), then I am reading that parquet file and saving the as an xport file using pyreadstat. Then I am benchmarking how long it takes to read that xport file using pyreadstat. On the current dev branch, it is 6 seconds using pandas and 3 seconds using polars. If I switch to your fork, it is 7 seconds using pandas (where did that extra second come from?) but 25 seconds using polars, so I am afraid your procedure is way to slow (assuming I am testing correctly). Please check it on your side, also curious to see how it is using sas7bdat if you manage to do that. So, if my measurement is good, I think speed would be really be an issue here. Just out of curiosity: what is the earliest date you can read using dev vs using your branch? I see in your test file you have dates back to 1700s, so I think that is enough, but curious about what would be the difference. A little detail I did not like too much in your code are lines 232 and 235 in _readstat_parser.pyx: you are returning a tuple. It was already bad before that I was returning sometimes a timestamp and sometimes a python date/time object, and now we are increasing the variability adding tupples (ideally I would like a function to return a constant type, rule that I have already broken before as I said). However, I do not see clearly why that is needed: unix_to_origin_secs is a constant and you have it stored in the datacontainer (see for example line 305), so you could also get this value after line 1051 and use it straight away. The code for the test dataset: import numpy as np
import pandas as pd
# import pyarrow as pa
n = 300_000
n_variables_per_type = 10
rng = np.random.default_rng(seed=42)
dat = pd.DataFrame({"id": np.arange(1, n + 1)})
fct_levels = ["A", "B", "C", "D", "E", "F", "G", "H"]
for i in range(1, n_variables_per_type + 1):
dat[f"fct_{i}"] = pd.Categorical(
rng.choice(fct_levels, size=n), categories=fct_levels
)
dat[f"chr_{i}"] = [
"".join(rng.choice(list("abcdefghijklmnopqrstuvwxyz0123456789"), size=15))
for _ in range(n)
]
dat[f"num_{i}"] = rng.uniform(-1_000_000, 1_000_000, size=n)
dat[f"date_{i}"] = pd.Timestamp("2020-01-01") + pd.to_timedelta(
rng.integers(-1000, 1000, size=n, endpoint=True), unit="D"
#, dtype=pd.ArrowDtype(pa.date32())
)
dat[f"dt_{i}"] = pd.Timestamp("2020-01-01 12:00:01") + pd.to_timedelta(
rng.integers(-70_000_000, 70_000_000, size=n, endpoint=True), unit="s"
)
dat.to_parquet("./benchmark_py.parquet")note: notice that the columns date_* should actually be transformed to a pyarrow date32 so that they become a date type in the parquet file, here I am leaving them as datetime so that in practice we have more datetime columns, that makes the difference more salient. You can also check with the correct transformation in. |
Datetime values in SAS, SPSS and STATA are stored as a floating point number. Any operation risks introducing a rounding error. Use integer math in order to preserve original interpretation.
This prevents `transform_datetime()` from having to return a tuple.
444e946 to
03e52af
Compare
|
Hi @ofajardo, thank you for the feedback! I got benchmark data consistent with yours: This PR with I took your suggestion regarding fetching The extra second in the pandas path is probably a result of having to process the fractional part of the timestamp and the integer math required to do that. |
|
Regarding the datetime range, I have tested |
|
Perfect, thanks a lot! |
|
ok, it seems to me that it is not that bad, what happens is that the dataset you built to test the fractional second thing has very old dates, and in pandas 2 there is an overflow because it uses nanoseconds as default. One simple option would be to skip that test if pandas 2. Maybe there is a more elegant solution? (forcing it to microseconds?) |
|
I adapted the tests to pass on python 3.10, checking the CI/CD pipeline now. Let me know in case you have a better idea. |
|
I am looking into this but I am not sure what the culprit is. According to https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#construction-with-datetime64-or-timedelta64-dtype-with-unsupported-resolution, microsecond resolution should already be supported in 2.0. |
|
I think you should be able to test on python 3.14 if you install pandas 2.3.3 or older. But yes, what you read works, please look at my latest commit on dev, I solved it in that way and let me know what you think (all tests are passing now in all versions). |
|
I like your solution. It still preserves the ability to process the dates outside of the |
|
ok, thanks for reviewing, in such case I will do a release soon |
|
Have you tried the |
|
Nevermind, |
|
yeah, I tried that I think and it raised an error. |
|
version 1.3.5 is out with the latest changes! thanks a lot for your PR! |
As things stand now, time and datetime precision when importing .sas7bdat files is limited to whole seconds. With this change, up to microsecond precision is possible.
I am submitting this as a draft PR for the following reasons: