Retain up to microsecond precision when importing sas7bdat files by belegdol · Pull Request #330 · Roche/pyreadstat

belegdol · 2026-05-08T12:49:45Z

As things stand now, time and datetime precision when importing .sas7bdat files is limited to whole seconds. With this change, up to microsecond precision is possible.

I am submitting this as a draft PR for the following reasons:

the changes are (partially) dependent on the upstream ReadStat changes which are included here for reproducibility. Otherwise one needs Linux SAS to generate appropriate files
Once the format length is read properly for 32-bit files like ones created with SAS for Windows, the current approach of comparing the entirety of format name with a list will break whenever a format width is defined. Is there a particular reason for this approach? Ignoring the format width altogether would definitely be simpler. I can include it in the change you would be willing to consider
Right now the same code is used for SPSS files, which I have none to test with. If the change needs to be made SAS-specific, please let me know.

belegdol · 2026-05-12T08:07:43Z

#1 is actually independent, I removed it.
#2 should be mitigated by #332
#3 I do not know how to test.

ofajardo · 2026-05-12T10:33:41Z

hi

Thanks for the PR. I think the change is straightforward, but I need a bit more context. Explain why you need this in first place and submit a SAS file with fractional seconds as example, it need to be included in the test as well (Check contributing, issue should be discussed before sending a PR).

1- Changes to Readstat are not allowed, but I do not see any at the moment, so I guess it is fine.
2- Explain why "the current approach of comparing the entirety of format name with a list will break whenever a format width is defined". I am not happy to change the current behavior of the package, so there must be a very strong reason to drop that mechanism that has worked for years by now.
3- At the very minimum tests should run without errors (they do for me). There are sample SPSS and STATA files in the test_data folder. However, we would need to generate SPSS and STATA tests files with fractional second information.

belegdol · 2026-05-12T11:57:37Z

Hi @ofajardo,

apologies for the confusing description and for the lack of issue. The description refers to the state before I submitted #332. As such, bullet point 2 can be ignored here and can be discussed in #332.
The reason for needing this is to read maximum precision available as well as ensuring maximum possible parity with https://github.com/saurfang/spark-sas7bdat/commits/master/. There should be no difference between creating a spark data frame natively with spark-sas7bdat and creating a pandas dataframe with pyreadstat and subsequently converting it to spark dataframe.
I will look into test errors, and can also likely provide a test SAS file. SPSS one can be a problem since I do not have access to SPSS currently.

ofajardo · 2026-05-12T12:22:09Z

OK, sounds good. Look at my comments in the other PR as well. For this one it would be important to provide a SAS example and writing a simple test that checks that the microseconds are read as expected. Probably one file and test would be enough for both PRs

belegdol · 2026-05-13T13:47:40Z

I have a test and a test file ready. Unfortunately, polars path is failing on datetime due to microseconds hitting the limit of float64 precision. I will see if I can get the conversion to work in a way that the test pass.
Interestingly enough, time column, which is not vectorized, passes fine.
The pandas route works for datetime column as well. Would disabling the vectorisation for datetime be worth the increased precision?

ofajardo · 2026-05-13T14:16:41Z

Hi, the vectorized version must stay by default, because the non vectorized version is very slow and I had complains in the past, and that is why the vectorized version was put in place.

If there is no graceful way to make the higher precision work with the vectorized version, one option is not too capture the high precision by default but activate it by an argument and in such case avoid the vectorized path. Saying from the top of my head as I have not seen it. But in my experience so far, precision higher than seconds is rare (I have not seen it so far myself), so making it optional is an option.

belegdol · 2026-05-13T14:44:24Z

Does this still qualify as vectorized?

From 186d2b17589f85bf95b9c8f05a2c6f4379aed277 Mon Sep 17 00:00:00 2001
From: Julian Sikorski <belegdol+github@gmail.com>
Date: Wed, 13 May 2026 16:40:19 +0200
Subject: [PATCH] Working concept for SAS using long long

---
 pyreadstat/_readstat_parser.pyx | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/pyreadstat/_readstat_parser.pyx b/pyreadstat/_readstat_parser.pyx
index 42a0390..f2fb7ba 100644
--- a/pyreadstat/_readstat_parser.pyx
+++ b/pyreadstat/_readstat_parser.pyx
@@ -199,6 +199,8 @@ cdef object transform_datetime(py_datetime_format var_format, double tstamp, py_
     cdef int secs
     cdef double msecs
     cdef int usecs
+    cdef long long days_usecs
+    cdef long long unix_to_origin_usecs
     cdef object mydat
 
     # For polars we are going to return an epoch from unix origin, 
@@ -233,13 +235,16 @@ cdef object transform_datetime(py_datetime_format var_format, double tstamp, py_
             return mydat.date()
     elif var_format == DATE_FORMAT_DATETIME:
         if output_format == "polars":
-            # we want to return seconds from unix
+            unix_to_origin_usecs = <long long> (unix_to_origin_secs * 1e6)
+            # we want to return microseconds from unix
             if file_format == FILE_FORMAT_STATA:
-                # tstamp is in millisecons
-                return (tstamp/1000) - unix_to_origin_secs
+                # tstamp is in milliseconds
+                return (tstamp * 1e3) - unix_to_origin_usecs
             else:
                 # tstamp in seconds
-                return tstamp - unix_to_origin_secs
+                days_usecs = <long long> (floor(tstamp) * 1e6)
+                usecs = <int> (round(tstamp % 1 * 1e6))
+                return days_usecs - unix_to_origin_usecs + usecs
 
         if file_format == FILE_FORMAT_STATA:
             # tstamp is in millisecons
@@ -1107,7 +1112,7 @@ cdef object dict_to_dataframe(object dict_data, data_container dc):
                 if var_format == DATE_FORMAT_DATE:
                     date_cols.append(column)
             if datetime_cols:
-                data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='s'))
+                data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='us'))
             if date_cols:
                 data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*date_cols), time_unit='d'))
 
-- 
2.54.0

Or does the entire double to timestamp conversion need to happen in from_epoch()? The code above works, but it definitely adds complexity.

belegdol · 2026-05-13T14:50:49Z

Another option would be to limit the feature to milliseconds and document that anything beyond that might me imprecise.
Millisecond precision is what I have actually seen in the wild, and it is also how much precision you get by calling datetime() in SAS.

ofajardo · 2026-05-13T16:13:47Z

milliseconds can be an option.

As described one problem is the speed, it has to at least match the current one.

Another problem with increasing the precision by default is that it may limit the oldest date that can be processed, and that is an issue when you have old datetimes in your data, in that sense seconds was very good. Please control for that as well.

belegdol · 2026-05-13T16:41:03Z

I will investigate. As things are currently in the master branch, the polars path will happily read fractional seconds from the SAS files, it is just that the results might differ in the last bit. Only the pandas path was explicitly discarding the fraction, which is what this PR changes.
My test file happens to expose the imprecision in the polars route. There is a slight chance that the way I generated it might be partially responsible for that though.
Regarding the date range, you are absolutely correct. With large distances from 1960, there is not enough mantissa bits for microsecond precision. Ideally would reproduce how SAS renders the values if it is possible without losing performance.

belegdol · 2026-05-14T07:12:10Z

It looks like the way I created the file has no influence on the precision challenges.
While I have managed to develop a vectorized version of the code in #330 (comment), it still produces an off-by-microsecond date for one of 100 values tested. At the first glance it looks like the subtraction in

pyreadstat/pyreadstat/_readstat_parser.pyx

Line 232 in 1adde46

return (tstamp/1000) - unix_to_origin_secs

changes the LSB enough to influence the rounding: 1829-09-11T10:47:37.282617 becomes 1829-09-11T10:47:37.282618. Before the subtraction, the floating point value is -4111996342.7173829079. After subtraction, it becomes -4427615542.7173824310. The former rounds up whereas the latter rounds down. It appears that one needs to convert to integer microseconds before the subtraction for maximum precision as done in #330 (comment), but this moves the majority of operations out of the vectorized polars code.
Given the nature of storing time as a floating point number, loss of precision due to rounding is expected at some point. How much is acceptable? Microsecond precision is only guaranteed for around 142 years around 1960 for SAS, anything beyond that is imprecise by nature.
Using a test file with milliseconds only would pass the tests but this feels like cheating.
Regarding the performance, is there a particular file you would like me to benchmark? Is the large dataset mentioned in the readme publicly available?

belegdol · 2026-05-14T09:54:07Z

Here is the vectorised version for reference:

From 32b083c82485f9472246d0cad15abbc8c0a2892e Mon Sep 17 00:00:00 2001
From: Julian Sikorski <belegdol+github@gmail.com>
Date: Thu, 14 May 2026 01:24:12 +0200
Subject: [PATCH] vectorised version

---
 pyreadstat/_readstat_parser.pyx | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/pyreadstat/_readstat_parser.pyx b/pyreadstat/_readstat_parser.pyx
index 42a0390..ef73a27 100644
--- a/pyreadstat/_readstat_parser.pyx
+++ b/pyreadstat/_readstat_parser.pyx
@@ -1107,7 +1107,14 @@ cdef object dict_to_dataframe(object dict_data, data_container dc):
                 if var_format == DATE_FORMAT_DATE:
                     date_cols.append(column)
             if datetime_cols:
-                data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*datetime_cols), time_unit='s'))
+                data_frame = data_frame.with_columns(
+                    [
+                        pl.from_epoch(
+                            (pl.col(c) % 1 * 1e6).round().cast(pl.Int64) + (pl.col(c).floor() * 1e6).cast(pl.Int64),
+                            time_unit='us')
+                        for c in datetime_cols if data_frame[c].len() > 0
+                    ]
+                )
             if date_cols:
                 data_frame = data_frame.with_columns(pl.from_epoch(pl.col(*date_cols), time_unit='d'))
 
-- 
2.54.0

belegdol · 2026-05-14T12:43:00Z

It turns out that generating a test file with precision limited to milliseconds only is not enough. I generated one with 1000 datetimes between years 1700 and 2200, and I still got one mismatch: 1825-05-31T22:55:23.626 ended up as 1825-05-31T22:55:23.625999 when imported from SAS with the vectorized code. Same as previously, subtraction changed the seventh digit after the radix: -4247082276.3740000725 became -4562701476.3740005493.
It would appear that for polars one would have to explicitly round down to milliseconds in order to ensure a complete match.

belegdol · 2026-05-14T16:48:58Z

I managed to get it working. tests/test_narwhalified.py ran in 3.081s with this PR applied on top of master vs 3.367s on master. If you have a heavier dataset to test, please let me know.
Please note that for this PR to work on dev, my other PR needs to be merged first, otherwise the formats will not be recognised properly. Alternatively, the changes can be applied on top of master.

ofajardo · 2026-05-18T14:26:54Z

hi @belegdol, below the code to generate the test data set. In my case, as you see I am saving it as parquet (but you can do as csv, or whatever), then I am reading that parquet file and saving the as an xport file using pyreadstat. Then I am benchmarking how long it takes to read that xport file using pyreadstat. On the current dev branch, it is 6 seconds using pandas and 3 seconds using polars. If I switch to your fork, it is 7 seconds using pandas (where did that extra second come from?) but 25 seconds using polars, so I am afraid your procedure is way to slow (assuming I am testing correctly). Please check it on your side, also curious to see how it is using sas7bdat if you manage to do that.

So, if my measurement is good, I think speed would be really be an issue here.

Just out of curiosity: what is the earliest date you can read using dev vs using your branch? I see in your test file you have dates back to 1700s, so I think that is enough, but curious about what would be the difference.

A little detail I did not like too much in your code are lines 232 and 235 in _readstat_parser.pyx: you are returning a tuple. It was already bad before that I was returning sometimes a timestamp and sometimes a python date/time object, and now we are increasing the variability adding tupples (ideally I would like a function to return a constant type, rule that I have already broken before as I said). However, I do not see clearly why that is needed: unix_to_origin_secs is a constant and you have it stored in the datacontainer (see for example line 305), so you could also get this value after line 1051 and use it straight away.

The code for the test dataset:

import numpy as np
import pandas as pd
# import pyarrow as pa

n = 300_000
n_variables_per_type = 10

rng = np.random.default_rng(seed=42)

dat = pd.DataFrame({"id": np.arange(1, n + 1)})

fct_levels = ["A", "B", "C", "D", "E", "F", "G", "H"]

for i in range(1, n_variables_per_type + 1):
    dat[f"fct_{i}"] = pd.Categorical(
        rng.choice(fct_levels, size=n), categories=fct_levels
    )
    dat[f"chr_{i}"] = [
        "".join(rng.choice(list("abcdefghijklmnopqrstuvwxyz0123456789"), size=15))
        for _ in range(n)
    ]
    dat[f"num_{i}"] = rng.uniform(-1_000_000, 1_000_000, size=n)
    dat[f"date_{i}"] = pd.Timestamp("2020-01-01") + pd.to_timedelta(
        rng.integers(-1000, 1000, size=n, endpoint=True), unit="D"
       #, dtype=pd.ArrowDtype(pa.date32())
    )
    dat[f"dt_{i}"] = pd.Timestamp("2020-01-01 12:00:01") + pd.to_timedelta(
        rng.integers(-70_000_000, 70_000_000, size=n, endpoint=True), unit="s"
    )

dat.to_parquet("./benchmark_py.parquet")

note: notice that the columns date_* should actually be transformed to a pyarrow date32 so that they become a date type in the parquet file, here I am leaving them as datetime so that in practice we have more datetime columns, that makes the difference more salient. You can also check with the correct transformation in.

Datetime values in SAS, SPSS and STATA are stored as a floating point number. Any operation risks introducing a rounding error. Use integer math in order to preserve original interpretation.

This prevents `transform_datetime()` from having to return a tuple.

belegdol · 2026-05-18T17:02:39Z

Hi @ofajardo,

thank you for the feedback! I got benchmark data consistent with yours:
dev branch:

Time taken pandas: 6.860337 seconds
Time taken polars: 3.666543 seconds

This PR with unix_to_origin_secs passed as a tuple:

Time taken pandas: 7.347710 seconds
Time taken polars: 36.542379 seconds

I took your suggestion regarding fetching unix_to_origin_secs from the data container. I was not aware it was possible, it also pretty much solves the slowness:

Time taken pandas: 7.200199 seconds
Time taken polars: 3.733002 seconds

The extra second in the pandas path is probably a result of having to process the fractional part of the timestamp and the integer math required to do that.
Regarding the earliest possible date, I will try to test it tomorrow.

belegdol · 2026-05-18T22:09:48Z

Regarding the datetime range, I have tested 0001-01-01 12:34:56 and 9999-12-31 23:45:56. Both survive being written to .xpt and read back into a pandas/polars dataframe.

ofajardo · 2026-05-19T10:06:20Z

Perfect, thanks a lot!

ofajardo · 2026-05-19T10:30:29Z

@belegdol there is an issue: tests are failing for python 3.10, see here for example, could you please take a look?

ofajardo · 2026-05-19T10:44:39Z

ok, it seems to me that it is not that bad, what happens is that the dataset you built to test the fractional second thing has very old dates, and in pandas 2 there is an overflow because it uses nanoseconds as default. One simple option would be to skip that test if pandas 2. Maybe there is a more elegant solution? (forcing it to microseconds?)

ofajardo · 2026-05-19T11:48:28Z

I adapted the tests to pass on python 3.10, checking the CI/CD pipeline now. Let me know in case you have a better idea.

belegdol · 2026-05-19T11:54:43Z

I am looking into this but I am not sure what the culprit is. According to https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#construction-with-datetime64-or-timedelta64-dtype-with-unsupported-resolution, microsecond resolution should already be supported in 2.0.
I will try to investigate this further, but I need a VM or something with python 3.10 to test properly. My development machine runs Fedora 44 and ships with python-3.14.

ofajardo · 2026-05-19T12:00:02Z

I think you should be able to test on python 3.14 if you install pandas 2.3.3 or older. But yes, what you read works, please look at my latest commit on dev, I solved it in that way and let me know what you think (all tests are passing now in all versions).

belegdol · 2026-05-19T12:04:27Z

I like your solution. It still preserves the ability to process the dates outside of the datetime64[ns] range, and likely just replicates what pandas 3.0 does automatically.

ofajardo · 2026-05-19T12:10:49Z

ok, thanks for reviewing, in such case I will do a release soon

belegdol · 2026-05-19T12:13:31Z

Have you tried the unit parameter of to_datetime? It would be slightly cleaner if it works. The behaviour change is described here:
https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#backwards-incompatible-api-changes

belegdol · 2026-05-19T12:15:10Z

Nevermind, unit parameter is for something else. Apologies for the noise.

ofajardo · 2026-05-19T12:18:44Z

yeah, I tried that I think and it raised an error.

ofajardo · 2026-05-19T12:23:37Z

version 1.3.5 is out with the latest changes! thanks a lot for your PR!

belegdol force-pushed the increased_datetime_time_precision branch from 8fd1d3f to 2ab3a38 Compare May 12, 2026 08:03

belegdol marked this pull request as ready for review May 12, 2026 08:07

belegdol changed the base branch from master to dev May 12, 2026 20:13

belegdol and others added 4 commits May 18, 2026 17:17

Read fractions of seconds from SAS datasets

ddb36d9

Add a test for a SAS file with fractional seconds time

b2fed94

Prevent rounding errors when processing datetime to polars

0818df5

Datetime values in SAS, SPSS and STATA are stored as a floating point number. Any operation risks introducing a rounding error. Use integer math in order to preserve original interpretation.

Fetch unix_to_origin_secs from data container

03e52af

This prevents `transform_datetime()` from having to return a tuple.

belegdol force-pushed the increased_datetime_time_precision branch from 444e946 to 03e52af Compare May 18, 2026 17:02

ofajardo merged commit 3cfc6ee into Roche:dev May 19, 2026
3 checks passed

belegdol deleted the increased_datetime_time_precision branch May 19, 2026 10:08

Conversation

belegdol commented May 8, 2026

Uh oh!

belegdol commented May 12, 2026

Uh oh!

ofajardo commented May 12, 2026

Uh oh!

belegdol commented May 12, 2026

Uh oh!

ofajardo commented May 12, 2026

Uh oh!

belegdol commented May 13, 2026

Uh oh!

ofajardo commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

belegdol commented May 13, 2026

Uh oh!

belegdol commented May 13, 2026

Uh oh!

ofajardo commented May 13, 2026

Uh oh!

belegdol commented May 13, 2026

Uh oh!

belegdol commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

belegdol commented May 14, 2026

Uh oh!

belegdol commented May 14, 2026

Uh oh!

belegdol commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ofajardo commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

belegdol commented May 18, 2026

Uh oh!

belegdol commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ofajardo commented May 19, 2026

Uh oh!

Uh oh!

ofajardo commented May 19, 2026

Uh oh!

ofajardo commented May 19, 2026

Uh oh!

ofajardo commented May 19, 2026

Uh oh!

belegdol commented May 19, 2026

Uh oh!

ofajardo commented May 19, 2026

Uh oh!

belegdol commented May 19, 2026

Uh oh!

ofajardo commented May 19, 2026

Uh oh!

belegdol commented May 19, 2026

Uh oh!

belegdol commented May 19, 2026

Uh oh!

ofajardo commented May 19, 2026

Uh oh!

ofajardo commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ofajardo commented May 13, 2026 •

edited

Loading

belegdol commented May 14, 2026 •

edited

Loading

belegdol commented May 14, 2026 •

edited

Loading

ofajardo commented May 18, 2026 •

edited

Loading

belegdol commented May 18, 2026 •

edited

Loading