notify-about-new-ACS-release script by fiendskrah · Pull Request #438 · oturns/geosnap

fiendskrah · 2026-04-07T19:12:34Z

Here's a draft for a script (suggested in #425) that checks the Census ftp server for a new ACS release by checking for specific files (tracts and BGs), then opens a new issue on the repository notifying that these are ready to be processed. I tested this locally and was able to make it recognize the 2022 release and open this issue ticket.

Some potential issues:

The most current year is currently explicit on the script (line 14), I wasn't sure how to infer this using the files in the repository.
The script looks for specific file names; if the naming conventions happen to change, this will break.
github token usage - this is all pretty new to me but I was able to make it work with a 'classic token' which I set to my local machine. There is probably a more sophisticated way to set this up where it's a github action performing this rather than my personal account.
This just recognizes the new vintage, it doesn't download, process, or upload to the quilt.

I'm happy to iterate on this if there are any desired changes

knaaptime · 2026-04-08T14:48:47Z

nice. This is also a great start to #355. A couple scattered thoughts:

To continue building the dataset and pushing up to quilt, this action would only need to run a few more functions (the AWS creds are already available as a repo secret). If we try to run the whole pipeline we might hit some resource limits, but it's worth a shot.

I think we can also get the LATEST_SUPPORTED_YEAR on line 14 to update dynamically. Once the script successfully downloads the census data, that line can be updated.

We should also add some checks to make sure the files have all the necessary data, not just that they're present. This could probably just be a size check, since the datasets are roughly like (at least) 1.5gb or something

knaaptime · 2026-04-08T14:51:46Z

the test failures are from #436 so dont worry about those. I was cautiously hopeful that might get fixed quickly because i think the spatial stuff in duckdb-spatial is kinda popular, but I think we may have to go ahead and pin <1.5

knaaptime · 2026-04-09T19:30:47Z

can we add a manual trigger too? we want to be able to fire this off to check and/or if the run is incomplete and needs to go again, etc

knaaptime · 2026-04-09T21:29:04Z

+    "missing21 = sorted(needed - present21)\n",
+    "missing22 = sorted(needed - present22)\n",
+    "\n",
+    "newly_missing_in_2022 = sorted((needed - present22) - (needed - present21))\n",


why not just missing22 - missing21?

fiendskrah · 2026-04-09T21:29:25Z

I was able to build /a version/ of the 2022 table but upon inspection, it doesn't seem to be valid. I Initially thought it was just an issue with 'GEOID' being renamed in certain tables (hence the change to io.utils.py), but none of the geosnap variables appear to be constructing from the tables that I'm pinging.

I was initially pulling from this URL: https://www2.census.gov/geo/tiger/TIGER_DP/
But maybe this one is what I need?: https://www2.census.gov/programs-surveys/acs/summary_file/2022/table-based-SF/data/5YRData/

Before I got into trying to download and parse these few thousand .dat files, I figured you'd be able to tell me if I was barking up the wrong tree. It's hard for me to know if my errors are due to the code, location of the data tables, or recent changes to the ACS. Are the tools in io able to unpack these .dat files?

knaaptime

one other thing is, we want to keep the intermediate tables as well (the save_intermediate=True in process_acs). We should look at how the structure of those tables has changed too (also save and upload them)

knaaptime · 2026-04-09T21:30:34Z

+      "Missing in 2021 but present in 2022: 0\n",
+      "\n",
+      "First 100 newly missing in 2022:\n",
+      "['B01001_003E', 'B01001_004E', 'B01001_005E', 'B01001_006E', 'B01001_018E', 'B01001_019E', 'B01001_020E', 'B01001_021E', 'B01001_022E', 'B01001_023E', 'B01001_024E', 'B01001_025E', 'B01001_027E', 'B01001_028E', 'B01001_029E', 'B01001_030E', 'B01001_042E', 'B01001_043E', 'B01001_044E', 'B01001_045E', 'B01001_046E', 'B01001_047E', 'B01001_048E', 'B01001_049E', 'B01003_001E', 'B02001_006E', 'B03002_003E', 'B03002_004E', 'B03002_005E', 'B03002_006E', 'B03002_007E', 'B03002_012E', 'B12001_001E', 'B12001_005E', 'B12001_007E', 'B12001_009E', 'B12001_010E', 'B12001_016E', 'B12001_018E', 'B12001_019E', 'B15002_001E', 'B15002_003E', 'B15002_004E', 'B15002_005E', 'B15002_006E', 'B15002_007E', 'B15002_008E', 'B15002_009E', 'B15002_010E', 'B15002_015E', 'B15002_016E', 'B15002_017E', 'B15002_018E', 'B15002_020E', 'B15002_021E', 'B15002_022E', 'B15002_023E', 'B15002_024E', 'B15002_025E', 'B15002_026E', 'B15002_027E', 'B15002_032E', 'B15002_033E', 'B15002_034E', 'B15002_035E', 'B17010_001E', 'B17010_004E', 'B17010_011E', 'B17010_017E', 'B19001_001E', 'B19013_001E', 'B19301_001E', 'B21001_002E', 'B25002_001E', 'B25002_002E', 'B25002_003E', 'B25003_001E', 'B25003_002E', 'B25003_003E', 'B25024_001E', 'B25024_004E', 'B25024_005E', 'B25024_006E', 'B25024_007E', 'B25024_008E', 'B25024_009E', 'B25058_001E', 'B25077_001E', 'C24010_001E']\n"


this is interesting. and not great. Just spot checking, these variables cover a ton of ground... age, marital status, housing tenure, multiunit structures, etc.

Its possible these have different analogues/variable names in the new ACS but we'd need to dig more

wait, thats every variable missing in 2022? if that's the case its more likely to be a code issue than truly missing data as i dont think the ACS has changed that much in the 2022 release

Might want to loop @jvtcl into the conversation for confirmation either way. He might at least have an immediate answer if it's a red herring.

knaaptime · 2026-04-09T21:43:18Z

https://www2.census.gov/programs-surveys/acs/summary_file/2022/table-based-SF/data/5YRData/

wow, is this every variable in the entire ACS?. that could be handy.

I've never worked with dat files but a couple other avenues first. First, lets take a look at the intermediate tables like i said above). It may be that the "demographic profile" or whatever has changed, so the first thing to do is checkout those files.

I really like having the whole DP available because there are some useful things up there that we dont include in geosnap becuase they're not in the LTDB (e.g. household earning by income bin, which is what you use for measuring income segregation). Ideally we can figure that out. But alternatively (and maybe additionally), we should have a failover that grabs all these variables from the API using something like cenpy. This is what we used to do, before i discovered all the DP data sitting on the Census server. But if we do that, we should try to parallelize for performance, batch by state or something, make sure there' ssome good logic to avoid duplication if the script needs to be run multiple times, etc (you can see why i liked the DP). We also might want to try to manually recreate the DP tables using cenpy if we need to go that route

knaaptime · 2026-04-09T21:45:41Z

one other thing that occurs to me is instead of hardcoding the YEAR, the script could just check which is the most recent file that exists in the spatial-ucr s3 bucket?

knaaptime · 2026-04-12T14:20:28Z

the universe of columns we need is the full set in this file and similarly for blockgroups

knaaptime · 2026-04-13T05:19:06Z

@fiendskrah ok, now that i've looked at ont of the 2022 tables in the geodatabase, the reason you're getting no results is the naming convention has changed. Your PR includes an update for the geoid column, but there are other systematic changes. In the new tables, the variables are named (as an example): B02001_E001. We need to have processing that anticipates this format, then converts it to the canonical form (like the json tables, B02001_001E (where E/M is the final character of the variable rather than the leading character)

fiendskrah · 2026-04-13T20:43:08Z

In the new tables, the variables are named (as an example): B02001_E001. We need to have processing that anticipates this format, then converts it to the canonical form (like the json tables, B02001_001E (where E/M is the final character of the variable rather than the leading character)

The examine notebook now has a bunch of diagnostics looking into how systematic this is. To me, it looks like a simple renaming rule (moving the e) would suffice for most cases, but there are also new tables to consider, and in a handful of tables, new variables.

knaaptime · 2026-04-13T21:24:42Z

so we need a function analogous to this one that handles this new format if year>=2022 that gets called similarly when processing the geodatabase

knaaptime · 2026-04-13T21:26:42Z

in parallel, i would also love a little utility that builds all these tables from scratch by hitting the census API directly (e.g. using cenpy). In that case we could always be up to date, not waiting around in the gap between the latest ACS release and when they publish the demographic profile on the ftp

fiendskrah · 2026-04-14T19:37:27Z

so we need a function analogous to this one that handles this new format if year>=2022 that gets called similarly when processing the geodatabase

The latest commit changes the io module to try accomplishing this, but I'm not sure how to verify the accuracy.

In short, the changes are:

Adjusted reformat_acs_vars (now normalize_acs_vars) to recognize the variable name change and make appropriate adjustments
Added a little helper function to do the same thing, but specific to the GEOID, just to move that if/elif block out of the main function.

The resulting demographic_profile.parquet and the processed_acs are too big to put in the cloud, but I updated the notebook (just to show they are read in) and the variables.csv (I will pull this out of the PR before merging as I think I botched the indexing, but maybe this is helpful for you to assess the accuracy?). I ran the convert_census_gdb and process_acs in isolation from bash, so maybe the next step is to get these into a GH workflow (e.g. update tools/check_acs_release.py)?

knaaptime · 2026-04-15T17:36:16Z

hm. you can see in the notebook the geoid isnt being handled properly which is screwing up the merge and subsequent calculations

we shouldnt have the 'fully qualified' fips code anywhere (drop those 14/150000US.. there should be logic for that in the script?). Those rows with missing data are a problem

also, its useful to look at it here, but we dont want to commit that diagnostic notebook to the repo (maybe put in a gist instead?). And get rid of those changes to the variables.csv file. It looks like maybe you added some 2022-specific columns, but since we know the ACS hasnt changed, those cols wouldnt be necessary

draft new acs release script

896c80b

fiendskrah added 2 commits April 9, 2026 11:40

attempt to process acs too

4e8afca

GEOID renamed in non-geo layers of 2022 acs vintage

b283b18

no vars present in 2022

6236c82

knaaptime reviewed Apr 9, 2026

View reviewed changes

updated examine notebook

b442ac4

diagnostic of 2022 col names

ba613fa

housing characteristics

cec1fea

fiendskrah added 3 commits April 14, 2026 10:11

fix for 2022 var names 0.1

67bdab4

resolve

2f9cb92

fix for 2022 var names 0.2

cbf51a2

Conversation

fiendskrah commented Apr 7, 2026

Uh oh!

knaaptime commented Apr 8, 2026

Uh oh!

knaaptime commented Apr 8, 2026

Uh oh!

knaaptime commented Apr 9, 2026

Uh oh!

knaaptime Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

fiendskrah commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knaaptime left a comment

Choose a reason for hiding this comment

Uh oh!

knaaptime Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

knaaptime Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

jGaboardi Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

knaaptime commented Apr 9, 2026

Uh oh!

knaaptime commented Apr 9, 2026

Uh oh!

knaaptime commented Apr 12, 2026

Uh oh!

knaaptime commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fiendskrah commented Apr 13, 2026

Uh oh!

knaaptime commented Apr 13, 2026

Uh oh!

knaaptime commented Apr 13, 2026

Uh oh!

fiendskrah commented Apr 14, 2026

Uh oh!

knaaptime commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fiendskrah commented Apr 9, 2026 •

edited

Loading

knaaptime commented Apr 13, 2026 •

edited

Loading