Skip to content

notify-about-new-ACS-release script#438

Open
fiendskrah wants to merge 10 commits intooturns:mainfrom
fiendskrah:add-acs-release-check
Open

notify-about-new-ACS-release script#438
fiendskrah wants to merge 10 commits intooturns:mainfrom
fiendskrah:add-acs-release-check

Conversation

@fiendskrah
Copy link
Copy Markdown

Here's a draft for a script (suggested in #425) that checks the Census ftp server for a new ACS release by checking for specific files (tracts and BGs), then opens a new issue on the repository notifying that these are ready to be processed. I tested this locally and was able to make it recognize the 2022 release and open this issue ticket.

Some potential issues:

  1. The most current year is currently explicit on the script (line 14), I wasn't sure how to infer this using the files in the repository.

  2. The script looks for specific file names; if the naming conventions happen to change, this will break.

  3. github token usage - this is all pretty new to me but I was able to make it work with a 'classic token' which I set to my local machine. There is probably a more sophisticated way to set this up where it's a github action performing this rather than my personal account.

  4. This just recognizes the new vintage, it doesn't download, process, or upload to the quilt.

I'm happy to iterate on this if there are any desired changes

@knaaptime
Copy link
Copy Markdown
Member

nice. This is also a great start to #355. A couple scattered thoughts:

To continue building the dataset and pushing up to quilt, this action would only need to run a few more functions (the AWS creds are already available as a repo secret). If we try to run the whole pipeline we might hit some resource limits, but it's worth a shot.

I think we can also get the LATEST_SUPPORTED_YEAR on line 14 to update dynamically. Once the script successfully downloads the census data, that line can be updated.

We should also add some checks to make sure the files have all the necessary data, not just that they're present. This could probably just be a size check, since the datasets are roughly like (at least) 1.5gb or something

@knaaptime
Copy link
Copy Markdown
Member

the test failures are from #436 so dont worry about those. I was cautiously hopeful that might get fixed quickly because i think the spatial stuff in duckdb-spatial is kinda popular, but I think we may have to go ahead and pin <1.5

@knaaptime
Copy link
Copy Markdown
Member

can we add a manual trigger too? we want to be able to fire this off to check and/or if the run is incomplete and needs to go again, etc

Comment thread build/examine_output.ipynb Outdated
"missing21 = sorted(needed - present21)\n",
"missing22 = sorted(needed - present22)\n",
"\n",
"newly_missing_in_2022 = sorted((needed - present22) - (needed - present21))\n",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just missing22 - missing21?

@fiendskrah
Copy link
Copy Markdown
Author

fiendskrah commented Apr 9, 2026

I was able to build /a version/ of the 2022 table but upon inspection, it doesn't seem to be valid. I Initially thought it was just an issue with 'GEOID' being renamed in certain tables (hence the change to io.utils.py), but none of the geosnap variables appear to be constructing from the tables that I'm pinging.

I was initially pulling from this URL: https://www2.census.gov/geo/tiger/TIGER_DP/
But maybe this one is what I need?: https://www2.census.gov/programs-surveys/acs/summary_file/2022/table-based-SF/data/5YRData/

Before I got into trying to download and parse these few thousand .dat files, I figured you'd be able to tell me if I was barking up the wrong tree. It's hard for me to know if my errors are due to the code, location of the data tables, or recent changes to the ACS. Are the tools in io able to unpack these .dat files?

Copy link
Copy Markdown
Member

@knaaptime knaaptime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one other thing is, we want to keep the intermediate tables as well (the save_intermediate=True in process_acs). We should look at how the structure of those tables has changed too (also save and upload them)

Comment thread build/examine_output.ipynb Outdated
"Missing in 2021 but present in 2022: 0\n",
"\n",
"First 100 newly missing in 2022:\n",
"['B01001_003E', 'B01001_004E', 'B01001_005E', 'B01001_006E', 'B01001_018E', 'B01001_019E', 'B01001_020E', 'B01001_021E', 'B01001_022E', 'B01001_023E', 'B01001_024E', 'B01001_025E', 'B01001_027E', 'B01001_028E', 'B01001_029E', 'B01001_030E', 'B01001_042E', 'B01001_043E', 'B01001_044E', 'B01001_045E', 'B01001_046E', 'B01001_047E', 'B01001_048E', 'B01001_049E', 'B01003_001E', 'B02001_006E', 'B03002_003E', 'B03002_004E', 'B03002_005E', 'B03002_006E', 'B03002_007E', 'B03002_012E', 'B12001_001E', 'B12001_005E', 'B12001_007E', 'B12001_009E', 'B12001_010E', 'B12001_016E', 'B12001_018E', 'B12001_019E', 'B15002_001E', 'B15002_003E', 'B15002_004E', 'B15002_005E', 'B15002_006E', 'B15002_007E', 'B15002_008E', 'B15002_009E', 'B15002_010E', 'B15002_015E', 'B15002_016E', 'B15002_017E', 'B15002_018E', 'B15002_020E', 'B15002_021E', 'B15002_022E', 'B15002_023E', 'B15002_024E', 'B15002_025E', 'B15002_026E', 'B15002_027E', 'B15002_032E', 'B15002_033E', 'B15002_034E', 'B15002_035E', 'B17010_001E', 'B17010_004E', 'B17010_011E', 'B17010_017E', 'B19001_001E', 'B19013_001E', 'B19301_001E', 'B21001_002E', 'B25002_001E', 'B25002_002E', 'B25002_003E', 'B25003_001E', 'B25003_002E', 'B25003_003E', 'B25024_001E', 'B25024_004E', 'B25024_005E', 'B25024_006E', 'B25024_007E', 'B25024_008E', 'B25024_009E', 'B25058_001E', 'B25077_001E', 'C24010_001E']\n"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is interesting. and not great. Just spot checking, these variables cover a ton of ground... age, marital status, housing tenure, multiunit structures, etc.

Its possible these have different analogues/variable names in the new ACS but we'd need to dig more

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, thats every variable missing in 2022? if that's the case its more likely to be a code issue than truly missing data as i dont think the ACS has changed that much in the 2022 release

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to loop @jvtcl into the conversation for confirmation either way. He might at least have an immediate answer if it's a red herring.

@knaaptime
Copy link
Copy Markdown
Member

https://www2.census.gov/programs-surveys/acs/summary_file/2022/table-based-SF/data/5YRData/

wow, is this every variable in the entire ACS?. that could be handy.

I've never worked with dat files but a couple other avenues first. First, lets take a look at the intermediate tables like i said above). It may be that the "demographic profile" or whatever has changed, so the first thing to do is checkout those files.

I really like having the whole DP available because there are some useful things up there that we dont include in geosnap becuase they're not in the LTDB (e.g. household earning by income bin, which is what you use for measuring income segregation). Ideally we can figure that out. But alternatively (and maybe additionally), we should have a failover that grabs all these variables from the API using something like cenpy. This is what we used to do, before i discovered all the DP data sitting on the Census server. But if we do that, we should try to parallelize for performance, batch by state or something, make sure there' ssome good logic to avoid duplication if the script needs to be run multiple times, etc (you can see why i liked the DP). We also might want to try to manually recreate the DP tables using cenpy if we need to go that route

@knaaptime
Copy link
Copy Markdown
Member

one other thing that occurs to me is instead of hardcoding the YEAR, the script could just check which is the most recent file that exists in the spatial-ucr s3 bucket?

@knaaptime
Copy link
Copy Markdown
Member

the universe of columns we need is the full set in this file and similarly for blockgroups

@knaaptime
Copy link
Copy Markdown
Member

knaaptime commented Apr 13, 2026

@fiendskrah ok, now that i've looked at ont of the 2022 tables in the geodatabase, the reason you're getting no results is the naming convention has changed. Your PR includes an update for the geoid column, but there are other systematic changes. In the new tables, the variables are named (as an example): B02001_E001. We need to have processing that anticipates this format, then converts it to the canonical form (like the json tables, B02001_001E (where E/M is the final character of the variable rather than the leading character)

@fiendskrah
Copy link
Copy Markdown
Author

In the new tables, the variables are named (as an example): B02001_E001. We need to have processing that anticipates this format, then converts it to the canonical form (like the json tables, B02001_001E (where E/M is the final character of the variable rather than the leading character)

The examine notebook now has a bunch of diagnostics looking into how systematic this is. To me, it looks like a simple renaming rule (moving the e) would suffice for most cases, but there are also new tables to consider, and in a handful of tables, new variables.

@knaaptime
Copy link
Copy Markdown
Member

so we need a function analogous to this one that handles this new format if year>=2022 that gets called similarly when processing the geodatabase

@knaaptime
Copy link
Copy Markdown
Member

in parallel, i would also love a little utility that builds all these tables from scratch by hitting the census API directly (e.g. using cenpy). In that case we could always be up to date, not waiting around in the gap between the latest ACS release and when they publish the demographic profile on the ftp

@fiendskrah
Copy link
Copy Markdown
Author

so we need a function analogous to this one that handles this new format if year>=2022 that gets called similarly when processing the geodatabase

The latest commit changes the io module to try accomplishing this, but I'm not sure how to verify the accuracy.

In short, the changes are:

  1. Adjusted reformat_acs_vars (now normalize_acs_vars) to recognize the variable name change and make appropriate adjustments
  2. Added a little helper function to do the same thing, but specific to the GEOID, just to move that if/elif block out of the main function.

The resulting demographic_profile.parquet and the processed_acs are too big to put in the cloud, but I updated the notebook (just to show they are read in) and the variables.csv (I will pull this out of the PR before merging as I think I botched the indexing, but maybe this is helpful for you to assess the accuracy?). I ran the convert_census_gdb and process_acs in isolation from bash, so maybe the next step is to get these into a GH workflow (e.g. update tools/check_acs_release.py)?

@knaaptime
Copy link
Copy Markdown
Member

hm. you can see in the notebook the geoid isnt being handled properly which is screwing up the merge and subsequent calculations

Screenshot 2026-04-15 at 10 31 32 AM

we shouldnt have the 'fully qualified' fips code anywhere (drop those 14/150000US.. there should be logic for that in the script?). Those rows with missing data are a problem

also, its useful to look at it here, but we dont want to commit that diagnostic notebook to the repo (maybe put in a gist instead?). And get rid of those changes to the variables.csv file. It looks like maybe you added some 2022-specific columns, but since we know the ACS hasnt changed, those cols wouldnt be necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants