notify-about-new-ACS-release script#438
Conversation
|
nice. This is also a great start to #355. A couple scattered thoughts: To continue building the dataset and pushing up to quilt, this action would only need to run a few more functions (the AWS creds are already available as a repo secret). If we try to run the whole pipeline we might hit some resource limits, but it's worth a shot. I think we can also get the LATEST_SUPPORTED_YEAR on line 14 to update dynamically. Once the script successfully downloads the census data, that line can be updated. We should also add some checks to make sure the files have all the necessary data, not just that they're present. This could probably just be a size check, since the datasets are roughly like (at least) 1.5gb or something |
|
the test failures are from #436 so dont worry about those. I was cautiously hopeful that might get fixed quickly because i think the spatial stuff in duckdb-spatial is kinda popular, but I think we may have to go ahead and pin <1.5 |
|
can we add a manual trigger too? we want to be able to fire this off to check and/or if the run is incomplete and needs to go again, etc |
| "missing21 = sorted(needed - present21)\n", | ||
| "missing22 = sorted(needed - present22)\n", | ||
| "\n", | ||
| "newly_missing_in_2022 = sorted((needed - present22) - (needed - present21))\n", |
There was a problem hiding this comment.
why not just missing22 - missing21?
|
I was able to build /a version/ of the 2022 table but upon inspection, it doesn't seem to be valid. I Initially thought it was just an issue with 'GEOID' being renamed in certain tables (hence the change to I was initially pulling from this URL: Before I got into trying to download and parse these few thousand .dat files, I figured you'd be able to tell me if I was barking up the wrong tree. It's hard for me to know if my errors are due to the code, location of the data tables, or recent changes to the ACS. Are the tools in |
knaaptime
left a comment
There was a problem hiding this comment.
one other thing is, we want to keep the intermediate tables as well (the save_intermediate=True in process_acs). We should look at how the structure of those tables has changed too (also save and upload them)
| "Missing in 2021 but present in 2022: 0\n", | ||
| "\n", | ||
| "First 100 newly missing in 2022:\n", | ||
| "['B01001_003E', 'B01001_004E', 'B01001_005E', 'B01001_006E', 'B01001_018E', 'B01001_019E', 'B01001_020E', 'B01001_021E', 'B01001_022E', 'B01001_023E', 'B01001_024E', 'B01001_025E', 'B01001_027E', 'B01001_028E', 'B01001_029E', 'B01001_030E', 'B01001_042E', 'B01001_043E', 'B01001_044E', 'B01001_045E', 'B01001_046E', 'B01001_047E', 'B01001_048E', 'B01001_049E', 'B01003_001E', 'B02001_006E', 'B03002_003E', 'B03002_004E', 'B03002_005E', 'B03002_006E', 'B03002_007E', 'B03002_012E', 'B12001_001E', 'B12001_005E', 'B12001_007E', 'B12001_009E', 'B12001_010E', 'B12001_016E', 'B12001_018E', 'B12001_019E', 'B15002_001E', 'B15002_003E', 'B15002_004E', 'B15002_005E', 'B15002_006E', 'B15002_007E', 'B15002_008E', 'B15002_009E', 'B15002_010E', 'B15002_015E', 'B15002_016E', 'B15002_017E', 'B15002_018E', 'B15002_020E', 'B15002_021E', 'B15002_022E', 'B15002_023E', 'B15002_024E', 'B15002_025E', 'B15002_026E', 'B15002_027E', 'B15002_032E', 'B15002_033E', 'B15002_034E', 'B15002_035E', 'B17010_001E', 'B17010_004E', 'B17010_011E', 'B17010_017E', 'B19001_001E', 'B19013_001E', 'B19301_001E', 'B21001_002E', 'B25002_001E', 'B25002_002E', 'B25002_003E', 'B25003_001E', 'B25003_002E', 'B25003_003E', 'B25024_001E', 'B25024_004E', 'B25024_005E', 'B25024_006E', 'B25024_007E', 'B25024_008E', 'B25024_009E', 'B25058_001E', 'B25077_001E', 'C24010_001E']\n" |
There was a problem hiding this comment.
this is interesting. and not great. Just spot checking, these variables cover a ton of ground... age, marital status, housing tenure, multiunit structures, etc.
Its possible these have different analogues/variable names in the new ACS but we'd need to dig more
There was a problem hiding this comment.
wait, thats every variable missing in 2022? if that's the case its more likely to be a code issue than truly missing data as i dont think the ACS has changed that much in the 2022 release
There was a problem hiding this comment.
Might want to loop @jvtcl into the conversation for confirmation either way. He might at least have an immediate answer if it's a red herring.
wow, is this every variable in the entire ACS?. that could be handy. I've never worked with dat files but a couple other avenues first. First, lets take a look at the intermediate tables like i said above). It may be that the "demographic profile" or whatever has changed, so the first thing to do is checkout those files. I really like having the whole DP available because there are some useful things up there that we dont include in geosnap becuase they're not in the LTDB (e.g. household earning by income bin, which is what you use for measuring income segregation). Ideally we can figure that out. But alternatively (and maybe additionally), we should have a failover that grabs all these variables from the API using something like |
|
one other thing that occurs to me is instead of hardcoding the YEAR, the script could just check which is the most recent file that exists in the spatial-ucr s3 bucket? |
|
the universe of columns we need is the full set in this file and similarly for blockgroups |
|
@fiendskrah ok, now that i've looked at ont of the 2022 tables in the geodatabase, the reason you're getting no results is the naming convention has changed. Your PR includes an update for the geoid column, but there are other systematic changes. In the new tables, the variables are named (as an example): |
The examine notebook now has a bunch of diagnostics looking into how systematic this is. To me, it looks like a simple renaming rule (moving the |
|
so we need a function analogous to this one that handles this new format if year>=2022 that gets called similarly when processing the geodatabase |
|
in parallel, i would also love a little utility that builds all these tables from scratch by hitting the census API directly (e.g. using cenpy). In that case we could always be up to date, not waiting around in the gap between the latest ACS release and when they publish the demographic profile on the ftp |
The latest commit changes the In short, the changes are:
The resulting |

Here's a draft for a script (suggested in #425) that checks the Census ftp server for a new ACS release by checking for specific files (tracts and BGs), then opens a new issue on the repository notifying that these are ready to be processed. I tested this locally and was able to make it recognize the 2022 release and open this issue ticket.
Some potential issues:
The most current year is currently explicit on the script (line 14), I wasn't sure how to infer this using the files in the repository.
The script looks for specific file names; if the naming conventions happen to change, this will break.
github token usage - this is all pretty new to me but I was able to make it work with a 'classic token' which I set to my local machine. There is probably a more sophisticated way to set this up where it's a github action performing this rather than my personal account.
This just recognizes the new vintage, it doesn't download, process, or upload to the quilt.
I'm happy to iterate on this if there are any desired changes