updated_datepartitions aren't important yet. You need all the entities, so for
Authorsyou would get
updated_datepartition. Each is under 2GB.
updated_datepartitions make this easy, but the way they work may be unfamiliar. Unlike a set of dated snapshots that each contain the full dataset as of a certain date, each partition contains the records that last changed on that date.
Authors, each being newly created on that date,
/data/authors/looks like this:
Authors, they would come out of one of the files in
/data/authors/updated_date=2021-12-30and go into one in
/data/authors/updated_date=2022-01-04to get everything that was changed or added since then.
Authorpartitions and the number of records in each (in the actual dataset):
updated_datepartition for an entity, we'll delete that entity's
manifestfile. When we finish writing the partition, we'll recreate the manifest, including the newly-created objects. So if
manifestis there, all the entities are there too.
urlproperty of each item in the
updated_dateyou haven't seen before.
s3://openalex/data/authors/manifestagain. If it hasn't changed since (1), no records moved around and any date partitions you downloaded are valid.