Harvard’s Library Releases Dataset from Old Books

Using scanned material in the public domain, the Harvard Library team releases new LLM-focused dataset with over 1 million volumes (and nearly 250 billion tokens).

Harvard has been in the news of late, and much of it for reasons I’d assume they would like to avoid. But in the midst of that, Harvard Librarians demonstrate why we’ve long admired University work. As holders of a vast wealth of history and information, they’re looking for ways to disseminate that to the world.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *