Importing 25 million books into Bookends
I’m currently spending a few hours on the weekends working on Bookends, the project has been moving along slowly. The past few days, I’ve had thing looming thought on my mind about how to import the millions of books available through OpenLibrary that I will use as the source for Bookends’ catalog. The search engine I’m using is Typesense and I will use it to look up books and authors from the Bookends UI.
Problems
- The sheer number of books is so many that relying on any over-the-wire requests would bring down the number of books indexed to a low number anything between 80-100k per day, which would leave me with almost a year of indexing time.
- Importing the whole 25GB text file is not possible as it’s formatted using tabs with a few of the values being useless to me. I only care about the title of the book and the OpenLibraryId.
- Parallelizing #1 was tried but it resulted in other problems.
A more complicated issue I ran into was that I tried to solve the problem without thinking about how long it would take to process per book, even if I got down to 100 ms per book, it was still too long.
First attempt
At first, I tried solving all this by creating a javascript program that would go over each line and find the author’s name and description for the book from OpenLibrary and then save that into a local SQLite database. This quickly proved to be slow. I tried to parallelize this and SQLite doesn’t like too many writes happening, I would often get an error that the database was busy which led me to think of other things. This approach was going through about 10,000 books per hour. Which would’ve taken me months if not years to get through 25 million books.
Second attempt
I realized that I don’t need to get author info and book’s description when indexing book titles, that can be a problem for another day. So this led me to try another approach. In this simpler approach, I was processing books a lot faster because it was only picking off books from the top of the file and deleting the line from the file. So it required 3 steps, in order, first get the book from the first line, then add the book to the SQLite DB, and finally delete the line from the file. Read speeds are fast, so reading the line did not matter but both writing to SQLite and updating the file are limited by disk speed and when you’re doing this with 3 parallel processes, it quickly becomes a race condition.
Attempt number 2.5 was to actually split the files into multiple folders and then parallelize that, this resulted in a huge gain in performance compared to just running 3 processes on the same directory but it also was limited in speed.
With this attempt, I was getting close to 50,000 books saved to the DB per hour. But…still not enough! This would’ve taken about 3 weeks. There was a once-in-awhile issue that would pop up with this approach, Typesense would complain about write queues being too long and some times would reject the index request.
A working solution
I should’ve paid attention to Typesense’s documentation when it recommended importing many documents instead of one document at a time.
So, after thinking about this for a bit and taking my lessons from the previous attempts, I did the following:
- Break up the large 25 gig file from OpenLibrary into smaller chunks, I went with files there were about 10MB in size.
- I converted the split files into Typesense compatible JSONL files. This was done by removing the unused details like the metadata on each line and author info. I only cared about title and openLibraryId which were added to the resulting JSONL file.
- Instead of writing a JS program to handle this, I used curl and a while loop like this:
while [ "$(ls -A lines)" ]; do
file=$(find lines -maxdepth 1 -type f -name "*.jsonl" | head -1)
if [ -n "$file" ]; then
curl -H "X-TYPESENSE-API-KEY: API-KEY-HERE" \
-X POST \
-T "$file" \
"http://localhost:8108/collections/books/documents/import?action=create"
rm "$file"
fi
sleep 1
done
- The
sleep 1
was used to limit the write queue in typesense from growing too big. - In each iteration, there were about 20,000 books added to typesense. So realistically, about 20,000 per 3 seconds.
- With this approach, I was able to index 25 million books in under 25 minutes.
I will share more notes about other Bookends features and infra as I make progress.
Comments ()