Looking for new books?
Using the Goodreads dataset we can easily look for books in your domains (like Fantasy and Science Fiction) to find the highest rated books.
The dataset seems recent as the last update is from June, 23 2024. The total number of books is 6,389,859. Many more then I will ever read...
Download the data
First let's kick off a docker container running DuckDB. The following command will download the latest version and provide us with the duckdb cli interface. Note that you will need to change the $PWD/goodreads according to your system. Your provided folder will then have a file called goodreads.duckdb which is
the database.
Once inside the cli run the following commands. I'm using the httpfs extension
INSTALL httpfs;
LOAD httpfs;
SET enable_progress_bar = true;
CREATE TABLE goodreads AS
SELECT * FROM 'hf://datasets/BrightData/Goodreads-Books/Goodreads-Books.csv';
The create table command will run for awhile but after a few seconds you should at least see a progress bar. In total it took around 3 mins on my laptop.
Note that we don't need to create any indices. DuckDB is a columnar, vectorized engine that performs many query optimizations automatically. And in fact all queries I have run so far are instant.
Data Structure
Once downloaded we can get the schema via describe goodreads;
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| url | VARCHAR | YES | NULL | NULL | NULL |
| id | VARCHAR | YES | NULL | NULL | NULL |
| name | VARCHAR | YES | NULL | NULL | NULL |
| author | VARCHAR | YES | NULL | NULL | NULL |
| star_rating | DOUBLE | YES | NULL | NULL | NULL |
| num_ratings | BIGINT | YES | NULL | NULL | NULL |
| num_reviews | BIGINT | YES | NULL | NULL | NULL |
| summary | VARCHAR | YES | NULL | NULL | NULL |
| genres | VARCHAR | YES | NULL | NULL | NULL |
| first_published | VARCHAR | YES | NULL | NULL | NULL |
| about_author | VARCHAR | YES | NULL | NULL | NULL |
| community_reviews | VARCHAR | YES | NULL | NULL | NULL |
| kindle_price | VARCHAR | YES | NULL | NULL | NULL |
George R. R. Martin Books
Let's quickly check out all books from the creator of Game Of Thrones:
COPY (
select url, name, author, star_rating, num_ratings, summary, genres, first_published
from goodreads where author ilike '%george%r%martin%'
) TO '/data/george_r_r_martin.csv'
WITH (HEADER, DELIMITER ',', QUOTE '"', ESCAPE '"')
;
You can then load the data into vscode or Google Sheets OR we just quickly create an html table and load it into the browser. Just copy and paste the following into a command line.
python - <<'EOF'
import pandas as pd
import html
df = pd.read_csv("george_r_r_martin.csv")
url_col = df.columns[0] # first column
def linkify(u):
if pd.isna(u) or str(u).strip() == "":
return ""
u = str(u).strip()
safe_href = html.escape(u, quote=True)
safe_text = html.escape(u)
return f'<a href="{safe_href}" target="_blank" rel="noopener noreferrer">{safe_text}</a>'
df[url_col] = df[url_col].map(linkify)
table = df.to_html(
index=False,
classes="styled-table",
border=0,
escape=False, # allow our <a> tags to render
justify="left"
)
page = f"""<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>George R. R. Martin – Book Data</title>
<style>
body {{
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial;
background: #f5f7fa;
padding: 2rem;
}}
h1 {{ margin: 0 0 1rem 0; }}
.styled-table {{
border-collapse: collapse;
width: 100%;
background: white;
box-shadow: 0 2px 8px rgba(0,0,0,0.08);
}}
.styled-table thead tr {{
background-color: #2f3e46;
color: #fff;
text-align: left;
}}
.styled-table th, .styled-table td {{
padding: 10px 12px;
border-bottom: 1px solid #e0e0e0;
vertical-align: top;
}}
.styled-table tbody tr:nth-child(even) {{ background-color: #f8f9fa; }}
.styled-table tbody tr:hover {{ background-color: #eef2f7; }}
.styled-table th {{ position: sticky; top: 0; }}
.styled-table a {{
color: #0b5ed7;
text-decoration: none;
}}
.styled-table a:hover {{
text-decoration: underline;
}}
</style>
</head>
<body>
<h1>George R. R. Martin – Book Data</h1>
{table}
</body>
</html>
"""
with open("george_r_r_martin.html", "w", encoding="utf-8") as f:
f.write(page)
EOF
The "Best Recent" books by genre
I define the best recent books as:
- published after the year 2000
- must have 4 or more star rating
- must have more than 10,000 ratings
- it's number of 5 star ratings must be higher than the number of 4 star ratings
There are a few json strings in the data but the following query works nicely for the genre Space Opera:
SELECT
url
, name
, author
, extract(year FROM strptime(first_published, '%m/%d/%Y')) AS year
FROM goodreads
WHERE
json_contains(genres, '"Space Opera"')
and star_rating >= 4.0
and num_ratings > 10000
and extract(year FROM strptime(first_published, '%m/%d/%Y')) > 2000
and json_extract(community_reviews, '$.5_stars.reviews_num')::INT > json_extract(community_reviews, '$.4_stars.reviews_num')::INT
order by
star_rating desc
;
There are many more things do be done. A streamlit app would be an obvious next step but who has the time? Go, read books! ;-)