Skip to content

Looking for new books?

Using the Goodreads dataset we can easily look for books in your domains (like Fantasy and Science Fiction) to find the highest rated books.

The dataset seems recent as the last update is from June, 23 2024. The total number of books is 6,389,859. Many more then I will ever read...

Download the data

First let's kick off a docker container running DuckDB. The following command will download the latest version and provide us with the duckdb cli interface. Note that you will need to change the $PWD/goodreads according to your system. Your provided folder will then have a file called goodreads.duckdb which is the database.

docker run --rm -it -v "$PWD/goodreads:/data" duckdb/duckdb:latest duckdb /data/goodreads.duckdb

Once inside the cli run the following commands. I'm using the httpfs extension

INSTALL httpfs;
LOAD httpfs;
SET enable_progress_bar = true; 

CREATE TABLE goodreads AS
SELECT * FROM 'hf://datasets/BrightData/Goodreads-Books/Goodreads-Books.csv';

The create table command will run for awhile but after a few seconds you should at least see a progress bar. In total it took around 3 mins on my laptop.

Note that we don't need to create any indices. DuckDB is a columnar, vectorized engine that performs many query optimizations automatically. And in fact all queries I have run so far are instant.

Data Structure

Once downloaded we can get the schema via describe goodreads;

column_name column_type null key default extra
url VARCHAR YES NULL NULL NULL
id VARCHAR YES NULL NULL NULL
name VARCHAR YES NULL NULL NULL
author VARCHAR YES NULL NULL NULL
star_rating DOUBLE YES NULL NULL NULL
num_ratings BIGINT YES NULL NULL NULL
num_reviews BIGINT YES NULL NULL NULL
summary VARCHAR YES NULL NULL NULL
genres VARCHAR YES NULL NULL NULL
first_published VARCHAR YES NULL NULL NULL
about_author VARCHAR YES NULL NULL NULL
community_reviews VARCHAR YES NULL NULL NULL
kindle_price VARCHAR YES NULL NULL NULL

George R. R. Martin Books

Let's quickly check out all books from the creator of Game Of Thrones:

COPY (
  select url, name, author, star_rating, num_ratings, summary, genres, first_published
  from goodreads where author ilike '%george%r%martin%'
) TO '/data/george_r_r_martin.csv'
WITH (HEADER, DELIMITER ',', QUOTE '"', ESCAPE '"')
;

You can then load the data into vscode or Google Sheets OR we just quickly create an html table and load it into the browser. Just copy and paste the following into a command line.

python - <<'EOF'
import pandas as pd
import html

df = pd.read_csv("george_r_r_martin.csv")

url_col = df.columns[0]  # first column

def linkify(u):
    if pd.isna(u) or str(u).strip() == "":
        return ""
    u = str(u).strip()
    safe_href = html.escape(u, quote=True)
    safe_text = html.escape(u)
    return f'<a href="{safe_href}" target="_blank" rel="noopener noreferrer">{safe_text}</a>'

df[url_col] = df[url_col].map(linkify)

table = df.to_html(
    index=False,
    classes="styled-table",
    border=0,
    escape=False,      # allow our <a> tags to render
    justify="left"
)

page = f"""<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>George R. R. Martin – Book Data</title>
<style>
body {{
  font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial;
  background: #f5f7fa;
  padding: 2rem;
}}
h1 {{ margin: 0 0 1rem 0; }}

.styled-table {{
  border-collapse: collapse;
  width: 100%;
  background: white;
  box-shadow: 0 2px 8px rgba(0,0,0,0.08);
}}
.styled-table thead tr {{
  background-color: #2f3e46;
  color: #fff;
  text-align: left;
}}
.styled-table th, .styled-table td {{
  padding: 10px 12px;
  border-bottom: 1px solid #e0e0e0;
  vertical-align: top;
}}
.styled-table tbody tr:nth-child(even) {{ background-color: #f8f9fa; }}
.styled-table tbody tr:hover {{ background-color: #eef2f7; }}

.styled-table th {{ position: sticky; top: 0; }}

.styled-table a {{
  color: #0b5ed7;
  text-decoration: none;
}}
.styled-table a:hover {{
  text-decoration: underline;
}}
</style>
</head>
<body>
<h1>George R. R. Martin – Book Data</h1>
{table}
</body>
</html>
"""

with open("george_r_r_martin.html", "w", encoding="utf-8") as f:
    f.write(page)
EOF

The "Best Recent" books by genre

I define the best recent books as:

  • published after the year 2000
  • must have 4 or more star rating
  • must have more than 10,000 ratings
  • it's number of 5 star ratings must be higher than the number of 4 star ratings

There are a few json strings in the data but the following query works nicely for the genre Space Opera:

SELECT
    url
    , name
    , author
    , extract(year FROM strptime(first_published, '%m/%d/%Y')) AS year
FROM goodreads
WHERE 
    json_contains(genres, '"Space Opera"')
    and star_rating >= 4.0 
    and num_ratings > 10000
    and extract(year FROM strptime(first_published, '%m/%d/%Y')) > 2000
    and json_extract(community_reviews, '$.5_stars.reviews_num')::INT > json_extract(community_reviews, '$.4_stars.reviews_num')::INT
order by 
    star_rating desc
;

There are many more things do be done. A streamlit app would be an obvious next step but who has the time? Go, read books! ;-)