Contributing to LanceDB: from Python dict to SQL struct
How I added dict-to-SQL struct conversion to LanceDB's table.update() and the concepts behind it
How I found the project
I was reading about newer data formats when I came across LanceDB. It stores text, images, videos, embeddings, and structured data together in one place. Went through their GitHub issues, filtered by good first issue, found one that looked doable.
The issue
This one. table.update() couldn’t handle struct columns if you passed a Python dict.
Say you have a training data pipeline that stores thousands of image annotations like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[
{
"sample_id": "img_991204",
"dataset": "laion-aesthetic",
"annotation": {
"caption": "sunset over mountains",
"aesthetic_score": 7.2,
"watermarked": false,
"source": "flickr"
}
},
{
"sample_id": "img_991205",
"dataset": "laion-aesthetic",
"annotation": {
"caption": "cat on a couch",
"aesthetic_score": 6.8,
"watermarked": true,
"source": "unsplash"
}
},
{
"sample_id": "img_991206",
"dataset": "laion-aesthetic",
"annotation": {
"caption": "street art mural",
"aesthetic_score": 5.1,
"watermarked": false,
"source": "flickr"
}
}
]
You ingest those into LanceDB. Later you find out a batch came with wrong aesthetic_score values because the scoring model was generating bad numbers. You need to update specific rows. You’d expect this to work:
1
2
3
4
5
6
7
8
9
10
11
12
13
table.update(
where="sample_id = 'img_991204'",
values={
"annotation": {
"caption": "sunset over mountains",
"aesthetic_score": 8.1,
"watermarked": False,
"source": "flickr"
}
}
)
# expected: annotation updated with aesthetic_score = 8.1
# actual: NotImplementedError
Nope. LanceDB didn’t know how to turn a dict into SQL. The values parameter only worked with scalar types.
You could get around it with values_sql, writing the SQL yourself:
1
2
3
4
5
6
7
8
table.update(
where="sample_id = 'img_991204'",
values_sql={
"annotation": "named_struct('caption', 'sunset over mountains', "
"'aesthetic_score', 8.1, 'watermarked', FALSE, "
"'source', 'flickr')"
}
)
It works, but you’re writing raw SQL strings by hand. Quotes, commas, nested structs. Gets ugly fast.
The issue pointed to named_struct from DataFusion as the way to solve this.
Background
DataFusion and named_struct
Apache DataFusion is the SQL engine behind LanceDB. When you call table.update(), your Python values get converted to SQL strings, and DataFusion runs them. Nobody had written the part that converts dicts.
The issue mentioned named_struct but I didn’t know where that function came from. That led me to find out that LanceDB uses DataFusion to run SQL queries. I had to read through the existing code and the DataFusion docs before it clicked.
named_struct is a SQL function that creates structs with named fields. You pass alternating name-value pairs:
1
2
3
4
5
6
7
8
9
10
SELECT named_struct(
'caption', 'sunset over mountains',
'aesthetic_score', 8.1,
'watermarked', FALSE,
'source', 'flickr'
);
-- {"caption": "sunset over mountains",
-- "aesthetic_score": 8.1,
-- "watermarked": false,
-- "source": "flickr"}
By the time I picked up the issue, named_struct was already available in LanceDB’s version of DataFusion.
singledispatch
The issue already said to use singledispatch from functools. It lets you register a different function for each type. So instead of a long chain of if isinstance(value, str) ... elif isinstance(value, float) ..., each type gets its own function that you can add without touching the rest.
Here’s how the whole flow works:
flowchart LR
A["Python dict<br/>{caption: sunset}"] -->|"table.update()"| B["value_to_sql()<br/>sees it's a dict"]
B --> C["named_struct(<br/>caption, sunset)"]
C -->|"DataFusion"| D["table updated"]
style A fill:#f0f4c3,stroke:#827717,color:#333
style B fill:#e1bee7,stroke:#6a1b9a,color:#333
style C fill:#bbdefb,stroke:#1565c0,color:#333
style D fill:#c8e6c9,stroke:#2e7d32,color:#333
The existing code
python/python/lancedb/util.py already had value_to_sql using singledispatch. One handler per type. That’s what made adding dict just one new function instead of rewriting the whole thing.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
@singledispatch
def value_to_sql(value):
raise TypeError(f"Cannot convert {type(value)} to SQL")
@value_to_sql.register(str)
def _(value: str):
value = value.replace("'", "''")
return f"'{value}'"
# value_to_sql("sunset") -> 'sunset'
# value_to_sql("it's raining") -> 'it''s raining'
@value_to_sql.register(float)
def _(value: float):
return str(value)
# value_to_sql(7.2) -> 7.2
@value_to_sql.register(bool)
def _(value: bool):
return str(value).upper()
# value_to_sql(False) -> FALSE
@value_to_sql.register(list)
def _(value: list):
return "[" + ", ".join(map(value_to_sql, value)) + "]"
# value_to_sql(["flickr", "unsplash"]) -> ['flickr', 'unsplash']
@value_to_sql.register(type(None))
def _(value):
return "NULL"
# value_to_sql(None) -> NULL
# value_to_sql({"caption": "sunset"}) -> ??? no handler for dict
What I added
1
2
3
4
5
6
7
8
@value_to_sql.register(dict)
def _(value: dict):
# https://datafusion.apache.org/user-guide/sql/scalar_functions.html#named-struct
return (
"named_struct("
+ ", ".join(f"'{k}', {value_to_sql(v)}" for k, v in value.items())
+ ")"
)
Goes through each key-value pair, puts the key as a string literal, and calls value_to_sql on the value. So our annotation dict:
1
2
3
4
5
6
value_to_sql({
"caption": "sunset over mountains",
"aesthetic_score": 8.1,
"watermarked": False,
"source": "flickr"
})
becomes:
1
2
3
4
5
6
named_struct(
'caption', 'sunset over mountains',
'aesthetic_score', 8.1,
'watermarked', FALSE,
'source', 'flickr'
)
Because value_to_sql calls itself, dicts inside dicts become named_struct inside named_struct. Lists and None work too since they already had their own handlers. The SQL comes out in the same order as the dict keys.
The tests
I wrote 6 test cases to cover what I could think of:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def test_value_to_sql_dict():
# flat annotation
assert (
value_to_sql({"caption": "sunset", "source": "flickr"})
== "named_struct('caption', 'sunset', 'source', 'flickr')"
)
# nested: metadata wrapping annotation
assert (
value_to_sql({"annotation": {"caption": "sunset"}})
== "named_struct('annotation', named_struct('caption', 'sunset'))"
)
# list inside struct
assert (
value_to_sql({"tags": ["landscape", "nature"]})
== "named_struct('tags', ['landscape', 'nature'])"
)
# mixed types
assert (
value_to_sql({
"caption": "sunset",
"aesthetic_score": 7.2,
"watermarked": False,
"source": "flickr"
})
== "named_struct('caption', 'sunset', "
"'aesthetic_score', 7.2, "
"'watermarked', FALSE, "
"'source', 'flickr')"
)
# null value
assert value_to_sql({"caption": None}) == "named_struct('caption', NULL)"
# empty dict
assert value_to_sql({}) == "named_struct()"
So instead of writing this:
1
2
3
4
5
6
7
8
table.update(
where="sample_id = 'img_991204'",
values_sql={
"annotation": "named_struct('caption', 'sunset over mountains', "
"'aesthetic_score', 8.1, 'watermarked', FALSE, "
"'source', 'flickr')"
}
)
You just pass a dict:
1
2
3
4
5
6
7
8
9
10
11
table.update(
where="sample_id = 'img_991204'",
values={
"annotation": {
"caption": "sunset over mountains",
"aesthetic_score": 8.1,
"watermarked": False,
"source": "flickr"
}
}
)
So
Most of the work was already done by whoever set up singledispatch in util.py. Because of that, supporting dict was just registering one new function without touching the rest.
LanceDB has good first issues and a contributing guide if you want to try something similar.