-
Notifications
You must be signed in to change notification settings - Fork 120
Description
As we all know, Databricks has a VARIANT data type that is more performant than standard JSON strings for nested data. However, I don't understand how I am supposed to insert into such a column using databricks-sql-connector==4.0.5 though.
If you want to write to a PostgreSQL JSON column in psycopg2, you use json dump. This trick did not seem to work here. I also tried custom escaped JSON and simply dict, but also no.
import databricks.sql
import json
# Replace with your Databricks SQL warehouse details
server_hostname = "<SERVER_HOSTNAME>"
http_path = "<HTTP_PATH>"
access_token = "<ACCESS_TOKEN>"
# Example data to insert
content = {
"age": 29,
"city": "New York"
}
data = [
(1, "Alice", json.dumps(content)),
2, "Bob", None), # (2, "Bob", data) # Raises error
(3, "Charlie", """{"age": 29, "city": "New York"}"""),
]
# Connect to Databricks SQL
with databricks.sql.connect(
server_hostname=server_hostname,
http_path=http_path,
access_token=access_token
) as connection:
with connection.cursor() as cursor:
# Insert data into Delta table
cursor.execute("CREATE OR REPLACE TABLE test_catalog.ad_hoc.variant_test (id INT, name STRING, content VARIANT) USING DELTA")
cursor.executemany(
"INSERT INTO prod_catalog.ad_hoc.variant_test (id, name, content) VALUES (?, ?, ?)",
data
)
If I query the resulting table,
SELECT *, content:age, parse_json("{\"age\": 29, \"city\": \"New York\"}") FROM ad_hoc.variant_test ORDER BY id;
1 and 3 succeeds, but it seems to end up as a "flat"STRING
(schema_of_variant(content)
) instead of OBJECT<age: BIGINT, city: STRING>
.

Attempt 2 (insert a dict) raises the error
databricks.sql.exc.ServerOperationError: [DATATYPE_MISMATCH.CAST_WITHOUT_SUGGESTION] Cannot resolve "content" due to data type mismatch: cannot cast "MAP<VOID, VOID>" to "VARIANT". SQLSTATE: 42K09; line 1 pos 0
Could we add documentation for how this is supposed to work? I guess MAP, STRUCT and VARIANT all map to Python type dict in some sense. It is only that VARIANT don't have a predefined schema like MAP or STRUCT, and unlike MAP can be nested?