API documentation

dataset.connect(url=None, schema=None, reflect_metadata=True, engine_kwargs=None, reflect_views=True, ensure_schema=True, row_type=<class 'collections.OrderedDict'>)[source]

Opens a new connection to a database.

url can be any valid SQLAlchemy engine URL. If url is not defined it will try to use DATABASE_URL from environment variable. Returns an instance of Database. Set reflect_metadata to False if you don’t want the entire database schema to be pre-loaded. This significantly speeds up connecting to large databases with lots of tables. reflect_views can be set to False if you don’t want views to be loaded. Additionally, engine_kwargs will be directly passed to SQLAlchemy, e.g. set engine_kwargs={‘pool_recycle’: 3600} will avoid DB connection timeout. Set row_type to an alternate dict-like class to change the type of container rows are stored in.:

db = dataset.connect('sqlite:///factbook.db')

Database

class dataset.Database(url, schema=None, reflect_metadata=True, engine_kwargs=None, reflect_views=True, ensure_schema=True, row_type=<class 'collections.OrderedDict'>)[source]

A database object represents a SQL database with multiple tables.

begin()[source]

Enter a transaction explicitly.

No data will be written until the transaction has been committed. NOTICE: Schema modification operations, such as the creation of tables or columns will not be part of the transactional context.

commit()[source]

Commit the current transaction.

Make all statements executed since the transaction was begun permanent.

create_table(table_name, primary_id='id', primary_type='Integer')[source]

Create a new table.

The new table will automatically have an id column unless specified via optional parameter primary_id, which will be used as the primary key of the table. Automatic id is set to be an auto-incrementing integer, while the type of custom primary_id can be a String or an Integer as specified with primary_type flag. The default length of String is 255. The caller can specify the length. The caller will be responsible for the uniqueness of manual primary_id.

This custom id feature is only available via direct create_table call.

Returns a Table instance.

table = db.create_table('population')

# custom id and type
table2 = db.create_table('population2', 'age')
table3 = db.create_table('population3', primary_id='race', primary_type='String')
# custom length of String
table4 = db.create_table('population4', primary_id='race', primary_type='String(50)')
get_table(table_name, primary_id='id', primary_type='Integer')[source]

Smart wrapper around load_table and create_table.

Either loads a table or creates it if it doesn’t exist yet. For short-hand to create a table with custom id and type using [], where table_name, primary_id, and primary_type are specified as a tuple

Returns a Table instance.

table = db.get_table('population')
# you can also use the short-hand syntax:
table = db['population']
load_table(table_name)[source]

Load a table.

This will fail if the tables does not already exist in the database. If the table exists, its columns will be reflected and are available on the Table object.

Returns a Table instance.

table = db.load_table('population')
query(query, **kw)[source]

Run a statement on the database directly.

Allows for the execution of arbitrary read/write queries. A query can either be a plain text string, or a SQLAlchemy expression. If a plain string is passed in, it will be converted to an expression automatically.

Keyword arguments will be used for parameter binding. See the SQLAlchemy documentation for details.

The returned iterator will yield each result sequentially.

res = db.query('SELECT user, COUNT(*) c FROM photos GROUP BY user')
for row in res:
    print(row['user'], row['c'])
rollback()[source]

Roll back the current transaction.

Discard all statements executed since the transaction was begun.

tables

Get a listing of all tables that exist in the database.

Table

class dataset.Table(database, table)[source]

Represents a table in a database and exposes common operations.

all()[source]

Return all rows of the table as simple dictionaries.

This is simply a shortcut to find() called with no arguments.

rows = table.all()
columns

Get a listing of all columns that exist in the table.

count(*_clauses, **kwargs)[source]

Return the count of results for the given filter set.

create_column(name, type)[source]

Explicitly create a new column name of a specified type.

type must be a SQLAlchemy column type.

table.create_column('created_at', sqlalchemy.DateTime)
create_index(columns, name=None, **kw)[source]

Create an index to speed up queries on a table.

If no name is given a random name is created.

table.create_index(['name', 'country'])
delete(*_clauses, **_filter)[source]

Delete rows from the table.

Keyword arguments can be used to add column-based filters. The filter criterion will always be equality:

table.delete(place='Berlin')

If no arguments are given, all records are deleted.

distinct(*args, **_filter)[source]

Return all rows of a table, but remove rows in with duplicate values in columns.

Interally this creates a DISTINCT statement.

# returns only one row per year, ignoring the rest
table.distinct('year')
# works with multiple columns, too
table.distinct('year', 'country')
# you can also combine this with a filter
table.distinct('year', country='China')
drop()[source]

Drop the table from the database.

Delete both the schema and all the contents within it. Note: the object will raise an Exception if you use it after dropping the table. If you want to re-create the table, make sure to get a fresh instance from the Database.

drop_column(name)[source]

Drop the column name.

table.drop_column('created_at')
find(*_clauses, **kwargs)[source]

Perform a simple search on the table.

Simply pass keyword arguments as filter.

results = table.find(country='France')
results = table.find(country='France', year=1980)

Using _limit:

# just return the first 10 rows
results = table.find(country='France', _limit=10)

You can sort the results by single or multiple columns. Append a minus sign to the column name for descending order:

# sort results by a column 'year'
results = table.find(country='France', order_by='year')
# return all rows sorted by multiple columns (by year in descending order)
results = table.find(order_by=['country', '-year'])

For more complex queries, please use db.query() instead.

find_one(*args, **kwargs)[source]

Get a single result from the table.

Works just like find() but returns one result, or None.

row = table.find_one(country='United States')
insert(row, ensure=None, types={})[source]

Add a row (type: dict) by inserting it into the table.

If ensure is set, any of the keys of the row are not table columns, they will be created automatically.

During column creation, types will be checked for a key matching the name of a column to be created, and the given SQLAlchemy column type will be used. Otherwise, the type is guessed from the row value, defaulting to a simple unicode field.

data = dict(title='I am a banana!')
table.insert(data)

Returns the inserted row’s primary key.

insert_ignore(row, keys, ensure=None, types={})[source]

Add a row (type: dict) into the table if the row does not exist.

If rows with matching keys exist they will be added to the table.

Setting ensure results in automatically creating missing columns, i.e., keys of the row are not table columns.

During column creation, types will be checked for a key matching the name of a column to be created, and the given SQLAlchemy column type will be used. Otherwise, the type is guessed from the row value, defaulting to a simple unicode field.

data = dict(id=10, title='I am a banana!')
table.insert_ignore(data, ['id'])
insert_many(rows, chunk_size=1000, ensure=None, types={})[source]

Add many rows at a time.

This is significantly faster than adding them one by one. Per default the rows are processed in chunks of 1000 per commit, unless you specify a different chunk_size.

See insert() for details on the other parameters.

rows = [dict(name='Dolly')] * 10000
table.insert_many(rows)
update(row, keys, ensure=None, types={})[source]

Update a row in the table.

The update is managed via the set of column names stated in keys: they will be used as filters for the data to be updated, using the values in row.

# update all entries with id matching 10, setting their title columns
data = dict(id=10, title='I am a banana!')
table.update(data, ['id'])

If keys in row update columns not present in the table, they will be created based on the settings of ensure and types, matching the behavior of insert().

upsert(row, keys, ensure=None, types={})[source]

An UPSERT is a smart combination of insert and update.

If rows with matching keys exist they will be updated, otherwise a new row is inserted in the table.

data = dict(id=10, title='I am a banana!')
table.upsert(data, ['id'])

Data Export

dataset.freeze(result, format='csv', filename='freeze.csv', fileobj=None, prefix='.', mode='list', **kw)[source]

Perform a data export of a given result set. This is a very flexible exporter, allowing for various output formats, metadata assignment, and file name templating to dump each record (or a set of records) into individual files.

result = db['person'].all()
dataset.freeze(result, format='json', filename='all-persons.json')

Instead of passing in the file name, you can also pass a file object:

result = db['person'].all()
fh = open('/dev/null', 'wb')
dataset.freeze(result, format='json', fileobj=fh)

Be aware that this will disable file name templating and store all results to the same file.

If result is a table (rather than a result set), all records in the table are exported (as if result.all() had been called).

freeze supports two values for mode:

list (default)
The entire result set is dumped into a single file.
item
One file is created for each row in the result set.

You should set a filename for the exported file(s). If mode is set to item the function would generate one file per row. In that case you can use values as placeholders in filenames:

dataset.freeze(res, mode='item', format='json',
               filename='item-{{id}}.json')

The following output format s are supported:

csv
Comma-separated values, first line contains column names.
json
A JSON file containing a list of dictionaries for each row in the table. If a callback is given, JSON with padding (JSONP) will be generated.
tabson
Tabson is a smart combination of the space-efficiency of the CSV and the parsability and structure of JSON.

You can pass additional named parameters specific to the used format.

As an example, you can freeze to minified JSON with the following:

dataset.freeze(res, format=’json’, indent=4, wrap=False,
filename=’output.json’)
json and tabson
callback:
if provided, generate a JSONP string using the given callback function, i.e. something like callback && callback({...})
indent:
if indent is a non-negative integer (it is 2 by default when you call dataset.freeze, and None via the datafreeze command), then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.
meta:
if meta is not None (default: {}), it will be included in the JSON output (for json, only if wrap is True).
wrap (only for json):
if wrap is True (default), the JSON output is an object of the form {"count": 2, "results": [...]}. if meta is not None, a third property meta is added to the wrapping object, with this value.