API documentation¶
Connecting¶
- dataset.connect(url: str | None = None, schema: str | None = None, engine_kwargs: dict[str, ~typing.Any] | None = None, ensure_schema: bool = True, row_type: ~collections.abc.Callable[[~collections.abc.Iterable[tuple[str, ~typing.Any]]], ~collections.abc.Mapping[str, ~typing.Any]] = <class 'collections.OrderedDict'>, sqlite_wal_mode: bool = True, on_connect_statements: list[str] | None = None) Database[source]¶
Opens a new connection to a database.
url can be any valid SQLAlchemy engine URL. If url is not defined it will try to use DATABASE_URL from environment variable. Returns an instance of
Database. Additionally, engine_kwargs will be directly passed to SQLAlchemy, e.g. set engine_kwargs={‘pool_recycle’: 3600} will avoid DB connection timeout. Set row_type to an alternate dict-like class to change the type of container rows are stored in.:db = dataset.connect('sqlite:///factbook.db')
One of the main features of dataset is to automatically create tables and columns as data is inserted. This behaviour can optionally be disabled via the ensure_schema argument. It can also be overridden in a lot of the data manipulation methods using the ensure flag.
If you want to run custom SQLite pragmas on database connect, you can add them to on_connect_statements as a set of strings. You can view a full list of PRAGMAs here.
Database¶
- class dataset.Database(url: str, schema: str | None = None, engine_kwargs: dict[str, ~typing.Any] | None = None, ensure_schema: bool = True, row_type: ~collections.abc.Callable[[~collections.abc.Iterable[tuple[str, ~typing.Any]]], ~collections.abc.Mapping[str, ~typing.Any]] = <class 'collections.OrderedDict'>, sqlite_wal_mode: bool = True, on_connect_statements: list[str] | None = None)[source]¶
A database object represents a SQL database with multiple tables.
- begin() None[source]¶
Enter a transaction explicitly.
No data will be written until the transaction has been committed.
- close() None[source]¶
Close all database connections and dispose of the engine.
Releases all pooled connections and makes this object unusable. This should be called when the database is no longer needed, especially in multi-threaded or connection-pooled setups.
- commit() None[source]¶
Commit the current transaction.
Make all statements executed since the transaction was begun permanent.
- create_table(table_name: str, primary_id: str | Literal[False] | None = None, primary_type: TypeEngine | type[TypeEngine] | None = None, primary_increment: bool | None = None) Table[source]¶
Create a new table.
Either loads a table or creates it if it doesn’t exist yet. You can define the name and type of the primary key field, if a new table is to be created. The default is to create an auto-incrementing integer,
id. You can also set the primary key to be a string or big integer. The caller will be responsible for the uniqueness ofprimary_idif it is defined as a text type. You can disable auto-increment behaviour for numeric primary keys by setting primary_increment to False.Returns a
Tableinstance.table = db.create_table('population') # custom id and type table2 = db.create_table('population2', 'age') table3 = db.create_table('population3', primary_id='city', primary_type=db.types.text) # custom length of String table4 = db.create_table('population4', primary_id='city', primary_type=db.types.string(25)) # no primary key table5 = db.create_table('population5', primary_id=False)
- get_table(table_name: str, primary_id: str | Literal[False] | None = None, primary_type: TypeEngine | type[TypeEngine] | None = None, primary_increment: bool | None = None) Table[source]¶
Load or create a table.
This is now the same as
create_table.table = db.get_table('population') # you can also use the short-hand syntax: table = db['population']
- load_table(table_name: str) Table[source]¶
Load a table.
This will fail if the tables does not already exist in the database. If the table exists, its columns will be reflected and are available on the
Tableobject.Returns a
Tableinstance.table = db.load_table('population')
- query(query: str | Executable, **kwargs: Any) ResultIter[source]¶
Run a statement on the database directly.
Allows for the execution of arbitrary read/write queries. A query can either be a plain text string, or a SQLAlchemy expression. If a plain string is passed in, it will be converted to an expression automatically.
Keyword arguments will be used for parameter binding. Use a named bind parameter in the query (i.e.
SELECT * FROM tbl WHERE a = :foo) and pass the value as a keyword argument (i.e.foo='bar').statement = 'SELECT user, COUNT(*) c FROM photos GROUP BY user' for row in db.query(statement): print(row['user'], row['c'])
The returned iterator will yield each result sequentially.
- rollback() None[source]¶
Roll back the current transaction.
Discard all statements executed since the transaction was begun.
- property tables: list[str]¶
Get a listing of all tables that exist in the database.
- property views: list[str]¶
Get a listing of all views that exist in the database.
Table¶
- class dataset.Table(database: Database, table_name: str, primary_id: str | Literal[False] | None = None, primary_type: TypeEngine | type[TypeEngine] | None = None, primary_increment: bool | None = None, auto_create: bool = False)[source]¶
Represents a table in a database and exposes common operations.
- __iter__() ResultIter[source]¶
Return all rows of the table as simple dictionaries.
Allows for iterating over all rows in the table without explicitly calling
find().for row in table: print(row)
- all(*_clauses: ColumnElement[bool], _limit: int | None = None, _offset: int = 0, order_by: str | Sequence[str] | None = None, _streamed: bool = False, _step: int | None = 1000, **kwargs: None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]) ResultIter¶
Perform a simple search on the table.
Simply pass keyword arguments as
filter.results = table.find(country='France') results = table.find(country='France', year=1980)
Using
_limit:# just return the first 10 rows results = table.find(country='France', _limit=10)
You can sort the results by single or multiple columns. Append a minus sign to the column name for descending order:
# sort results by a column 'year' results = table.find(country='France', order_by='year') # return all rows sorted by multiple columns (descending by year) results = table.find(order_by=['country', '-year'])
You can also submit filters based on criteria other than equality, see Advanced filters for details.
To run more complex queries with JOINs, or to perform GROUP BY-style aggregation, you can also use
db.query()to run raw SQL queries instead.
- property columns: list[str]¶
Get a listing of all columns that exist in the table.
- count(*_clauses: ColumnElement[bool], **kwargs: None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]) int[source]¶
Return the count of results for the given filter set.
- create_column(name: str, type: TypeEngine | type[TypeEngine], **kwargs: object) None[source]¶
Create a new column
nameof a specified type.table.create_column('created_at', db.types.datetime)
type corresponds to an SQLAlchemy type as described by dataset.db.Types. Additional keyword arguments are passed to the constructor of Column, so that default values, and options like nullable and unique can be set.
table.create_column('key', unique=True, nullable=False) table.create_column('food', default='banana')
- create_column_by_example(name: str, value: None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]) None[source]¶
Explicitly create a new column
namewith a type that is appropriate to store the given examplevalue. The type is guessed in the same way as for the insert method withensure=True.table.create_column_by_example('length', 4.2)
If a column of the same name already exists, no action is taken, even if it is not of the type we would have created.
- create_index(columns: Sequence[str], name: str | None = None, **kw: object) None[source]¶
Create an index to speed up queries on a table.
If no
nameis given a random name is created.table.create_index(['name', 'country'])
- delete(*clauses: ColumnElement[bool], **filters: None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]) bool[source]¶
Delete rows from the table.
Keyword arguments can be used to add column-based filters. The filter criterion will always be equality:
table.delete(place='Berlin')
If no arguments are given, all records are deleted.
- distinct(*args: str | ColumnElement[bool], _limit: int | None = None, _offset: int | None = 0, **kwargs: None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]) ResultIter[source]¶
Return all the unique (distinct) values for the given
columns.# returns only one row per year, ignoring the rest table.distinct('year') # works with multiple columns, too table.distinct('year', 'country') # you can also combine this with a filter table.distinct('year', country='China')
- drop() None[source]¶
Drop the table from the database.
Deletes both the schema and all the contents within it.
- property exists: bool¶
Check to see if the table currently exists in the database.
- find(*_clauses: ColumnElement[bool], _limit: int | None = None, _offset: int = 0, order_by: str | Sequence[str] | None = None, _streamed: bool = False, _step: int | None = 1000, **kwargs: None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]) ResultIter[source]¶
Perform a simple search on the table.
Simply pass keyword arguments as
filter.results = table.find(country='France') results = table.find(country='France', year=1980)
Using
_limit:# just return the first 10 rows results = table.find(country='France', _limit=10)
You can sort the results by single or multiple columns. Append a minus sign to the column name for descending order:
# sort results by a column 'year' results = table.find(country='France', order_by='year') # return all rows sorted by multiple columns (descending by year) results = table.find(order_by=['country', '-year'])
You can also submit filters based on criteria other than equality, see Advanced filters for details.
To run more complex queries with JOINs, or to perform GROUP BY-style aggregation, you can also use
db.query()to run raw SQL queries instead.
- find_one(*args: ColumnElement[bool], **kwargs: None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]) Mapping[str, Any] | None[source]¶
Get a single result from the table.
Works just like
find()but returns one result, orNone.row = table.find_one(country='United States')
- has_column(column: str | None) bool[source]¶
Check if a column with the given name exists on this table.
- has_index(columns: Iterable[str]) bool[source]¶
Check if an index exists to cover the given
columns.
- insert(row: Mapping[str, None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]], ensure: bool | None = None, types: dict[str, TypeEngine | type[TypeEngine]] | None = None) Any[source]¶
Add a
rowdict by inserting it into the table.If
ensureis set, any of the keys of the row are not table columns, they will be created automatically.During column creation,
typeswill be checked for a key matching the name of a column to be created, and the given SQLAlchemy column type will be used. Otherwise, the type is guessed from the row value, defaulting to a simple unicode field.data = dict(title='I am a banana!') table.insert(data)
Returns the inserted row’s primary key.
- insert_ignore(row: Mapping[str, None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]], keys: Sequence[str], ensure: bool | None = None, types: dict[str, TypeEngine | type[TypeEngine]] | None = None) Any[source]¶
Add a
rowdict into the table if the row does not exist.If rows with matching
keysexist no change is made.Setting
ensureresults in automatically creating missing columns, i.e., keys of the row are not table columns.During column creation,
typeswill be checked for a key matching the name of a column to be created, and the given SQLAlchemy column type will be used. Otherwise, the type is guessed from the row value, defaulting to a simple unicode field.data = dict(id=10, title='I am a banana!') table.insert_ignore(data, ['id'])
- insert_many(rows: Sequence[Mapping[str, None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]]], chunk_size: int = 1000, ensure: bool | None = None, types: dict[str, TypeEngine | type[TypeEngine]] | None = None) None[source]¶
Add many rows at a time.
This is significantly faster than adding them one by one. Per default the rows are processed in chunks of 1000 per commit, unless you specify a different
chunk_size.See
insert()for details on the other parameters.rows = [dict(name='Dolly')] * 10000 table.insert_many(rows)
- update(row: Mapping[str, None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]], keys: Sequence[str], ensure: bool | None = None, types: dict[str, TypeEngine | type[TypeEngine]] | None = None, return_count: bool = False) bool | int[source]¶
Update a row in the table.
The update is managed via the set of column names stated in
keys: they will be used as filters for the data to be updated, using the values inrow.# update all entries with id matching 10, setting their title # columns data = dict(id=10, title='I am a banana!') table.update(data, ['id'])
If keys in
rowupdate columns not present in the table, they will be created based on the settings ofensureandtypes, matching the behavior ofinsert().
- update_many(rows: Sequence[Mapping[str, None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]]], keys: Sequence[str], chunk_size: int = 1000, ensure: bool | None = None, types: dict[str, TypeEngine | type[TypeEngine]] | None = None) None[source]¶
Update many rows in the table at a time.
This is significantly faster than updating them one by one. Per default the rows are processed in chunks of 1000 per commit, unless you specify a different
chunk_size.See
update()for details on the other parameters.
- upsert(row: Mapping[str, None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]], keys: Sequence[str], ensure: bool | None = None, types: dict[str, TypeEngine | type[TypeEngine]] | None = None) Any[source]¶
An UPSERT is a smart combination of insert and update.
If rows with matching
keysexist they will be updated, otherwise a new row is inserted in the table.data = dict(id=10, title='I am a banana!') table.upsert(data, ['id'])
- upsert_many(rows: Sequence[Mapping[str, None | bool | int | float | str | bytes | Decimal | date | datetime | dict[str, None | bool | int | float | str | bytes | Decimal | date | datetime] | list[None | bool | int | float | str | bytes | Decimal | date | datetime]]], keys: Sequence[str], chunk_size: int = 1000, ensure: bool | None = None, types: dict[str, TypeEngine | type[TypeEngine]] | None = None) None[source]¶
Sorts multiple input rows into upserts and inserts. Inserts are passed to insert and upserts are updated.
See
upsert()andinsert_many().
Data Export¶
Note: Data exporting has been extracted into a stand-alone package, datafreeze. See the relevant repository here.