|
| 1 | +SQLAlchemy Dialect for BigQuery |
| 2 | +=============================== |
| 3 | + |
| 4 | +|GA| |pypi| |versions| |
| 5 | + |
| 6 | +`SQLALchemy Dialects`_ |
| 7 | + |
| 8 | +- `Dialect Documentation`_ |
| 9 | +- `Product Documentation`_ |
| 10 | + |
| 11 | +.. |GA| image:: https://img.shields.io/badge/support-GA-gold.svg |
| 12 | + :target: https://github.com/googleapis/google-cloud-python/blob/main/README.rst#general-availability |
| 13 | +.. |pypi| image:: https://img.shields.io/pypi/v/sqlalchemy-bigquery.svg |
| 14 | + :target: https://pypi.org/project/sqlalchemy-bigquery/ |
| 15 | +.. |versions| image:: https://img.shields.io/pypi/pyversions/sqlalchemy-bigquery.svg |
| 16 | + :target: https://pypi.org/project/sqlalchemy-bigquery/ |
| 17 | +.. _SQLAlchemy Dialects: https://docs.sqlalchemy.org/en/14/dialects/ |
| 18 | +.. _Dialect Documentation: https://googleapis.dev/python/sqlalchemy-bigquery/latest |
| 19 | +.. _Product Documentation: https://cloud.google.com/bigquery/docs/ |
| 20 | + |
| 21 | + |
| 22 | +Quick Start |
| 23 | +----------- |
| 24 | + |
| 25 | +In order to use this library, you first need to go through the following steps: |
| 26 | + |
| 27 | +1. `Select or create a Cloud Platform project.`_ |
| 28 | +2. [Optional] `Enable billing for your project.`_ |
| 29 | +3. `Enable the BigQuery Storage API.`_ |
| 30 | +4. `Setup Authentication.`_ |
| 31 | + |
| 32 | +.. _Select or create a Cloud Platform project.: https://console.cloud.google.com/project |
| 33 | +.. _Enable billing for your project.: https://cloud.google.com/billing/docs/how-to/modify-project#enable_billing_for_a_project |
| 34 | +.. _Enable the BigQuery Storage API.: https://console.cloud.google.com/apis/library/bigquery.googleapis.com |
| 35 | +.. _Setup Authentication.: https://googleapis.dev/python/google-api-core/latest/auth.html |
| 36 | + |
| 37 | + |
| 38 | +Installation |
| 39 | +------------ |
| 40 | + |
| 41 | +Install this library in a `virtualenv`_ using pip. `virtualenv`_ is a tool to |
| 42 | +create isolated Python environments. The basic problem it addresses is one of |
| 43 | +dependencies and versions, and indirectly permissions. |
| 44 | + |
| 45 | +With `virtualenv`_, it's possible to install this library without needing system |
| 46 | +install permissions, and without clashing with the installed system |
| 47 | +dependencies. |
| 48 | + |
| 49 | +.. _`virtualenv`: https://virtualenv.pypa.io/en/latest/ |
| 50 | + |
| 51 | + |
| 52 | +Supported Python Versions |
| 53 | +^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 54 | +Python >= 3.9, <3.14 |
| 55 | + |
| 56 | +Unsupported Python Versions |
| 57 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 58 | +Python <= 3.7. |
| 59 | + |
| 60 | + |
| 61 | +Mac/Linux |
| 62 | +^^^^^^^^^ |
| 63 | + |
| 64 | +.. code-block:: console |
| 65 | +
|
| 66 | + pip install virtualenv |
| 67 | + virtualenv <your-env> |
| 68 | + source <your-env>/bin/activate |
| 69 | + <your-env>/bin/pip install sqlalchemy-bigquery |
| 70 | +
|
| 71 | +
|
| 72 | +Windows |
| 73 | +^^^^^^^ |
| 74 | + |
| 75 | +.. code-block:: console |
| 76 | +
|
| 77 | + pip install virtualenv |
| 78 | + virtualenv <your-env> |
| 79 | + <your-env>\Scripts\activate |
| 80 | + <your-env>\Scripts\pip.exe install sqlalchemy-bigquery |
| 81 | +
|
| 82 | +
|
| 83 | +Installations when processing large datasets |
| 84 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 85 | + |
| 86 | +When handling large datasets, you may see speed increases by also installing the |
| 87 | +`bqstorage` dependencies. See the instructions above about creating a virtual |
| 88 | +environment and then install `sqlalchemy-bigquery` using the `bqstorage` extras: |
| 89 | + |
| 90 | +.. code-block:: console |
| 91 | +
|
| 92 | + source <your-env>/bin/activate |
| 93 | + <your-env>/bin/pip install sqlalchemy-bigquery[bqstorage] |
| 94 | +
|
| 95 | +
|
| 96 | +Usage |
| 97 | +----- |
| 98 | + |
| 99 | +SQLAlchemy |
| 100 | +^^^^^^^^^^ |
| 101 | + |
| 102 | +.. code-block:: python |
| 103 | +
|
| 104 | + from sqlalchemy import * |
| 105 | + from sqlalchemy.engine import create_engine |
| 106 | + from sqlalchemy.schema import * |
| 107 | + engine = create_engine('bigquery://project') |
| 108 | + table = Table('dataset.table', MetaData(bind=engine), autoload=True) |
| 109 | + print(select([func.count('*')], from_obj=table().scalar())) |
| 110 | +
|
| 111 | +
|
| 112 | +Project |
| 113 | +^^^^^^^ |
| 114 | + |
| 115 | +``project`` in ``bigquery://project`` is used to instantiate BigQuery client with the specific project ID. To infer project from the environment, use ``bigquery://`` – without ``project`` |
| 116 | + |
| 117 | +Authentication |
| 118 | +^^^^^^^^^^^^^^ |
| 119 | + |
| 120 | +Follow the `Google Cloud library guide <https://google-cloud-python.readthedocs.io/en/latest/core/auth.html>`_ for authentication. |
| 121 | + |
| 122 | +Alternatively, you can choose either of the following approaches: |
| 123 | + |
| 124 | +* provide the path to a service account JSON file in ``create_engine()`` using the ``credentials_path`` parameter: |
| 125 | + |
| 126 | +.. code-block:: python |
| 127 | +
|
| 128 | + # provide the path to a service account JSON file |
| 129 | + engine = create_engine('bigquery://', credentials_path='/path/to/keyfile.json') |
| 130 | +
|
| 131 | +* pass the credentials in ``create_engine()`` as a Python dictionary using the ``credentials_info`` parameter: |
| 132 | + |
| 133 | +.. code-block:: python |
| 134 | + |
| 135 | + # provide credentials as a Python dictionary |
| 136 | + credentials_info = { |
| 137 | + "type": "service_account", |
| 138 | + "project_id": "your-service-account-project-id" |
| 139 | + } |
| 140 | + engine = create_engine('bigquery://', credentials_info=credentials_info) |
| 141 | +
|
| 142 | +Location |
| 143 | +^^^^^^^^ |
| 144 | + |
| 145 | +To specify location of your datasets pass ``location`` to ``create_engine()``: |
| 146 | + |
| 147 | +.. code-block:: python |
| 148 | +
|
| 149 | + engine = create_engine('bigquery://project', location="asia-northeast1") |
| 150 | +
|
| 151 | +
|
| 152 | +Table names |
| 153 | +^^^^^^^^^^^ |
| 154 | + |
| 155 | +To query tables from non-default projects or datasets, use the following format for the SQLAlchemy schema name: ``[project.]dataset``, e.g.: |
| 156 | + |
| 157 | +.. code-block:: python |
| 158 | +
|
| 159 | + # If neither dataset nor project are the default |
| 160 | + sample_table_1 = Table('natality', schema='bigquery-public-data.samples') |
| 161 | + # If just dataset is not the default |
| 162 | + sample_table_2 = Table('natality', schema='bigquery-public-data') |
| 163 | +
|
| 164 | +Batch size |
| 165 | +^^^^^^^^^^ |
| 166 | + |
| 167 | +By default, ``arraysize`` is set to ``5000``. ``arraysize`` is used to set the batch size for fetching results. To change it, pass ``arraysize`` to ``create_engine()``: |
| 168 | + |
| 169 | +.. code-block:: python |
| 170 | +
|
| 171 | + engine = create_engine('bigquery://project', arraysize=1000) |
| 172 | +
|
| 173 | +Page size for dataset.list_tables |
| 174 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 175 | + |
| 176 | +By default, ``list_tables_page_size`` is set to ``1000``. ``list_tables_page_size`` is used to set the max_results for `dataset.list_tables`_ operation. To change it, pass ``list_tables_page_size`` to ``create_engine()``: |
| 177 | + |
| 178 | +.. _`dataset.list_tables`: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list |
| 179 | +.. code-block:: python |
| 180 | +
|
| 181 | + engine = create_engine('bigquery://project', list_tables_page_size=100) |
| 182 | +
|
| 183 | +Adding a Default Dataset |
| 184 | +^^^^^^^^^^^^^^^^^^^^^^^^ |
| 185 | + |
| 186 | +If you want to have the ``Client`` use a default dataset, specify it as the "database" portion of the connection string. |
| 187 | + |
| 188 | +.. code-block:: python |
| 189 | +
|
| 190 | + engine = create_engine('bigquery://project/dataset') |
| 191 | +
|
| 192 | +When using a default dataset, don't include the dataset name in the table name, e.g.: |
| 193 | + |
| 194 | +.. code-block:: python |
| 195 | +
|
| 196 | + table = Table('table_name') |
| 197 | +
|
| 198 | +Note that specifying a default dataset doesn't restrict execution of queries to that particular dataset when using raw queries, e.g.: |
| 199 | + |
| 200 | +.. code-block:: python |
| 201 | +
|
| 202 | + # Set default dataset to dataset_a |
| 203 | + engine = create_engine('bigquery://project/dataset_a') |
| 204 | +
|
| 205 | + # This will still execute and return rows from dataset_b |
| 206 | + engine.execute('SELECT * FROM dataset_b.table').fetchall() |
| 207 | +
|
| 208 | +
|
| 209 | +Connection String Parameters |
| 210 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 211 | + |
| 212 | +There are many situations where you can't call ``create_engine`` directly, such as when using tools like `Flask SQLAlchemy <http://flask-sqlalchemy.pocoo.org/2.3/>`_. For situations like these, or for situations where you want the ``Client`` to have a `default_query_job_config <https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client>`_, you can pass many arguments in the query of the connection string. |
| 213 | + |
| 214 | +The ``credentials_path``, ``credentials_info``, ``credentials_base64``, ``location``, ``arraysize`` and ``list_tables_page_size`` parameters are used by this library, and the rest are used to create a `QueryJobConfig <https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig>`_ |
| 215 | + |
| 216 | +Note that if you want to use query strings, it will be more reliable if you use three slashes, so ``'bigquery:///?a=b'`` will work reliably, but ``'bigquery://?a=b'`` might be interpreted as having a "database" of ``?a=b``, depending on the system being used to parse the connection string. |
| 217 | + |
| 218 | +Here are examples of all the supported arguments. Any not present are either for legacy sql (which isn't supported by this library), or are too complex and are not implemented. |
| 219 | + |
| 220 | +.. code-block:: python |
| 221 | +
|
| 222 | + engine = create_engine( |
| 223 | + 'bigquery://some-project/some-dataset' '?' |
| 224 | + 'credentials_path=/some/path/to.json' '&' |
| 225 | + 'location=some-location' '&' |
| 226 | + 'arraysize=1000' '&' |
| 227 | + 'list_tables_page_size=100' '&' |
| 228 | + 'clustering_fields=a,b,c' '&' |
| 229 | + 'create_disposition=CREATE_IF_NEEDED' '&' |
| 230 | + 'destination=different-project.different-dataset.table' '&' |
| 231 | + 'destination_encryption_configuration=some-configuration' '&' |
| 232 | + 'dry_run=true' '&' |
| 233 | + 'labels=a:b,c:d' '&' |
| 234 | + 'maximum_bytes_billed=1000' '&' |
| 235 | + 'priority=INTERACTIVE' '&' |
| 236 | + 'schema_update_options=ALLOW_FIELD_ADDITION,ALLOW_FIELD_RELAXATION' '&' |
| 237 | + 'use_query_cache=true' '&' |
| 238 | + 'write_disposition=WRITE_APPEND' |
| 239 | + ) |
| 240 | +
|
| 241 | +In cases where you wish to include the full credentials in the connection URI you can base64 the credentials JSON file and supply the encoded string to the ``credentials_base64`` parameter. |
| 242 | + |
| 243 | +.. code-block:: python |
| 244 | +
|
| 245 | + engine = create_engine( |
| 246 | + 'bigquery://some-project/some-dataset' '?' |
| 247 | + 'credentials_base64=eyJrZXkiOiJ2YWx1ZSJ9Cg==' '&' |
| 248 | + 'location=some-location' '&' |
| 249 | + 'arraysize=1000' '&' |
| 250 | + 'list_tables_page_size=100' '&' |
| 251 | + 'clustering_fields=a,b,c' '&' |
| 252 | + 'create_disposition=CREATE_IF_NEEDED' '&' |
| 253 | + 'destination=different-project.different-dataset.table' '&' |
| 254 | + 'destination_encryption_configuration=some-configuration' '&' |
| 255 | + 'dry_run=true' '&' |
| 256 | + 'labels=a:b,c:d' '&' |
| 257 | + 'maximum_bytes_billed=1000' '&' |
| 258 | + 'priority=INTERACTIVE' '&' |
| 259 | + 'schema_update_options=ALLOW_FIELD_ADDITION,ALLOW_FIELD_RELAXATION' '&' |
| 260 | + 'use_query_cache=true' '&' |
| 261 | + 'write_disposition=WRITE_APPEND' |
| 262 | + ) |
| 263 | +
|
| 264 | +To create the base64 encoded string you can use the command line tool ``base64``, or ``openssl base64``, or ``python -m base64``. |
| 265 | + |
| 266 | +Alternatively, you can use an online generator like `www.base64encode.org <https://www.base64encode.org>_` to paste your credentials JSON file to be encoded. |
| 267 | + |
| 268 | + |
| 269 | +Supplying Your Own BigQuery Client |
| 270 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 271 | + |
| 272 | +The above connection string parameters allow you to influence how the BigQuery client used to execute your queries will be instantiated. |
| 273 | +If you need additional control, you can supply a BigQuery client of your own: |
| 274 | + |
| 275 | +.. code-block:: python |
| 276 | +
|
| 277 | + from google.cloud import bigquery |
| 278 | +
|
| 279 | + custom_bq_client = bigquery.Client(...) |
| 280 | +
|
| 281 | + engine = create_engine( |
| 282 | + 'bigquery://some-project/some-dataset?user_supplied_client=True', |
| 283 | + connect_args={'client': custom_bq_client}, |
| 284 | + ) |
| 285 | +
|
| 286 | +
|
| 287 | +Creating tables |
| 288 | +^^^^^^^^^^^^^^^ |
| 289 | + |
| 290 | +To add metadata to a table: |
| 291 | + |
| 292 | +.. code-block:: python |
| 293 | +
|
| 294 | + table = Table('mytable', ..., |
| 295 | + bigquery_description='my table description', |
| 296 | + bigquery_friendly_name='my table friendly name', |
| 297 | + bigquery_default_rounding_mode="ROUND_HALF_EVEN", |
| 298 | + bigquery_expiration_timestamp=datetime.datetime.fromisoformat("2038-01-01T00:00:00+00:00"), |
| 299 | + ) |
| 300 | +
|
| 301 | +To add metadata to a column: |
| 302 | + |
| 303 | +.. code-block:: python |
| 304 | +
|
| 305 | + Column('mycolumn', doc='my column description') |
| 306 | +
|
| 307 | +To create a clustered table: |
| 308 | + |
| 309 | +.. code-block:: python |
| 310 | +
|
| 311 | + table = Table('mytable', ..., bigquery_clustering_fields=["a", "b", "c"]) |
| 312 | +
|
| 313 | +To create a time-unit column-partitioned table: |
| 314 | + |
| 315 | +.. code-block:: python |
| 316 | +
|
| 317 | + from google.cloud import bigquery |
| 318 | +
|
| 319 | + table = Table('mytable', ..., |
| 320 | + bigquery_time_partitioning=bigquery.TimePartitioning( |
| 321 | + field="mytimestamp", |
| 322 | + type_="MONTH", |
| 323 | + expiration_ms=1000 * 60 * 60 * 24 * 30 * 6, # 6 months |
| 324 | + ), |
| 325 | + bigquery_require_partition_filter=True, |
| 326 | + ) |
| 327 | +
|
| 328 | +To create an ingestion-time partitioned table: |
| 329 | + |
| 330 | +.. code-block:: python |
| 331 | +
|
| 332 | + from google.cloud import bigquery |
| 333 | +
|
| 334 | + table = Table('mytable', ..., |
| 335 | + bigquery_time_partitioning=bigquery.TimePartitioning(), |
| 336 | + bigquery_require_partition_filter=True, |
| 337 | + ) |
| 338 | +
|
| 339 | +To create an integer-range partitioned table |
| 340 | + |
| 341 | +.. code-block:: python |
| 342 | +
|
| 343 | + from google.cloud import bigquery |
| 344 | +
|
| 345 | + table = Table('mytable', ..., |
| 346 | + bigquery_range_partitioning=bigquery.RangePartitioning( |
| 347 | + field="zipcode", |
| 348 | + range_=bigquery.PartitionRange(start=0, end=100000, interval=10), |
| 349 | + ), |
| 350 | + bigquery_require_partition_filter=True, |
| 351 | + ) |
| 352 | +
|
| 353 | +
|
| 354 | +Threading and Multiprocessing |
| 355 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 356 | + |
| 357 | +Because this client uses the `grpc` library, it's safe to |
| 358 | +share instances across threads. |
| 359 | + |
| 360 | +In multiprocessing scenarios, the best |
| 361 | +practice is to create client instances *after* the invocation of |
| 362 | +`os.fork` by `multiprocessing.pool.Pool` or |
| 363 | +`multiprocessing.Process`. |
0 commit comments