Loading Data Google Professional Data Engineer GCP
- load data from
- From Cloud Storage
- From other Google services
- From a readable data source
- By inserting individual records using streaming inserts
- Using DML statements to perform bulk inserts
- Using a BigQuery I/O transform in a Dataflow pipeline to write data to BigQuery
- can load data into a new table or partition
- can also append data to an existing table or partition
- can overwrite a table or partition.
- method for ingesting data into BigQuery:
- the BigQuery Jobs API
- streaming writes
- writing query results into a table
- loading CSV files from Cloud Storage
- using BigQuery as a Cloud Dataflow sink
- The default source format for loading data is CSV.
- Also supports streaming inserts by BigQuery API and BigQuery buffers records before insertion.
Load data from Cloud Storage and readable data sources in the following formats:
Cloud Storage:
- Avro
- CSV
- JSON (newline delimited only)
- ORC
- Parquet
- Datastore exports
- Firestore exports
Readable data source (such as local machine):
- Avro
- CSV
- JSON (newline delimited only)
- ORC
- Parquet
Choosing a data ingestion format
- can load data into variety of formats.
- Loaded data is converted into columnar format for Capacitor (BigQuery’s storage format).
- During loading select data ingestion format based on
- data’s schema
- Embedded newlines
- External limitations
Loading encoded data
- BigQuery supports UTF-8 encoding
- for both nested or repeated and flat data.
- ISO-8859-1 encoding for flat data only for CSV files.
- By default, the BigQuery service expects all source data to be UTF-8 encoded.
Loading compressed and uncompressed data
- The Avro binary format is the preferred for loading both compressed and uncompressed data.
Loading denormalized, nested, and repeated data
- BigQuery performs best when data is denormalized.
Schema auto-detection
- available when you load data into BigQuery
- Also when you query an external data source.
- Steps
- BigQuery starts inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as sample.
- then examines each field and attempts to assign a data type to that field based on the values in the sample.
- use schema auto-detection for JSON or CSV files.
- not available for Avro files, ORC files, Parquet files, Datastore exports, or Firestore exports
BigQuery Data Transfer Service
It automates loading data into BigQuery from these services:
Google Software as a Service (SaaS) apps
- Campaign Manager
- Cloud Storage
- Google Ad Manager
- Google Ads
- Google Merchant Center (beta)
- Google Play
- Search Ads 360 (beta)
- YouTube Channel reports
- YouTube Content Owner reports
External cloud storage providers
- Amazon S3
Data warehouses
- Teradata
- Amazon Redshift
Google Professional Data Engineer (GCP) Free Practice TestTake a Quiz