Schema and data model Google Professional Data Engineer GCP
- Can contain one or more tables
- Tables look like relational database tables
- Table structured with rows, columns, and values
- table have primary keys.
- Data is strongly typed
- must define a schema for each database and specify the data types of each column.
- Allowable data types include scalar and array types
- Avoid hot spotting like bigtable
- Use nested tables with primary keys called interleaving for faster access.
- Do not use sequential numbers, timestamps.
- If needed, store timestamps in descending order.
Parent-child table relationships
- Two ways to define parent-child relationships – table interleaving and foreign keys.
- Table interleaving is good if child table’s primary key includes the parent table’s primary key columns.
- Foreign keys
- are not limited to primary key columns
- tables can have multiple foreign key relationships
Primary keys
- Every table must have a primary key
- primary key can be composed of zero or more columns of that table.
- For child, the primary key column(s) of the parent table must be the prefix of the primary key of the child table.
- Spanner stores rows in sorted order by primary key values
- child rows also inserted between parent rows that share the same primary key prefix.
- Spanner can physically co-locate rows of related tables.
.
Choosing a primary key
- primary key uniquely identifies each row in a table.
- Options for primary key
- Hash the key and store it in a column.
- Use a Universally Unique Identifier (UUID).
- Bit-reverse sequential values.
Database splits
- can define hierarchies of parent-child relationships between tables up to seven layers deep
- Spanner divides data into chunks called “splits”
- individual splits can move independently from each other and assigned to different servers
- A split is a range of rows in a top-level (in other words, non-interleaved) table, where the rows are ordered by primary key.
- The start and end keys of this range are called “split boundaries”.
- Spanner automatically adds and removes split boundaries
Key columns
- The keys of a table can’t change
- can’t add a key column to an existing table
- cannot remove a key column from an existing table.
Storing NULLs
- Primary key columns can be defined to store NULLs.
- to store NULLs in a primary key column, omit the NOT NULL clause
Disallowed types
These cannot be of type ARRAY:
- A table’s key columns.
- An index’s key columns.
Replication
- Spanner automatically gets replication at the byte level
- Spanner writes database mutations to files in this filesystem
- Spanner creates replicas of each database split
- 3 types of replicas: read-write replicas, read-only replicas, and witness replicas.
- Single-region instances use only read-write replicas,
- multi-region instance configurations use a combination of all three types
For multi-region
Replica type | Can vote | Can become leader | Can serve reads |
Read-write | yes | yes | yes |
Read-only | no | no | yes |
Witness | yes | no | no |
Schema Design Best Practices
- Design a schema that prevents hotspots and other performance issues.
- Place compute resources for write-heavy workloads within or close to the default leader region for optimal write latency
- Use staleness of at least 15 seconds for optimal read performance outside of the default leader region
- place critical compute resources in at least two regions to avoid single-region dependency
- keep high priority total CPU utilization under 45% in each region.
Google Professional Data Engineer (GCP) Free Practice TestTake a Quiz