Schema and data model Google Professional Data Engineer GCP

  1. Home
  2. Schema and data model Google Professional Data Engineer GCP
  • Can contain one or more tables
  • Tables look like relational database tables
  • Table structured with rows, columns, and values
  • table have primary keys.
  • Data is strongly typed
  • must define a schema for each database and specify the data types of each column.
  • Allowable data types include scalar and array types
  • Avoid hot spotting like bigtable
  • Use nested tables with primary keys called interleaving for faster access.
  • Do not use sequential numbers, timestamps.
  • If needed, store timestamps in descending order.

 

Parent-child table relationships

  • Two ways to define parent-child relationships – table interleaving and foreign keys.
  • Table interleaving is good if child table’s primary key includes the parent table’s primary key columns.
  • Foreign keys
    • are not limited to primary key columns
    • tables can have multiple foreign key relationships

 

Primary keys

  • Every table must have a primary key
  • primary key can be composed of zero or more columns of that table.
  • For child, the primary key column(s) of the parent table must be the prefix of the primary key of the child table.
  • Spanner stores rows in sorted order by primary key values
  • child rows also inserted between parent rows that share the same primary key prefix.
  • Spanner can physically co-locate rows of related tables.

.

Choosing a primary key

  • primary key uniquely identifies each row in a table.
  • Options for primary key
    • Hash the key and store it in a column.
    • Use a Universally Unique Identifier (UUID).
    • Bit-reverse sequential values.

 

Database splits

  • can define hierarchies of parent-child relationships between tables up to seven layers deep
  • Spanner divides data into chunks called “splits”
  • individual splits can move independently from each other and assigned to different servers
  • A split is a range of rows in a top-level (in other words, non-interleaved) table, where the rows are ordered by primary key.
  • The start and end keys of this range are called “split boundaries”.
  • Spanner automatically adds and removes split boundaries

 

Key columns

  • The keys of a table can’t change
  • can’t add a key column to an existing table
  • cannot remove a key column from an existing table.

 

Storing NULLs

  • Primary key columns can be defined to store NULLs.
  • to store NULLs in a primary key column, omit the NOT NULL clause

 

Disallowed types

These cannot be of type ARRAY:

  • A table’s key columns.
  • An index’s key columns.

Replication

  • Spanner automatically gets replication at the byte level
  • Spanner writes database mutations to files in this filesystem
  • Spanner creates replicas of each database split
  • 3 types of replicas: read-write replicas, read-only replicas, and witness replicas.
  • Single-region instances use only read-write replicas,
  • multi-region instance configurations use a combination of all three types

For multi-region

Replica type Can vote Can become leader Can serve reads
Read-write yes yes yes
Read-only no no yes
Witness yes no no

 

 

 

Schema Design Best Practices

  • Design a schema that prevents hotspots and other performance issues.
  • Place compute resources for write-heavy workloads within or close to the default leader region for optimal write latency
  • Use staleness of at least 15 seconds for optimal read performance outside of the default leader region
  • place critical compute resources in at least two regions to avoid single-region dependency
  • keep high priority total CPU utilization under 45% in each region.
Menu