Create an ETL Process with Google BigQuery and Google Data Studio

The present gist is a hybrid between a 'go-to' cheat sheet and a tutorial when setting up a new Data Science Project.

Its purpose is to create a basic Google BigQuery data source instance and use Google Data Studio to build a Dashboard.

These tools are part of the Google Cloud Platform suite.

Tables of contents:

Create an ETL Process with Google BigQuery and Google Data Studio

System Settings

Settings at the time of writing this gist (20^th of March 2021).

Microsoft Windows Operating System


xxxxxxxxxx
Edition: Windows 10 Home
Version: 1909
OS build: 18363.1256
System type: 64-bits operating, x64-based processor

Microsoft Visual Studio Code


xxxxxxxxxx
Version: 1.52.1 (user setup)
Date: 2020-12-16T16:34:46.910Z
Electron: 9.3.5
Chrome: 83.0.4103.122
Node.js: 12.14.1
V8: 8.3.110.13-electron.0
OS: Windows_NT x64 10.0.18363

Google Cloud Platform

Main Console: https://console.cloud.google.com

Data Studio: https://datastudio.google.com/u/0/navigation/reporting

BigQuery: https://console.cloud.google.com/bigquery?project=imdb-project-307217

Start a New Project

Go to Google Cloud Platform main console, e.g.
https://console.cloud.google.com/home/dashboard?project=project-name
NOTE: in the URL above, project-name represents the name of a project, e.g. imdb-project-307217:
https://console.cloud.google.com/home/dashboard?project=imdb-project-307217
Click the button to the top-left of the window as it contains the list of all the projects.
Click the New project button.
Click the Create button.
The new project name is IMDB-project and the project ID is imdb-project-307217. The project number is 482553292490.
NOTE 1: alternatively, click the Create resource button to the right-hand side of the window and select New project from the drop-down menu.
NOTE 2: an (underscore) _ CANNOT be used, hence a (dash) - MUST be used instead.

Create a Dataset

Go to BigQuery console, e.g.
https://console.cloud.google.com/bigquery?project=imdb-project-307217
Select the project name imdb-project-307217 from the side panel.
Click the Create Dataset button. The new dataset is named imdb_movie_metadata.

Create a Table using BigQuery

Select the dataset name imdb_movie_metadata from the side panel.
Click the + button to create a new table. The new table is named imdb_table, and represents an empty table.
In the Source section:
- choose how to create a table from the Create a table from drop-down menu.
- select the Import option.
- select the appropriate file (e.g. IMDB_movie_metadata.csv).
NOTE: do not forget to select the file format in the File format drop-down menu.
In the Destination section:
- select the Project name (e.g. IMDB-project) from the drop-down menu.
- select the Dataset name (e.g. imdb_metadata) from the drop-down menu.
- choose a Table name (e.g. imdb_movie_metadata).
In the Schema section:
- Tick the Automatic detection option.
  NOTE: if the .csv file selected has a delimiter that is different from a (comma) ,, e.g. a (semicolon) ;, then go down to the Advanced Options section:
  - select Customised from the Field delimiter roll-down menu.
  - enter ; in the Customised Field Delimiter field.
- Click Create table.
Several columns are displayed in the Schema tab:
- Field name.
- Type.
- Mode.
- Description.
The Details tab displays information about the dataset table such as size, number of rows, etc..
The Preview tab displays the first 100 rows of the dataset.

The table ID is imdb-project-307217:imdb_metadata.imdb_movie_metadata. This is CRUCIAL as the table name will be used to write the SQL queries.
IMPORTANT: the exact SQL syntax needs to replace the : by a . (see example below).

Query the dataset

From the Schema display, click the Run a query on the table button to create a query that will look as follows:


xxxxxxxxxx
SELECT  FROM `imdb-project-307217.imdb_metadata.imdb_movie_metadata` LIMIT 1000

NOTE: the final ; to close a query is NOT needed.

In the (new) query tab, add the variables to include in the query, e.g. just use * to select ALL variables, and click Run:


xxxxxxxxxx
  SELECT * FROM `imdb-project-307217.imdb_metadata.imdb_movie_metadata` LIMIT 1000

Click the + Compose New Query button to the right-hand side of the window.

Write a new SQL query. For example:


xxxxxxxxxx
SELECT
  movie_title,
  title_year,
  duration,
  country,
  budget,
  color
FROM
  imdb-project-307217.imdb_metadata.imdb_movie_metadata
WHERE
  color!="Color"
LIMIT
  100;

NOTE: click the More button and select Format query to automatically format the query against good practices.

Query Output

After running the query, several options are available:

The following options are available in the "query tab" top side:
- Save the query as an actual (query) table
  - Click Save the query from the roll-down menu.
  - Give a meaningful name (e.g. movie_budget_query)
- Save the query as a table vue:
  - Click Save the view from the roll-down menu.
  - Choose the Project name and Dataset ID.
  - Give a Table name and click Save.
  - Give a meaningful name (e.g. movie_budget_query)
- Schedule a query (e.g. monthly, weekly, etc.)
- Access Query Settings to customise the search.
In the "query tab" bottom side:
- Save output:
  - Click Save Results
  - Choose the format and location, e.g. Google Drive, Local File, .csv, json, Table BigQuery, etc.
    For Table BigQuery, fill in:
    - project name from the drop-down menu.
    - dataset name from the drop-down menu.
    - give table name (i.e. imdb_black_white_movies).
  - Click Save.
  NOTE: the new table will be saved within the imdb_metadata dataset.
- Explore output data:
  - Click Explore data.
  - Select Explore data with Data Studio. This opens a new window to Google Data Studio Explorer, e.g.
    https://datastudio.google.com/u/0/explorer
  - Rename the new file (e.g. IMDB_query - 10/03/2021 22:54).
  - Perform basic data analyses, e.g. create tables, histograms, etc.
    NOTE: it is not possible to build a fully-fledged "dashboard" from within the Data Studio Explorer. Only one analysis can be performed at the time, e.g. table or graph.
  - Filters can be added to the charts.
  - Click Save once done.
  NOTE 1: to find the new "Explorer", one must go to Data Studio main console and to the Explorer tab (see next section).
  NOTE 2: the data table can be downloaded by clicking the options of the graph and choosing Download as CSV format.
  Then, click the Share button and choose one of two options:
  - Create a new report and share. This will paste the chart into a new report under Google Data Studio to build a new dashboard (see next section). Name the report and share/save it.
    NOTE: to find the new report, one must go to Data Studio main console and to the Reports tab (see next section).
  - Copy to existing report and share. This will override the content of the "Explorer"!!!

Back to the "query output table tab" (i.e. imdb_black_white_movies):

Click the Export button to the right-hand side of the window.
Choose an option:
- Explore with Data Studio.
  This is the same as above.
- Export to GCS (Google Cloud Storage).
- Analyse with DLP (Data Loss Prevention).

Prepare a Dashboard

Go to Google Data Studio page.
There are 3 tabs in "Recent Elements":
- Reports: list of all existing reports.
- Data Sources: list of all data sources available.
- Explorer: list of all explorers produced.
Click the Empty Report (large) icon to create a new report.
Give a name to the report (to the top-left side of the window), e.g. IMDB-project.
In the Add data to report roll-down window, there are 2 tabs:
- Click Associate to data to select a new dataset to add to Google Cloud Platform.
- Click My data sources to select a dataset that is already on Google Cloud Platform.
With Associate to data, click the BigQuery icon to import to Google Data Studio. Follow the below steps:
- Select a project name from the list (e.g. IMDB-project).
- Select a dataset name (e.g. imdb_metadata).
- Select a table name (e.g. imdb_black_white_movies).
- Click the Add button.
It is possible to amend the variable names and types as well as transform variable (e.g. convert a date, aggregate data, etc.):
- When a graph is selected and the Data and Style tabs are displayed to the right-hand side of the window, follow the steps below to manipulate the variable properties:
  - Click the "pencil" just below the Data Source tag.
  - Modify the variables displayed: rename, change type, etc.
  - Click OK.

Following building of the dashboard, go to File to download the dashboard as a .pdf file.

IMDB Dashboard

Also, click the following URL to access the Cloud IMDB Dashboard once permission has been granted.

Connect a data source

Click the Create button to the top-left side of the window.
Click Data Source.

DS Variable Setup

A list of (18) Google Connectors appears. After that, Partner Connectors are listed.

Give a name to the data source (e.g. bigquery-data-source).
NOTE: an (underscore) _ CANNOT be used, hence a (dash) - MUST be used instead.
Choose a connector (e.g. BigQuery).

DS Data Source

Choose a Project, Dataset and Table.
Click Associate to the top-right side of the window.
A Data Studio window appear where it is possible to amend the variable names and types as well as transform variable (e.g. convert a date, aggregate data, etc.). Calculated fields can also be added.

There are 2 main types of data processed in Data Studio:
- Metrics are highlighted in blue and are numerical data. They will be used for statistics.
- Dimensions are highlighted in green and categorical data. They can be strings, dates, URLs, booleans, geographical positions, etc.
Click Create a new report to prepare a Dashboard.
Click Explore to open an Explorer window.

Explore Public Datasets

Alternatively, public datasets can be explored.

Click the Explorer menu, and click the + Add Data button. Then, select the Explore Public Datasets option. Choose the Stack Overflow dataset.

Stack Overflow Dataset

When View Dataset is clicked, the following happens:

a new BQ window is opened.
a new project name appears: "bigquery-public-data".
inside, there is a long list of datasets.
select the Stack Overflow dataset.
The dataset ID is bigquery-public-data:stackoverflow.
select the posts_questions table.

IMPORTANT: Do NOT run heavy queries on TBs or PBs of sample data to avoid incurring any unwanted costs. Stick to GBs for learning!