How to combine data from two BigQuery public datasets using SQL

By 24 Sep 2020BigQuery, Blog, SQL
How to combine data from two BigQuery public datasets using SQL

So in this blog post I’ll show how you can combine data from two BigQuery public datasets using SQL and to visualize the data in Data Studio.

So let’s say we want to combine data from two different public datasets available in Google BiqQuery; the UN’s Sustainable Development Goals (SDG) Indicators and the World Bank’s World Development Indicators.

UN's Sustainable Development Goals (SDG) Indicators and the World Bank's World Development Indicators.

Both contain a wide variety of data about various countries and we might want to join the data together when we visualize it in Google Data Studio.

I’ll show you a method for joining them together in BigQuery using SQL (Structured Query Language). I find W3 Schools a helpful guide if you’d like to learn some SQL.

Step 1: Decide what data you want from each table

So let’s choose what data we want from each table. To keep things simple I’ll just want one measure from each of the data sets.

UN SDG = Annual growth rate of real GDP per capita (%)
World Bank WDI = Population

I also only want this data from 2016.

For the UN SDG dataset, this is the query I’d use to get only this relevant data;

SELECT geoareaname, timeperiod, value 
FROM `bigquery-public-data.un_sdg.indicators` as UN_SDG
WHERE seriesdescription = 'Annual growth rate of real GDP per capita (%)'  
AND timeperiod = '2016'

Below is a screenshot of what I’d get after using the query in BigQuery.

SELECT geoareaname, timeperiod, value 
FROM `bigquery-public-data.un_sdg.indicators` as UN_SDG
WHERE seriesdescription = 'Annual growth rate of real GDP per capita (%)'  
AND timeperiod = '2016'

For the World Bank WDI data set, I’d use the following SQL query.

SELECT country_name, year, value FROM `bigquery-public-data.world_bank_wdi.indicators_data` as WB_WDI
WHERE indicator_name = 'Population, total' 
AND year = 2016

After running the query I’d see the following data.

SELECT country_name, year, value FROM `bigquery-public-data.world_bank_wdi.indicators_data` as WB_WDI
WHERE indicator_name = 'Population, total' 
AND year = 2016

So we can confirm that these are the types of results we’d like to get from each of our data sets. Now we need to combine them together.

Step 2: Write a SQL query to join the data together

So we will create a SQL query that joins the UN country onto the World Bank country.

Here is the SQL query I wrote which worked.

SELECT UN_SDG.geoareaname, UN_SDG.timeperiod, UN_SDG.value as GDP_per_Capita_growth, WB_WDI.country_name, WB_WDI.year, WB_WDI.value as WB_Population 
FROM `bigquery-public-data.un_sdg.indicators` as UN_SDG
JOIN `bigquery-public-data.world_bank_wdi.indicators_data` as WB_WDI on WB_WDI.country_name = UN_SDG.geoareaname
WHERE UN_SDG.seriesdescription = 'Annual growth rate of real GDP per capita (%)'  
AND UN_SDG.timeperiod = '2016'
AND WB_WDI.indicator_name = 'Population, total'
AND WB_WDI.year = 2016 

Once you’ve run the query your results should be similar to the screenshot below.

a query to combine data from two BigQuery public datasets using SQL

The next step is to visualize the data in Google Data Studio.

Step 3: Connect your BigQuery data to Google Data Studio

click explore data to view and edit the data in google data studio

After you’ve clicked ‘Explore Data’ it will open a new tab in Data Studio where you can take a look and mess around with the data, as shown below.

exploring data from bigquery in google data studio

The next thing to do is to save your new data source. Click on Save in the top right hand corner of the screen.

Then click on ‘Share’ and you will be able to edit and interact with your report as you normally would any other Data Studio report.

You can also save your query in BigQuery. Simply click on ‘Save Results’ and give it a new Table name.

save bigquery results page

Step 4: Display your data in Google Data Studio

Now that our data is in Google Data Studio we can visualize it. Below is the finished Google Data Studio report that you can interact with.

Limitations and Issues with the Data

You’ll notice that a few countries are conspicuously absent from the data results. These include the US and UK.

a map in data studio without the USA

This is because in the UN Sustainable Development Goals data the US is labelled ‘United States of America’ while in the World Bank data it is labelled ‘United States’.

Similarly the UK in the UN data is ‘United Kingdom of Great Britain and Northern Ireland’ while in the World Bank data it is simply ‘The United Kingdom’.

united states does not match united states of america in google bigquery data sets

I haven’t figure out how to overcome these issues! If you know how please leave a comment!

I hope you found this blog post about combine data from two BigQuery public datasets using SQL useful and interesting!

My Other Blog Posts

You might enjoy these other blog posts.

Do NOT follow this link or you will be banned from the site!