How to combine data from two BigQuery public datasets using SQL
So in this blog post I’ll show how you can combine data from two BigQuery public datasets using SQL and to visualize the data in Data Studio.
So let’s say we want to combine data from two different public datasets available in Google BiqQuery; the UN’s Sustainable Development Goals (SDG) Indicators and the World Bank’s World Development Indicators.
Both contain a wide variety of data about various countries and we might want to join the data together when we visualize it in Google Data Studio.
I’ll show you a method for joining them together in BigQuery using SQL (Structured Query Language). I find W3 Schools a helpful guide if you’d like to learn some SQL.
Step 1: Decide what data you want from each table
So let’s choose what data we want from each table. To keep things simple I’ll just want one measure from each of the data sets.
UN SDG = Annual growth rate of real GDP per capita (%)
World Bank WDI = Population
I also only want this data from 2016.
For the UN SDG dataset, this is the query I’d use to get only this relevant data;
SELECT geoareaname, timeperiod, value
FROM `bigquery-public-data.un_sdg.indicators` as UN_SDG
WHERE seriesdescription = 'Annual growth rate of real GDP per capita (%)'
AND timeperiod = '2016'
Below is a screenshot of what I’d get after using the query in BigQuery.
For the World Bank WDI data set, I’d use the following SQL query.
SELECT country_name, year, value FROM `bigquery-public-data.world_bank_wdi.indicators_data` as WB_WDI
WHERE indicator_name = 'Population, total'
AND year = 2016
After running the query I’d see the following data.
So we can confirm that these are the types of results we’d like to get from each of our data sets. Now we need to combine them together.
Step 2: Write a SQL query to join the data together
So we will create a SQL query that joins the UN country onto the World Bank country.
Here is the SQL query I wrote which worked.
SELECT UN_SDG.geoareaname, UN_SDG.timeperiod, UN_SDG.value as GDP_per_Capita_growth, WB_WDI.country_name, WB_WDI.year, WB_WDI.value as WB_Population
FROM `bigquery-public-data.un_sdg.indicators` as UN_SDG
JOIN `bigquery-public-data.world_bank_wdi.indicators_data` as WB_WDI on WB_WDI.country_name = UN_SDG.geoareaname
WHERE UN_SDG.seriesdescription = 'Annual growth rate of real GDP per capita (%)'
AND UN_SDG.timeperiod = '2016'
AND WB_WDI.indicator_name = 'Population, total'
AND WB_WDI.year = 2016
Once you’ve run the query your results should be similar to the screenshot below.
The next step is to visualize the data in Google Data Studio.
Step 3: Connect your BigQuery data to Google Data Studio
After you’ve clicked ‘Explore Data’ it will open a new tab in Data Studio where you can take a look and mess around with the data, as shown below.
The next thing to do is to save your new data source. Click on Save in the top right hand corner of the screen.
Then click on ‘Share’ and you will be able to edit and interact with your report as you normally would any other Data Studio report.
You can also save your query in BigQuery. Simply click on ‘Save Results’ and give it a new Table name.
Step 4: Display your data in Google Data Studio
Now that our data is in Google Data Studio we can visualize it. Below is the finished Google Data Studio report that you can interact with.
Limitations and Issues with the Data
You’ll notice that a few countries are conspicuously absent from the data results. These include the US and UK.
This is because in the UN Sustainable Development Goals data the US is labelled ‘United States of America’ while in the World Bank data it is labelled ‘United States’.
Similarly the UK in the UN data is ‘United Kingdom of Great Britain and Northern Ireland’ while in the World Bank data it is simply ‘The United Kingdom’.
I haven’t figure out how to overcome these issues! If you know how please leave a comment!
I hope you found this blog post about combine data from two BigQuery public datasets using SQL useful and interesting!
My Other Blog Posts
You might enjoy these other blog posts.
Hey thanks for the tips.
I think you can use a CASE Statement to create a new column that would uniform country name
CASE
WHEN table1.country = “United Kingdom of Great Britain and Northern Ireland” THEN “The United Kingdom”
WHEN table1.country = “United States of America” THEN “United States”
ELSE table1.country
END