One of my favourite things is working on data warehouses using BigQuery; in fact, BigQuery provides a wealth of tools to automate and process data to build a robust ETL.
BigQuery has some super powerful features which allow us to build, manipulate and even run machine learning algorithms within the interface. However, we're all human, and things can go wrong, so what do you do when that happens?
Imagine one morning, you're working in the BigQuery interface and want to aggregate your table to check the volumes, but you forget to clear down the options before running.
COUNT(DISTINCT(user_psuedo_id) ) AS distinct count
Just as the code finishes running, you realise you've lost months of daily updates after accidentally overwriting the table and even worse, any reporting in dashboard tools like Data Studio or Tableau using this table will now fail.
Before you look to completely re-run all your historical data (which could cost a lot of your time and increase your BigQuery spend), Google Cloud Platform has some great tools that allow you to recover and revert your datasets.
The world has ended; what can, or can't you do?
- If you have deleted a table or have an error on your table, you have seven days to recover - effectively you can time travel.
- Remember, at present, you cannot restore a deleted dataset.
The syntax fix
The syntax itself for reverting the tables is relatively simple and is not expensive to run. The interval can be seconds/hours/days; however, it cannot be over seven days due to the reversion time limit.
You must remember when you are reverting if you have added correct data in the period you are reverting to, you will lose this, so it’s crucial to check when the error occurred if possible.
You can use Data Manipulation Language (DML)* or the output functionality to overwrite the table you are reverting to. You will simply see the data in a temp table without a destination.
Here are the get-out-of-jail cards
Here are step-by-step instructions from Google Link.
CREATE OR REPLACE TABLE
testing.out_table FOR system_time AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 75 second)
Here is a link to the script on GitHub.
Then voilà, check your table has reverted, and if all looks well, have a tea and relax in the knowledge the reports feeding on this table will be back up and running when they next process (phew).
Whilst we should always be super careful when amending or overwriting tables if something does go wrong, remember to look at this syntax before you re-run the entire history.
* A quick refresher on DML
DML stands for Data Manipulation Language. DML statement is used to insert, update or delete the records on tables and is super powerful when creating tables. With DML and DDL, you can both effectively script and manage your tables and code effectively.