In this post, we would like to walk through the process of how Lore IO imports, transforms, and aggregates data for the COVID-19 Case and Mobility dashboard. We will begin by importing local CSV files into Lore IO. From there, we will transform common identifiers into the same format to prepare the merging of data and, finally, build out the final output table as the input source for the dashboard.
We have CSV data downloaded from Google Community Mobility Reports (https://www.google.com/covid19/mobility/) and COVID-19 Data (https://data.ca.gov/dataset/covid-19-cases, https://data.ca.gov/dataset/covid-19-hospital-data) to help us glean better insights into how COVID-19 impacts our daily life. The community mobility data shows the change in visits to different places and establishments where we expect to see differences before and after stay-at-home orders went into effect. With COVID-19 Hospital and Case data, we would like to know if there is any relationship between mobility patterns and the transmission of COVID-19. Our process of taking CSV raw data files, unifying them, and building dashboards from the data is structured as follows:
Data import is usually very time-consuming and frustrating, especially when you need to ingest a lot of data from different sources. However, this process is important because it’s not only the first, but also the essential step for all data preparation platforms. Getting your data into Lore IO is simple. We support importing data from a local file, folder or major cloud storage. It only takes a few clicks and your data is clean and organized from the start.
In our case, we would like to create three Source Tables based on our three local files. Notice the Operation is “Create New Table” by default and Source Table is populated with the file name. Lore IO recommends those settings for your convenience. It can save a lot of time especially when you are importing many files in the similar way. Of course you can manually adjust those settings if you want, and you can customize your import configuration from Advanced Options.
The advanced settings are self-explained, you can also hover over the info icons to view a short description of its behavior. Our CSV files are very standard so we keep all settings by default. Click Import File(s) and wait a few seconds for the progress bar to complete. Once finished, a success message will show our import summary and guide us to explore our data.
Data standardization is about making sure that data is internally consistent, each data type has the same content and format. It’s necessary if our data is from various sources and may not in the same way.
Transform dates to one format
When we check the date Column from covid19cases Table, we find sample values are in timestamp format: “2020-05-27T00:00:00”. However, the date Column in Table Global_Mobility_Report has value in date format: “2020-05-05”. We want the date information in just “YYYY-MM-DD” format. One solution is extracting the first 10 characters from timestamp format, eg: “2020-08-31” from “2020-08-31T00:00:00”. In the date Column definition page, we implement this with the “substr” function. Now the date Columns have the same format of dates.
Remove suffix from value
sub_region_2 Column from Global_Mobility_Report Table contains County information with suffix “County”: “Santa Clara County”. While county Column from covid19cases Table has values without the string “County”: “Santa Clara”. In order to let the system know “Santa Clara County” and “Santa Clara” are the same value, we would need to remove “County” substring from all values in sub_region_2 Column. Pick the “rtrim” function with input “ County”. Now the sub_region_2 Column and county Column would have the same format of county information.
While the different data sources share a common identifying variable, we can combine them into a single and organized output. If these information do not match perfectly - in our case, “Santa Clara County” in one source and “Santa Clara” in another, and “2020-05-20T00:00:00” in one source and “2020-05-20” in another, the system will not be able to identify that these various refer to the same one. So in the previous section we transformed date and county information into the same format and now we are ready to merge our data.
Create Aggregation Table
To have a Table that contains all information from three Source Tables, we can create an Aggregation Table in Lore IO. This Aggregation Table maps to three Source Tables covid19cases, covid19hospitals and Global_mobility_Report aggregated by the Columns, County Name and Date. We also put conditions to filter our data only in those ten counties.
Implement Key Columns
Once the table is created, we need to define our Key Column first. Key Column contains the common information across different sources. For the Column, County Name, we map it to three Columns that contain county name data from each Source Table. Similar to Column Date, we map it to three Columns that contain date information from each Source Table.
Bulk create columns
Besides aggregation keys we also create Columns to contain data for mobility changes and covid-19 cases. We can bulk create those 14 columns by entering names of Columns. Solid green circle means the Columns are defined like County Name and Date; while green circle means the Columns are not implemented.
The remaining columns don’t require any transformations. We use the “Any” function to pick any non-Null value in the given column within the group.
Data from three Source Tables has been merged to one Aggregation Table where there is only one record for mobility data and COVID-19 data per county per day.
Verify and Export Data
Verify data involves checking accuracy and inconsistencies of data. It helps to determine whether data was accurately and completely translated.
Preview Columns in the Explore page
Click multiple columns and pick Explore Columns in the action dropdown menu, we will see data records in tabular form. Now, each row represents every county, every day’s COVID-19 cases, COVID-19 hospitals data and mobility data.
Once the data looks clean and organized, we would export records as CSV files and import to Tableau to build our dashboard. We create a dashboard Bay Area: Covid-19 Cases and Mobility to display the Mobility Report and COVID-19 cases.
What happens after you import data, create tables, do the cleanup and then your data source gets updated? As in today’s fast-paced world, we expect our information is being updated in real-time. It's a very common use case and Lore IO already has solutions for this. If your source table is connected with an external source table or cloud storage, all data will automatically update whenever your source gets updated. If your source table is created by uploading local files like our case, you can import updated CSV files to those source tables with the “Overwrite file Data” Option, replacing the new records with old records to the source tables. Any data cleaning or transformation on the old data are automatically applied to the new data.
Lore IO helps us clean data in direct and visual ways. We transform our raw data from three different sources into a unified output table.
We start with importing local CSV files into Lore IO. This process is quick and straightforward. Three Source Tables are created based on our uploaded files: covid19cases Source Table contains the case data for California COVID-19 response, covid19hospitals Source Table contains hospital data for the COVID-19 response and Global_Mobility_Report Source Table contains Google Community Mobility Reports. Then we transform common identifiers County and Date Columns into the same format to prepare merging data. Next, we create covid19_bay_area_trends Aggregation Table to combine data from covid19hospitals, covid19cases and Global_Mobility_Report Source Tables aggregated by Key Columns County and Date.
Once we build the final output table, we can go to the Explore page in Lore IO to check if our data is accurate and complete. After confirming the result is as expected, we export the final output table into a CSV file. This is the input source for our dashboard in Tableau.
If we have an extra data source and would like to add into the Tableau dashboard, let’s say we want to know is there any relationship between COVID-19 cases and weather, we can repeat the above steps to create a new Source Table and map it to covid19_bay_area_trends Aggregation Table. When we explore the data again, we will notice each record contains COVID-19 data, COVID-19 hospital data, mobility data, and weather information by county per day.
Table: COVID-19 Cases
Table: COVID-19 Hospital Data
Table: Google COVID-19 Community Mobility Reports
Changes for each day are compared to a baseline value for that day of the week:
- The baseline is the median value, for the corresponding day of the week, during the 5-week period Jan 3–Feb 6, 2020.
- The datasets show trends over several months with the most recent data representing approximately 2-3 days ago—this is how long it takes to produce the datasets.
Interact with the COVID-19 Case and Mobility Data dashboard.