After getting data into Power BI from different sources, it is better to transform the dataset before loading it into the Power BI since it improves queries’ performance and saves the internal memory space of Power BI.
Here are five Golden rules for cleaning data efficiently in Power BI:
Data Cleaning Flowchart
Dealing with Column headers
The first step of any cleaning process is to decide and indicate the columns that will be used in the data analysis as per the requirement.
Five steps to clean Column Headers
Promote Headers and Remove Top Rows
It is not mandatory for every data source to have the right column names as their column headers. The column headers can be located in the first row of the dataset. Sometimes, they can be found after several empty rows. Hence, the first step is to make the right row the column headers. It can be done using the option Use First Row as Headers.
Use First Row as Headers option makes the first row of the table the column headers.
When there are empty rows in the dataset before the column names row. Remove those rows manually till the column names row becomes the first row. Then, promote the first row to column headers. These empty rows can be removed using the option Remove Top Rows. This option can be accessed by clicking on Remove Rows located at the top pane of the application.
A dataset can contain non-user-friendly column names. For instance, let's consider that our dataset is sourced from a fitness band. The column names might contain abbreviations, and spelling errors, or there might not be proper spacing between words of the column name. A Power BI Analyst should take into consideration that the end-user who uses the reports or dashboards may not be from a technical background. So, it’s better if the column names do not contain any technical jargon.
# Correct Spelling error Saels --> Sales# Add Space between words Targetcountry --> Target Country Target_Country --> Target Country# Expand Abbreviation Dept --> Department# Remove technical jargon TimeStamp --> Date and Time, Purchase Date
There are two ways to connect to a data source in Power BI. One is by importing data into Power BI and the other is by establishing a live connection to the dataset (a database). In the case of importing data into Power BI, the data is temporarily copied and stored in the internal memory of the application. In the case of large data, the application might face lags or performance issues. In order to prevent this, it is highly recommended to clean and transform the dataset before loading it to the local memory of Power BI.
During transformation, the analyst should be well aware of the business requirements. We should be able to form a list of unnecessary columns for the business use case and remove them. It can be done in two ways,
Using Remove Columns option, we can select certain columns and remove them from the Power BI local dataset. Removing these columns won’t make any changes to the original data source.
Another way is to use the Remove Other Columns option. This option can be accessed by clicking Remove Columns on the top pane of the application. This option lists every column and allows the user to select the columns to be displayed in the dataset.
Tip: I would personally suggest going with Remove Other Columns option. In some cases after loading the dataset to Power BI, we might need to go back and add a certain column that was removed before. If we had used Remove Other Columns option, we could simply go back and click on the same option to select the newly required columns and load. It is quite simple and time efficient. But in case of Remove Columns, we would have to delete a step from the transformation steps pane. We might have removed 5 columns together and we might need to add only one column back. But, it is not directly possible to remove that particular column with this option. You would have to delete the whole step and remove the unnecessary columns again.
A dataset can contain some rows and columns that might not make sense to visualize. In some cases, you might feel that the columns bring more value to the analysis if they are represented as rows.
Let’s consider the example above, here we have Country and every year as a column name. It would make more sense if we create new columns as Year and Population to mention the years and value of each corresponding year. This can be implemented with the help of Unpivot Columns option under Transform tab. Select a column and choose Unpivot Other Columns to convert the other columns into rows based on the selected column.
In some use cases, we would be required to find an aggregated value of certain rows. In those scenarios, Pivot Column option can be used to convert the flat data (rows) to a column containing aggregated values (like SUM, MIN, MAX).
Let’s consider our previous Unpivoted data. Assume that we are required to find the total population of each Country. In order to achieve that, we would need to sum the Population column grouped by the country names. This can be achieved by Pivot Column option under the Transform tab.
Select the column by which the data should be grouped (in our case, it is the Country column). Then click on Pivot Column to get more options on how to pivot. We can select the column that is going to be pivoted and also the aggregation function preferred.
Dealing with Row Values
Options to clean row values
Rename a query
Tables stored in a database or tables generated from flat files may not have helpful names relevant to the business use case. Hence, use Rename option to change the name of the tables to more meaningful ones.
In Power Query Editor, the left pane contains the list of tables. Right-click on the table to rename it.
Some values might be misspelled or abbreviated. End-users would find it difficult to decipher the values. So, it is a good practice to replace those values with meaningful expanded ones.
To replace values, select the column in which these values are present. Then use Replace Values option from the top pane, to find and replace the values.
Replace null values
Data can contain null values. An analyst should expect this case and always take measures to handle it.
In a numeric column, if the value stays null, it might cause an issue when applying aggregation functions to it. So, it’s better to change these null values of a numeric column with 0.
In a non-numeric column, the null value can be changed to a more meaningful representation of it. For example, null → Not Available, Not Applicable
Use Replace Values option from the top pane to find and replace the null values.
Let’s assume that our database table doesn’t have any Primary key or Unique Key constraint. Our dataset is prone to contain duplicate rows. These duplicate values when imported into Power BI can cause errors in calculations and lead to misleading visualization results.
To remove these duplicate rows use the option Remove Duplicates that is found when we right-click the table or a column.
Choosing right column data types
Choosing Right Column Types
Assigning Right Data Type to Columns
While importing data from flat files, there are more chances of columns getting assigned with the wrong data types. If the columns are not assigned with their correct data type, then errors would occur while performing calculations or grouping by value. These errors usually show up during calculations or while loading the data. To prevent this, each column should be assigned a correct data type and the values of the column should be formatted with respect to the business logic.
To check and change the current data type of a column, right-click on the column header and hover on Change Type. Select the data type that is more relevant to the available data. It’s a good practice to choose Fixed decimal number columns with monetary value.
It’s crucial to format the values of a column with respect to business logic. For instance, assume that a column has a profit percentage. If the value looks like 15.920144, it is not easy to read. We can round it up to two decimal points by right-clicking on the column and selecting the option Transform. Then choose Round to define the number of decimal places.
The case of value can also be changed (from uppercase to lowercase or lowercase to uppercase or Capitalize) according to the business logic. To format cases, trim or add prefix/suffix, select the column, and switch to Transform tab on the top pane. Then click on the Format option to perform the above-mentioned operations.
In some cases, we might have a Full name, and our business use case would require the name to be split into First name and Last name. In that case, the use split the column using Split Column option. A column can be split by delimiter, the number of characters, positions, cases, and digits.
In several scenarios, we might need to combine tables for ease of querying or visualization. If we need to group certain data from several tables, it is beneficial to join them together. If we connect to a database, chances are that the data is normalized and split into many tables for storage purposes. But, in Power BI it becomes complex to query from so many tables. Hence, we can join them or denormalize them based on our use case. There are two ways in which tables can be combined.
Ways to Combine tables
Let’s consider that we work for a company that stores the sales data in a new spreadsheet every day. Assume that the management asks for a yearly Sales Analysis report. We have 365 files containing sales data for each day of the year. We can simply append all the data to a single table and then proceed with the analysis.
The Append Queries option on the top pane of the application appends the rows of a table to that of another table. To append two tables, there should be a common column on both tables.
To learn more about the implementation of Append Queries, check out my article Combining Data Sources in Power BI.
Merge Query works like a JOIN operation in SQL. Tables are merged based on a common column and new columns get added from the secondary table to the main table. Like SQL, we have an option to choose how to merge the tables.
Left Outer join is used when all the rows from the first table and the matching rows from the second table need to be displayed.
Full Outer join is used when all rows from both tables need to be displayed.
Inner join is used when only the matching rows from the tables
To learn more about the implementation of Merge Queries, check out my article Combining Data Sources in Power BI.
It is important for an analyst to check the rightness of the data based on the business logic. Someone handling business daily regularly would be aware of anomalous values. In certain use cases, a column cannot have any duplicate values. In some cases, a certain number of distinct values are expected for a categorical column. Any discrepancy in the data might lead to wrong output and misleading visualization. In order to verify the rightness of the data, Power BI has a few in-built options.
To check the count of distinct and unique values of the columns, Column Distribution option can be used. Switch to View tab on the top pane and check Column Distribution. A distribution graph would appear on top of the rows of each column. This graph provides us with information on the number of distinct and unique values present in each column.
Sample Column Distribution graph
The column values might have mistyped values of other fields or be empty. Some values might have a different data type. For instance, if a numeric column has a string value, then it is considered an error. The percentage of Valid, Error and Empty values can be computed for each column using Column quantity option on the View tab.
Sample Column Quality
For a numeric column like Sales, it is crucial to know the statistics to verify its correctness. The statistical methods such as Count , Error , Empty , Distinct , Unique , Nan , Zero , Min , Max , Average , Standard Deviation , Even, and Odd are available under Column Profile option on View tab. Apart from these methods, the value distribution graph is very useful to detect anomalies and verify the data before modeling.
Sample Column profile