The “Process” Phase: Cleaning
Now that we have the data that we want to work with prepared, the next step is to process all of it into a form and format that will be appropriate for our analysis. To start the “Process” phase, we will want to clean up and remove any data that is incomplete or invalid to our dataset.
We first want to examine the quality of the data in each column. Here in the file for January, we can see from the first few rows of data that there are some blank cells in the “end_station_name” and “end_station_id” columns. If we apply a filter to the “end_station_name” column, we will see that every value in the corresponding “end_lat” and “end_lng” columns is cut off after 2 decimal places. This suggests that they are faulty coordinates, and might explain why there is no name or id for these stations.
In an actual work environment, we would ask someone within the department in charge of data collection why these fields are blank and why are the coordinates cut short. They might tell us that there was some sort of error in the data collection for those rides, or that these rides did not end at a location that is cataloged in our list of station names and ids.
For the purposes of our analysis, we will get rid of every row that is blank in our filter. To do this, start by going through every row and applying a filter through the Home menu in Excel. Select your filter and deselect all of the values in it’s panel. Type in “blanks” in the search box so that only the blank cells will be shown.
Once your filter is in place, click the left corner of your rows panel in Excel select all of your rows. Use Ctrl+Click on the very top row of your dataset to make sure that your header row is deselected. With every other row selected, you can right click on the selection and click the “Delete Row” option.
Repeat this process with every column in each of our monthly datasets. (Note that starting in September we see coordinates with only two decimal places that actually have names and ids assigned to them. We will include those, even though such coordinates are typically not very precise to the location).
Once we’ve gone through every month’s dataset like this, we’ll know that every invalid row has been removed from our overall dataset. The next step in our Process phase is to get our spreadsheets into a uniform format that we will continue to use throughout the project.