The “Process” Phase: Creating a Format For Our Dataset
In the previous post, we deleted rows in our dataset that had data which was incomplete and couldn’t be used in our analysis. To continue the “Process” phase, we need to create the format in which we will want all of our spreadsheets to be in as we move forward. This means that now is the time to determine what columns of data will be in our dataset when we move on to the next phase.
If we take another look at our edited data, we can see that there are several different ways to list certain data, like the location of the stations. Some of this data can be considered redundant, and it would be best to drop some of these columns. We will delete the “start_station_id” and “end_station_id” columns, leaving us with the names of these stations to be used when identifying them.
Select the start_station_id column by clicking the “F” header above the top row. Then use Ctrl+Click to select the end_station_id column with the ‘H” header. Right-click on these selections and then click “Delete”.
We now have the first phase of edits completed on this sheet for our analysis, so we should replicate these results across the rest of the sheets for 2021. The next phase of edits will be to add data into our spreadsheet. This data will be based off of what we already have in the sheets, and is meant to expand our knowledge of the data as well as increase our capabilities for analysis.
In order to complete our sheets, we will need to add three more columns to them. One column will show the duration of time for each ride. Another will show the geospatial distance traveled for each ride, from start point to end point, using the longitude and latitude coordinates. And the third column will show the day of the week that each ride took place on. We can use Excel functions to get the time duration. We will use the R console to get both the geospatial distance between starting point and end point, as well as the day of the week that each ride starts on.
There is a distinct and somewhat detailed process that goes into making each of these new columns. They will all be covered in depth on their own separate pages.