Finding Max and Mean in SQL
So far in the Analysis Phase, we’ve used SQL queries to create subsets of data that can be used to create visualizations of the data to help us understand the story that our dataset tells. We’ve also been able to collect smaller insights on our data, like how many member rides there have been in total. Before we move on to creating visualizations with our data in R and Tableau, we our going to extract some more insightful information in SQL. We are going to find the Max, Mean, and Mode values of certain columns within our dataset.
The Max value represents the largest number or amount found in a given column. It is found in SQL by using the MAX statement. The MAX statement belongs in the SELECT segment of our query. We nest the column name that we wish to find the Max value within our MAX statement. This would typically be followed with an AS statement to give a name to the resulting value when we run our query. Here is a sample of a SELECT statement that uses a MAX statement for one of it’s values:
SELECT MAX(ride_length) AS MaxRideLength
We could just complete the query as is with a FROM and a WHERE statement, but that would create just the one value in our results. What if we also want to see the Mean ride_length in the same query? We can do this easily by putting a comma at the end of the statement above and writing an AVG statement next to it.
A Mean value is the average number or amount taken from all of the values within a given column. We find the Mean value by running an AVG statement. AVG stands for the average value in a column. Our AVG statement will be run just like the MAX statement, placed after SELECT and with the column name nested inside of it. Here is an example of a SELECT statement that includes a MAX statement followed by a AVG statement:
SELECT MAX(ride_length) AS MaxRideLength , AVG(ride_length) AS MeanRideLength
Theoretically, we would be able to run this statement in a query with the appropriate FROM and WHERE statements to produce two values: the Maximum ride length and the Mean ride length.
However, things in SQL are not always so simple. Here is what happens when we try to run a query with the statement that is written above:
Instead of getting the values that we desired, an error message came up. Notice the line that I highlighted in yellow. It says “Operand data type varchar is invalid for avg operator.” This means that the datatype for our column is not compatible with the AVG operator.
Issues like this are a pretty common occurrence in SQL. If you’ll recall from one of our previous posts, we spent some time changing the datatypes of some of the columns in our dataset. Every column that was not changed remained a varchar datatype, which is a more technical name for a string.
Since the data in our ride_length column is structured in a “hh:mm:ss” format, the best datatype for it would be TIME(0). The (0) portion of our TIME(0) designates that the TIME format that we will be working with is “hh:mm:ss”.
When we try to alter the ride_length column into a TIME(0) datatype, we get this error message:
This is only one hurdle that we would have to get over when it comes to using AVG on our ride_length column. Even if we could change the datatype to TIME(0), this column would still not be compatible with AVG. We would want to turn this data into the float datatype in order to use AVG on ride_length. However, when we try to do this, we get the same error message:
It’s now clear that altering the table to suit our needs is not an option. We may have hit a snag here, but there is still a solution! If we can’t alter the column itself to get our AVG function to work, then we can write a query that tricks SQL into thinking that our column is the right datatype.
This will get a bit complex, but we’ve gotten this far without a hitch, so I know we can get over this hurdle easily enough. We’ll take our time going through the steps needed to get our AVG operator to work with our ride_length column.
There are two new functions that we will use to find the Mean of our ride_length column. These are the CAST() function and the TRY_CONVERT() function. They are both used to do essentially the same thing, and that is to convert data from one datatype to another.
TRY_CONVERT() is a variation of the CONVERT() function, which is essentially the same thing as CAST(). One big difference between CAST() and CONVERT() is that CONVERT() is used especially for SQL, and can handle specialized datatypes like smalldatetime. CAST() is a function found in several other languages, and it works by attempting to convert the data into a datatype that is less complex.
The “TRY” prefix in TRY_CONVERT indicates that the function will try to convert each cell of data in the column into the datatype we wish. If it is not able to convert that cell, it will be marked as “null” in the query.
Getting AVG to work on our ride_length column is essentially going to require two queries rolled into one. The main query will involve using the CAST() function several times in a series of nested functions. The second query will be nested within the main query and will act as it’s FROM statement, meaning it will act as the source of our query’s data.
Before we dive deeper into these queries, we’ll need to go over the syntax of the CAST() and the TRY_CONVERT() functions.
When using CAST(), we will nest the name of the column we wish to change, and follow that with an AS statement that has the name of the datatype we wish to use. Here is an example of a basic CAST() function:
CAST(ride_length AS TIME(0))
For TRY_CONVERT(), we will borrow from the base CONVERT() syntax. This means that we will simply nest the name of the datatype we would like to use, followed by a comma and the name of the column that we wish to change. Here is an example of a TRY_CONVERT function:
TRY_CONVERT(TIME(0), ride_length)
Before I can explain our main query, it’s best that we go over the secondary, nested query and why it is needed. All queries that pull data from a dataset in SQL need to have that dataset summoned by using a FROM statement. The dataset that we will be using is our main, all-year dataset, except this time it will be using the TRY_CONVERT function to turn our ride_length column into a datatype of TIME(0). Remember that TRY_CONVERT will make the effort convert each cell in the column, and will change any cell that it can’t convert into “null”.
In the FROM statement of our main query, we will use parenthesis to nest our secondary query. Within this secondary query, the SELECT statement will list the ride_length column in a TRY_CONVERT function. We will use a TRY_CONVERT function to convert ride_length into TIME(0), and will also use an AS statement to name our function as ride_length in the results. Here is what the SELECT statement we just made will look like.
SELECT TRY_CONVERT(TIME(0), ride_length) AS ride_length
The rest of the query follows the SELECT FROM WHERE format that we have used before. In the WHERE statement, we would write the words “is not null” in order to remove the null values which our TRY_CONVERT function could not convert properly. Here is what the complete secondary query looks like:
SELECT TRY_CONVERT(TIME(0), ride_length) AS ride_length
FROM dbo.[2021-all-year-tripdata-v2]
WHERE ride_length IS NOT NULL
This secondary query will be nested within the FROM statement of our main query. After nesting it in parentheses, it’s important to put a letter of our choosing right behind the parenthesis. This will act as a name for the secondary query that we just nested into our main FROM statement. SQL needs us to do this in order for this to work. We’ll just put an “a” after the parentheses and move on. The FROM statement of our main query should look like this:
FROM (
SELECT TRY_CONVERT(TIME(0), ride_length) AS ride_length
FROM dbo.[2021-all-year-tripdata-v2]
WHERE ride_length IS NOT NULL
) a ;
Now, the only thing left for our main query is the SELECT statement. To get the Mean ride_length using the AVG function, we will nest several CAST(0) functions within each other, getting SQL to switch datatypes for ride_length until it can use AVG, and then change it back again until it is in an acceptable datatype for SQL to use.
We will first cast ride_length as datetime. Then, we will nest that function into another CAST(), with the entire function acting as the data source (where ride_length would usually go), and use the float datatype. We can now nest all of this into an AVG function.
Our SELECT statement is only halfway done. We will know need to nest our AVG function and everything inside of it into another CAST() function. This CAST() function will change it all to the datetime datatype. Finally, we can use one last CAST() function to turn this all into TIME(0). With a name “MeanRideLength” added behind the function, our SELECT statement should look like this:
SELECT cast(cast(avg(cast(cast(ride_length as datetime) as float)) as datetime) as TIME(0)) MeanRideLength
We can add a simple MAX() function to our SELECT statement in order to find out the longest ride recorded in our dataset. We’ll nest the ride_length column our MAX() function and use an AS statement to name it “MaxRideLength”.
With all of our steps completed, we should have created a query which will show use the average ride length in our dataset, as well as the longest ride length. Here is what this query should look like:
Congratulations! We just got through the most complex SQL code that we will use in this project!
Be sure to save this file as “ALL-YEAR-2021-max-mean-mode”. In the next post, we will go over the much simpler process of finding the Mode of certain columns in our dataset. We will also list different variations of these queries that will give us more insightful summary information.