# Avocado Hass Market: United States

# Project Overview

# Section 1: Introduction — Business Understanding

In this project we are interested in answering questions regarding the Avocado business and its commercialization. A few of the questions we want to answer are: What are the average prices of avocado in different states? Do some states have different prices or are the prices the same everywhere? What is the time of the year with have higher sales or lowest sales? From highest sales of the year, can we figure it out what week was the most profitable, the least? Have the prices have been increasing as years go by or is there a variation?

# Problem Statement

In order to answer some of these possible questions we must have a better understanding of the data we have access to. For this we use open source data that is available to us on the internet in Kaggle website. In this site we found data that is related to Avocado prices in the United States. Looking at the avocado csv file, we can see data columns such as: average_price, total_volume (Total number of avocados sold), 4046 (Total number of avocados with PLU 4046 sold), 4225 (Total number of avocados with PLU 4225 sold), 4770 (Total number of avocados with PLU 4770 sold), total_bags, small_bags, large_bags, xlarge_bags, type (organic or conventional avocado), year, geography (cities where avocado sales are followed) and month. So far this is a good starting point to try and answer a few questions.

- What is the price to charge for an avocado?
- What is the comparative average price people in various cities are charging for similar units?
- What are the greatest factors that determine prices?

We will prepare the data next and start to visualize some feedback with the help of Data Modeling techniques.

# Section 2: Data Understanding

In order to answer the questions we have in mind, we need to understand the data and see what it can reveal to us after we analyze it.

We have our data from internet. We are going to use avocado-prices csv file that we found in Kaggle.

First read in the dataset and take a look or data.

We have 3 categorical column values and one int that furthermore will need to be converted to categorical. The rest of the columns are numerical. The last column will need to be removed since it will not be relevant for our analysis.

4046, 4225 and 4770 columns refer to the sizes of avocados sold. These are small, medium and large respectively. Total bags is data associated with the amount of different sizes avocados bags were sold in a certain period. Type can be either conventional or organic avocado.

Then we wonder if there are any null values in our dataset.

Fortunately for us our dataset returned false, meaning that we do not have any missing values. Also after some observation we found that our data looks clean. There are some columns though that will not be much used in our analysis. For example, columns 4046, 4225 and 4770 refer to small, medium and large sale of avocados, but these columns also mean the same as total_volume column. Total Volume column is the sum of all avocados. We can use size avocados columns to explore any question related to size.

Let’s look an overall graph of the columns of our dataframe.

# Metrics

# Section 3: Preparing the Data — Clean and analyze

In this dataset, we are interested in finding out more about the data and what we can deduce from it. A quick analysis below that will help us in our exploring.

Columns: date, average_price, total_volume, 4046, 4225, 4770, total_bags, small_bags, large_bags, xlarge_bags, type, year, geography.

** Date** can give us information about the evolution of the product throughout time. We can find out if there is an important change in product consumption during a certain time of the year. Some possible questions that we are thinking about include:

- What is the time of the year that U.S sells more, less avocados?
- Do average prices increase during a specific time of the year?
- Is there any preference in consumption size during a certain time of the year?
- Is there a prevalence of type on a certain time?
- Has it always been the case that size had a direct impact on price fluctuation?

*average_price*

- Are average prices related to geography?
- Do some regions have a tendency to have higher prices?
- Is there a strong relation to the type of product and price?
- What is the tendency in prices over time. Have they increased, decreased?

*total_volume*

- When was the highest sale of the product?
- At what price did the biggest sale occur? the smallest?
- In what region did the biggest, smallest sale took place?

** 4046, 4225, 4770** are the sizes of the avocados: small, medium and large respectively.

- Is there a place where a specific size is more popular than others?
- Is there a time of the year that one size is more prevalent than others?

** type** conventional and organic.

- Overall is one product preferred over the other?
- What are the locations of preference of both products?
- What is the price relationship to these types?

# Data Exploration and Data Visualization

Now we will start analyzing the data to answer some of these questions and have a broader understanding of our dataset.

We know that we have three sizes of avocados in our dataset. Now, we will like to know what is their relationship to average price

It seems that small avocado average prices in its majority fluctuate between 0.8 and 1.5 USD, approximately. This size of avocados is the one that presents highest sales.

Medium avocados have its breaking point about 1.7. Anytime the price goes above this mark sales tend to decrease.

Large avocados have their majority on its sales when prices drop. Generally when prices go over 1.7 the sales decrease dramatically just as medium size do.

We are also interested in what are the most common prices that all avocados sizes get sold at?

The majority of avocados get sold in a range from 0.75 and 1.8 USD, approximately. This information is important because if we were to get into the avocado business we can calculate our profits expectation according to our supply of product at hand.

What year were avocados the most expensive? What is the price tendency since year 2015?

It seems average prices were higher in the year 2017. The tendency of prices was increasing steadily up until then when prices rose considerably. After 2017, prices show a steep decrease up until now. Therefore the tendency up until April 2021 is a decline in the prices. Let’s not forget to mention that we had COVID pandemic in 2020, which could or should be one of the greatest influence in this tendency, but that is another topic.

In what parts of the country avocado prices are the highest and the lowest?

`geography`

Dallas/Ft. Worth 1.073419

Houston 1.073484

South Central 1.102484

Phoenix/Tucson 1.205677

Nashville 1.214984

Columbus 1.218113

New Orleans/Mobile 1.225597

Roanoke 1.230161

Cincinnati/Dayton 1.235935

Richmond/Norfolk 1.244565

West Tex/New Mexico 1.248622

Detroit 1.251065

Indianapolis 1.256129

Denver 1.259290

Louisville 1.267355

Tampa 1.293968

Atlanta 1.300242

Great Lakes 1.307500

Los Angeles 1.310839

West 1.320242

Total U.S. 1.320871

Southeast 1.322016

Pittsburgh 1.324984

Miami/Ft. Lauderdale 1.334387

Plains 1.346371

Orlando 1.354677

Las Vegas 1.354887

South Carolina 1.362661

Midsouth 1.368968

Jacksonville 1.381516

Harrisburg/Scranton 1.381645

Buffalo/Rochester 1.392935

Portland 1.405984

Syracuse 1.415113

Boise 1.433677

Grand Rapids 1.441452

California 1.445081

Northern New England 1.446919

St. Louis 1.454000

San Diego 1.455419

Baltimore/Washington 1.470419

Albany 1.487887

Spokane 1.499226

Philadelphia 1.519371

Chicago 1.519887

Boston 1.534629

Northeast 1.535000

Raleigh/Greensboro 1.539274

Seattle 1.541452

Charlotte 1.544274

Sacramento 1.581984

New York 1.654258

Hartford/Springfield 1.755726

San Francisco 1.757339

The three highest prices are in New York, Hartford/Springfield and San Francisco. The lowest are Dallas/Ft. Worth, Houston and South Central. South Central United States is a region that includes the states of Arkansas, Louisiana, Oklahoma, and Texas.

The highest average prices of avocados are found in San Francisco and the lowest are in Dallas/Ft. Worth. The difference in average price is $0.68 USD.

We are interested to know at what prices in general are the most amount of avocados sold according to our dataset.

`average_price`

3.05 2.068260e+03

2.95 2.417550e+03

3.17 3.018560e+03

3.03 3.714710e+03

2.91 4.103970e+03

...

0.66 2.743026e+06

1.04 2.759713e+06

0.94 2.940731e+06

1.02 2.961101e+06

0.89 3.154869e+06

Name: total_volume, Length: 260, dtype: float64

The highest volume of avocados were sold when the average price was 0.89 USD per unit. The least amount of avocados were sold when the price was 3.05 USD per unit.

We want to know if there is any tendency in prices associated with month and year.

date

2016-10-30 5.292655e+05

2018-11-25 5.649694e+05

2015-11-29 5.692443e+05

2016-11-06 5.722733e+05

2016-11-27 5.847638e+05

...

2018-02-04 1.566461e+06

2019-02-03 1.569369e+06

2020-05-03 1.572185e+06

2018-05-06 1.598463e+06

2020-02-02 1.600011e+06

Name: total_volume, Length: 310, dtype: float64date

2021-01-04 1.050370

2021-01-24 1.104630

2021-01-10 1.117963

2020-02-02 1.131389

2019-02-03 1.138796

...

2017-09-24 1.833333

2017-10-08 1.843333

2017-09-10 1.843889

2017-09-03 1.855185

2017-10-01 1.871296

Name: average_price, Length: 310, dtype: float64

In February 2020, we can see that there was the highest sale of avocados, followed by May 2018 and March 2020. The lowest wast on October 2016, followed by November 2018.

The year we experienced the highest prices was in 2017, while the lowest so far seems to be January 2021. This document is written in April 2021.

What have been the averages prices per year?

We now think we can get more information if we have month as one of our columns. We already have the year column in our dataframe. Le’s extract month from our date column.

What is the month of the year that the U.S sells more and less avocados?

`year month`

2016 11 6.023364e+05

2017 9 6.592119e+05

2016 10 6.670404e+05

2017 10 6.718110e+05

2015 11 6.841482e+05

...

2020 8 1.287638e+06

7 1.342336e+06

2021 1 1.356068e+06

2020 6 1.359096e+06

5 1.458038e+06

Name: total_volume, Length: 72, dtype: float64

May 2020 was the date that U.S. sold most avocados and October 2016 was the date with the least avocados sold.

What is the total volume association with sales by year?

`year`

2015 7.810274e+05

2016 8.584206e+05

2017 8.623393e+05

2019 1.036816e+06

2018 1.040919e+06

2020 1.259413e+06

2021 1.356068e+06

What is the total volume association with sales by month?

`month`

12 8.076695e+05

11 8.223470e+05

10 8.446863e+05

9 9.064773e+05

8 9.499249e+05

7 9.832681e+05

3 9.957150e+05

4 1.016582e+06

1 1.033152e+06

6 1.051247e+06

2 1.095374e+06

5 1.123632e+06

Name: total_volume, dtype: float64

The results above are consistent with our previous results. We observed that the highest volume of sales was in 2021 with 1.356068e+06 units sold, this makes sense since 2021 we also had the lowest prices. As prices decreases volume of sales increases.

It looks like the total amount of sales have been increasing each year, with the exception of of year 2018. 2021 shows to be the year up until March that has had the most amount of sales so far since 2015. 2018 had more sales than 2019. The data also shows that May is the month that has the most sale volume followed by February. The year that had less sales was 2015 and the month with less total volume is December followed by November.

The lowest amount of sales was in 2015, followed by 2016.

May, February and June seem to be the best months for sales, while December, November and October are the slowest.

But what about average prices per year or month?

`month`

2 1.248885

1 1.263919

3 1.316207

12 1.321063

11 1.361806

5 1.362226

4 1.374765

6 1.403636

10 1.448109

7 1.457877

8 1.470107

9 1.485673

Name: average_price, dtype: float64

Prices have been decreasing steadily since 2019 and again the graph shows that highest avocado prices seen were in 2019. July, August, September approximately show the highest prices in general.

Our graph show us that prices have been having a tendency to decrease ever since 2015. We can also see that since 2015, the highest average prices were in 2017. As prices decline total volume of sales increase.

Next, We would like to know what is the avocado size that sells the most.

Small size avocados (4046) sell the most and medium size (4225) sell the least.

We can observe that the most popular size sold since year 2015 until 2021 is PLU 4046. The least sold is extra large with PLU 4770. Therefore the most popular size is small avocado. The smallest size avocado is generally sold most from February-July and medium size the rest of the year.

Again, small bags is what is most popular for sale.

What are the months that have the highest prices and lowest?

`month`

2 1.248885

1 1.263919

3 1.316207

12 1.321063

11 1.361806

5 1.362226

4 1.374765

6 1.403636

10 1.448109

7 1.457877

8 1.470107

9 1.485673

Name: average_price, dtype: float64

September shows to hold the higher prices in average, followed by August and July. The lowest prices in general of our dataset is February.

Is there a marked relation to the type of product and price?

`type`

conventional 1.141474

organic 1.611331

Name: average_price, dtype: float64

What type of avocado sells more?

`type`

organic 6.441855e+04

conventional 1.882223e+06

Name: total_volume, dtype: float64

Overall is one product preferred over the other?

type year

organic 2015 3.165993e+04

2016 4.652424e+04

2017 5.879018e+04

2019 7.262982e+04

2018 7.695845e+04

2020 9.894945e+04

2021 1.225478e+05

conventional 2015 1.530128e+06

2017 1.665327e+06

2016 1.670317e+06

2019 2.001003e+06

2018 2.004880e+06

2020 2.419877e+06

2021 2.589589e+06

Name: total_volume, dtype: float64type year

conventional 2021 0.891944

2020 1.051420

2015 1.077963

2016 1.105595

2018 1.141092

2019 1.187179

2017 1.294888

organic 2021 1.316250

2020 1.513098

2018 1.568946

2016 1.571684

2019 1.615830

2015 1.673324

2017 1.735521

Name: average_price, dtype: float64

Organic type avocado has generally a higher price than conventional. Conventional avocados sell more than organic.

Non organic avocado has been the most sold type in comparison with organic from 2015–2021 and the price of organic has been higher.

Notice that we did not need to do any predictive modeling for the analysis above. We only used descriptive and a little inferential statistics to retrieve the results. We do have a greater understanding now about our data though. The next step for us is to predict prices of avocados.

If a person decides to go into the avocado business it will be important to understand what statistics can predict about the price of the product. We want to know what we expect the price to be, the total volume we expect to sell at a certain time of the year, the price most likely we will want to charge, etc. All this information prior can make a a big difference in the planing of a business venture, we would want to know if we have a competitive advantage or disadvantage while considering all variables associated with a business.

# Refinement

In the next part of the analysis, we are going to look at the categorical variables that we did not use much in our descriptive analysis. We are interested in the “geography” column, since this data has the name of the regions were the sales took place. Mostly we are interested in California, because this is the place we will want to have an avocado plantation or maybe buy and distribute avocados to the population.

California in our dataset has 4 cities that we will focus on: Los Angeles, Sacramento, San Diego and San Francisco. We will like to know which seems to be the best place for us to sell avocados. Would also like to know what prices we can charge at certain times os the year and more or less see what machine learning models can tell us about the the data and its predictions.

Let’s look at this categorical column and see what it tells us. It looks that we will have to separate different columns and create a matrix for analysis.

# Section 4: Data Modeling and Results

Next we move our quantitative variables to an X matrix, which we will use to predict our response. We also create our response. We then split our data into training and testing data.

In our first try we are going to look at the r-square value that we get. This is going to be an important data that we will get and will tell us how well our model is doing.

The actual calculation of R-squared requires several steps. This includes taking data points from dependent variable(y) and independent variables (x) and finding the line of best fit that we will get from our model. Then we can calculate predicted values, subtract actual values and square the results. This will give us a list of errors squared, which we are going to sum and equalize it to the unexplained variance.

To calculate the total variance, we will subtract the average actual value from each of the actual values, then we will square them and sum. We will then divide the first sum of errors by the second sum to finally subtract the results. This is going to give us the r-squared.

And by looking at the actual results versus predictions we get:

We built a model that predicts… but we are missing a lot. A r-squared error value of 0.050 suggests that 5% of the variability in prices can be explained by these variables. The R-squared value is the square of the correlation coefficient.

A common definition for the R-squared variable is that it is the amount of variability in the response variable that can be explained by the x-variable in our model. In general, the closer this value is to 1, the better our model fits the data.

We can see about how well our model perform in our test data. But we see that the R-squared value is not as good as we would expected. We will now see if we can do better by adding categorical values to our model. We could do this by creating an X matrix that would (potentially) allow us to predict better than just the numeric columns we have been using thus far.

We will also want to remove total_us sales as well as columns representing the sales by state. These columns will have redundant sales information that can potentially cause issues in our models and predictions. We also want to get rid of the “date” column since we already have month and year in our matrix. We will not need it for this analysis.

We are going to try again to get a better r-square value. This time we are using our new matrix and a linear regression model.

It looks like we did better, we do have a higher r-squared value and a lower mean squared error.

We are going to see if we can do even better with an optimized linear model. This time we are going to predict using a matrix holding all of the variables we want to consider when predicting the response. Let’s create the X (explanatory matrix) and y (response vector) to be used in the model.

R-Squared, also known as the Coefficient of Determination, is a value between 0 and 1 that measures how well our regression line fits our data. R-Squared can be interpreted as the percent of variance in our dependent variable that can be explained by our model. The closer R-Squared is to 1 or 100% the better our model will be at predicting our dependent variable.

Since we are using a rich regression model one of the best way to tell which x features matter in our model is to look at the weight of their coefficients. A larger coefficient means a larger influence or impact in the variable we are interested in, in this case average price. we can look at the size of the coefficients in the model as an indication of the impact of each variable on the average price. The larger the coefficient, the larger the expected impact on average price.

The analysis above of the coefficients is again consistent with our results we obtained before from our descriptive analysis. We did say that San Francisco, the month of September and type organic had the highest prices and Houston had the lowest average prices according to the analyzed information of our dataframe.

One of the best out of the box methods for supervised machine learning is known as the RandomForest — let’s see if we can use this model to outperform The linear model from earlier. Let’s see what be the best number of features to use based on the test set performance

We can see that RandomForest did helped us to improve our model. The R-squared values we were able to get before were o.61 and 0.62 in train and test data respectively, also our MSE was 0.056. Now we have values of 0.98 for training and 0.87 for test data and a lower MSE of 0.017.

In simple linear regression, we compare two quantitative variables to one another. In our last analysis we are going to use linear ordinary least squares (OLS) model.

The response variable is what you want to predict, while the explanatory variable is the variable you use to predict the response.

If we try to predict price this is what we get:

Total_volume is statistically significant and the results suggest that as the independent variable increases (total_volume), the dependent variable tends to decrease (average_price). For every unit increase of total volume sales we would predict that the price will decrease by 1.77 USD.

If we tried to predict volume:

As the average price increases, total_volume decreases.

From a simple analysis above we can see that out r-squared value is not that great. Yet, we are able to see that total_volume is statistically significnat for predicting price.

As we can see simple linear regression did not help us too much though since we got a small r-squared. We are going to use now multiple linear regression since we have more variables including categorical that we can check. We will be using both quantitative and categorical x-variables to predict a quantitative response.

Let’s say that we ae interested in avocado businees and we will like to know what is the average price we should charge for the units. If we are going to sale avocados in California State, we can look at the prices we would be able to charge by specifying the city.

To answer this question we will look at the data that we already have. We know that the size of the avocados as well as the locations where avocados are sold can dictate or influence prices.

We can attempt to do this using linear algebra to predict our price using all our variables at the same time (categorical and quantitative).

For this we need to create a matrix of inputs (X) and a vector of the response that we want to predict (y). But since we already built one before, we will be using it here as well.

Let’s try to fit our lineal model using quantitative variables first. We get the following summary result.

total_bags’,’small_bags’,’large_bags’,’xlarge_bags’ showed us to be statistically significant. We could say now that for each additional unit increase of avocados sold, average prices are expected to decrease by 2.09 as long as all the other variables stay the same. There is an issue with this statement though because p value associated with total_volume is slightly greater than 0.05, this will make total_volume not statistically significant.

Since we are not getting a good r-squared value and we know that there are potential problems related to Multiple Linear Regression, we are wondering if our model is accurate. Next we are going to check for multicollinearity, which is one of the five potential problems.

Multicollinearity is when we have predictor variables that are correlated with one another. One of the main concerns of multicollinearity is that it can lead to coefficients being flipped from the direction we expect from simple linear regression.

One of the most common ways to identify multicollinearity is with bivariate plots or with variance inflation factors (or VIFs).

We can see that size small_size and large_size have negative coefficient. The interpretation of these coefficients now is counter intuitive to the relationship we expected from our pairplot results. This is one of the side effects to have multicollinearity in our model. Let’s look at the other way to identify if our predictor is correlated to one another using variance inflation factor (VIF)

Let’s remove total_volume and other variables since the VIF factor is too high. We will also remove bags since they do not seem to be statistically significant.

We do see that r-squared value increased and our VIF Factors now are below 10. All the variables in this model showed us to be statistically significant. We could say now that for each additional small size avocado sold, medium size and large size avocado average prices are expected to decrease by 1.24, 1.10 and 1.28 respectively as long as all the other variables stay the same.

We are also interested to look at the relationship of total_volume and prices.

For the association of total_volume of avocados sold and average_price in our dataset, we get R-squared value of 0.874, It means that the variability in total_volume can explain 87.4% of the variability in average_price.

‘average_price’,’small_size’,’medium_size’,’large_size’ showed us to be statistically significant, meaning that they are statiscally related to the response. We could say now that as average_price increases, total_volume is expected to decrease by 4.04e+04 as long as all the other variables stay the same.

On both of the models we get warnings that point out a flaw in the model: there could be strong multicollinearity — which means the explanatory variables used could be correlated. This can lead to some instability in the parameter estimation, but it is difficult to predict the exact effects of multicollinearity. In this case, both variables are still significantly associated with our dependent variable and the coefficients are places in small confidence intervals.

Next we are going to use our categorical variables in our lineal model. If we are only interested in finding out the predictions for selling avocado in California area, we can analyze below to see what we can deduce regarding the data from the whole country.

Geography -> Let’s look at Los Angeles

Geography -> Let’s look at Sacramento

Geography -> Let’s look at San Diego

Geography -> Let’s look at San Francisco

Los Angeles: the average price to charge for avocado would be 1.4974 USD Sacramento: 1.4918 USD, San Diego: 1.5407 USD, San Francisco: 1.5442 USD as long as all the other variables stay the same.

Since San Francisco is one of the most expensive cities in United States, it makes sense that the Average price predicted here for avocado is higher.

Also, if we are interested to see the predicted average prices for the rest of the cities in the country, we just need to look at the other variables that are being compared to the baseline. For example, in our case first analysis we did it on Los Angeles baseline.

If we observe in the summary above, New York compared to the baseline has a positive higher value. If we take a look at the intercept we can predict the average price of the avocado in New York, again as long as all the other variables stay the same.

# Conclusion

To summarize this project: we started with the problem of finding out tendencies of prices of avocado in the United States. We wanted to be able to predict prices based on historical data if we were going to venture into the avocado business.

# Reflection

We wanted to know as much as we could about the volume as well and the relationship that price, volume and location of avocado sales. Several features of the data were visualized to understand it better and then the data was preprocessed for machine learning. We then then implemented descriptive and inferential statistics as well as three different machine learning models: Linear Regression, Random Forest Regression and OLS linear regression. The R2 metric for each model was recorded, which provides a starting point for improving the models in the future. The goal of this project was to research how regression-based machine learning models can be applied to the avocado business markets.

# Improvement

The possible ways to improve the models include adding more features, and creating a matrix that will handle better the outliers.