Final Project Report

Zach Norman

Published

December 15, 2023

Importing the Dataset

I analyzed this dataset from Kaggle for weekly sales data from 45 Walmart stores between February 5, 2010, and November 1, 2012. In addition to weekly sales, it also includes information on if the week had a holiday (a value of 0 or 1 if Labor Day, Thanksgiving, Christmas, or the Super Bowl occurred), unemployment rate, average temperature in Fahrenheit, fuel price, and consumer price index. The dataset has 8 columns and 6435 rows (one row per store per week).

Total Sales Data

Below are the plots for weekly sales data over time, both over the course of the entire dataset and color-coded by year. There are highs and lows throughout the year, but generally sales are flat throughout the year, with the exception of the end of the year around the holidays, where there is a big increase in sales. There is no upward trend at the end of 2012 due to the dataset cutting off in October 2012.

As seen here, all the years are pretty similar in trends as well, with some peaks happening in slightly different spots. In order to correctly overlay the years (because going off of just the dates, even coloring by year, would just make a color-coded version of the graph above), I made a new column with the same month and day, but made the year for every week 2013. I then plotted with the new column in the x-axis, colored based on the original year, and changed the x-axis labels to exclude the year.

Comparing stores

I also looked at comparing each store and seeing what each store looked like in terms of sales. As you can see, all the stores follow similar trends, and each store has a similar ranking in terms of sales throughout the entire dataset. Stores 20 and 4 have the highest sales, while 33 and 44 have the lowest.

Seasons

I also decided to look at sales by season to see if there was a difference. I decided to group seasons by the following: - Winter: December, January, February - Spring: March, April, May - Summer: June, July, August - Fall: September, October, November

When looking at the plot originally for sales totals, it appeared that summer was highest and winter was lowest, which seems strange as winter should include most of the holiday shopping (though Black Friday would be considered fall). However, since the dataset is missing January 2010 and November-December 2012, it is missing a decent amount (2 months) of winter data. I then decided to look at average sales instead, where winter was the highest, lining up more with what we’d expect from the time series plots.

Holidays

Adding additional holidays

I wondered if other holidays had an impact on sales, so I decided to add other holidays to the dataset that tend to have some shopping associated with them. I marked the holidays I added with 2 (changing my logical variable into more of an ordinal one) so I could more easily differentiate the holidays I added versus the ones that already existed in the dataset.

# Add other holidays to the store_data. 
# Flag with "2" to differentiate from pre-existing holidays in the store_dataset
# Observation: dates are for end of week, not beginning of week 
# (so take date and check week previous, not week after)
holidays = c(
  "2010-05-09", "2011-05-08", "2012-05-13", # Mother's Day
  "2012-06-20", "2012-06-19", "2012-06-17", # Father's Day
  "2010-05-31", "2011-05-30", "2012-05-28", # Memorial Day
  "2010-04-04", "2011-04-24", "2012-04-08", # Easter
  "2010-10-31", "2011-10-31", "2012-10-31", # Halloween
  "2010-07-04", "2011-07-04", "2012-07-04", # 4th of July
  "2010-02-14", "2011-02-14", "2012-02-14"  # Valentine's Day
)
determine_holiday = function(day) {
  for (holiday in holidays) {
    if(between(as.numeric(as.Date(holiday,format="%Y-%m-%d")-day), -6, 0)) {
      return(TRUE)
    }
  }
  return(FALSE)
}

# Allows determine_holiday to take in one date at a time in the ifelse
determine_holiday_vectorized <- Vectorize(determine_holiday)
store_data_with_holidays = store_data %>%
  mutate(
    Holiday_Flag = ifelse(
      Holiday_Flag==0, 
      ifelse(
        determine_holiday_vectorized(Date),
        2,
        Holiday_Flag
      ),
      Holiday_Flag
    )
  )

Looking at holiday impact

I decided to remake some of the time series plots I made earlier, but point out holidays as well. One of the issues I came across is due to the the dates in the dataset being for the last day of the sale week. The way ggplot colors the graphs (as an example):

  • Dec 20 has a holiday flag of 0, as Dec 14-20 have no holidays

  • Dec 27 has a holiday flag of 1, since Christmas is 2 days earlier and contained in the week

    • ggplot colors with a different color going forward
  • Jan 3 has a holiday flag again of 0 since there’s no holiday in the previous week, so then ggplot changes the color back

However, we want it to color previous to the date based on the holiday flag, not previous to it. Therefore, I had to add a lead variable to get the next week’s holiday flag and color off of that instead (code for lead variable below).

holidays = store_data_with_holidays %>% 
  select(Date, Holiday_Flag) %>% 
  distinct() %>% 
  mutate(
    Next_Week_Holiday = lead(
      Holiday_Flag, default = 0
    )
  )

I made a plot that color-coded weeks based on their holiday flag and a plot grouped by year with points for where each holiday occurs. Based on this, it appears that most peaks in sales occur either the week of the holiday or the week before (depending on when in the week the holiday occurred. For example, a holiday occuring on a Thursday would show sales increase the same week, whereas a holiday on a Sunday would likely show the week before). What is interesting is that there appear to be small peaks at the beginning of August and beginning of October that don’t correspond with holidays.

Looking at other factors

I was interested in seeing if the other variables in the dataset affected sales in any way. So I could display the plots using a facet-wrap, I did a pivot longer to have each of the variables as a column I could group on.

As can be seen in the plots, there does not appear to be much relationship between CPI/unemployment/temperature/fuel price and weekly sales. Some things I noticed: - Stores don’t change much in terms of CPI - No matter the fuel price, weekly sales for each store remained consistent/flat - Temperature, CPI, and Unemployment appear to have a slight right skew, but there is generally not much relationship.

I also decided to divide each variable into 3 bins and create boxplots to see if there was a major difference in the median sales. While the range sizes of the middle 50% and extreme values at times differ, the medians appear to not shift across variable values.

I then decided to bin the values by rounding them and creating a histogram for each variable. When looking at the histograms, Unemployment appears to have the most relationship (a downward trend), while Temperature, Fuel Price, and CPI appear to have a bell-curve shape (larger middle, lower ends). This also matches up with the calculated correlation values. Unemployment has the highest correlation but is still very low (-0.1, very close to 0).

[1] "Temperature: -0.0638100131794695"
[1] "CPI: -0.0726341620401763"
[1] "Unemployment: -0.106176089657954"
[1] "Fuel: 0.00946378631447514"

I also looked at how each variable changed for each store throughout the data. As can be seen below, all stores for temperature and fuel price have similar trends and move together throughout the data, while unemployment either stays flat or decreases depending on the store, and CPI either stays flat or increases. These trends make sense as time goes on, getting further away and recovering from the 2008 recession.

Conclusion

Based on the plots, it appears that whether a week contains or is near a holiday had the most impact on sales. All stores follow similar trends and aren’t impacted much by CPI, Temperature, Unemployment, or Fuel Price.

There were a few challenges I came across while working on the project. One issue is that the data is missing Jan 2010 and Nov-Dec 2012. This makes it difficult to, for example, find out what percentage of a year’s sales a particular week makes up, as I only have 52 weeks for 2011. I originally wanted to look at this to control for some stores getting more sales consistently/being larger than others. I also had to account for dates being the last day of the sales week and not before. When I was first adding in holidays (and treated the dates as if they represented the first day of the week), I thought the holidays that were already existing in the dataset were wrong because they seemed to be a week off. I also, again, had to add the lead variable for ggplot to correctly color weeks, because the dates were for the last day and not the first. Finally, I wish I had some more information on each store, such as where stores were located, population of area, distance to other Walmarts, etc. That way, I could look at whether a store has low sales because they have a low population, or if there’s just a lot of Walmarts in the area, which would spread sales out.

I started to look into temperature patterns to see if I could maybe get a sense of region, and all stores have cold winters and warmer summers, suggesting that they are all in a region that at least experiences changes in seasons rather than consistent weather year-round. Many stores are similar in terms of temperatures, but one store reaches a maximum of 60 degrees during the summer, suggesting that maybe they are in the northern part of the Northeast, while another seems to be above 50 degrees the entire year, including the winter, suggesting that they might be in the southern part of the country, or west coast where temperatures tend to be warmer compared to the east coast, However, beyond this, it’s hard to tell where these stores may be.