Best American cities to move to and  invest in real estate

I. Introduction
Due to the impact of COVID-19, the world has rapidly changed over the past year. A large number of working professionals now have the option to choose where to live regardless of their job locations, and the housing market was turned upside down. Experts are predicting that the housing market could potentially experience massive swings in terms of growth, mortgage rates, available housing, and more in the near future. Due to these rapid changes, there may have been long-lasting potential consequences on the housing market in many highly densely populated cities. It’s no secret that areas such as New York City and San Francisco have much higher costs of living so many Americans are moving to cheaper areas, but moving is much more than just about lower costs of living. Is moving right now a good decision, and if so what cities would be the best place to potentially buy property in?

Buying real estate is a significant and potentially long-term investment so it’s important to understand which cities have the most potential growth in terms of real estate value, as well as the ability to provide a high quality of life. There are many factors that are essential to consider when making potential real estate investment decisions. Different individuals have different personal values, so I looked to build a model that could consider multiple different factors and provide an overall weighted score based on each individual’s personal preferences. From a financial perspective, I wanted to understand the cost of living in each city, how much it would cost to buy property there, and the property’s potential growth over time. In addition to buying the property, I also wanted to see what the quality of life is in those particular cities. Since I am also considering moving and living in that respective city I also looked to determine some potential sentimental factors. For example, how safe and happy do people feel while living in a particular city?

Since each user has different preferences there is no one city that is the best for everyone, however I was able to generate a report with details on each respective city that can be used to help a user make a decision. I started off with a list of initial questions that I hoped to answer through the analysis:

II. Research Questions

Given different personas, which cities are best for individuals looking to relocate?
What are the best cities to live in the U.S. to maximize your housing investment?
What cities have the highest ROI on their property value?
What cities are the best to live in from a sentimental point of view?
What cities have the highest potential growth over the future?


III. Describing and Preparing the Data Tweet Data

I wanted to get an assessment of how happy people were in the different cities and how they felt about living there. In order to begin the analysis, I initially pulled tweets from Twitter users in each respective different cities to get raw feedback. By using this social media data, I studied the attitudes of residents in particular regions. The reason why I chose this data set was to create a ‘sentiment’ value of each city to compare with one another.

I narrowed our search to the top 35 metro areas located across the United States and pulled 300 tweets for each area using the twitter API. One of the challenges I ran into while pulling the data was that Twitter only allowed me to pull 100 tweets at a time. In order to pull the data more efficiently I wrote a for loop to pull 100 tweets that cycled through all the cities. By using this method, it allowed me to run just one code to do 35 API calls and I ran that 3 times. Each API call provided a JSON file with 100 tweets, I then ‘cleaned’ that data by concatenating all of the JSON files, removing the unwanted json fields, adding a unique city identifier, and turning the JSON files into a .csv so that it could be more easily analyzed.

In order to create a sentiment score for each city I took all of the tweets and split them up by word and linked them by tweet ID. I then threw out all of the stop words and words with a sentiment score of 0. Then I created a tweet sentiment score using the average of each word in the tweet. After I had a sentiment score for each tweet, I just took the average tweet sentiment score per city to create a city level sentiment score.

Overall, over 10,000 tweets were pulled that used city names as keywords. I then combined this dataset with the other ‘quality of life’ data (Cost of living and Crime Data) to help build a quantifiable model to understand how much people currently living in the cities are enjoying their respective cities.

By regarding content classification and running sentiment analysis of these tweets, I was able to assign values and get a weighted metric for each respective city to determine how happy people are in each city.

Cost of Living and Crime Data

In order to obtain the information I needed for our analysis, I pulled three sets of raw housing data from Zillow to gather information on the home value index, the rental index, and home inventory and sales.

The home value index is a smoothed, seasonally adjusted measure of the typical home value across a given region and housing type. This index reflects the typical value for homes in the 35th and 65th percentile range. For our analysis, I reviewed the home value index over the past 30 years on all homes including: single-family homes, condos, and co-ops. The rental index is computed by taking the mean of listed rents in between the 40th to 60th percentile range for all homes and apartments across a certain region. The home inventory and sales data count the unique listings for all homes that were active at any time in a given month. For our analysis, I narrowed down the Zillow data to highlight the top 35 cities ranked by size and ran a time series for the home values over the last 20 years. Additionally, I also analyzed the crime rates and cost of living index for the top 35 cities I identified in the Zillow data.

I pulled the dataset from AdvisorSmith and incorporated the most recent cost of living index considering the following expenses: food, housing, utilities, transportation, healthcare, and consumer discretionary spending. I also used individual state datasets from the FBI’s Crime Data Explorer, from which I considered both violent and property crime rates. Violent crimes include homicide, rape, robbery, and aggravated assault. Property crimes include arson, burglary, larceny-theft, and motor vehicle theft.

The raw data was not usable in its original form so I pulled all the raw datasets into R and looked to clean it up. In R, I was able to clean the Zillow datasets by clearing all columns containing missing values and removing data for the cities I am excluding from our analysis. Our cost of living dataset was pretty structured and required little wrangling; I cleaned up the data to reflect the cost of living for the 35 cities that I will be analyzing within our Zillow data. Due to some limitations with how each city reports its crime data, I will be assessing crime rates at the state level rather than at the city level. I found the most consistent crime rate definition within the state-wide data provided by the Crime Data Explorer. In R, I pulled the raw data of the violent crime rate and property crime rate of each state and cleaned each dataset to reflect the most recent crime rate from 2019. Afterward, I used a left join to combine the two crime datasets per state then used rbind to combine all the states data.

IV. Analyzing the Data and Running Models

Time Series Analysis on Zillow Data

I used time series analysis in order to identify and forecast the future value of real estate across multiple cities. I’m using it in the valuation of real estate to analyze a trend in price changes in time and other factors to determine market value, ultimately leading to the question of which cities are the best to potentially move to.

The cleaned Zillow data sets were then used to run three unique time series that correspond to housing and rent prices. This allowed us to see the changes in home values as well as the velocity of those home value changes in each respective city. Our goal was to evaluate the potential property value to determine which cities had the most growth and overall value. To evaluate the Potential Growth of the properties, I ran a time series on the average value of property over 10 years for each city: An arima model was created for each of the top 35 cities to predict the monthly home values for the next 10 years. I converted its predictions into data frames and exported it into a CSV file. After taking the home values data that were pulled from Zillow, I cleaned the data. Next, wI separated all home values data per city in order to create a separate data frame to be converted into a time series object. The start date of June 1999 was used with an end date of Jan 2021 with a frequency set to 12 so that I would receive monthly home values for our predictions. After using the time series objects for each city, I used the auto arima package to figure out the p, d, q for each model for each city to get accurate home value predictions for the next 10 years.

Using the CSV file I exported through R, I inputted the CSV file into Tableau to help analyze the data to see the trends of growth and increase of property value.

Time Series over 10 Year Period of Time:

My overall analysis method allowed me to look at the month-to-month details of the growth in property values. The summary will ultimately focus most on a 10 year long-term growth period however this can be adjusted based on individual user preferences.

The model showed: that cities in the Southwest Pacific area had the highest upside for growth of property value over the next 10 years. Four of the top five cities were located in California along with Phoenix, AZ. This was calculated using the total dollar increase of property values in each respective city.

In Descending order of Highest to Lowest Property Value Growth:

Sentiment Analysis on Twitter Data

Twitter sentiment data improves the accuracy for predicting socio-economic indicators, price changes, and standards of living, compared to models with exclusively economic and demographic variables. I then combined this sentiment analysis with our collected data for Cost of Living and Violent Crime to give us a number of quality of life metrics. Due to the different types of data and variables there were some limitations that I ran into. In order to best get around this, I looked to standardize the data.

Model and Data-based Limitations

Standardizing the data (turning them into a 0-1 rating) creates changes to the data. In order to standardize the data I turned all of the raw data into Z-scores ((Actual-Mean) / ST. Deviation) then I trimmed all data above 2 to a value of 2 and all data below a -2 to a -2. That allowed us to just divide the scores by 5 and then add 0.5 to create a value between 0 and 1. A limitation to this technique is that large outliers in the data do not have as much of an impact on the model as they should. This was necessary however to make the persona utility functions easy to create and scale.

Additionally, tweet data has its limitations in ability to create a city sentiment score for people living there. In our data Orlando had the highest sentiment score, however I assume that is heavily influenced by people vacationing in that city where a city like Pittsburgh is far less likely to have vacationers tweeting about it. Along with that tweet nuance, having only 300 tweets per city was nowhere near a large enough sample size, especially after I threw away words with a 0 sentiment score there just was not a lot of data per city.

Also, the crime statistics used for each city were gathered at the state level. This is not necessarily the best metric because all cities within a single state will have the same crime number which I know not to be true. I also know that crime within large cities will be higher than crime in rural areas so a city like DC has outlandishly high crime statistics relative to other cities because there are no rural areas to skew down the crime states. The opposite effect occurs on cities that are in states without large urban populations

Link between Sentiment and Property Values

I ran correlations on the different factors and the results showed that while Property Values were affected by rent and cost of living, they were not affected significantly by sentiment values or crime. This allowed us to make sure that our two different analyses were working with independent factors and that our model was not subject to bias from each other

V. Recommendations/ Conclusion

In the above analysis, I used a time series analysis and a Twitter sentiment analysis to explore the different cities to potentially invest in a place to live and move to. Our goal is to provide insights that will help individuals in their real estate investment and moving decisions.

After analyzing the data and running our respective models, I was able to measure each respective city using a few key factors. For each city considered: the average property value, the inflation levels of the property value, the cost of living, the crime rate, and the sentiment value. After compiling this data, I looked to analyze the data and create a model with individual factors and allow each user to assign weights to those factors.

If a user cared more about money, they could assign a higher weight to property value, and if a user was very concerned about safety than they could assign a higher weight to the crime rates. The property values were evaluated over a long time in order to project potential growth and future value, but the quality of life metrics were determined using recent data as people would be living in that area immediately if they made the decision to move.

Since each individual will have different values and preferences I created a weighted model of different personas with varying parameters:

*Remember that a negative CoL means that you are conscientious of cost of living, a negative home value score means that you would like to spend less up front for your house, and a negative crime score means you prefer cities with low crime.

Based on these persona archetypes and the overall results of analysis I created a top five list of recommendations for each Persona:

From our weighted system, Home buyers with families should invest in homes and move to Orlando Florida, Cincinnati Ohio, Tampa Florida, Phoenix Arizona and Cleveland Ohio.

Rich and High Class individuals should invest in and move to Orlando Florida, Riverside California, San Francisco California, San Diego California and San Jose California.

Single Investors should invest in and move to Phoenix Arizona, Riverside California, Tampa Florida, Orlando Florida and Austin Texas.

Budget Conscious home buyers should invest in and move to Orlando Florida, Charlotte South Carolina, Riverside California, Tampa Florida and Phoenix Arizona.

Individuals who care highly about safety and good vibes should invest in and move to Orlando Florida, San Francisco California, San Jose California, Seattle Washington, and Riverside California.

VI. Citations

Evangelou, N. (2020, December 04). 8.9 million people have relocated since the beginning of the pandemic. Retrieved March 03, 2021, from https://www.nar.realtor/blogs/economists-outlook/8-9-million-people-relocated-since-the-beginningof-the-pandemic

Orton, K. (2021, January 14). Experts predict what the 2021 housing market will bring. Retrieved March 04, 2021, from https://www.washingtonpost.com/business/2021/01/11/2021-housing-market-predictions/

Ramani, Arjun, and Nicholas Bloom. “The Doughnut Effect of COVID-19 on Cities.” VOX, VOX CEPR Policy Portal, 28 Jan. 2021, voxeu.org/article/doughnut-effect-covid-19-cities.

Bennett, Scott, and Craig Zuelke. “Reassessing Real Estate Investments During the Coronavirus Pandemic.” Reassessing Real Estate Investments During the Coronavirus Pandemic – The Private Bank – Wells Fargo, Wells Fargo Bank, Oct. 2020, www.wellsfargo.com/the-private-bank/insights/reassessing-ream-during-covid/