Topic Modeling on Racing to Space

For this project, I will be focusing on natural language processing, topic modeling using Latent Dirichlet Allocation (LDA)
Full Code: Topic Modeling to Space FULL HTML

Interest has moved from fringe to the mainstream since big business is taking more of an interest in space. The space race use to be between countries and now it is between billionaires.  Jeff Bezos, Elon Musk, and Richard Branson have increasingly turned their attention in the last two decades to space. All three billionaires have similar but contrasting space ambitions, with the main purpose being to transform the private sector and expand the commercial space sector. They are among many trying to expand the capabilities of the commercial space industry and create a sustainable business model. The business impact can lead to an increase in satellites and easier and cheaper transportation into space for both human beings and cargo. The space economy has an increasingly bright feature as barriers of entry decrease and access to satellite technology and innovation increase. My objective is to identify topic clusters that will lead to segments of interest for future investment in the commercial space sector. The data source of this project is news feeds collected using Webhose.io. I implemented API calls to obtain all data of news feeds regarding four major entities, SpaceX, NASA, Virgin Galactic, and Blue Origin, and stored them as JSON documents. The query params were defined to limit articles to English language and news websites only. The total number of news feeds collection is 3327 articles before cleaning and deduplicating.  This project aims to perform topic modeling with Latent Dirichlet Allocation (LDA). I chose to use LDA because it allows us to classify sets of observations into similar groups and thus creates clusters of main topics that are most representative of the articles. I will first collect news articles related to SpaceX, Blue Origin, Virgin Galactic, and NASA from webhose, then extract text data from each article. I perform basic data cleaning by tokenizing each sentence into a list of words and removing punctuations and stopwords. Before creating the required corpus and dictionary for the LDA model, I lemmatize the words to map different forms of a word to a single form in order to increase accuracy. A base model will be created with default values from the gensim python package. I plan to determine the optimal number of topics through hyperparameter tuning. The project’s deliverables include a visualization of the topics and a trained LDA model to identify topics of an article.  The topic modeling with two major methods, first based on human observation listing top “N” words in each topic, then using quantitative metric topic coherence score to measure each topic by the degree of semantic similarity between high scoring words in the topic. The final model will provide the optimal number of topics based on the highest coherence score.Training an LDA model with parameters from the elbow method:

THE FINAL MODEL:
The elbow method provides an improved coherence score, however, the visualization shows that there are overlapping words within multiple clusters. To have refined topic clusters for this data set, the model is further enhanced with 4 topic clusters to avoid overlaps.
For full interactive versions of the visualizations above, please click here