Project - Covid Tweets

🦠 Gauging New York Mental Health Through Covid-19 Tweets

Contributors: Oviya Adhan, Clarissa Solis, Robin Zhao

Premise:

A year after the breakout of Covid-19, this project aimed to explore the correlation between general public mental health based on public Twitter posts and the number of cases of Covid throughout the first year of the pandemic. Personally, our team had all experienced an increase in anxiety and depression throughout the pandemic and set out to explore whether this trend reflected in the general public or not.

Hypothesis: With this corpus analysis, we hypothesize that the number of depression and anxiety related tweets and negative sentiment are positively correlated with increasing COVID-19

Methodology:

With Natural Language Processing (NLP) techniques, we aimed to analyze whether this correlation existed by conducting a sentiment analysis on a corpus of tweets mentioning Covid from March 2020 to January 2021. The process included:

Data Wrangling - Dataset collected from the COVID-19 Tweets Dataset cited from IEEE DataPort(1). Content of tweets gathered using Twitter API.
Data Pre-Processing - Tokenization, lowercase conversion, punctuation removal, lemmatization, stop word removal. Filtered data for only tweets made in New York state and between the dates 2/28/2020 to 5/01/2021.
Exploratory Data Visualization - Validated that our corpus follows Zipf’s Law. Lexical dispersion plot generated on the first ten keywords in a custom depression keywords list (including ‘mental’, ‘depression’, ‘worried’, ‘anxious’, ‘scared’, ‘lonely’, ‘depressed’).
Sentiment Analysis - Calculated depression sentiment per tweet based on the number of occurrences of keywords the custom list.
Correlation Analysis - Normalize average sentiment scores of tweets, depression keyword occurrences, and number of new cases in New York state. Plotted all normalized scores across a time series to compare the trends.

Tools: Python (Pandas, NumPy, NLTK, SciPy, Matplotlib)

Results:

The time series analysis across the normalized average sentiment scores of tweets, depression keyword occurrences, and number of new cases per day in New York state from March 2020 to May 2021 is as follows:

Based on the chart, we can see that the time series of sentiment and keyword follow a similar pattern: an increase in depression keyword occurrences is often followed by an increase in depression sentiment score. This implies that our depression keyword analysis has some degree of accuracy. Furthermore, the keyword occurrences are surprisingly negatively correlated with the case number. Lag analysis shows that the correlation between the number of new COVID-19 cases and keyword occurrences lagged by one week is -0.77, which is a moderately strong negative correlation meaning an increase in Covid-19 cases was tightly followed by a decrease in depression keywords and vice versa. This contradicts with our initial hypothesis, and does not provide evidence that instances of depression and anxiety related tweets and negative sentiment are positively correlated with increasing COVID-19 cases.

Limitations & Future Work:

This study has two primary limitations:

Limited to tweets in English due to language limitation of NLP packages especially with stopwords, tokenizers, and lemmitizers. During the pandemic, underrepresented communities, often BIPOC, and immigrant communities who are more likely to speak different languages were disproportionately affected by the pandemic, but this study is not able to capture this reality when limited to English tweets.
Limited to Twitter, so conclusions can not be generalized to all of social media, which would provide a more nuanced view of mental health sentiment

Check out the work behind this project at https://github.com/oadhan/Covid-Tweet-Sentiment-Analysis

Citations:

(1) Lamsal, R. (2022, May 12). Coronavirus (COVID-19) tweets dataset. IEEE Dataport. https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset