Code Maroon: Global Edition

Motivation

Information retrieval is a critical component of crisis response, but it's not always easy to find reliable sources in the midst of chaos. Rumors, gossip, and wild speculation run rampant, leaving authorities scrambling to sort fact from fiction. In this scenario, there is a need for a global, reliable and user friendly application to provide quick and accurate facts to the general masses.

The motivation behind this project is to develop a system that can efficiently extract relevant facts and information from unstructured data during a crisis event. This system will use state-of-the-art natural language processing, information retrieval and machine learning techniques to extract accurate and timely information from news articles and social media platforms such as Twitter and Reddit. By developing such a system, we hope to contribute to the field of crisis management and provide a valuable tool for individuals and emergency response teams during crisis events.

This task is from 2022 TREC CrisisFACTs Track: https://crisisfacts.github.io/

Dataset

CMGE uses the stream of data from News articles, social media platforms (Twitter, Reddit and Facebook), about an event based on which we can ask queries to the system and get answers based on the relevant content.
The CrisisFACTs dataset is a collection of News articles, social media platform posts (Twitter, Reddit and Facebook) that were posted during various crisis events, such as Wildfires, Hurricanes, Floods, and other emergencies. The dataset was created to help researchers and developers better understand how people use social media and News during times of crisis and to develop new tools and technologies to support crisis response and recovery efforts.

Analyzing these data streams can provide a more complete picture of the crisis event, including how people are responding, what information they are sharing, and what actions are being taken by authorities and organizations.

Approach

The first step in this task is to collect and preprocess the data. This involves gathering a stream of data from given dataset collected from various sources such as news articles and social media (Twitter, Reddit).
The data is then preprocessed to remove irrelevant texts to get relevant context text for fact retrieval. This includes tasks such as removal of Media names from News articles, Twitter handles, and hyperlinks from tweets.
We collected the relevant extracted context text “Headlines”. These headlines are then divided into chunks of data and further these chunks are embedded and stored in chroma DB. For embedding of chunks Sentence-Transformer model “paraphrase-MiniLM-L6-v2” is used as it captures sentence semantics giving us effective sentence embeddings of the headline's chunks.
The next tasks involve extracting relevant headline chunks from a pool of embedded headline data via query. This task is achieved by first embedding the query using the Sentence-Transformer model “paraphrase-MiniLM-L6-v2”. Cosine similarity was performed on Embedded query vectors and Embedded headline data chunks to get similarity scores. Top-k headline chunks are chosen based on similarity score which gives us the relevant data (called documents) corresponding to the query.
Once we have a query and corresponding relevant documents, we intend to extract exact facts from the relevant documents using large language models (LLMs) modeling this subtask as a “Question-Answering task”.
We chose the LLM's, based on the following criteria that the Model is pre-trained for Question-Answering (QA) tasks on QA datasets like SQuAD dataset. (Stanford Question answering dataset).
We used 3 models which were successful in performing fact extraction in the subtask of Question-Answering which were “deepset/roberta-base-squad2; BERT based model”, “Google Flan T5 Model”, and “Open AI text-davinci-002 - GPT 3.5 series model”.
Input to these models were the embeddings of Query and retrieved relevant documents. For, Input to Roberta model tokenizer of the “deepset/roberta-base-squad2” model is used to tokenize the inputs. The output of the models was the extracted facts in answers form which is given in the results section.

Fig 1: Flow diagram of Fact extraction approach using Google’s Flan T5 model.

Fig 2: Flow diagram of Fact extraction approach using Open AI text-davinci-002 model.

Failed case:

We have tried to use the “XLNetForQuestionAnswering” model as an LLM model which is pre-trained with QA Dataset. This Model is then fine-tuned by us with SQuAD 2.0 dataset on “TAMU HPRC”. The input to the model was the queries and retrieved relevant documents (from the cosine similarity step) tokenized by the “xlnet-base-cased ” tokenizer. The results of this model were not satisfactory, and the reason is discussed in the evaluation section.

Evaluation Models

We have used the following evaluation models for comparison:
1. RoBERTa model
2. Google Flan T5
3. Open AI text-davinci-003 model

Evaluation Results and Analysis

We have created human-annotated queries for fact extraction. There are two categories of generated queries: “Numerical and Fact-based queries” and “general queries”. These queries are being used to test the algorithm and evaluate the model output for any crisis event. The fact extracted output is evaluated using metrics: “Average Exact Match, Average F1 score, Average Recall, and Average Precision.”

General query: A general query is a question that requires a broad understanding of a topic, and the answer can be subjective or open to interpretation. In our model testing, a general query was defined as a question related to the topic of Hurricane Sally, but not specifically asking for a numerical fact or a specific piece of information.
Factual query: A factual query is a question that requires a specific piece of information as an answer, usually a numerical fact. In our model testing, a factual query was defined as a question related to the topic of Hurricane Sally, but specifically asking for a numerical fact or a specific piece of information, such as the wind speed, the amount of rainfall, or the number of homes damaged.

The Sally Hurricane crisis event which happened in Louisiana state of US, is considered for evaluation. Text 1 (pdf) shows the example output for Roberta's model for General queries.

Table 1 gives the Summary of the evaluation of all models with “Numerical and Fact-based queries” and “general queries”. Fig 3 shows the same evaluation result in the line chart.
Fig 4 shows the Performance of Models for “Numerical and Factual Queries”.
Fig 5 shows the Performance of Models for “General Queries”.

Evaluation of all models on different metrics gives the clear idea that the Google Flan T5 model performed better for our use case, where it gives the highest “Average Recall value” for “Numerical and Fact-based queries” which is near 60%.

Open AI text-davinci-003 model gives higher precision for both types of queries. One of the main observations was that as it is a generative model, it generates a lot of words around the extracted facts which are probably not relevant to the extracted facts, and due to generating a higher number of words for a query, its precision percentage increases but the recall percentage remains less.

As the T5 model was giving better results outperforming the other two models, we further investigated the results by changing different parameters for our approach. “k” value in the “Top k” documents chosen as relevant documents is varied to get the best results and metrics for the T5 model. The results gave us interesting observations. For “Numerical and Fact-based queries” after “k = 3” (Given in Fig 6), the Average Recall decreases and for “General queries” after “k = 50” (Given in Fig 7), the Average Recall decreases.
Therefore, for “Numerical and Fact-based queries” the best k value is “k = 3”, and for “General queries” the best k value is “k = 50”.

Failed case analysis:
Model XLNetForQuestionAnswering was trained using the below parameters:
Optimizer: Adam
Learning rate: 1e-5
Epochs : 3

A possible failure reason can be the non optimal hyperparameters for training. Due to time constraints, we couldn’t explore and train the model for different hyperparameters to get better results.

As a Future work, multiple possible optimal hyperparameters can be used to fine-tune the model. Another possible reason for bad output could be domain-specific training. If the XLNetForQuestionAnswering model could be trained on crisis data with a lot of human-annotated questions and answers, it might perform better. Though it needs a lot of effort to generate a large training dataset for supervised learning.

Metric Glossary:
- Average Precision (AP) is a metric that measures the quality of the ranked results, i.e., how many of the predicted positives are actually positive. It is calculated as the area under the precision-recall curve, and ranges between 0 and 1. A higher AP score indicates better performance.
- Average Recall (AR) is a metric that measures the completeness of the ranked results, i.e., how many of the actual positives were retrieved by the model. It is calculated as the ratio of the true positives to the total number of actual positives, and also ranges between 0 and 1. A higher AR score indicates better performance.
- Average F1 score is the harmonic mean of the precision and recall scores, and is a commonly used metric to evaluate classification models. It is calculated as 2 * ((precision * recall) / (precision + recall)), and ranges between 0 and 1. A higher F1 score indicates better performance.
- Average Exact match is a metric that measures the percentage of instances for which the predicted label matches the true label exactly. It is calculated as the ratio of the number of instances for which the predicted label is an exact match to the true label to the total number of instances. A higher exact match score indicates better performance.

Conclusion

We have proposed an approach for fact extraction from crisis event data using a combination of natural language processing, IR and machine learning techniques. Our approach involves data preprocessing, embedding, and retrieval, followed by fact extraction using large language models. We have evaluated our approach using human-annotated queries for fact extraction and compared the performance of three different models: BERT-based model, Google Flan T5 model, and Open AI text-davinci-003 model.
Our evaluation results show that the Google Flan T5 model performs the best for our use case, with the highest recall value for numerical and fact-based queries. However, the Open AI text-davinci-003 model gives higher precision for both types of queries, but generates extraneous words, resulting in a lower recall value. We also found that the optimal value for the "k" parameter for retrieving relevant documents varies based on the type of query.
Although our approach has shown promising results, there are still areas for improvement. For instance, we have evaluated our approach on one type of crisis event dataset (Sally Hurricane), and it is possible that the performance may vary for other type of events or datasets. Additionally, our approach can be further improved by fine-tuning hyperparameters and using additional models for fact extraction.
Overall, our approach provides a foundation for fact extraction from crisis event data, which can be useful for disaster response teams, news organizations, and other stakeholders who need to quickly access accurate information during a crisis event.