Conclusion
We have proposed an approach for fact extraction from crisis event data using a combination of natural language processing, IR and machine learning techniques. Our approach involves data preprocessing, embedding, and retrieval, followed by fact extraction using large language models. We have evaluated our approach using human-annotated queries for fact extraction and compared the performance of three different models: BERT-based model, Google Flan T5 model, and Open AI text-davinci-003 model.
Our evaluation results show that the Google Flan T5 model performs the best for our use case, with the highest recall value for numerical and fact-based queries. However, the Open AI text-davinci-003 model gives higher precision for both types of queries, but generates extraneous words, resulting in a lower recall value. We also found that the optimal value for the "k" parameter for retrieving relevant documents varies based on the type of query.
Although our approach has shown promising results, there are still areas for improvement. For instance, we have evaluated our approach on one type of crisis event dataset (Sally Hurricane), and it is possible that the performance may vary for other type of events or datasets. Additionally, our approach can be further improved by fine-tuning hyperparameters and using additional models for fact extraction.
Overall, our approach provides a foundation for fact extraction from crisis event data, which can be useful for disaster response teams, news organizations, and other stakeholders who need to quickly access accurate information during a crisis event.