Abstract
Identifying technical document duplications is a crucial task to optimize storage resources and improve document management and retrieval processes. This helps to ensure that the most up-to-date version of documents is going to be available, which is important in all industrial domains including the petroleum sector to avoid errors and inconsistencies. In addition, deduplication is a key step to consider in data pre-processing for training large machine learning models. This paper tackles the task of duplicate detection in technical reports by proposing a hybrid solution based on open-source libraries on computer vision. The proposed solution has been evaluated on a large set of data and provided promising results which can help us to optimize our storage. We observed the solution is flexible enough to adapt to different scenarios and different types of reports. Having such a solution will potentially lead to reducing the carbon footprint in IT infrastructures.