Posted on :: 290 Words

Continuous epidemiological surveillance is extremely resource-intensive. Typically, a team of qualified clinicians continuously tracks reports of hospitalizations and medical activites in the region, along with follow-up information (when available). This involves the laborious task of parsing through millions of medical records, which can easily require thousands of man- hours. This has only gotten worse with the COVID-19 pandemic, which caused a massive backlog. We aim at lowering this cost and scaling the activities: by leveraging machine learning methods for natural language, we can automate most of this work, allowing professionals to focus on the analysis and not on the mechanical task of annotating documents. To this end, we’ve developed OLIM, a general end-to-end system for fitting models that identify maladies, medication use and more in patients’ medical records. The user starts by creating a label for the feature they want to detect, and can then use textual search tools along with automatic selectors to locate medical record entries to label in regards to the task at hand. Once there are enough labels, the user can get a fine-tuned model capable of performing the desired task with great accuracy. With a good model in hand, the system leverages conformal prediction tools to automatically label, in a matter of seconds, all the medical records present, while providing sound uncertainty quantification. OLIM has already been tested in collaboration with the Municipal Health Department of Florianópolis and Grupo Hospitalar Concei ̧c ̃ao, where it was used to automate the classification of 13 different symptoms closely associated with long COVID syndrome. We were able to automatically label all but a handful of the relevant medical records in the 2020-2022 range with over 90% confidence.