Tech Tips: Document AI with Apache Airflow

Monday, January 8, 2024

Document AI with Apache Airflow

Summary

In this post, I cover an approach to a document AI problem using a task flow implemented in Apache Airflow. The particular problem is around the de-duplication of invoices. This comes up in payment provider space. I use Azure AI Document Intelligence for OCR, Azure OpenAI for vector embeddings, and Redis Enterprise for vector search.

Architecture

Code Snippets

File Sensor DAG

OCR DAG

OCR Client (Azure AI Doc Intelligence)

Embedding DAG

Embedding Client (Azure OpenAI)

Vector Search DAG

Vector Search Client (Redis Enterprise)

Source

https://github.com/Redislabs-Solution-Architects/docai_pipeline