Large Language Models (LLM) have created an indelible mark for solving use cases in the textual medium. However, the same cannot be said about the image medium. Of course, the models can generate images and solve image-related use cases when fine-tuned, but there's still a gap between generative capabilities and practical enterprise solutions.
Consider this: When you prompt an LLM to generate images, say of a car accident - the results can be hit or miss. Some images might look convincing, but others have that distinctive AI-generated surreal quality that makes them unsuitable for professional computer vision applications. The issue is that we cannot use these images as-is to solve Computer Vision use cases.
However, we used LLM for specialized Optical Character Recognition (OCR) purposes. We wanted to explore large language models to extract dimensional data from Engineering Diagrams. Below is a handwritten engineering diagram with pipes and valves. (While we tested our approach using handwritten diagrams for proof of concept, real-world applications involve more complex, software-generated drawings with precise dimension lines and detailed annotations. Those diagrams are complex, and their information will be crowded.)
Our primary focus has been on developing a reliable system for extracting two critical pieces of information: pipe identifiers and their corresponding dimensions. This capability could transform how engineering teams handle document processing and data extraction from technical drawings.
The pipe names and dimensions to extract are:
The Prompt Engineering Journey: Our initial success came through careful prompt engineering, which required multiple iterations and fine-tuning. Through a series of conversational exchanges with the LLM, we gradually refined our prompts, correcting and guiding the model to identify specific elements within the diagrams. This iterative process eventually led to accurate dimension and label extraction.
The Context Dilemma: However, we hit an unexpected roadblock. When attempting to replicate our success in a new chat session using the same diagram and our previously successful prompt, the results were disappointing. Despite using identical inputs, the model returned to producing inaccurate values - effectively erasing our previous progress.
Understanding the Limitations: Our experience highlighted several challenges in using Generative AI for engineering diagram analysis:
Alternative approaches given these challenges; we've identified more reliable traditional computer vision approaches:
This experience has reinforced that while LLMs show promise, a hybrid approach combining traditional computer vision techniques with newer AI technologies might be the most practical path forward for engineering diagram analysis.
As we continue to explore this space, we're focusing on developing hybrid solutions that combine the interpretative power of LLMs with the reliability of traditional computer vision techniques. This balanced strategy not only addresses current limitations but also points us to adapt as LLM technology evolves quickly.
Note: As with any AI implementation, data security comes first. Our standard practice includes thoroughly redacting all Personally Identifiable Information (PII) from source materials, including customer references and sensitive data, before processing any engineering diagrams.
Ready to explore AI's transformative power? Visit Quasar to know more.