Yesterday I showed how we can use MarkItDown to convert multiple document types to markdown to make them easier consumable in an LLM context. I used a simple CV in PDF format as an example.
But what if you have images inside your documents?
No worries! MarkItDown allows to process images inside documents as well. Although I couldn’t find a way to directly use and call it through the command line, it certainly is possible by writing some Python code.
First make sure that you have the MarkItDown module installed locally:
pip install 'markitdown[all]'
Now we can first try to recreate our example from yesterday through code:
Remark: The latest version of MarkItDown requires Python 3.10.
from markitdown import MarkItDown | |
# Instantiate the MarkItDown object | |
md = MarkItDown() | |
result = md.convert("cvexample.pdf") | |
print(result.text_content) |
If that works as expected, we can further extend this code to include an LLM to extract image data. We’ll use Ollama in combination with LLaVA (Large Language and Vision Assistant), a model designed to combine language and vision capabilities, enabling it to process and understand both text and images.
First we need to install some extra packages:
pip install openai
We update our code to use this modules:
from markitdown import MarkItDown | |
from openai import OpenAI | |
# Instantiate the OpenAI client with Ollama's local server URL | |
client = OpenAI( | |
base_url='http://localhost:11434/v1', # Ollama's local server URL | |
api_key='ollama', # Required, but not used for local models | |
) | |
model = "llava" | |
# Instantiate the MarkItDown object | |
md = MarkItDown(llm_client=client, llm_model=model) | |
result = md.convert("puppies.jpg") | |
print(result.text_content) |
Processing images
I first tried a small example image:
And here is the result I got back after processing the image:
# Description:
Three adorable puppies stand together as if they are part of a larger family or pack, looking out with a hint of anticipation and a strong sense of curiosity. They seem to be young pups, possibly Corgis or Terriers given their short legs and distinctive ears. The backdrop is plain and white, directing all attention to the puppies. In front of them on a tabletop, there appears to be an object that could represent knowledge or learning, perhaps indicating that these puppies are in a comfortable, domestic environment designed for their care and education.
OK, that seems to work.
Processing images in docx
Let’s now try this with a word document containing images:
from markitdown import MarkItDown | |
from openai import OpenAI | |
# Instantiate the OpenAI client with Ollama's local server URL | |
client = OpenAI( | |
base_url='http://localhost:11434/v1', # Ollama's local server URL | |
api_key='ollama', # Required, but not used for local models | |
) | |
model = "llava" | |
# Instantiate the MarkItDown object | |
md = MarkItDown(llm_client=client, llm_model=model) | |
result = md.convert("examplewithimage.docx") | |
print(result.text_content) |
Here is the word document I tried:
And here is the result:
Unfortunately it seems that MarkItDown didn’t process the embedded images in the docx but it could keep the document structure intact.
Processing images in pptx
Let’s give it another try, this time with a PPTX:
And here is the result:
That look’s better!
Processing images in PDF
Let’s do a last try, this time with the docx converted to a PDF:
And here is the result:
It didn’t process the embedded image and all the formatting is gone as well. Too bad!
Conclusion
The tool is still in active development but the support to process embedded images looks still limited at the moment. I noticed that you can integrate with Azure AI Document Intelligence, so maybe that gives better results but I keep that for another post…
More information
microsoft/markitdown: Python tool for converting files and office documents to Markdown.Deep Dive into Microsoft MarkItDown - DEV Community