Skip to main content

MarkItDown with Ollama–Process images inside documents

Yesterday I showed how we can use MarkItDown to convert multiple document types to markdown to make them easier consumable in an LLM context. I used a simple CV in  PDF format as an example.

But what if you have images inside your documents?

No worries! MarkItDown allows to process images inside documents as well. Although I couldn’t find a way to directly use and call it through the command line, it certainly is possible by writing some Python code.

First make sure that you have the MarkItDown module installed locally:

pip install 'markitdown[all]'

Now we can first try to recreate our example from yesterday through code:

Remark: The latest version of MarkItDown requires Python 3.10.

from markitdown import MarkItDown
# Instantiate the MarkItDown object
md = MarkItDown()
result = md.convert("cvexample.pdf")
print(result.text_content)
view raw example.py hosted with ❤ by GitHub

If that works as expected, we can further extend this code to include an LLM to extract image data. We’ll use Ollama in combination with LLaVA (Large Language and Vision Assistant), a model designed to combine language and vision capabilities, enabling it to process and understand both text and images.

First we need to install some extra packages:

pip install openai

We update our code to use this modules:

from markitdown import MarkItDown
from openai import OpenAI
# Instantiate the OpenAI client with Ollama's local server URL
client = OpenAI(
base_url='http://localhost:11434/v1', # Ollama's local server URL
api_key='ollama', # Required, but not used for local models
)
model = "llava"
# Instantiate the MarkItDown object
md = MarkItDown(llm_client=client, llm_model=model)
result = md.convert("puppies.jpg")
print(result.text_content)
view raw example.py hosted with ❤ by GitHub

Processing images

I first tried a small example image:

And here is the result I got back after processing the image:

# Description:
Three adorable puppies stand together as if they are part of a larger family or pack, looking out with a hint of anticipation and a strong sense of curiosity. They seem to be young pups, possibly Corgis or Terriers given their short legs and distinctive ears. The backdrop is plain and white, directing all attention to the puppies. In front of them on a tabletop, there appears to be an object that could represent knowledge or learning, perhaps indicating that these puppies are in a comfortable, domestic environment designed for their care and education.

OK, that seems to work.

Processing images in docx

Let’s now try this with a word document containing images:

from markitdown import MarkItDown
from openai import OpenAI
# Instantiate the OpenAI client with Ollama's local server URL
client = OpenAI(
base_url='http://localhost:11434/v1', # Ollama's local server URL
api_key='ollama', # Required, but not used for local models
)
model = "llava"
# Instantiate the MarkItDown object
md = MarkItDown(llm_client=client, llm_model=model)
result = md.convert("examplewithimage.docx")
print(result.text_content)
view raw example.py hosted with ❤ by GitHub

Here is the word document I tried:

And here is the result:

Unfortunately it seems that MarkItDown didn’t process the embedded images in the docx but it could keep the document structure intact.

Processing images in pptx

Let’s give it another try, this time with a PPTX:

And here is the result:

That look’s better!

Processing images in PDF

Let’s do a last try, this time with the docx converted to a PDF:

And here is the result:

It didn’t process the embedded image and all the formatting is gone as well. Too bad!

Conclusion

The tool is still in active development but the support to process embedded images looks still limited at the moment. I noticed that you can integrate with Azure AI Document Intelligence, so maybe that gives better results but I keep that for another post…

More information

microsoft/markitdown: Python tool for converting files and office documents to Markdown.Deep Dive into Microsoft MarkItDown - DEV Community

Popular posts from this blog

Kubernetes–Limit your environmental impact

Reducing the carbon footprint and CO2 emission of our (cloud) workloads, is a responsibility of all of us. If you are running a Kubernetes cluster, have a look at Kube-Green . kube-green is a simple Kubernetes operator that automatically shuts down (some of) your pods when you don't need them. A single pod produces about 11 Kg CO2eq per year( here the calculation). Reason enough to give it a try! Installing kube-green in your cluster The easiest way to install the operator in your cluster is through kubectl. We first need to install a cert-manager: kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.5/cert-manager.yaml Remark: Wait a minute before you continue as it can take some time before the cert-manager is up & running inside your cluster. Now we can install the kube-green operator: kubectl apply -f https://github.com/kube-green/kube-green/releases/latest/download/kube-green.yaml Now in the namespace where we want t...

Azure DevOps/ GitHub emoji

I’m really bad at remembering emoji’s. So here is cheat sheet with all emoji’s that can be used in tools that support the github emoji markdown markup: All credits go to rcaviers who created this list.

DevToys–A swiss army knife for developers

As a developer there are a lot of small tasks you need to do as part of your coding, debugging and testing activities.  DevToys is an offline windows app that tries to help you with these tasks. Instead of using different websites you get a fully offline experience offering help for a large list of tasks. Many tools are available. Here is the current list: Converters JSON <> YAML Timestamp Number Base Cron Parser Encoders / Decoders HTML URL Base64 Text & Image GZip JWT Decoder Formatters JSON SQL XML Generators Hash (MD5, SHA1, SHA256, SHA512) UUID 1 and 4 Lorem Ipsum Checksum Text Escape / Unescape Inspector & Case Converter Regex Tester Text Comparer XML Validator Markdown Preview Graphic Col...