Skip to content

Incorrect handling of JPEG with color space CMYK image extraction #4186

@pdc1

Description

@pdc1

Description of the bug

I have a PDF with CMYK colorspace images. I want to convert the raw image bytes (e.g. from extract_image or get_text(dict)) to an RGB image.

For images with decode filter = 'DCTDecode', the colorspace conversion does not appear to work when given raw images bytes. If the Pixmap is loaded using xref directly, it works.

The document images look like this:

>>> doc.get_page_images(0)
[(44, 0, 1350, 1125, 8, 'DeviceCMYK', '', 'X10', 'DCTDecode'), (46, 45, 1221, 1357, 8, 'DeviceCMYK', '', 'X11', 'FlateDecode'), (52, 51, 500, 500, 8, 'DeviceCMYK', '', 'X7', 'FlateDecode'), (53, 0, 1650, 1275, 8, 'DeviceCMYK', '', 'X9', 'FlateDecode'), (48, 0, 1024, 683, 8, 'DeviceCMYK', '', 'X4', 'FlateDecode')]
>>> doc.get_page_images(1)
[(7, 0, 2848, 4288, 8, 'DeviceCMYK', '', 'X15', 'DCTDecode')]

See sample code below.

Sample PDF is Seven Deadly Sins Program-1.pdf

Correct image (using Pixmap(xref))
temp

Incorrect image (using extract_image(xref)["image"] bytes)
temp2

How to reproduce the bug

Here is the code I used to generate the two images:

import pymupdf

doc = pymupdf.open("Seven Deadly Sins Program-1.pdf")
images = doc.get_page_images(0)
xref = images[0][0]

pix = pymupdf.Pixmap(doc, xref)
pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
pix.save("temp.jpeg")

img = doc.extract_image(xref)
pix2 = pymupdf.Pixmap(img["image"])
pix2 = pymupdf.Pixmap(pymupdf.csRGB, pix2)
pix2.save("temp2.jpeg")

PyMuPDF version

1.25.1

Operating system

Windows

Python version

3.11

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions