SheetReader DuckDB extension

sheetreader is a DuckDB extension that allows reading XLSX files into DuckDB tables with SheetReader, our blazingly fast XLSX parser (https://github.com/polydbms/sheetreader-core).

This repository is based on https://github.com/duckdb/extension-template.

Usage

Before using SheetReader, you need to install it from the community extensions and load it into your DuckDB-environment:

INSTALL sheetreader FROM community;
LOAD sheetreader;

Now, you can run your first query:

D SELECT *
  FROM sheetreader('test.xlsx');

The sheetreader() function offers further parameters to load the XLSX-file as required:

D CREATE TABLE test AS FROM sheetreader(
    'test.xlsx',
    sheet_index=1,
    threads=16,
    skip_rows=0,
    has_header=TRUE,
    types=[BOOLEAN,VARCHAR],
    coerce_to_string=TRUE,
    force_types=TRUE
  );

Parameters

Name	Description	Type	Default
`sheet_index`	Index of the sheet to read. Starts at 1.	`INTEGER`	`1`
`sheet_name`	Name of the sheet to read. Only either `sheet_index` or `sheet_name` can be set.	`VARCHAR`	`""`
`threads`	Number of threads to use, while parsing	`INTEGER`	Half of available cores; minimum 1
`skip_rows`	Number of rows to skip	`INTEGER`	`0`
`types`	List of types for all columns Types currently available: `VARCHAR`,`BOOLEAN`,`DOUBLE`, `DATE`. Useful in combination with `coerce_to_string` and `force_types`.	`LIST(VARCHAR)`	Uses types determined by first & second row (after skipped rows)
`coerce_to_string`	Coerce all cells in column of type `VARCHAR` to string (i.e. `VARCHAR`).	`BOOLEAN`	`false`
`force_types`	Use `types` even if they are not compatible with types determined by first/second row. Cells, that are not of the column type, are set to `NULL` or coerced to string, if option is set.	`BOOLEAN`	`false`
`has_header`	If set to `true`: Force to treat first row as header row (only works if all cells are of type `VARCHAR`). If successful, the cell contents are used for column names. Will overwrite the default behavior, which doesn't use the first row as headers, if all columns have type `VARCHAR`. If set to `false`: The extension will still try to treat the first row as header row. The difference is that it will not fail, if the first row is not usable (i.e. not all cells are of type `VARCHAR`). The first row won't be used as headers, if all columns have type `VARCHAR`.	`BOOLEAN`	`false`

More information on SheetReader

SheetReader was published in the Information Systems Journal.

@article{DBLP:journals/is/GavriilidisHZM23,
  author       = {Haralampos Gavriilidis and
                  Felix Henze and
                  Eleni Tzirita Zacharatou and
                  Volker Markl},
  title        = {SheetReader: Efficient Specialized Spreadsheet Parsing},
  journal      = {Inf. Syst.},
  volume       = {115},
  pages        = {102183},
  year         = {2023},
  url          = {https://doi.org/10.1016/j.is.2023.102183},
  doi          = {10.1016/J.IS.2023.102183},
  timestamp    = {Mon, 26 Jun 2023 20:54:32 +0200},
  biburl       = {https://dblp.org/rec/journals/is/GavriilidisHZM23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Benchmarks

You can find benchmarks in the above-mentioned paper, comparing SheetReader to other XLSX parsers.

Here is a plot of preliminary benchmarks comparing the sheetreader DuckDB extension to the spatial extension's st_read function:

(System info: 2x Intel(R) Xeon(R) E5530 @ 2.40GHz, 47GiB RAM)

Building yourself

First, clone this repository with the --recurse-submodules flag --- so you get all the needed source files.

To build the extension, run:

GEN=ninja make

The main binaries that will be built are:

./build/release/duckdb
./build/release/extension/sheetreader/sheetreader.duckdb_extension

duckdb is the binary for the DuckDB shell with the extension code automatically loaded.
sheetreader.duckdb_extension is the loadable binary as it would be distributed.

Running the extension

To run the self-built extension code, simply start the shell with ./build/release/duckdb.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets		assets
benchmarks		benchmarks
docs		docs
duckdb @ fa5c2fe		duckdb @ fa5c2fe
extension-ci-tools @ 117cfaf		extension-ci-tools @ 117cfaf
scripts		scripts
src		src
test		test
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.clangd		.clangd
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
extension_config.cmake		extension_config.cmake
vcpkg.json		vcpkg.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SheetReader DuckDB extension

Table of Contents

Usage

Parameters

More information on SheetReader

Benchmarks

Building yourself

Running the extension

About

Releases

Packages

Contributors 2

Languages

License

polydbms/sheetreader-duckdb

Folders and files

Latest commit

History

Repository files navigation

SheetReader DuckDB extension

Table of Contents

Usage

Parameters

More information on SheetReader

Benchmarks

Building yourself

Running the extension

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages