Data Portraits

This is the code for Data Portraits: Recording Foundation Model Training Data by Marc Marone and Ben Van Durme.

Large models are trained on increasingly immense and opaque datasets, but it can be very difficult to answer a fundamental question: Was this in a model's training set?

We call for documentation artifacts that can answer this membership question and term these artifacts Data Portraits.

This repo implements one tool that can answer this question -- based on efficient hash storage with Bloom filters. Of course, many other dataset documentation tools exist. See our paper for details about this method, other tools, and properties that make ours unique.

For more details, see our paper.

Warning

Consider this an alpha code release - other portrait files, MUCH easier redis installation, and the interface will be added soon!

Installation

This system uses several components: a python library, an bitarray server (redis), and binary files containing data hashes.

Install the library

pip install -e . #install editable package

Install redis and RedisBlooom

If you're familiar with redis already and want to use an existing server or a system installation, you can skip this section. Just make sure you know your redis connection details and have RedisBloom loaded into your server (with config files or command line args)

Included launch scripts assume this repo structure but it's easily changed:
```
├── RedisBloom
├── redis_configs
├── redis-stable
├── scripts
└── src
    └── dataportraits
```
1. Install and build Redis
Build with make install PREFIX="$(pwd)" # install to redis-stable/bin to put binaries in the expected places.
1. Install and build RedisBloom
  
  Our code was tested against version 2.4.3. We assume a certain binary header structure in serialized Bloom filter files, other redis versions may change this!
  
  You can use the instructions in the repo or these:
```
git clone https://github.com/RedisBloom/RedisBloom.git
cd RedisBloom
git checkout tags/v2.4.3 
git submodule update --init --recursive
make -j 4
cd ..
```
Start Redis

We include a launch script helper. If you're familiar with redis already, you can freely launch servers as you would typically or use an existing server.
```
python easy_redis.py --just-start
```
Note that this starts a persistent, daemon server on your system. Loading sketches will consume ram until you shut it down: python easy_redis.py --shutdown
Fetch Data Files

Fetch data from https://huggingface.co/mmarone/portraits-wikipedia on HuggingFace:
```
git lfs install
git clone https://huggingface.co/mmarone/portraits-wikipedia
```
This takes about 2.5GB local space
Load Files
```
python easy_redis.py --start-from-dir portraits-wikipedia/
```
Note: if redis wasn't already started (e.g. you skipped step 3, this will attempt to start a server for you).

Usage

More examples coming soon!

import dataportraits

# localhost:8899 is the default for the redis server started above
# wikipedia.50-50.bf is the name of the system - see the easy_redis.py script for more
# change as necessary!
portrait = dataportraits.RedisBFSketch('localhost', 8899, 'wiki-demo.50-50.bf', 50)

text = """
Test sentence about Data Portraits - NOT IN WIKIPEDIA!
Bloom proposed the technique for applications where the amount of source data would require an impractically large amount of memory if "conventional" error-free hashing techniques were applied
"""
report = portrait.contains_from_text([text])
print(report[0]['chains'])

# [['cations where the amount of source data would requ', 'ire an impractically large amount of memory if "co', 'nventional" error-free hashing techniques were app']]

Citing

If you find this repo or our web demo useful, please cite our paper.

@inproceedings{
    marone2023dataportraits,
    title={Data Portraits: Recording Foundation Model Training Data},
    author={Marc Marone and Benjamin {Van Durme}},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
    year={2023},
    url={https://openreview.net/pdf?id=ZrNRBmOzwE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
redis_configs		redis_configs
scripts		scripts
src/dataportraits		src/dataportraits
.gitignore		.gitignore
README.md		README.md
easy_redis.py		easy_redis.py
setup.py		setup.py
start_redis_ez.sh		start_redis_ez.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Portraits

Installation

Usage

Citing

About

Uh oh!

Releases

Packages

Languages

bufferoverflow/data-portraits

Folders and files

Latest commit

History

Repository files navigation

Data Portraits

Installation

Usage

Citing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages