The future of AI relies on a high schoolteacher’s free database

Located in a suburban area on the outskirts of Hamburg, Germany, there is a single word, “LAION,” scrawled in pencil on a mailbox. This is the home of high school teacher Christoph Schuhmann, who is behind a massive data gathering effort that has become central to the artificial intelligence (AI) boom that has captured the world’s attention. LAION stands for “Large-scale AI Open Network,” which is Schuhmann’s passion project. When he is not teaching physics and computer science to German teenagers, Schuhmann works with a small team of volunteers to build the world’s largest free AI training dataset. LAION has already been used in text-to-image generators such as Google’s Imagen and Stable Diffusion.

AI text-to-image generators rely on databases like LAION for the vast amounts of visual material needed to deconstruct and create new images. The debut of these products last year caused a paradigm shift in the AI arms race, and raised ethical and legal issues. Lawsuits were quickly filed against generative AI companies Stability AI and Midjourney for copyright infringement, and critics raised concerns about the violent, sexualized, and otherwise problematic images within their datasets. These images were accused of introducing biases that are almost impossible to mitigate.

However, Schuhmann is not concerned about these issues. He only wants to set the data free.

Large language

Two years ago, a 40-year-old teacher and trained actor co-founded LAION with other AI enthusiasts after being inspired by OpenAI’s release of the DALL-E deep learning model. Concerned about big tech companies monopolizing data, they decided to create an open-source dataset to help train image-to-text diffusion models. Using raw HTML code collected by the California nonprofit Common Crawl, the group associated images with descriptive text, resulting in 3 million image-text pairs within weeks and 400 million pairs after three months. LAION has now grown to become the largest free dataset of images and captions, with over 5 billion pairs.

Despite working without pay and receiving only one donation from a machine-learning company, LAION’s reputation grew, leading to an offer from Emad Mostaque, a former hedge fund manager who offered to cover the costs of computing power to launch his own open-source generative AI business. Initially skeptical, the team eventually accepted the offer, gaining access to GPUs in the cloud worth around $9,000 or $10,000.

Stability AI launched in 2022, using LAION’s dataset for its flagship AI image generator, Stable Diffusion, and hiring two of the organization’s researchers. A year later, the company is seeking a $4 billion valuation, largely thanks to LAION’s data. However, the LAION co-founder, who rejected job offers from various companies to keep the project independent, hasn’t profited from it and isn’t interested in doing so, saying, “I’m still a high school teacher.”

New oil?

LAION is a database that uses artificial intelligence (AI) to generate images. The quality of the images it creates depends on the quantity and diversity of data in the database. This has raised legal and ethical questions about the use of publicly available materials to feed databases. The founders of LAION have scraped visual data from various sources such as Pinterest, Shopify, and Amazon Web Services. They have also used images from YouTube thumbnails, portfolio platforms like DeviantArt and EyeEm, government websites like the U.S. Department of Defense, and content from news sites like The Daily Mail and The Sun.

The lack of AI regulation in the European Union has left the use of copyrighted materials in big data sets unaddressed. The forthcoming AI Act, which is expected to be finalized early this summer, will not rule on the issue. Instead, lawmakers are considering a provision that would require companies behind AI generators to disclose what materials were used in the data sets their products were trained on. This would allow creators of those materials to take action if they choose.

European Parliament Member Dragos Tudorache stated that the provision’s goal is to ensure transparency: “As a developer of generative AI, you have an obligation to document and be transparent about the copyrighted material that you have used in the training of algorithms.” While such regulation wouldn’t affect Stability AI, it could pose a problem for other text-to-image generators. For instance, it is unknown what data Open AI used to train its DALL-E 2 model.

Schuhmann, the founder of LAION, argues that anything freely available online is fair game. However, the proposed regulation could change the current status quo in data collection. Therefore, there is a need for the AI industry to address ethical and legal issues concerning the use of copyrighted materials to build AI models.

Worst of the web

As his son played Minecraft, Schuhmann sat in the living room and compared LAION to a “small research boat” on top of a “big information technology tsunami”. He explained that LAION takes samples of what’s beneath to showcase to the world.

Schuhmann mentioned that the LAION database contains only a tiny fraction of what’s publicly available on the internet. He added that acquiring this data was quite easy and could be done even with a budget of $10,000 from donors.

However, the data available publicly isn’t always appropriate or legal for everyone to view. Alongside photos of cats and firetrucks, LAION’s dataset includes millions of images depicting pornography, violence, child nudity, racist memes, hate symbols, copyrighted art, and content scraped from private company websites. Although Schuhmann stated that he wasn’t aware of any child nudity in LAION’s dataset, he admitted to not having reviewed the data in great depth. If alerted about such content, he would remove links to it immediately.

Before assembling the database, Schuhmann sought advice from lawyers and employed an automated tool to filter out illegal content. However, he is more interested in learning from LAION’s holdings than in sanitizing them. Schuhmann mentioned that they could have filtered out violence from the data released, but they decided not to do so as it would slow down the development of violence detection software. Although LAION offers a takedown form for requesting photo removal, the dataset has already been downloaded thousands of times.