Public data warehouse for open data

Backers: 0/5

Stage: Development

📈 Big Data


Our goal is to make finding and accessing public data easier for human analysis, in apps, or as a source of up-to-date data for retrieval-augmented-generation. Inspired by git scraping [1], the core idea is to build something where people don’t upload a snapshot of their dataset directly, like you might do on Kaggle or Huggingface. Instead, anyone can contribute code (connectors) which we then continuously run and make the fetched data available for everyone in our shared, public data warehouse. We currently have connectors for 120+ datasets including an index of YC companies, U.S. house prices, and Wikipedia search volumes.

Separately, open data portals, such as from NGOs, can be hard to use due to their use of semantic web principles - i.e., representing data as a graph and adding structured metadata. We’re taking a less structured approach: each dataset is just a table that you can download or query using SQL, and we’re building a machine learning engine for ranking, pre-processing, and to generate relevant subsets/views from the data warehouse.

BigQuery is used as the data warehouse. We use dagster for the data pipelines, running it on top of Kubernetes. Frontend is NextJS. The data pipelines are currently centralised in our repo, but we’re building our own engine where you can just upload simple scripts. Search is currently basic semantic search, with one big index that stores unique strings across tables, columns, and rows. Before we used better search using LLM’s, but the cost, latency, and rate limits mean we’re still investigating the right way to go.

The project is in its very beginning stages, but we’d like to get some early feedback and find people who either want to help us build connectors or use the data to build something cool. The connectors are available at https://github.com/subsetsio/subsets-connectors, and you can visually explore the datasets and get your own free API key at https://www.subsets.io.

Benefits and perks of signing up a indiebacker

Early Access to New Features Priority Customer Support Access to Beta Versions Free or Discounted Access

Ways of Collaboration

Discord Meetings Google Meetings Dedicated Email Support Collaboration on Design Decisions Feedback Sessions and Q&A Participation in Development Sprints

Become a Indiebacker for Subsets.io

Join as an early adopter and help out

By jyavorska



Made by Socketopp
© 2024 Indiebackers