Static torrent website with peer-to-peer queries over BitTorrent on 2M records

2022-03-07

TL;DR

Thanks to SQLite VFS abstraction, it is possible to implement your own file system on which SQLite parks data and structures. Inspired by Phiresky's sql.js-httpvfs which uses HTTP Range requests to lazy load blocks of storage from a static web server, I changed few lines of code to point the VFS read() calls to a database seeded by peers as a torrent. A 300 MiB db with 2 million records can be queried from seeders for full text searches in less than 1 MiB traffic with the BitTorrent protocol, all inside the browser, in a static website.

LIVE DEMO HERE!

1. On the last demo

In the last post I discussed about a demo of a static torrent website which retrieves a small database (13 MiB with 135,000 records) from the IPFS peer-to-peer network to let users do their searches. The purpose was to not let the database vanish in the void, together with the torrents, in case of service shutdown. The solution was to host it away from servers. Unfortunately this approach showed me the limits from the start: 10 seconds to download the whole database under ideal conditions, a little waste of time and network traffic for just a little database which I would use for a single query to retrieve 40 records out of 135,000. Not efficient and only suitable for small communities and local websites.

After posting it on HN for comments, some users dropped this gem I didn't know about:

Enhance it with the fact you can lazy load sqlite query using HTTP Range as demonstrated in: https://news.ycombinator.com/item?id=27016630

https://github.com/phiresky/sql.js-httpvfs

2. sql.js-httpvfs: under the looking glass

As shown in a post by the author, this project makes possible to do sql queries on a database hosted on a static web server, without a backend, by "just" translating SQLite read() calls to HTTP Range requests for blocks of the .sqlite file. The results are out of this world:

20 GET requests, fetching a total of 270 KiB to do a complex query on a 670 MiB database with 8,000,000 records

33 GET requests, fetching a total of 70 KiB to do a full text search on a 8 MiB table

2.1 SQLite VFS

By doing my searches I discovered the main magic trick behind the project: SQLite's VFS (Virtual File System). Thanks to SQLite design it's possible to attach whatever implementation of a virtual file systems you like under its feet and make it work transparently. To do this you have to code a module that implements some functions like read() and write() the way it's described in the documentation and register the module with a call: sqlite3_vfs_register(module).

The author has implemented a VFS using the good old synchronous XMLHttpRequests (XHR) in javascript by tricking them to get the requested portions of storage with the Range HTTP header from a remote server.

2.2 .wasm

Another magic trick involved is the connection of the .js VFS code with .c source code of SQLite.

At the moment I have no idea how this actually works and I can't reproduce it (I'm busycaveman rn). As far as I know a wrapper function is injected when compiling the .c source code to .wasm with emscripten and then the .js VFS code is hooked at runtime by the browser. Part of the process has been automated by sql.js.

2.3 Lazy loading

The last part of this insane project is the lazy loading. It's not strictly related to the project, rather it's in the nature of DBMS. Thanks to various indexing techniques used by database engines, the amount of I/O involved in a text search query is very low in terms of size. As mentioned before, 270 KiB of reads over a 670 MiB database to complete a complex query seems good compared to the whole 13 MiB database of the last demo.

3. What if sql.js + VFS + torrent?

By doing simple math I quickly realized that the same project would have worked if the whole database was parked on different peers and the XHR requests were translated to equivalent bittorrent piece requests. 270 KiB is absolutely nothing in these days.

The idea is nothing new. BitTorrent Inc already coupled these two pieces of puzzle in https://github.com/bittorrent/sqltorrent. Another developer took inspiration from that to build https://github.com/lmatteis/torrent-net which uses sqltorrent in its backend and exposes a website interface to do text searches like a torrent website. But as I said from the start, I don't want any kind of backend running in the server, but just a static webpage with a bundle.js that works out of the box inside the browser. In this way the website could be copy-pasted easily.

Here's my journey.

4. OUR code

4.1 WebTorrent

WebTorrent is a working bittorrent client library written in javascript that runs entirely on browser. It's actually a composition of low level modules well designed and separated. It uses WebRTC for peer-to-peer communications so it can only communicates with bittorrent clients that support WebRTC.

An interesting API function provided by this library let you select the exact range of bytes you need from a selected file in the torrent and gives you back a ReadableStream, which is good because you don't have to deal with peers/pieces/offsets. This call also prioritize that range. The code is simple:

const client = new WebTorrent()
const torrentId = 'magnet:?xt=urn:btih:whatever+trackers+webseeds'

client.add(torrentId, function (torrent) {
  const stream = torrent.files[0].createReadStream({ start: 0, end: 4096 - 1 })

  stream.on('data', (chunk) => {
    // chunk ready to be used
  })
})

4.2 Fork of Phiresky's sql.js-httpvfs code (FAILED)

I found the original code of the project very difficult to understand, mainly because I have no experience with nodejs and modules management / packaging. After spending a lot of time looking at the code I decided to look around for similiar projects. The main difficult was to find the correct spot to convert the whole chain of synchronous calls into asynchronous calls, starting from the execute() and read() down to the doXHR() (the function in charge of actually reading bytes from somewhere), since the WebTorrent library is event-based and requires promises to deal with it.

Fortunately I didn't have to look past the README.md to find an alternative:

wa-sqlite, which is a much simpler wasm wrapper for SQLite than sql.js a and has different VFSes that don't require an EMScripten dependency. sql.js-httpvfs could easily be reimplemented on top of this.

4.3 Rhashimoto's wa-sqlite

This user did a great job by doing the same work phiresky did just before implementing the httpVFS, to allow any user to implement its own VFS in its own way. The repo has also some working examples with synchronous and asynchronous calls like the wa-sqlite/src/examples/MemoryAsyncVFS.js:

xRead(fileId, pData, iOffset) {
  // just a wrapper around a sync function
  return this.handleAsync(async () => {
    super.xRead(fileId, pData, iOffset)
  })
}

From this point I had all the ingredients, so I started cooking.

5. The demo

LIVE DEMO HERE!

5.1 Building and slicing the database

In the repo of sql.js-httpvfs the author had to balance the number of requests and network traffic by tricking the page_size of the sqlite database and decided to use 1024 bytes instead of 4096 (default). According to the bittorrent protocol specification v1.0 my requests to peers must be of 16 KiB minimum, so I decided to use 16 KiB pages.

Sincerely I don't remember where I got this csv dump of torrents (they are all dead maybe). It's 200 MiB in size containing over 2,000,000 records. I prepared the database according to Phiresky's guide with little changes:

sqlite3> .mode csv
sqlite3> .import dump.csv orig_torrents
sqlite3> CREATE VIRTUAL TABLE torrents USING fts5(magnet UNINDEXED, title, size UNINDEXED);
sqlite3> INSERT INTO torrents SELECT * FROM orig_torrents;
sqlite3> .save example.sqlite3
sqlite3> .quit

sqlite3> DROP TABLE orig_torrents;
sqlite3> VACUUM;
sqlite3> PRAGMA journal_mode = delete;
sqlite3> PRAGMA page_size = 16384;
sqlite3> INSERT INTO torrents(torrents) VALUES ('optimize');
sqlite3> VACUUM;
sqlite3> .quit

Because the full text search module fts5 makes a copy of the values, it is safe to drop the original table. Now, from a 200 MiB csv dump I had a 310 MiB database fully indexed for text searches on titles.

5.2 Repo changes

I then worked on the wa-sql project.

Before running make to compile the SQLite amalgamation source code I added a flag to support the fts5 module:

$ git clone https://github.com/rhashimoto/wa-sqlite
$ cd wa-sqlite
$ yarn install
$ vim Makefile

+  -DSQLITE_ENABLE_FTS5

$ make
$ cd src/
$ curl -O https://cdn.jsdelivr.net/npm/webtorrent@1.8.1/webtorrent.min.js

5.3 Code

It's worth mentioning that WebTorrent is designed as a high-level streaming torrent library composed by low level modules like bittorrent-protocol, bittorrent-dht etc... and because of this I had to deal with the limitation of not being able to ask ONLY for pieces I'm interested in. Since it's not a final product but it's just a demo I'm ok with it. The result is a low/medium waste of traffic which I tried to limit with some tricks.

Here's the main changes I made to wa-sqlite/src/examples/MemoryAsyncVFS.js:

import * as VFS from '../VFS.js'
import WebTorrent from './webtorrent.min.js'

let torrentPieceLength = 0
const client = new WebTorrent()

const torrentPromise = new Promise((resolve, reject) => {
  const torrentId = 'magnet:?xt=urn:btih:whatever+trackers+webseeds'

  try {
    client.add(torrentId, (torrent) => {
      torrent.deselect(0, torrent.pieces.length - 1) // stop download
      torrentPieceLength = torrent.pieceLength
      resolve(torrent)
    })
  } catch (err) {
    reject(err)
  }
})

xOpen(name, fileId, flags, pOutFlags) {
  return this.handleAsync(async () => {
    await torrentPromise
    return super.xOpen(name, fileId, flags, pOutFlags) // I don't care
  });
}

xRead(fileId, pData, iOffset) {
  // exclusive: [from, to)
  const from = iOffset
  const nBytes = pData.size

  const fromPiece = Math.floor(from / torrentPieceLength)
  const nPieces = Math.ceil(nBytes / torrentPieceLength)

  return this.handleAsync(async () => {
    const data = await new Promise((resolve, reject) => {
      torrentPromise.then((torrent) => {
        torrent.critical(fromPiece, fromPiece + nPieces - 1)
        return torrent.files[0].createReadStream({ start: from, end: from + nBytes - 1 })
      }).then((stream) => {
        stream.on('data', (chunk) => {
          resolve(chunk)
        })
      })
    })

    if (data.length) {
      pData.value.set(new Int8Array(data, 0, data.length));
    }

    if (data.length < pData.size) {
      pData.value.fill(0, data.length);
      return VFS.SQLITE_IOERR_SHORT_READ;
    }
    return VFS.SQLITE_OK
  })
}

xFileSize(fileId, pSize64) {
  return this.handleAsync(async () => {      
    await torrentPromise.then((torrent) => {
      pSize64.set(torrent.length)
    })
    return VFS.SQLITE_OK;
  })
}

And here is how I glued all together inside the index.js:

import SQLiteAsyncESMFactory from './src/wa-sqlite-async.mjs';
import * as SQLite from './src/sqlite-api.js';
import { MemoryAsyncVFS } from './src/MemoryAsyncVFS.js';

let sqlite3a = {}
let dbPromise = {}
const initializeDb = (async function() {
  const SQLiteAsyncModule = await SQLiteAsyncESMFactory();

  sqlite3a = SQLite.Factory(SQLiteAsyncModule);
  sqlite3a.vfs_register(new MemoryAsyncVFS());
  
  // sqlite3a doesn't know what's behind xRead()
  // it's opening the database from seeders!
  dbPromise = sqlite3a.open_v2('memory-async', undefined, 'memory-async');
})()

async function search(input) {
  await initializeDb

  dbPromise
    .then((db) => {
      const query = `SELECT * FROM torrents WHERE title MATCH '` + input + `' LIMIT 20;`
      const callback = function(row, column) { results.innerText += row + '\n' }
      sqlite3a.exec(db, query, callback)
    })
}

// DOM code
document.querySelector(...)

After some clean up, the dir looked like this:

$ tree
├── dist
│   ├── wa-sqlite-async.mjs
│   ├── wa-sqlite-async.wasm
├── src
│   ├── MemoryAsyncVFS.js
│   ├── MemoryVFS.js
│   ├── sqlite-api.js
│   ├── sqlite-constants.js
│   └── VFS.js
├── index.html
├── index.js

The last step was to bundle all together, reducing the number of files and size and making it browser compatible with webpack:

$ yarn add webpack webpack-cli
$ ./node_modules/.bin/webpack ./index.js -o dist
$ tree
├── dist
│   ├── 89635f597f687bd55656.wasm
│   ├── bundle.js
├── index.html

You can find all the code at https://gitlab.com/boredcaveman/p2psearch

6. Results

DB Size: 310 MiB
Records: 2,000,000 defined as (magnet, title, size)
- magnets are just hash strings
Loading time: 1/2 seconds to ask trackers for the database torrent
Execution time of a query: 5 to 15 seconds to get 20 matches
- results appears 1 by 1 as they are found, it needs 3 to 5 seconds to get the first matches
Network traffic: 400 KiB to 1 MiB for a text search query limited at 20 results

7. Final thoughts

Even though it's just a demo, this project does actually what I was trying to do from the start, a torrent website, or whatever text search website, that does not store the database on the server, but let the users participate in the hosting effort by making a copy of the public database and serving it from their computers in a peer-to-peer way. With this design, the takedown of a single website is useless since the actual data is not there. Also, hosting a static webpage is cheap and simple as downloading a torrent.

From a different point of view, a website like this acts as a software-server: you just ask for the software, and, once downloaded, the actual work is done locally by the browser.

7.1 Pros ordered by relevance

It resists chensorship
Database size and number of records are not a problem
Text searches on the whole database
Low traffic (ideal)

7.2 Cons ordered by relevance

Privacy problem: everyone with a torrent client can see who is visiting the website, because every client is a peer asking for pieces of database
~~Waste of network traffic:~~
- ~~as said before, this demo uses a high-level bittorrent library, I have no control over the download aside from throttling and pieces prioritization~~
At the moment, to do your part and host a copy of the database, you have to use a WebRTC capable bittorrent client like WebTorrent Desktop or websites like Instant.io. More infos at https://webtorrent.io/faq

Peace!

Hey dudes, since this demo s*cks it only needs 20 concurrent queries to saturate the network cap of my VPS, would you help me for just 1 day? Seed a copy of the database with WebTorrent Desktop, Instant.io or any WebRTC capable client, or just keep the following code running in nodejs:

Magnet:

magnet:?xt=urn:btih:160da04261d7c9544620003c7821ed780b18f7db&dn=example.sqlite3&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fopen.tracker.cl%3A1337%2Fannounce&tr=udp%3A%2F%2F9.rarbg.com%3A2810%2Fannounce&tr=udp%3A%2F%2Fwww.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fvibe.sleepyinternetfun.xyz%3A1738%2Fannounce&tr=udp%3A%2F%2Ftracker1.bt.moack.co.kr%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.zerobytes.xyz%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Ftracker.tiny-vps.com%3A6969%2Fannounce&tr=wss%3A%2F%2Ftracker.btorrent.xyz&tr=wss%3A%2F%2Ftracker.openwebtorrent.com

index.js

import WebTorrent from 'webtorrent-hybrid'
const client = new WebTorrent()

var magnetURI = 'magnet:?xt=urn:btih:160da04261d7c9544620003c7821ed780b18f7db&dn=example.sqlite3&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fopen.tracker.cl%3A1337%2Fannounce&tr=udp%3A%2F%2F9.rarbg.com%3A2810%2Fannounce&tr=udp%3A%2F%2Fwww.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fvibe.sleepyinternetfun.xyz%3A1738%2Fannounce&tr=udp%3A%2F%2Ftracker1.bt.moack.co.kr%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.zerobytes.xyz%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Ftracker.tiny-vps.com%3A6969%2Fannounce&tr=wss%3A%2F%2Ftracker.btorrent.xyz&tr=wss%3A%2F%2Ftracker.openwebtorrent.com'

client.add(magnetURI, { path: '/path/to/downloads/CHANGE_ME!!!1!' }, function (torrent) {
  torrent.on('done', function () {
    console.log('Torrent download finished. Seeding...')
  })
})

$ vim package.json
{
  "type": "module",
  "dependencies": {
    "webtorrent-hybrid": "^5.0.2"
  }
}
$ yarn install
$ nodejs index.js