5-Sep-2020-brave-websites

There's nearly 60K websites that have signed up as publishers on the Brave Rewards program (including this site), and I think it'd be cool to build out systems for understanding and utilizing those sites more.

I have some ideas about how we'd do that, but first, here's a key resource for those of us interested in building out such things: a list of URLs for all the websites registered as publishers on the Brave Rewards platform.

I extracted this list using the same resource that the BatGrowth.com creator uses. Please don't try to extract the list yourself; the file is a HUGE download and we don't want to needlessly download big files from Brave. + It takes a bit of work to strip it down to the URLs list that I've provided (I'll soon provide the full steps that I took to make the list).

My ideas for going forward

Search

I think that searching is the most obvious idea nowadays when we deal with a collection of websites. We could develop a search engine for the network, but that would be really difficult.

Targeted Site Search

I suspect that we can outsource most of what we need for searching the network of Brave publishers to DuckDuckGo, which provides site searches across multiple domains (and happens to be a Brave publisher!). The key issue with the DuckDuckGo approach is that we'd need to refine our idea of which sites we'd want DuckDuckGo to search (especially because I'm pretty sure there's some limit to the number of sites we can search across).

For this, we could filter down the entire list of URLs to only the ones that are likely relevant, then search through those sites.

To do this, we use two stages of searching:

  1. With a simple database of our own, we would search across something that (hopefully) represents what each website is about (the text in each website's homepage, which we'd need to scrape; the search would filter out which sites are relevant to the search from which sites are not relevant), then
  2. we would pipe the list of relevant sites to the DuckDuckGo search and execute the full site search from there.

Text and Network Analysis

It'd be cool to do text and network analysis on the network of Brave publishers. For instance, this could let us see (via diagrams like the ones shown here) which words or phrases are often mentioned on websites that are in the network and how the network's changing over time. Note that this idea also depends on scraping all the websites, but hopefully the homepages alone would be sufficient.