Gnutella Web Caching System

Copyright (c) 2002 Hauke Dämpfling, version 1.1 / 13.5.2002, http://www.zero-g.net/gwebcache/

Introduction

(ripped with many thanks from Info Anarchy's summary)
The goal of the "Gnutella Web Caching System" (the "cache") is to eliminate the "Initial Connection Point Problem" of a fully decentralized network: Where do I find a first host to connect to? The cache is a program (script) placed on any web server that stores IP addresses of hosts in the Gnutella network and URLs of other caches. Gnutella clients connect to a cache in their list randomly. They send and receive IP addresses and URLs from the cache. With the randomized connection it is to be assured that all caches eventually learn about each other, and that all caches have relatively fresh hosts and URLs. The concept is independent from Gnutella clients.

Basics (Client-Side)

Interaction with the web server and cache is a series of HTTP GET requests and responses. Support for POST requests is optional and not necessary. The following specifications describe the GET requests and the expected responses, as well as the expected behavior of the script and the client. The notation url?query indicates the URLs of a script with the attached query string, where "query" is a series of name=value pairs. These name/value pairs must be "URL-Encoded", as is described (for example) here, or in RFC1738. Due to the differences between operating systems, responses can be LF, CRLF, or CR-terminated, but should be of Content-Type "text/*". Responses are interpreted line-by-line.

Tip: GET requests are easier than they may sound above: the query (the information/request you are sending the script) is simply part of the URL. For example, let's say the the request is: url?ip1=192.168.0.1:123, you will simply have to open the following URL using whatever web functions your programming language provides:
http://www.somehost.com/path/to/script.php?ip1=192.168.0.1:123
The only tricky parts are: one, the "URL-Encoding" - your best bet is to go look for such functions, they have often already been written by someone and maybe already are part of your libraries. Second, interpreting the end-of-line characters in the responses - again, often there are already functions in the libraries that you can use to read responses line-by-line, taking the end-of-line characters into account.

Client-Script Interaction

Clients generally keep an internal cache of the IP addresses of known Gnutella nodes. In addition to this list, they should also keep an internal list of web caches. When making requests, a client should pick a cache from its internal list (a different one every time). Clients should remove invalid nodes and URLs from their internal caches. Doing this in combination with regular update requests will keep the integrity of the "network" of web caches. How this works is: URLs of scripts that are non-functional will (should) not be submitted to the functional caches by clients. In case a cache still has a URL of a non-functional script, it will soon be "phased out" by the regular update requests.

Of course, all developers should take care that the interactions as described here are strictly followed and should never release scripts or clients without proper in-house testing first, as to not disrupt the integrity of the network.

Security Note: Clients and/or scripts may not be able to verify that nodes and URLs in their caches are valid. Clients and scripts should therefore have security measures in place against possible errors in scripts/clients, mischief, or DoS (Denial of Service) attacks. Examples include: verification of URLs by sending Ping requests, automatic removal of caches that may be behaving "strangely" (errors/invalid caches and hosts), limiting the time between requests for a single client.

Retrieval

The client wishes to receive a list of Gnutella nodes.
Request: url?hostfile=1
Response: A return-separated list of Gnutella nodes in the format "ip:port" (numerical IPs only). The list should not be very long (around 20 nodes) and should contain only the newest entries.
OR
A redirect (HTTP code 3xx) response, indicating that the client needs to send another HTTP GET request for the file. Clients must support this method. Luckily, many standard HTTP libraries automatically follow redirects. When a client follows the redirect, it should receive a list as described above.
OR
The string "ERROR", possibly followed by more specific error information.
Client-Side: A client should send this request whenever it needs hosts to connect to. Clients should be able to handle variable sizes of lists in responses, including empty responses. Clients should remove web caches from their internal lists in case the caches return ERROR messages (or fail to respond correctly altogether) more than a few times in a row.
Server-Side: As noted above, scripts need not and should not return many hosts (only ~20). See the comment on the ERROR response in the notes for the "Update" request below.

The client wishes to receive a list of alternate web cache URLs.
Request: url?urlfile=1
Response: A return-separated list of alternate web caches' URLs. The list should not be very long (around 20 URLs) and should contain only the newest entries.
OR
A redirect (HTTP code 3xx) response, indicating that the client needs to send another HTTP GET request for the file. Clients must support this method. Luckily, many standard HTTP libraries automatically follow redirects. When a client follows the redirect, it should receive a list as described above.
OR
The string "ERROR", possibly followed by more specific error information.
Client-Side: A client should send this request to build its internal list of caches (such as once on start up). Clients should be able to handle variable sizes of lists in responses, including empty responses. Clients should remove web caches from their internal lists in case the caches return ERROR messages (or fail to respond correctly altogether) more than a few times in a row.
Server-Side: As noted above, scripts need not and should not return many URLs (only ~20). See the comment on the ERROR response in the notes for the "Update" request below.

Update

The client wishes to update IP addresses and/or alternate web cache URLs to a cache.
Request: url?ip1=XXX.XXX.XXX.XXX:PORT&url1=http://WWW.SOMEHOST.COM/PATH/TO/SCRIPT&ip2=...&url2=...
(Reminder: Requests need to be URL-Encoded - see "Basics")
Response: First line must be: either "OK" or "ERROR", or "ERROR: Message".
Following lines: can be ignored by the client, can be used by the script for warning messages.
Note: These two basic responses let the client know that the script is functional (to a certain extent). In other words: if anything else is returned by the web server (for example, if the response begins with <HTML>), this can be interpreted as a server error of some sort.
Client-Side: A client should send this request periodically (~every hour). For best efficiency, a client should submit only its own IP address and one alternate web cache when it updates. Clients should only send the URLs of web caches that they know to be functional!
Clients can handle the responses silently - however clients should remove web caches from their internal lists in case the caches return ERROR messages (or fail to respond correctly altogether) more than a few times in a row.
Server-Side: An OK message usually means that everything went well and the script executed normally. An ERROR message usually indicates some form of fatal error because of which the script could not do what is is supposed to. Since clients will (should) remove scripts that return error messages often, it is advised to return ERRORs only when the script is expected to be down for a while (such as, the script will be or has been removed from server, server overload, file errors, etc.).
Since scripts need to only return a few and only the newest Hosts and URLs, the oldest entries should simply be removed when new entries are submitted through an update request.
Scripts may whish to check the validity of submitted URLs by sending a Ping request, but this is not required.

Miscellaneous

A ping/pong scheme to verify that caches are active.
Request: url?ping=1
Response: The first four characters of the response are: PONG, followed by a version number string (can be omitted).
Client-Side: This system can and should be used to verify that a URL is valid and that a script is functioning correctly.
Note: Some scripts, when installed by users on their servers, may return pings correctly but fail on other requests (mostly due to file access errors and the like), so verification is not always 100% guaranteed.
Server-Side: ditto

Other Responses / Extensions

Other responses that a script can send include HTML information pages, statistics, etc. For example, if no request is sent to the script (i.e. the script is simply browsed to), it could display a page informing the user that "this is a Gnutella web cache" or something similar. Or, one could include an extra request, "url?stats=1", which could display a HTML page with some statistics.

In general, script authors can include any extensions they wish, as long as the interaction described above remains unchanged. Clients need not implement any extensions, since the basic interactions will be the same.

Statistics

Statistics are regularly collected on all known GWebCache scripts. If the author of a script would like to make statistics from their script available, the following request should be implemented.

Request: url?statfile=1
Response: Line 1: Total number of requests received.
Line 2: Requests received in the last full hour.

Change Log

v1.1
- Suggested client and server-side behavior more specific.
- Added suggested statistics response.

v1.0
- First release.

GWebCache Home
http://www.zero-g.net/gwebcache/
Copyright (c) 2002 Hauke Dämpfling. Licensed under FDL.
See also: http://www.gnucleus.net/