README.md 3.5 KB
Newer Older
Gerhard Gossen's avatar
Gerhard Gossen committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
# ALEXANDRIA snaphots server

Simple REST access to Web archive snapshots and their contents.

## Usage

The service currently runs on the Hadoop master on port 8888. Please ensure
that you can access the server through the firewall.

### Snapshots endpoint `/snapshots`

The snapshots endpoint returns the crawl time and archive location for all
snapshots of the given URL. You can pass one or more URLs as URL parameters.

    GET /snapshots?url=url_1&url=url_2& ... &url=url_n

The result is a JSON array containing one array for each URL, in the order given
by the parameters. Each of those arrays contains zero or more objects with the
following properties:

| Name           | Description                                             |
|----------------|---------------------------------------------------------|
| url            | The actual URL (string)                                 |
| crawlTime      | As given in the WARC header (ISO date string)           |
| warcFile       | File name of the WARC containing this snapshot (string) |
| warcFileOffset | Start offset into the warc (long)                       |
| length         | File size (long)                                        |
| mimeType       | Content MIME type (string)                              |
| signature      | Content signature (string)                              | 

### Content endpoint `/content`

The content endpoint returns the HTTP headers and content for one or all
snapshots of the given URL. You can pass one or more URLs as URL parameters.

    GET /content?url=url_1&url=url_2& ... &url=url_n&crawlTime=timestamp

38 39 40
When `crawlTime` is specified (as `yyyy-MM-dd'T'HH:mm:ss`), only the snapshot
closest to that date is retrieved.

Gerhard Gossen's avatar
Gerhard Gossen committed
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
The result is a JSON array containing one array for each URL, in the order given
by the parameters. Each of those arrays contains zero or more objects with the
following properties:

| Name           | Description                                                   |
|----------------|---------------------------------------------------------------|
| originalUrl    | The actual URL (string)                                       |
| crawlTime      | As given in the WARC header (ISO date string)                 |
| status         | HTTP status code (integer)                                    |
| mimeType       | Content MIME type as given by the WARC header (string)        |
| headers        | HTTP headers (JSON object with string keys and values)        |
| content        | HTTP payload (String for text types, Base64 string otherwise) |

## Setup

To start your own copy of the server, follow the following steps:

1. Get the source:

        git clone https://git.l3s.uni-hannover.de/gossen/snapshots-server.git

2. Create a text file `application.properties` with the following contents (adapt to your data):

        cdxPath=hdfs:///user/gossen/ia-de-zipnum/    # directory containing a ZipNum index (file/HDFS)
        warcRoot=hdfs:///data/ia/w/de/               # root directory of the WARC/ARC files
        server.port=8888                             # port for the REST service

3. If necessary, edit `src/main/resources/core-site.xml` to adapt to your HDFS server.

4. Start by running `mvn spring-boot:run`.

The code assumes that WARC/ARC files are partitioned by the first part of their file name, i.e.
the file called `TB-151295-000000.arc.gz` is expected to be in `$warcRoot/TB/`. If you use a
different scheme, create an instance of `de.l3s.gossen.snapshots.LocationResolver` and register it
in `de.l3s.gossen.snapshots.Server`.