Documentation

Mantabase is a scalable Web Crawling and Scraping API. It handles every step of a web crawling process: Proxies, Rendering, Data Extraction, Navigation, Scheduling, Storing the Datasets, and much more!

Getting Started

Auth & API Key

Authentication is done via the Bearer Authentication with your Mantabase API Key.

How it works

The typical process to build a new Mantabase crawler goes as follow:

  1. Steps: Write the set of instructions on how to crawl and scrape the website
  2. Preview: Send the crawling steps to the preview endpoint while developing to iterate and get feedback
  3. Crawler: Once you're happy with the configuration of your crawler, you can save it
  4. Run: Run the crawler, either programmatically or by setting a frequency
  5. Dataset: Retrieve the dataset, either via API or by setting an export option

Examples

curl 'https://api.mantabase.com/1/preview' \
-H 'Authorization: Bearer $MANTABASE_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
  "steps": [
    "url(https://news.ycombinator.com/)",
    "pagination_next(a[rel=\"next\"])",
    "click_links(span.age a)",
    {
      "object(hn_post)": {
        "attributes": {
          "title": ".titleline",
          "link_url": ".titleline a @href",
          "user": "a.hnuser",
          "user_link": "a.hnuser @href",
          "score": "to_number(.score)",
          "nb_comments": "to_number(.subline > a:last-child)",
          "time_ago": ".subline a:last-child"
        }
      }
    }
  ]
}'

Preview

The preview endpoint allows you to iterate quickly and get feedback while you're configuring your crawler.

POST  api.mantabase.com/1/preview
=> { "status": "success", "steps": [...], "messages": [...] }

It takes the same parameters as regular crawlers and:

  • has a more verbose output to help you iterate
  • only runs the first page of each navigation/pagination loops, to return a sample dataset quickly

Once you're happy with the configuration of your crawler, you can save it.

Crawlers

A Crawler is the set of instructions on how to crawl and scrape a website, when to crawl it, and where to export the dataset.

Set

After iterating on your crawler configuration in preview, you can save the crawler

POST  api.mantabase.com/1/crawlers/{crawler_ID?}
      params = { "steps": [...], "frequency": {...}, ... }
=> { "status": "success", "crawler_id": 12345 }

This will create a new crawler if crawler_ID is not provided or new, and update a crawler if you provide an existing crawler_ID.

Get

Get a crawler and its configuration

GET  api.mantabase.com/1/crawlers/{crawler_ID}
=> { "crawler_id": 12345, steps: [...] }

Delete

Delete a crawler

DELETE  api.mantabase.com/1/crawlers/{crawler_ID}
=> { "status": "success" }

List

List all saved crawlers

GET  api.mantabase.com/1/crawlers
=> [{ "crawler_id": 12345, steps: [...] }]

Runs

A Run of a crawler is one crawl of a website.

Most crawler runs are initiated by the scheduler, but you can also trigger a crawler run programmatically.

Run

POST  api.mantabase.com/1/crawlers/{crawler_ID}/run
=> [{ "status": "success", "run_id": 12345 }]

List Runs

Check all of the past and ongoing runs performed by a crawler

GET  api.mantabase.com/1/crawlers/{crawler_ID}/runs
=> [{ "run_id": 12345, "status": "success", ... }, ...]

Datasets

Every Crawler Run is automatically stored by Mantabase in a Dataset. For example, a crawler with the configuration:

{ "steps": [
    "url(http://wikipedia.org)",
    { "object(articles)[div.article]": { "attributes": { "title": "h2" } } }
  ],
  "frequency": {
    "every": "day",
    "at_time": ["0800"]
  } }

Will generate a dataset called articles, and store a new set of objects for each daily run.

List Datasets

GET  api.mantabase.com/1/datasets/
=> [{ "name": "products", "runs": 123 }, ...]

List Dataset Runs

List all crawler runs that have retrieved objects for this dataset

GET  api.mantabase.com/1/datasets/{dataset_name}/runs
=> [{ "run_id": 12345, "objects": 12345, "crawled_at": "2022-12-06T13:21:37Z" }, ...]

Get Latest Objects

Retrieve the objects extracted by the latest successful run of the crawler.

GET  api.mantabase.com/1/datasets/{dataset_name}/latest
=> { "products": [{ "run_id": 12345, "objects": 12345, "crawled_at": "2022-12-06T13:21:37Z" }, ...] }

Get Objects

Retrieve the objects extracted by a given run of the crawler.

GET  api.mantabase.com/1/datasets/{dataset_name}/{run_id}
=> { "products": [{ "run_id": 12345, "objects": 12345, "crawled_at": "2022-12-06T13:21:37Z" }, ...] }

Steps

Steps are the instructions telling the crawler what to do on the website you're crawling: which pages to visit, buttons to click, objects to extract...

The parameter steps is formatted as an array, where each step is going to be executed consecutively.

"steps": [
  "url(http://wikipedia.org)",
  { "object(wiki)": { "attributes": { "title": "h1" } } }
]
The crawler will:
1/ Open the page wikipedia.org
2/ Extract the title (h1) from this page

Each step can represent multiple pages, all running in parallel.

"steps": [
  { "urls()": { urls: ["wikipedia.org/wiki/Web_crawler", "wikipedia.org/wiki/API"] } },
  { "object(wiki)": { "attributes": { "title": "h1" } } }
]
The crawler will:
1/ Open both of the wikipedia pages in parallel browsers
2/ Extract the title (h1) of both pages

All steps

ParameterDescription
url#Load one page
urls#Load a list of pages
object#Extract one object
objects#Extract a list of objects
click#Click on an element
click_links#Click on a list of links and open all of the pages
pagination_numbers#Paginate over a list of pages using page numbers
pagination_next#Paginate by recursively clicking on a 'next page' button
pagination_load#Load the full page by recursively clicking on a 'load more' button
pagination_scroll#Infinite scroll to load the full page
for#Provide a list of values to loop over
input#Fill an input or textarea
js_code#Run custom JS code
wait#Wait for a duration in milliseconds

url

Load one page

"url(www.example.com)"

urls

Load a list of pages

{ "urls()": { urls: ["www.example.com/1", "www.example.com/2"] } }

object

Extract one object

{ "object(object_name)": { "attributes": {...} } }

More on object extraction on the dedicated section.

objects

Extract a list of objects

{ "object(object_name)[scope_selector]": { "attributes": {...} } }

More on object extraction on the dedicated section.

click

Click on an element

"click(css_selector)"

Click on a list of links and open all of the pages

"click_links(css_selector)"

pagination_numbers

Paginate over a list of pages using page numbers. There are two ways to use this function, depending on where the page number appears on the URL.

Page number in a URL parameter

{ "pagination_numbers()": {
    "param": "page",
    "from": 1,
    "to": 100
  } } => Will crawl the urls: current_url/?page=1, current_url/?page=2, ..., current_url/?page=100 

Page number as part of the URL

{ "pagination_numbers()": {
    "url": "http://example.com/$page_number",
    "from": 1,
    "to": 100
  } } => Will crawl the urls: example.com/1, example.com/2, ..., example.com/100 

To use the current host or URL, you can use the variables $current_URL or $current_host

"url": "$current_host/$page_number" => example.com/1... 
"url": "$current_URL/$page_number" => example.com/my-product/1... 

In both cases, you can also use CSS selectors to extract the from and to values.

"from": "ul.pagination li:first-child",
"to": "ul.pagination li:last-child"

pagination_next

Paginate by recursively clicking on a "next page" button

"pagination_next(css_selector)"

pagination_load

Load the full page by recursively clicking on a "load more" button

"pagination_load(css_selector)"

pagination_scroll

Infinite scroll to load the full page

"pagination_scroll()"

for

Provide a list of values to loop over

{ "for($variable_name)": { "values": [string|number|objects] } }"

You can use it to specify a list of values

{ "for($categories)": { "values": ["Phones", "Laptops"] } },
"input(#search, $categories)"

It also works with objects

{ "for($accounts)": { "values": [{ "login": "a", "pwd": "a" }, { "login": "b", "pwd": "b" }] } },
"input(#login, $accounts.login)",
"input(#login, $accounts.pwd)"

input

Fill an input or textarea

"input(selector, value)"

js_code

Run custom JS code

"js_code(console.log("test");)"

wait

Wait for a duration in milliseconds

"wait(duration)"

Extract Objects

One object

"object(object_name)": {
  "attributes": {
    "attribute_name": "selector"
    ...
  }
}

For example:

"object(product)": {
  "attributes": {
    "name": "div.product span.name",
    "price": "div.product span.price",
    ...
  }
}
=> { "name": Excalibur", price": "42$" }

A list of objects

"object(object_name)[scope_selector]": { // scope_selector is the selector for one object of the list
  "attributes": {
    "attribute_name": "selector", // Mantabase will search for this selector within scope_selector
    ...
  }
}

For example:

"object(product)[div.product]": {
  "attributes": {
    "name": "span.name",
    "price": "span.price",
    ...
  }
}
=> [{ "name": Excalibur", price": "42$" }, { "name": Durendal", price": "48$" }, ...]

@attribute

By default, the engine will extract the innerText value of the element linked to the css selector provided. You can get an HTML attribute by specifying @name_of_attribute at the end of the selector.

"logo_image": "img#logo @src"
=> Extract the value of the logo's "src" attribute.

List of values

To capture an array of values for an attribute, suffix the attribute name with []

"titles[]": "span.title" => Extracts all titles: { "titles": ["title1", "title2", ...]}
"title": "span.title"    => Extracts the first title found: { "titles": "title1" }

Formatting

To format the value of an attribute you are extracting, use one of the following functions.

ParameterDescription
to_number#Cast to a number
remove#Remove a string
replace#Replace a string
prepend#Add a string at the beginning
append#Add a string at the end

to_number

Captures the first number present in a string, and transforms it into an integer or float automatically.

"price": "to_number(span.price)"
=> "42$" => 42
=> "for sale at 42,50 €" => 42.5

remove

Remove a string

"price": "remove(span.price, $)"
=> "42$" => "42"

replace

Replace a string

"price": "replace(span.price, $, €)"
=> "42$" => "42€"

prepend

Add a string at the beginning

"domain": "prepend(span.domain, www.)"
=> "example.com" => "www.example.com"

append

Add a string at the end

"domain": "append(span.domain, /)"
=> "example.com" => "example.com/"

Chaining

Formatting functions can be chained

"price": "append(to_number(span.price), $)"
=> "price: 42" => "42$" 

Step Conditions

On every step, you can add conditions specifying how they are going to be executed.

ParameterDescription
do_if#Execute this step only if true
fail_if#Fail and retry if true
skip_next_if#Skip the next steps if true
optional#Execute next steps even if this step fails
limit#Limit the size of iteration loops

These parameters take checks that return booleans.

ParameterDescription
booleantrue|false
visibleA DOM element is visible
!visibleA DOM element is not visible

do_if

Execute this step only if true

{ "click(.show_all)": { "do_if": "!visible(span:text("No more results"))" } }

fail_if

Fail and retry if true

{ "js_code(show_cart_info();)": { "fail_if": "visible(span:text(HTTP Status 405 – Method Not Allowed))" } }

skip_next_if

Skip the next steps if true

{
  "object(product)": {
    "attributes": {
      "name": "div.name",
    },
    "skip_next_if": "visible(div.name)"
  }
}

optional

Execute next steps even if this step fails

{ "click(.accept_cookies)": { "optional": true } },

limit

For all steps that perform a loop, reduce the maximum number of iterations performed by the crawler.

{ "click_links(.category)": { "limit": 100 } },

This applies to: objects click_links for urls pagination_numbers pagination_next pagination_load and pagination_scroll.

Geoloc

Our IP proxies are entirely automated and configured in the background. But you can configure the country of the proxies we use by using geoloc.

ParameterDescription
geolocCountry of the IP Proxies

geoloc

{
  "steps": [...],
  "geoloc": "FR" => Sets the IPs to France only
}

Scheduling

Set a run frequency for the crawler

ParameterDescription
frequencySet a run frequency for the crawler

Daily

{
  "frequency": {
    "every": "day",
    "at_time": ["0800", "1430"],
    "timezone": "UTC"
  }
}

Weekly

{
  "frequency": {
    "every": "week",
    "at_weekday": ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
    "at_time": ["0800", "1430"],
    "timezone": "UTC"
  }
}

Monthly

{
  "frequency": {
    "every": "month",
    "at_monthday": [1, 2, 3, ..., 31],
    "at_time": ["0800", "1430"],
    "timezone": "UTC"
  }
}

Export

After every successful run of a crawler, the data extracted will be stored in a Dataset. But you can also export it.

ParameterDescription
exportDestination to export the dataset after every sun
{
  "export": ["gcp(bucket_name)", "s3(bucket_name)"]
}