Getting Started Preview Crawlers Runs Datasets Steps Extract Objects Formatting Step Conditions Geoloc Scheduling Export

Mantabase is a scalable Web Crawling and Scraping API. It handles every step of a web crawling process: Proxies, Rendering, Data Extraction, Navigation, Scheduling, Storing the Datasets, and much more!

Getting Started#

Auth & API Key#

Authentication is done via the Bearer Authentication with your Mantabase API Key.

How it works#

The typical process to build a new Mantabase crawler goes as follow:

Steps: Write the set of instructions on how to crawl and scrape the website
Preview: Send the crawling steps to the preview endpoint while developing to iterate and get feedback
Crawler: Once you're happy with the configuration of your crawler, you can save it
Run: Run the crawler, either programmatically or by setting a frequency
Dataset: Retrieve the dataset, either via API or by setting an export option

Examples#

curl 'https://api.mantabase.com/1/preview' \
-H 'Authorization: Bearer $MANTABASE_API_KEY' \
-H 'Content-Type: application/json' \
-d '{
  "steps": [
    "url(https://news.ycombinator.com/)",
    "pagination_next(a[rel=\"next\"])",
    "click_links(span.age a)",
    {
      "object(hn_post)": {
        "attributes": {
          "title": ".titleline",
          "link_url": ".titleline a @href",
          "user": "a.hnuser",
          "user_link": "a.hnuser @href",
          "score": "to_number(.score)",
          "nb_comments": "to_number(.subline > a:last-child)",
          "time_ago": ".subline a:last-child"
        }
      }
    }
  ]
}'

Preview#

The preview endpoint allows you to iterate quickly and get feedback while you're configuring your crawler.

POST  api.mantabase.com/1/preview
=> { "status": "success", "steps": [...], "messages": [...] }

It takes the same parameters as regular crawlers and:

has a more verbose output to help you iterate
only runs the first page of each navigation/pagination loops, to return a sample dataset quickly

Once you're happy with the configuration of your crawler, you can save it.

Crawlers#

A Crawler is the set of instructions on how to crawl and scrape a website, when to crawl it, and where to export the dataset.

Set#

After iterating on your crawler configuration in preview, you can save the crawler

POST  api.mantabase.com/1/crawlers/{crawler_ID?}
      params = { "steps": [...], "frequency": {...}, ... }
=> { "status": "success", "crawler_id": 12345 }

This will create a new crawler if crawler_ID is not provided or new, and update a crawler if you provide an existing crawler_ID.

Get#

Get a crawler and its configuration

GET  api.mantabase.com/1/crawlers/{crawler_ID}
=> { "crawler_id": 12345, steps: [...] }

Delete#

Delete a crawler

DELETE  api.mantabase.com/1/crawlers/{crawler_ID}
=> { "status": "success" }

List#

List all saved crawlers

GET  api.mantabase.com/1/crawlers
=> [{ "crawler_id": 12345, steps: [...] }]

Runs#

A Run of a crawler is one crawl of a website.

Most crawler runs are initiated by the scheduler, but you can also trigger a crawler run programmatically.

Run#

POST  api.mantabase.com/1/crawlers/{crawler_ID}/run
=> [{ "status": "success", "run_id": 12345 }]

List Runs#

Check all of the past and ongoing runs performed by a crawler

GET  api.mantabase.com/1/crawlers/{crawler_ID}/runs
=> [{ "run_id": 12345, "status": "success", ... }, ...]

Datasets#

Every Crawler Run is automatically stored by Mantabase in a Dataset. For example, a crawler with the configuration:

{ "steps": [
    "url(http://wikipedia.org)",
    { "object(articles)[div.article]": { "attributes": { "title": "h2" } } }
  ],
  "frequency": {
    "every": "day",
    "at_time": ["0800"]
  } }

Will generate a dataset called articles, and store a new set of objects for each daily run.

List Datasets#

GET  api.mantabase.com/1/datasets/
=> [{ "name": "products", "runs": 123 }, ...]

List Dataset Runs#

List all crawler runs that have retrieved objects for this dataset

GET  api.mantabase.com/1/datasets/{dataset_name}/runs
=> [{ "run_id": 12345, "objects": 12345, "crawled_at": "2022-12-06T13:21:37Z" }, ...]

Get Latest Objects#

Retrieve the objects extracted by the latest successful run of the crawler.

GET  api.mantabase.com/1/datasets/{dataset_name}/latest
=> { "products": [{ "run_id": 12345, "objects": 12345, "crawled_at": "2022-12-06T13:21:37Z" }, ...] }

Get Objects#

Retrieve the objects extracted by a given run of the crawler.

GET  api.mantabase.com/1/datasets/{dataset_name}/{run_id}
=> { "products": [{ "run_id": 12345, "objects": 12345, "crawled_at": "2022-12-06T13:21:37Z" }, ...] }

Steps#

Steps are the instructions telling the crawler what to do on the website you're crawling: which pages to visit, buttons to click, objects to extract...

The parameter steps is formatted as an array, where each step is going to be executed consecutively.

"steps": [
  "url(http://wikipedia.org)",
  { "object(wiki)": { "attributes": { "title": "h1" } } }
]
The crawler will:
1/ Open the page wikipedia.org
2/ Extract the title (h1) from this page

Each step can represent multiple pages, all running in parallel.

"steps": [
  { "urls()": { urls: ["wikipedia.org/wiki/Web_crawler", "wikipedia.org/wiki/API"] } },
  { "object(wiki)": { "attributes": { "title": "h1" } } }
]
The crawler will:
1/ Open both of the wikipedia pages in parallel browsers
2/ Extract the title (h1) of both pages

All steps#

Parameter	Description
url#	Load one page
urls#	Load a list of pages
object#	Extract one object
objects#	Extract a list of objects
click#	Click on an element
click_links#	Click on a list of links and open all of the pages
pagination_numbers#	Paginate over a list of pages using page numbers
pagination_next#	Paginate by recursively clicking on a 'next page' button
pagination_load#	Load the full page by recursively clicking on a 'load more' button
pagination_scroll#	Infinite scroll to load the full page
for#	Provide a list of values to loop over
input#	Fill an input or textarea
js_code#	Run custom JS code
wait#	Wait for a duration in milliseconds

url#

Load one page

"url(www.example.com)"

urls#

Load a list of pages

{ "urls()": { urls: ["www.example.com/1", "www.example.com/2"] } }

object#

Extract one object

{ "object(object_name)": { "attributes": {...} } }

More on object extraction on the dedicated section.

objects#

Extract a list of objects

{ "object(object_name)[scope_selector]": { "attributes": {...} } }

More on object extraction on the dedicated section.

click#

Click on an element

"click(css_selector)"

click_links#

Click on a list of links and open all of the pages

"click_links(css_selector)"

pagination_numbers#

Paginate over a list of pages using page numbers. There are two ways to use this function, depending on where the page number appears on the URL.

Page number in a URL parameter#

{ "pagination_numbers()": {
    "param": "page",
    "from": 1,
    "to": 100
  } } => Will crawl the urls: current_url/?page=1, current_url/?page=2, ..., current_url/?page=100

Page number as part of the URL#

{ "pagination_numbers()": {
    "url": "http://example.com/$page_number",
    "from": 1,
    "to": 100
  } } => Will crawl the urls: example.com/1, example.com/2, ..., example.com/100

To use the current host or URL, you can use the variables $current_URL or $current_host

"url": "$current_host/$page_number" => example.com/1... 
"url": "$current_URL/$page_number" => example.com/my-product/1...

In both cases, you can also use CSS selectors to extract the from and to values.

"from": "ul.pagination li:first-child",
"to": "ul.pagination li:last-child"

pagination_next#

Paginate by recursively clicking on a "next page" button

"pagination_next(css_selector)"

pagination_load#

Load the full page by recursively clicking on a "load more" button

"pagination_load(css_selector)"

pagination_scroll#

Infinite scroll to load the full page

"pagination_scroll()"

for#

Provide a list of values to loop over

{ "for($variable_name)": { "values": [string|number|objects] } }"

You can use it to specify a list of values

{ "for($categories)": { "values": ["Phones", "Laptops"] } },
"input(#search, $categories)"

It also works with objects

{ "for($accounts)": { "values": [{ "login": "a", "pwd": "a" }, { "login": "b", "pwd": "b" }] } },
"input(#login, $accounts.login)",
"input(#login, $accounts.pwd)"

input#

Fill an input or textarea

"input(selector, value)"

js_code#

Run custom JS code

"js_code(console.log("test");)"

wait#

Wait for a duration in milliseconds

"wait(duration)"

Extract Objects#

One object#

"object(object_name)": {
  "attributes": {
    "attribute_name": "selector"
    ...
  }
}

For example:

"object(product)": {
  "attributes": {
    "name": "div.product span.name",
    "price": "div.product span.price",
    ...
  }
}
=> { "name": Excalibur", price": "42$" }

A list of objects#

"object(object_name)[scope_selector]": { // scope_selector is the selector for one object of the list
  "attributes": {
    "attribute_name": "selector", // Mantabase will search for this selector within scope_selector
    ...
  }
}

For example:

"object(product)[div.product]": {
  "attributes": {
    "name": "span.name",
    "price": "span.price",
    ...
  }
}
=> [{ "name": Excalibur", price": "42$" }, { "name": Durendal", price": "48$" }, ...]

@attribute#

By default, the engine will extract the innerText value of the element linked to the css selector provided. You can get an HTML attribute by specifying @name_of_attribute at the end of the selector.

"logo_image": "img#logo @src"
=> Extract the value of the logo's "src" attribute.

List of values#

To capture an array of values for an attribute, suffix the attribute name with []

"titles[]": "span.title" => Extracts all titles: { "titles": ["title1", "title2", ...]}
"title": "span.title"    => Extracts the first title found: { "titles": "title1" }

Formatting#

To format the value of an attribute you are extracting, use one of the following functions.

Parameter	Description
to_number#	Cast to a number
remove#	Remove a string
replace#	Replace a string
prepend#	Add a string at the beginning
append#	Add a string at the end

to_number#

Captures the first number present in a string, and transforms it into an integer or float automatically.

"price": "to_number(span.price)"
=> "42$" => 42
=> "for sale at 42,50 €" => 42.5

remove#

Remove a string

"price": "remove(span.price, $)"
=> "42$" => "42"

replace#

Replace a string

"price": "replace(span.price, $, €)"
=> "42$" => "42€"

prepend#

Add a string at the beginning

"domain": "prepend(span.domain, www.)"
=> "example.com" => "www.example.com"

append#

Add a string at the end

"domain": "append(span.domain, /)"
=> "example.com" => "example.com/"

Chaining#

Formatting functions can be chained

"price": "append(to_number(span.price), $)"
=> "price: 42" => "42$"

Step Conditions#

On every step, you can add conditions specifying how they are going to be executed.

Parameter	Description
do_if#	Execute this step only if true
fail_if#	Fail and retry if true
skip_next_if#	Skip the next steps if true
optional#	Execute next steps even if this step fails
limit#	Limit the size of iteration loops

These parameters take checks that return booleans.

Parameter	Description
boolean	true\|false
visible	A DOM element is visible
!visible	A DOM element is not visible

do_if#

Execute this step only if true

{ "click(.show_all)": { "do_if": "!visible(span:text("No more results"))" } }

fail_if#

Fail and retry if true

{ "js_code(show_cart_info();)": { "fail_if": "visible(span:text(HTTP Status 405 – Method Not Allowed))" } }

skip_next_if#

Skip the next steps if true

{
  "object(product)": {
    "attributes": {
      "name": "div.name",
    },
    "skip_next_if": "visible(div.name)"
  }
}

optional#

Execute next steps even if this step fails

{ "click(.accept_cookies)": { "optional": true } },

limit#

For all steps that perform a loop, reduce the maximum number of iterations performed by the crawler.

{ "click_links(.category)": { "limit": 100 } },

This applies to: objects click_links for urls pagination_numbers pagination_next pagination_load and pagination_scroll.

Geoloc#

Our IP proxies are entirely automated and configured in the background. But you can configure the country of the proxies we use by using geoloc.

Parameter	Description
geoloc	Country of the IP Proxies

geoloc#

{
  "steps": [...],
  "geoloc": "FR" => Sets the IPs to France only
}

Scheduling#

Set a run frequency for the crawler

Parameter	Description
frequency	Set a run frequency for the crawler

Daily#

{
  "frequency": {
    "every": "day",
    "at_time": ["0800", "1430"],
    "timezone": "UTC"
  }
}

Weekly#

{
  "frequency": {
    "every": "week",
    "at_weekday": ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
    "at_time": ["0800", "1430"],
    "timezone": "UTC"
  }
}

Monthly#

{
  "frequency": {
    "every": "month",
    "at_monthday": [1, 2, 3, ..., 31],
    "at_time": ["0800", "1430"],
    "timezone": "UTC"
  }
}

Export#

After every successful run of a crawler, the data extracted will be stored in a Dataset. But you can also export it.

Parameter	Description
export	Destination to export the dataset after every sun

{
  "export": ["gcp(bucket_name)", "s3(bucket_name)"]
}

Documentation