Mantabase is a scalable Web Crawling and Scraping API. It handles every step of a web crawling process: Proxies, Rendering, Data Extraction, Navigation, Scheduling, Storing the Datasets, and much more!
Getting Started
Auth & API Key
Authentication is done via the Bearer Authentication with your Mantabase API Key.
How it works
The typical process to build a new Mantabase crawler goes as follow:
Steps: Write the set of instructions on how to crawl and scrape the websitePreview: Send the crawling steps to thepreviewendpoint while developing to iterate and get feedbackCrawler: Once you're happy with the configuration of your crawler, you cansaveitRun: Run the crawler, either programmatically or by setting a frequencyDataset: Retrieve the dataset, either via API or by setting an export option
Examples
curl 'https://api.mantabase.com/1/preview' \ -H 'Authorization: Bearer $MANTABASE_API_KEY' \ -H 'Content-Type: application/json' \ -d '{ "steps": [ "url(https://news.ycombinator.com/)", "pagination_next(a[rel=\"next\"])", "click_links(span.age a)", { "object(hn_post)": { "attributes": { "title": ".titleline", "link_url": ".titleline a @href", "user": "a.hnuser", "user_link": "a.hnuser @href", "score": "to_number(.score)", "nb_comments": "to_number(.subline > a:last-child)", "time_ago": ".subline a:last-child" } } } ] }'
Preview
The preview endpoint allows you to iterate quickly and get feedback while you're configuring your crawler.
POST api.mantabase.com/1/preview
=> { "status": "success", "steps": [...], "messages": [...] }
It takes the same parameters as regular crawlers and:
- has a more verbose output to help you iterate
- only runs the first page of each navigation/pagination loops, to return a sample dataset quickly
Once you're happy with the configuration of your crawler, you can save it.
Crawlers
A Crawler is the set of instructions on how to crawl and scrape a website, when to crawl it, and where to export the dataset.
Set
After iterating on your crawler configuration in preview, you can save the crawler
POST api.mantabase.com/1/crawlers/{crawler_ID?}
params = { "steps": [...], "frequency": {...}, ... }
=> { "status": "success", "crawler_id": 12345 }
This will create a new crawler if crawler_ID is not provided or new, and update a crawler if you provide an existing crawler_ID.
Get
Get a crawler and its configuration
GET api.mantabase.com/1/crawlers/{crawler_ID}
=> { "crawler_id": 12345, steps: [...] }
Delete
Delete a crawler
DELETE api.mantabase.com/1/crawlers/{crawler_ID}
=> { "status": "success" }
List
List all saved crawlers
GET api.mantabase.com/1/crawlers
=> [{ "crawler_id": 12345, steps: [...] }]
Runs
A Run of a crawler is one crawl of a website.
Most crawler runs are initiated by the scheduler, but you can also trigger a crawler run programmatically.
Run
POST api.mantabase.com/1/crawlers/{crawler_ID}/run
=> [{ "status": "success", "run_id": 12345 }]
List Runs
Check all of the past and ongoing runs performed by a crawler
GET api.mantabase.com/1/crawlers/{crawler_ID}/runs
=> [{ "run_id": 12345, "status": "success", ... }, ...]
Datasets
Every Crawler Run is automatically stored by Mantabase in a Dataset. For example, a crawler with the configuration:
{ "steps": [
"url(http://wikipedia.org)",
{ "object(articles)[div.article]": { "attributes": { "title": "h2" } } }
],
"frequency": {
"every": "day",
"at_time": ["0800"]
} }Will generate a dataset called articles, and store a new set of objects for each daily run.
List Datasets
GET api.mantabase.com/1/datasets/
=> [{ "name": "products", "runs": 123 }, ...]
List Dataset Runs
List all crawler runs that have retrieved objects for this dataset
GET api.mantabase.com/1/datasets/{dataset_name}/runs
=> [{ "run_id": 12345, "objects": 12345, "crawled_at": "2022-12-06T13:21:37Z" }, ...]
Get Latest Objects
Retrieve the objects extracted by the latest successful run of the crawler.
GET api.mantabase.com/1/datasets/{dataset_name}/latest
=> { "products": [{ "run_id": 12345, "objects": 12345, "crawled_at": "2022-12-06T13:21:37Z" }, ...] }
Get Objects
Retrieve the objects extracted by a given run of the crawler.
GET api.mantabase.com/1/datasets/{dataset_name}/{run_id}
=> { "products": [{ "run_id": 12345, "objects": 12345, "crawled_at": "2022-12-06T13:21:37Z" }, ...] }
Steps
Steps are the instructions telling the crawler what to do on the website you're crawling: which pages to visit, buttons to click, objects to extract...
The parameter steps is formatted as an array, where each step is going to be executed consecutively.
"steps": [
"url(http://wikipedia.org)",
{ "object(wiki)": { "attributes": { "title": "h1" } } }
]
The crawler will:
1/ Open the page wikipedia.org
2/ Extract the title (h1) from this page
Each step can represent multiple pages, all running in parallel.
"steps": [
{ "urls()": { urls: ["wikipedia.org/wiki/Web_crawler", "wikipedia.org/wiki/API"] } },
{ "object(wiki)": { "attributes": { "title": "h1" } } }
]
The crawler will:
1/ Open both of the wikipedia pages in parallel browsers
2/ Extract the title (h1) of both pages
All steps
| Parameter | Description |
|---|---|
| url# | Load one page |
| urls# | Load a list of pages |
| object# | Extract one object |
| objects# | Extract a list of objects |
| click# | Click on an element |
| click_links# | Click on a list of links and open all of the pages |
| pagination_numbers# | Paginate over a list of pages using page numbers |
| pagination_next# | Paginate by recursively clicking on a 'next page' button |
| pagination_load# | Load the full page by recursively clicking on a 'load more' button |
| pagination_scroll# | Infinite scroll to load the full page |
| for# | Provide a list of values to loop over |
| input# | Fill an input or textarea |
| js_code# | Run custom JS code |
| wait# | Wait for a duration in milliseconds |
url
Load one page
"url(www.example.com)"
urls
Load a list of pages
{ "urls()": { urls: ["www.example.com/1", "www.example.com/2"] } }object
Extract one object
{ "object(object_name)": { "attributes": {...} } }More on object extraction on the dedicated section.
objects
Extract a list of objects
{ "object(object_name)[scope_selector]": { "attributes": {...} } }More on object extraction on the dedicated section.
click
Click on an element
"click(css_selector)"
click_links
Click on a list of links and open all of the pages
"click_links(css_selector)"
pagination_numbers
Paginate over a list of pages using page numbers. There are two ways to use this function, depending on where the page number appears on the URL.
Page number in a URL parameter
{ "pagination_numbers()": {
"param": "page",
"from": 1,
"to": 100
} } => Will crawl the urls: current_url/?page=1, current_url/?page=2, ..., current_url/?page=100 Page number as part of the URL
{ "pagination_numbers()": {
"url": "http://example.com/$page_number",
"from": 1,
"to": 100
} } => Will crawl the urls: example.com/1, example.com/2, ..., example.com/100 To use the current host or URL, you can use the variables $current_URL or $current_host
"url": "$current_host/$page_number" => example.com/1... "url": "$current_URL/$page_number" => example.com/my-product/1...
In both cases, you can also use CSS selectors to extract the from and to values.
"from": "ul.pagination li:first-child", "to": "ul.pagination li:last-child"
pagination_next
Paginate by recursively clicking on a "next page" button
"pagination_next(css_selector)"
pagination_load
Load the full page by recursively clicking on a "load more" button
"pagination_load(css_selector)"
pagination_scroll
Infinite scroll to load the full page
"pagination_scroll()"
for
Provide a list of values to loop over
{ "for($variable_name)": { "values": [string|number|objects] } }"You can use it to specify a list of values
{ "for($categories)": { "values": ["Phones", "Laptops"] } },
"input(#search, $categories)"It also works with objects
{ "for($accounts)": { "values": [{ "login": "a", "pwd": "a" }, { "login": "b", "pwd": "b" }] } },
"input(#login, $accounts.login)",
"input(#login, $accounts.pwd)"input
Fill an input or textarea
"input(selector, value)"
js_code
Run custom JS code
"js_code(console.log("test");)"wait
Wait for a duration in milliseconds
"wait(duration)"
Extract Objects
One object
"object(object_name)": {
"attributes": {
"attribute_name": "selector"
...
}
}
For example:
"object(product)": {
"attributes": {
"name": "div.product span.name",
"price": "div.product span.price",
...
}
}
=> { "name": Excalibur", price": "42$" }
A list of objects
"object(object_name)[scope_selector]": { // scope_selector is the selector for one object of the list
"attributes": {
"attribute_name": "selector", // Mantabase will search for this selector within scope_selector
...
}
}For example:
"object(product)[div.product]": {
"attributes": {
"name": "span.name",
"price": "span.price",
...
}
}
=> [{ "name": Excalibur", price": "42$" }, { "name": Durendal", price": "48$" }, ...]
@attribute
By default, the engine will extract the innerText value of the element linked to the css selector provided.
You can get an HTML attribute by specifying @name_of_attribute at the end of the selector.
"logo_image": "img#logo @src"
=> Extract the value of the logo's "src" attribute.
List of values
To capture an array of values for an attribute, suffix the attribute name with []
"titles[]": "span.title" => Extracts all titles: { "titles": ["title1", "title2", ...]} "title": "span.title" => Extracts the first title found: { "titles": "title1" }
Formatting
To format the value of an attribute you are extracting, use one of the following functions.
| Parameter | Description |
|---|---|
| to_number# | Cast to a number |
| remove# | Remove a string |
| replace# | Replace a string |
| prepend# | Add a string at the beginning |
| append# | Add a string at the end |
to_number
Captures the first number present in a string, and transforms it into an integer or float automatically.
"price": "to_number(span.price)"
=> "42$" => 42
=> "for sale at 42,50 €" => 42.5remove
Remove a string
"price": "remove(span.price, $)"
=> "42$" => "42"replace
Replace a string
"price": "replace(span.price, $, €)"
=> "42$" => "42€"prepend
Add a string at the beginning
"domain": "prepend(span.domain, www.)"
=> "example.com" => "www.example.com"append
Add a string at the end
"domain": "append(span.domain, /)"
=> "example.com" => "example.com/"Chaining
Formatting functions can be chained
"price": "append(to_number(span.price), $)"
=> "price: 42" => "42$" Step Conditions
On every step, you can add conditions specifying how they are going to be executed.
| Parameter | Description |
|---|---|
| do_if# | Execute this step only if true |
| fail_if# | Fail and retry if true |
| skip_next_if# | Skip the next steps if true |
| optional# | Execute next steps even if this step fails |
| limit# | Limit the size of iteration loops |
These parameters take checks that return booleans.
| Parameter | Description |
|---|---|
| boolean | true|false |
| visible | A DOM element is visible |
| !visible | A DOM element is not visible |
do_if
Execute this step only if true
{ "click(.show_all)": { "do_if": "!visible(span:text("No more results"))" } }fail_if
Fail and retry if true
{ "js_code(show_cart_info();)": { "fail_if": "visible(span:text(HTTP Status 405 – Method Not Allowed))" } }skip_next_if
Skip the next steps if true
{
"object(product)": {
"attributes": {
"name": "div.name",
},
"skip_next_if": "visible(div.name)"
}
}optional
Execute next steps even if this step fails
{ "click(.accept_cookies)": { "optional": true } },limit
For all steps that perform a loop, reduce the maximum number of iterations performed by the crawler.
{ "click_links(.category)": { "limit": 100 } },This applies to: objects click_links for urls pagination_numbers pagination_next pagination_load and pagination_scroll.
Geoloc
Our IP proxies are entirely automated and configured in the background. But you can configure the country of the proxies we use by using geoloc.
| Parameter | Description |
|---|---|
| geoloc | Country of the IP Proxies |
geoloc
{
"steps": [...],
"geoloc": "FR" => Sets the IPs to France only
}Scheduling
Set a run frequency for the crawler
| Parameter | Description |
|---|---|
| frequency | Set a run frequency for the crawler |
Daily
{
"frequency": {
"every": "day",
"at_time": ["0800", "1430"],
"timezone": "UTC"
}
}Weekly
{
"frequency": {
"every": "week",
"at_weekday": ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
"at_time": ["0800", "1430"],
"timezone": "UTC"
}
}Monthly
{
"frequency": {
"every": "month",
"at_monthday": [1, 2, 3, ..., 31],
"at_time": ["0800", "1430"],
"timezone": "UTC"
}
}Export
After every successful run of a crawler, the data extracted will be stored in a Dataset. But you can also export it.
| Parameter | Description |
|---|---|
| export | Destination to export the dataset after every sun |
{
"export": ["gcp(bucket_name)", "s3(bucket_name)"]
}