LLMs

ScraperAI heavily relies on Large Language Models (LLMs) to analyze webpage content and extract relevant information. To ensure flexibility and quality, we integrate langchain as a core AI package.

OpenAI

By default, ScraperAI utilizes the latest OpenAI's ChatGPT model (gpt-4-turbo-2024-04-09).

To use another OpenAI model pass its name during initialization:

from scraperai import JsonOpenAI

model = JsonOpenAI(openai_api_key='sk-...', model_name='gpt-3.5-turbo')

You can access total USD spent using total_cost property.

Custom models

ScraperAI provides flexibility to integrate any other model capable of generating JSON answers.

Please note that although ScraperAI utilizes HTML minification, it still requires LLMs to be capable of processing substantial contexts. It's recommended that the model's input size be equal to or greater than approximately 16k Byte-Pair Encoded tokens.

To implement a custom model, simply inherit from the following abstract class:

from langchain_core.messages import BaseMessage
from scraperai import BaseJsonLM

class YourJsonModel(BaseJsonLM):
    def invoke(self, messages: list[BaseMessage]) -> dict:
        ...

Then, pass your model to the ParserAI during initialization:

from scraperai import ParserAI

parser = ParserAI(json_lm_model=YourJsonModel(), ...)

We are working on extending a list of models available out of the box.

Vision Models

Vision Models play a crucial role in ScraperAI for determining webpage types and generating descriptions of webpages.

While usage of Vision Models is recommended, it's not mandatory. Experiments have demonstrated the effectiveness of the Vision approach compared to using JSON Models with HTML inputs.

OpenAI

By default, ScraperAI employs the latest OpenAI's ChatGPT model (gpt-4-vision-preview).

To use another OpenAI model pass its name during initialization:

from scraperai import VisionOpenAI

model = VisionOpenAI(openai_api_key='sk-...', model_name='gpt-4-1106-vision-preview')

You can access total USD spent using total_cost property.

Custom models

ScraperAI provides flexibility to integrate any other model by extending the abstract class as follows:

from langchain_core.messages import BaseMessage
from scraperai import BaseVision

class YourVisionModel(BaseVision):
    def invoke(self, messages: list[BaseMessage]) -> str:
        ...

Note: ScraperAI passes the image as a base64 encoded string to the image_url field of HumanMessage.