Enhanced Scraping & Automated Website Retraining

Samuel Su
on October 11, 2024We're excited to introduce Enhanced Scraping Options and Automated Website Retraining, making chatbot training more efficient, accurate, and adaptable to dynamic website content.
🚀 Enhanced Scraping Options
We've addressed key scraping challenges to give you greater control and accuracy when training your chatbot:
-
Access to Authenticated Pages
- Train your chatbot on both public and authenticated web pages requiring login credentials.
-
Cloudflare CAPTCHA Solver
- Our new scraper bypasses Cloudflare CAPTCHAs with 100% accuracy, ensuring uninterrupted website scraping.
-
Improved Content Filtering
- Websites often contain irrelevant elements like headers, footers, and cookie pop-ups that clutter training data.
- Our enhanced scraper now filters out non-essential elements, significantly increasing the number of pages within your training limit.
- Example: One user expanded their chatbot's training set from 600 to 3,500 pages by removing redundant content—without losing critical information.
These upgrades ensure your chatbot can train more effectively on both public and authenticated pages, delivering refined and high-quality responses.
📖 Check the Page Options in our cookbook for detailed configuration steps.
🔄 Automated Website Retraining
Keep your chatbot's knowledge base up to date—automatically!
How It Works
Once your chatbot is trained on website URLs, you can set up a cron job to periodically fetch new content and retrain the chatbot.
- Entry Plan: Monthly retraining
- Standard Plan & Above: Weekly or daily retraining
Customization Options
- Choose Your Crawl Type:
- List Mode: Keeps the existing structure intact.
- Automatic Mode & Sitemap Mode: Identify and remove outdated pages.
- Apply Page Options:
- Extract only relevant content to improve training quality.
Scrape Credit Usage
To maintain efficiency for all users, each scraped page URL will consume scrape credits.
- ✅ Get Started Today
1️⃣ Enable scraping for authenticated pages in your chatbot settings.
2️⃣ Set up a cron job to automate website retraining.
3️⃣ Optimize training by configuring Page Options for cleaner, more focused content extraction.