Commit Graph

3 Commits

Author SHA1 Message Date
Claude
d62867e0cb Add URL endpoint testing script
- Test different Ruliweb URLs (search, board, best, main)
- Result: All endpoints return 403 "Access denied"
- Confirms that Ruliweb blocks all bot requests
- Validates that Puppeteer/Selenium is required
2025-11-15 17:34:10 +00:00
Claude
1ccbc17b79 Improve fetcher with browser-like headers and cookie handling
- Add cookie jar for session management
- Include sec-ch-ua and Sec-Fetch-* headers (Chrome-like)
- Add HTTPS agent with keepAlive
- Log 403 response body for debugging

Result: Still blocked by TLS fingerprinting
- Both Ruliweb and Arcalive return "Access denied"
- Need Puppeteer to bypass advanced bot detection
2025-11-15 17:28:23 +00:00
Claude
c5ef580534 Add crawler implementation (Node.js + TypeScript)
- Create crawler project structure
- Implement base crawler class with safety features
- Add crawlers for Ruliweb, Arcalive, DCInside
- Implement utilities: fetcher (with retry logic), logger
- Configure crawling settings (3s delay, max 20 posts/board)
- Add test script and scheduler (30min intervals)

Safety measures:
- 3 second delay between requests
- Exponential backoff retry logic
- Respect robots.txt (DCInside disabled)
- User-Agent and proper headers

Current status:
- Structure complete
- Both Ruliweb and Arcalive return 403 (bot detection)
- Need to decide: Puppeteer, switch targets, or use mock data
2025-11-15 17:18:09 +00:00