community-crawler

Gyubin-Han/community-crawler

Fork 0

Commit Graph

Author	SHA1	Message	Date
Claude	dd4345267a	Add HTML structure debugging script - Create debug-html.ts to save actual HTML from pages - Add npm run debug command to package.json - Helps identify correct CSS selectors for crawlers - Saves HTML to debug-ruliweb.html and debug-arcalive.html Usage: npm run debug	2025-11-15 17:57:35 +00:00
Claude	ae85dcbd87	Add Puppeteer support to bypass bot detection Major changes: - Install puppeteer, puppeteer-extra, puppeteer-extra-plugin-stealth - Create PuppeteerFetcher class with Stealth plugin - Update all crawlers to use Puppeteer instead of Axios - Add browser lifecycle management (init/close) - Update test.ts and index.ts with browser cleanup Features: - Real Chrome browser execution (bypasses TLS fingerprinting) - Stealth plugin to avoid bot detection - Headless mode for background operation - Proper error handling and browser cleanup Limitations: - Requires Chrome/Chromium installation - Higher resource usage (~200MB memory) - Slower than Axios (browser startup time) - Cannot test in current environment (Chrome install blocked) Next steps: - Test in local environment with Chrome installed - Adjust HTML selectors based on actual page structure - Monitor for Cloudflare blocks	2025-11-15 17:39:43 +00:00
Claude	c5ef580534	Add crawler implementation (Node.js + TypeScript) - Create crawler project structure - Implement base crawler class with safety features - Add crawlers for Ruliweb, Arcalive, DCInside - Implement utilities: fetcher (with retry logic), logger - Configure crawling settings (3s delay, max 20 posts/board) - Add test script and scheduler (30min intervals) Safety measures: - 3 second delay between requests - Exponential backoff retry logic - Respect robots.txt (DCInside disabled) - User-Agent and proper headers Current status: - Structure complete - Both Ruliweb and Arcalive return 403 (bot detection) - Need to decide: Puppeteer, switch targets, or use mock data	2025-11-15 17:18:09 +00:00

Author

SHA1

Message

Date

Claude

dd4345267a

Add HTML structure debugging script

- Create debug-html.ts to save actual HTML from pages
- Add npm run debug command to package.json
- Helps identify correct CSS selectors for crawlers
- Saves HTML to debug-ruliweb.html and debug-arcalive.html

Usage: npm run debug

2025-11-15 17:57:35 +00:00

Claude

ae85dcbd87

Add Puppeteer support to bypass bot detection

Major changes:
- Install puppeteer, puppeteer-extra, puppeteer-extra-plugin-stealth
- Create PuppeteerFetcher class with Stealth plugin
- Update all crawlers to use Puppeteer instead of Axios
- Add browser lifecycle management (init/close)
- Update test.ts and index.ts with browser cleanup

Features:
- Real Chrome browser execution (bypasses TLS fingerprinting)
- Stealth plugin to avoid bot detection
- Headless mode for background operation
- Proper error handling and browser cleanup

Limitations:
- Requires Chrome/Chromium installation
- Higher resource usage (~200MB memory)
- Slower than Axios (browser startup time)
- Cannot test in current environment (Chrome install blocked)

Next steps:
- Test in local environment with Chrome installed
- Adjust HTML selectors based on actual page structure
- Monitor for Cloudflare blocks

2025-11-15 17:39:43 +00:00

Claude

c5ef580534

Add crawler implementation (Node.js + TypeScript)

- Create crawler project structure
- Implement base crawler class with safety features
- Add crawlers for Ruliweb, Arcalive, DCInside
- Implement utilities: fetcher (with retry logic), logger
- Configure crawling settings (3s delay, max 20 posts/board)
- Add test script and scheduler (30min intervals)

Safety measures:
- 3 second delay between requests
- Exponential backoff retry logic
- Respect robots.txt (DCInside disabled)
- User-Agent and proper headers

Current status:
- Structure complete
- Both Ruliweb and Arcalive return 403 (bot detection)
- Need to decide: Puppeteer, switch targets, or use mock data

2025-11-15 17:18:09 +00:00

3 Commits