Commit Graph

9 Commits

Author SHA1 Message Date
Claude
bae43e4679 Fix HTML selectors based on actual page structure
Ruliweb changes:
- Use tr.table_body instead of .table_body tr
- Correct selectors: .subject .subject_link, .writer a, .hit, .recomd, .num_reply, .time
- Remove icons/reply count from title text
- Add debug logging for parsed posts

Arcalive changes:
- Use a.vrow instead of .vrow
- Correct selectors: .col-title .title, .col-author .user-info span[data-filter]
- Selectors: .col-view, .col-rate, .col-time time
- Add debug logging for parsed posts

Both crawlers now match actual HTML structure from pages
2025-11-15 18:23:12 +00:00
Claude
dd4345267a Add HTML structure debugging script
- Create debug-html.ts to save actual HTML from pages
- Add npm run debug command to package.json
- Helps identify correct CSS selectors for crawlers
- Saves HTML to debug-ruliweb.html and debug-arcalive.html

Usage: npm run debug
2025-11-15 17:57:35 +00:00
Claude
ae85dcbd87 Add Puppeteer support to bypass bot detection
Major changes:
- Install puppeteer, puppeteer-extra, puppeteer-extra-plugin-stealth
- Create PuppeteerFetcher class with Stealth plugin
- Update all crawlers to use Puppeteer instead of Axios
- Add browser lifecycle management (init/close)
- Update test.ts and index.ts with browser cleanup

Features:
- Real Chrome browser execution (bypasses TLS fingerprinting)
- Stealth plugin to avoid bot detection
- Headless mode for background operation
- Proper error handling and browser cleanup

Limitations:
- Requires Chrome/Chromium installation
- Higher resource usage (~200MB memory)
- Slower than Axios (browser startup time)
- Cannot test in current environment (Chrome install blocked)

Next steps:
- Test in local environment with Chrome installed
- Adjust HTML selectors based on actual page structure
- Monitor for Cloudflare blocks
2025-11-15 17:39:43 +00:00
Claude
d62867e0cb Add URL endpoint testing script
- Test different Ruliweb URLs (search, board, best, main)
- Result: All endpoints return 403 "Access denied"
- Confirms that Ruliweb blocks all bot requests
- Validates that Puppeteer/Selenium is required
2025-11-15 17:34:10 +00:00
Claude
1ccbc17b79 Improve fetcher with browser-like headers and cookie handling
- Add cookie jar for session management
- Include sec-ch-ua and Sec-Fetch-* headers (Chrome-like)
- Add HTTPS agent with keepAlive
- Log 403 response body for debugging

Result: Still blocked by TLS fingerprinting
- Both Ruliweb and Arcalive return "Access denied"
- Need Puppeteer to bypass advanced bot detection
2025-11-15 17:28:23 +00:00
Claude
c5ef580534 Add crawler implementation (Node.js + TypeScript)
- Create crawler project structure
- Implement base crawler class with safety features
- Add crawlers for Ruliweb, Arcalive, DCInside
- Implement utilities: fetcher (with retry logic), logger
- Configure crawling settings (3s delay, max 20 posts/board)
- Add test script and scheduler (30min intervals)

Safety measures:
- 3 second delay between requests
- Exponential backoff retry logic
- Respect robots.txt (DCInside disabled)
- User-Agent and proper headers

Current status:
- Structure complete
- Both Ruliweb and Arcalive return 403 (bot detection)
- Need to decide: Puppeteer, switch targets, or use mock data
2025-11-15 17:18:09 +00:00
Claude
e8ca418817 Fix Tailwind CSS v4 configuration
- Update src/index.css to use @import "tailwindcss" (v4 syntax)
- Remove tailwind.config.js (not needed in v4)
- Tailwind styles now properly applied
2025-11-15 14:51:59 +00:00
Claude
8f7e0ee216 Initial setup: Korean community aggregator web app
- Set up Vite + React + TypeScript project
- Configure Tailwind CSS v4 with PostCSS
- Create project structure (components, types, data)
- Implement core features:
  - Header with search functionality
  - PostCard component for displaying posts
  - PostList with community filtering (전체/디씨/루리웹/아카)
  - PostModal for detailed post view
- Add mock data for 3 communities (DCInside, Ruliweb, Arcalive)
- Update README with project documentation
2025-11-15 13:22:50 +00:00
6e5c108269 First commit 2025-11-15 19:55:24 +09:00