SCRAPING AGENT DOCKER KONTEJNER
Datum: 2026-01-18 Server: 46.224.121.179 Cesta: /opt/scraping-agent/ Port: 8094
CO TO JE
Docker kontejner s autonomním scraping agentem: - Performance Guard (RAM ≤70%, CPU ≤45%) - 50-100 workerů (auto-scaling) - Všechny moderní scrapery (crawl4ai, scrapling, playwright, selenium) - API endpoint pro ovládání - Izolované prostředí (neovlivní hlavní server)
STRUKTURA
/opt/scraping-agent/
├── Dockerfile # Build definice
├── docker-compose.yml # Spouštění
├── scraping_agent.py # Agent s Performance Guard
├── agent_api.py # FastAPI server
└── data/
└── MEGA_DATABAZE_FINAL.xlsx # 112,466 leadů
NAINSTALOVANÉ SCRAPERY
crawl4ai # LLM-ready scraping (nejlepší pro AI)
scrapling # Anti-bot adaptive scraping
playwright # Headless browsing
selenium # Browser automation
beautifulsoup4 # HTML parsing
requests # HTTP
pandas # Data processing
psutil # Resource monitoring
fastapi # API server
API ENDPOINTY
1. Health Check
GET http://46.224.121.179:8094/health
Response: {
"status": "healthy",
"cpu_percent": 25.3,
"ram_percent": 62.1,
"limits": {"cpu": 45, "ram": 70}
}
2. Status
GET http://46.224.121.179:8094/status
Response: {
"running": true,
"processed": 15000,
"scraped": 12450,
"failed": 2550,
"cpu_percent": 42.1,
"ram_percent": 68.3,
"workers": 45,
"eta_minutes": 180.5
}
3. Start Scraping
POST http://46.224.121.179:8094/start
Body: {
"excel_file": "MEGA_DATABAZE_FINAL.xlsx",
"max_workers": 50,
"max_cpu": 45.0,
"max_ram": 70.0
}
Response: {
"status": "started",
"urls_to_scrape": 65516,
"max_workers": 50,
"limits": {"cpu": 45, "ram": 70}
}
4. Stop Scraping
POST http://46.224.121.179:8094/stop
Response: {"status": "stopped"}
SPUŠTĚNÍ
Build (jednorázově)
ssh root@46.224.121.179
cd /opt/scraping-agent
docker build -t scraping-agent:latest .
Spuštění
cd /opt/scraping-agent
docker compose up -d
Kontrola logů
docker logs -f scraping-agent
Zastavení
docker compose down
PERFORMANCE GUARD
Agent automaticky: - Monitoruje CPU každých 10s - Monitoruje RAM každých 10s - Když CPU > 45% → snižuje workers o 20% - Když RAM > 70% → snižuje workers o 20% - Když CPU < 30% a RAM < 60% → zvyšuje workers o 20% - Minimálně 5 workerů, maximálně 100 workerů
RESOURCE LIMITS (Docker)
limits:
cpus: '8' # Max 8 CPU jader
memory: 8G # Max 8 GB RAM
reservations:
cpus: '2' # Garantováno 2 jádra
memory: 2G # Garantováno 2 GB
NGINX ROUTING (TODO)
Přidat do /etc/nginx/sites-enabled/czechai.conf:
# Scraping Agent API
location /scraping/ {
proxy_pass http://localhost:8094/;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
}
Potom:
nginx -t && systemctl reload nginx
URL: https://router.czechai.io/scraping/health
MONITORING
Real-time monitoring přes API:
watch -n 2 'curl -s http://46.224.121.179:8094/status | jq'
Nebo HTML monitor (zkopírovat agent_monitor.html):
scp C:/Users/info/Desktop/NOVY\ EC/agent_monitor.html root@46.224.121.179:/var/www/router-static/scraping-monitor.html
URL: https://router.czechai.io/web/scraping-monitor.html
TROUBLESHOOTING
Agent neběží
docker ps -a | grep scraping
docker logs scraping-agent
CPU/RAM překročeno
- Agent automaticky snižuje workers
- Zkontroluj ostatní procesy:
top - Případně sniž max_workers při startu
Build failuje
- Konflikt verzí balíčků
- Řešení: odstraň pevné verze, nech pip vyřešit závislosti
CHANGELOG
2026-01-18: - Vytvořen Dockerfile s crawl4ai, scrapling, playwright - Vytvořen agent_api.py s Performance Guard - Nahráno MEGA_DATABAZE_FINAL.xlsx (112,466 leadů) - Build probíhá