Optimizing Performance: Speed Tips for Jericho HTML Parser in Large-Scale Projects
Parsing thousands of HTML documents or streaming large pages requires attention to memory, CPU, and I/O when using Jericho HTML Parser. Below are practical, prescriptive optimizations you can apply immediately.
1) Choose the right parsing mode
- Use StreamedSource for very large inputs or continuous streams — it processes events without building whole-document structures and keeps memory low.
- Use Source (in-memory) only when you need random-access element queries or OutputDocument edits.
2) Minimize object allocation
- Reuse parser-related objects where possible (e.g., reuse a single Source/InputStream wrapper per thread when parsing many similar inputs).
- Avoid constructing intermediate Strings from large document regions; prefer Segment/TextExtractor methods that operate on the source directly.
3) Limit scope of searches
- Narrow find/findAll calls: use name- or attribute-specific find methods (e.g., findAllStartTags(name)) instead of scanning entire document with getAllElements().
- When extracting data from a known region, create a Segment or use Source.subSequence to restrict search range.
4) Prefer streaming extraction for simple tasks
- For bulk text extraction or simple token processing, use StreamedSource or Source.getTextExtractor() with configured options (e.g., exclude scripts/styles) to avoid building element trees.
5) Use OutputDocument sparingly and batch edits
- OutputDocument creates a mapped representation for replacements — minimize the number of replacements by batching changes (compute string replacements or build fragments then apply a single replace).
- When only reading or extracting, avoid creating OutputDocument entirely.
6) Tune concurrency and threading
- Parse documents in parallel using a bounded thread pool sized to CPU cores (e.g., cores2 for IO-bound fetch+parse). Keep each Source/Stream confined to one thread.
- Avoid shared mutable parser state; instantiate per-thread parser objects or use ThreadLocal caches for reusable items.
7) Control logging and diagnostics
- Disable or limit parser logging (Source.setLogWriter(null)) in production to avoid I/O overhead.
- Only enable debug features (e.g., getDebugInfo()) for problematic samples.
8) Optimize I/O and network
- Stream HTTP responses directly into the parser (InputStream → Source/StreamedSource) rather than buffering entire responses to disk or string.
- Use HTTP clients that support streaming and connection reuse (keep-alive) to reduce latency.
9) Manage memory and GC pressure
- Parse in chunks and release references to Source/OutputDocument immediately after use to allow GC.
- For very large batches, trigger periodic heap pruning (by letting a short pause occur or using small explicit allocation bursts) to avoid long GC pauses.
- If using JVM tuning, prefer G1/Graal-compatible GC settings and cap maximum heap to avoid swaps.
10) Profile and benchmark
- Measure with realistic inputs: time parsing, memory, and allocation rates (e.g., Java Flight Recorder, async-profiler).
- Benchmark StreamedSource vs Source for your workloads; the best choice depends on required operations (editing vs read-only extraction).
Quick checklist before production
- Use StreamedSource for > tens of MB or continuous streams.
- Restrict search ranges and use targeted find methods.
- Batch OutputDocument edits or avoid OutputDocument when possible.
- Stream HTTP responses directly into parser.
- Parse concurrently with a bounded thread pool; avoid shared state.
- Disable logging in production; profile with real data.
Applying these steps will reduce memory use, lower allocations, and speed parsing across large-scale workloads while preserving Jericho’s robustness for malformed or server-tagged HTML.
Leave a Reply