Building Robust HTML Scrapers with Jericho HTML Parser

Optimizing Performance: Speed Tips for Jericho HTML Parser in Large-Scale Projects

Parsing thousands of HTML documents or streaming large pages requires attention to memory, CPU, and I/O when using Jericho HTML Parser. Below are practical, prescriptive optimizations you can apply immediately.

1) Choose the right parsing mode

Use StreamedSource for very large inputs or continuous streams — it processes events without building whole-document structures and keeps memory low.
Use Source (in-memory) only when you need random-access element queries or OutputDocument edits.

2) Minimize object allocation

Reuse parser-related objects where possible (e.g., reuse a single Source/InputStream wrapper per thread when parsing many similar inputs).
Avoid constructing intermediate Strings from large document regions; prefer Segment/TextExtractor methods that operate on the source directly.

3) Limit scope of searches

Narrow find/findAll calls: use name- or attribute-specific find methods (e.g., findAllStartTags(name)) instead of scanning entire document with getAllElements().
When extracting data from a known region, create a Segment or use Source.subSequence to restrict search range.

4) Prefer streaming extraction for simple tasks

For bulk text extraction or simple token processing, use StreamedSource or Source.getTextExtractor() with configured options (e.g., exclude scripts/styles) to avoid building element trees.

5) Use OutputDocument sparingly and batch edits

OutputDocument creates a mapped representation for replacements — minimize the number of replacements by batching changes (compute string replacements or build fragments then apply a single replace).
When only reading or extracting, avoid creating OutputDocument entirely.

6) Tune concurrency and threading

Parse documents in parallel using a bounded thread pool sized to CPU cores (e.g., cores2 for IO-bound fetch+parse). Keep each Source/Stream confined to one thread.
Avoid shared mutable parser state; instantiate per-thread parser objects or use ThreadLocal caches for reusable items.

7) Control logging and diagnostics

Disable or limit parser logging (Source.setLogWriter(null)) in production to avoid I/O overhead.
Only enable debug features (e.g., getDebugInfo()) for problematic samples.

8) Optimize I/O and network

Stream HTTP responses directly into the parser (InputStream → Source/StreamedSource) rather than buffering entire responses to disk or string.
Use HTTP clients that support streaming and connection reuse (keep-alive) to reduce latency.

9) Manage memory and GC pressure

Parse in chunks and release references to Source/OutputDocument immediately after use to allow GC.
For very large batches, trigger periodic heap pruning (by letting a short pause occur or using small explicit allocation bursts) to avoid long GC pauses.
If using JVM tuning, prefer G1/Graal-compatible GC settings and cap maximum heap to avoid swaps.

10) Profile and benchmark

Measure with realistic inputs: time parsing, memory, and allocation rates (e.g., Java Flight Recorder, async-profiler).
Benchmark StreamedSource vs Source for your workloads; the best choice depends on required operations (editing vs read-only extraction).

Quick checklist before production

Use StreamedSource for > tens of MB or continuous streams.
Restrict search ranges and use targeted find methods.
Batch OutputDocument edits or avoid OutputDocument when possible.
Stream HTTP responses directly into parser.
Parse concurrently with a bounded thread pool; avoid shared state.
Disable logging in production; profile with real data.

Applying these steps will reduce memory use, lower allocations, and speed parsing across large-scale workloads while preserving Jericho’s robustness for malformed or server-tagged HTML.

Building Robust HTML Scrapers with Jericho HTML Parser

Optimizing Performance: Speed Tips for Jericho HTML Parser in Large-Scale Projects

1) Choose the right parsing mode

2) Minimize object allocation

3) Limit scope of searches

4) Prefer streaming extraction for simple tasks

5) Use OutputDocument sparingly and batch edits

6) Tune concurrency and threading

7) Control logging and diagnostics

8) Optimize I/O and network

9) Manage memory and GC pressure

10) Profile and benchmark

Quick checklist before production

Comments

Leave a Reply Cancel reply

More posts

Vevo! CatalogBuilder vs. Alternatives: Which Catalog Tool Wins?

Advanced OpenRefine Techniques: GREL, Clustering, and Workflows

Freeware Burner Comparison: Features, Pros & Cons of Leading Free Tools

Migrating to DRS 2006: Tips for Smooth Radio Automation