How we built the kynetranews ingest layer
Forty-nine RSS feeds, six social platforms, four search products, and one knowledge graph. Notes on what we learned shipping it.
By Marcus Yi, Engineering Lead, Ingest
The kynetranews ingest layer is the part of SignalGrid that turns the public web into rows in our entity graph. It is the unglamorous half of the system. It is also where most of the value gets locked in or thrown away.
We built it on three principles, in this order.
First, prefer authoritative sources to convenient ones. The convenient source for film metadata is whichever scraper is easiest to write. The authoritative source is TMDB plus IMDb plus the studio press kit. We pay the cost of pulling from authoritative sources first because every downstream model inherits the noise floor of its inputs. A model trained on a corpus where 4 percent of release dates are wrong learns to discount release dates. A model trained on a corpus where 0.4 percent are wrong does not need to.
Second, normalize at write time, not at read time. Source A says "United States." Source B says "USA." Source C says "us." If you store all three and normalize at query time, every model that uses country has to ship its own normalizer, and they will diverge. We normalize once at write time, store the canonical form, and keep the original raw value in a separate provenance field for audits. Read paths are simple. Write paths absorb the complexity.
Third, store provenance for everything. Every fact in our entity graph carries a source ID, an extracted-at timestamp, and a confidence. When two sources disagree, we do not pick a winner; we store both with their provenance and let the model see the disagreement. Disagreement is signal. Two trade outlets reporting different opening-weekend numbers tell you something a single number does not.
The implementation, briefly. We run two pools of workers. The polling pool fetches RSS feeds and structured APIs on configured cadences (most feeds at 5 minutes, the slower trades at 30, marketplace endpoints at 15). The streaming pool consumes platform firehoses and partner feeds where available. Both pools write to a typed queue that fans out to extractors, one per source schema. Extractors emit canonicalized facts, which the resolver attaches to entities by stable ID.
The resolver is the part we rewrote three times. The version we shipped uses a two-stage approach: a fast cheap blocker that narrows candidates by name and rough type, then a learned ranker that scores candidate matches using the full feature set. The blocker is responsible for recall. The ranker is responsible for precision. Combining them is what got us under our cross-vertical resolution error budget of one resolution mistake per ten thousand facts.
The unsexy lessons. Latency in ingest is not the bottleneck for prediction freshness; resolution latency is. Storage is cheaper than recomputation, so we keep the raw fetch payloads forever. The hardest entities to resolve are not the famous ones; they are the moderately famous ones whose names collide. And finally: every time we have been tempted to write our own crawler instead of paying for a feed that already exists, we have eventually wished we had paid.