ATA_0001 // MVP v0.2
ATALAYA//
ENES
ATA_0001 // METHODOLOGY

Six stages, in this order.

The sequence is intentional: surface and differentiated value are decided before the data pipeline is built.

Atalaya is built as a visual memory of what the Colombian press has already published — an editorial dossier first, an engineering pipeline second.

01

Framing the problem

The underlying problem is not technical. It is cultural: Colombia operates with institutional amnesia. The track records of candidates and officials are forgotten between electoral cycles — the same mistakes repeat, the same actors return.

Atalaya starts from a simple premise: the Colombian press has already done the journalism. What is missing is a layer that indexes, deduplicates and connects what was published, so the public can recognize patterns over time.

The scope is narrow: a public-interest tool that helps citizens recognize the people connected to corruption cases and understand that corruption is not the property of any one political wing — it is a systemic, interconnected network.

Result: Cultural diagnosis + audience + expected outcome.
02

The graph as primary surface

The most important design decision is made before any code is written: the primary navigation is a force-directed graph, not a table.

Tables and headlines isolate cases one by one. The graph surfaces what the tabular format hides — the people, institutions and periods that reappear across cases. That is the differentiated contribution against existing archives such as OCCRP Aleph or OpenCorporates: the systemic pattern is visible before the individual case is.

Result: Force-directed graph as primary navigation.
03

Source and publication rules

The sole source is the Colombian traditional press: El Tiempo, El Espectador, Semana, La Silla Vacía, Vorágine, Cuestión Pública and regional outlets. No SECOP, no Procuraduría, no judicial bulletins as primary sources. The reason is deliberate: Atalaya indexes what is already public and journalistically verified, not what an official could reclassify tomorrow.

A claim is published when two or more distinct outlets agree on (people + institution + status), with wire-copy deduplication so a single Colprensa story republished by three outlets does not count as three sources. Single-source claims are also published, but with a `Single source` badge that communicates lower confidence to the reader.

Cross-source confrontation replaces the human reviewer by design. Atalaya is an aggregator, not an editor. The most recent judicial ruling overrides the case status — the courts have the final word, the press has the first draft.

Result: Multi-source as default + single source with badge + judicial override.
04

AI-assisted extraction pipeline

The pipeline has four steps: classification, structured extraction, embedding, and linking into the graph. Classification is the gate — a low-cost classifier rejects articles that are not about Colombian corruption before the extractor ever sees them. Keeping the cost of entry close to zero is what makes it viable to scan the entire press daily.

The extractor returns JSON validated against a strict schema: people, institutions, amounts, procedural status, presidential period. If the JSON fails validation, the entry is rejected — we would rather lose a claim than publish a malformed one.

Every extracted entity is embedded and searched against the existing corpus by semantic similarity. If similarity passes a threshold, the system enriches an existing case instead of creating a duplicate node. This is how the graph grows: each new article adds evidence to a known case, or opens a new one.

Result: Cascade of classifier + schema-validated extractor + semantic deduplication.
05

Learning from failure modes

The three most important design patterns came from real production failures, not from the whiteboard.

**Cambio as institution.** An early version of the extractor confused mentions of the outlet Cambio with the Spanish verb «cambio» (change) and created spurious nodes. The fix was not a model patch — it was an exclusion list (`stoplist`) of ambiguous mentions applied before extraction. The rule we kept: if a token has any semantic-merge risk, filter it at the input, do not trust the model to disambiguate.

**Syria–narcotics conflation.** The extractor merged two entirely distinct cases because they shared secondary people and both contained the word «narcotics». The fix was multi-signal deduplication: a case identity no longer depends on people alone, but on the combination (people + institution + period + event type). A single axis is not enough to merge.

**Naive amount sums.** The first computation of «amount under indictment» summed every figure mentioned in every article, which inflated a case each time a story recapped it. The fix was to take the modal value — the amount that appears most often across sources — instead of the sum. The metric went from noise to signal.

Result: Stoplist + multi-signal dedup + modal amount — three rules born from three failures.
06

Visual register

The visual register is deliberately sober: dark editorial, 1 px technical lines, monospaced type for data and an editorial sans for headings. Functional color — signal red, amber, teal, blue — only appears to differentiate relationship types and confidence levels, never as decoration.

The goal is for the site to read like a surveillance dossier, not a commercial landing. Information density, grid lines and the absence of decorative imagery reinforce the editorial position: Atalaya shows what the press published, without embellishing or dramatizing.

Result: Editorial dossier: dense information, functional color, zero decoration.