Skip to main content
Algo Strategy Analyzer
Complete Guide

Market Data for Algorithmic Trading: The Raw Material

Data types, free vs premium providers, deadly biases, and how to choose the right data for your backtest

Rubén Villahermosa Rubén Villahermosa
January 24, 2026 22 min read

Your strategy has a Sharpe Ratio of 3.5. A maximum drawdown of 8%. Profit Factor of 2.8. It seems too good to be true. And it probably is.

Garbage In, Garbage Out. If your market data has errors, gaps, or biases, your backtest is fiction. It doesn't matter how sophisticated your algorithm is: if the raw material is contaminated, the final product will be garbage.

"Market data is the foundation upon which you build your entire algorithmic trading system. Without quality data, everything else is irrelevant."

If you're coming from our algorithmic trading guide, you already know you need data for backtesting. And if you've explored the algorithmic trading tools, you know that data is the fuel that powers everything. Now you'll learn exactly what types of data exist, where to get them, how much they cost, and —most importantly— how to avoid the traps that invalidate 90% of backtests.

Already have data and a strategy?

Validate your backtest with Monte Carlo analysis, Walk Forward, and bias detection. Free.

Analyze my strategy →
ALERT ⚠️

Garbage In, Garbage Out: Why Data Is Everything

A backtest is only as good as the data that feeds it. This is not a cliché: it's the first law of algorithmic trading. As Marcos López de Prado explains in Advances in Financial Machine Learning, data quality is the foundation upon which any reliable quantitative system is built. Considering that 60-80% of trading volume in U.S. and European equity markets is generated by algorithmic trading (Select USA, 2024), the competition for quality data is fierce: data is the real competitive advantage.

The Real Cost of Bad Data

Imagine you develop a momentum strategy in stocks. You backtest 10 years of data and get spectacular results. But your data has a problem: it doesn't include companies that went bankrupt or were delisted. You're only seeing the survivors.

💀

Result of Survivorship Bias

Your backtest shows +340% in 10 years. Reality would have been +40% (or losses). You built on incomplete data.

Real Cases of Data Failures

Case Data Problem Impact
Pairs strategy (stocks) Data not adjusted for splits False signals, -60% losses
Mean reversion (futures) 2-hour gaps in data Real DD 3x higher than backtest
Breakout (forex) Single broker data (biased) Real spread 2x higher, not profitable
Momentum (ETFs) No delisted ETFs Performance inflated +200%

Why Free Data Has a Hidden Cost

"Yahoo Finance is free, why pay for data?"

Because free data:

  • Has errors that no one corrects (misapplied splits, gaps)
  • Doesn't include delisted (guaranteed survivorship bias)
  • Is only EOD (end of day) — useless for intraday strategies
  • Changes retroactively without notice (Yahoo has modified historical data)
  • Has no bid/ask — impossible to simulate real slippage

The global real-time financial data market is valued at $28 billion (2024), demonstrating the importance the industry places on information quality. It's no coincidence that Python is used by 74% of algorithmic traders according to the QuantInsti Developer Survey (2024): the Python ecosystem of data libraries and tools makes it easier to access and process market data at scale.

💡

Cost Perspective

A professional provider costs $50-200/month. This is insignificant compared to the cost of developing strategies on garbage data for months and losing real money afterwards.

TYPES 📊

Types of Market Data

Not all data is equal. Depending on your strategy, you'll need different types and granularities.

TICK DATA vs OHLC: THE DIFFERENCE TICK DATA (Raw) Each point = 1 transaction Thousands of points per minute Maximum precision OHLC (Aggregated) H O C L Each bar = 1 period (1m, 1h, 1D...) Loses intrabar detail Enough for swing Aggregation

OHLC (Open, High, Low, Close)

The most common format. Each bar contains:

OHLC Components

  • Open: Opening price of the period
  • High: Maximum price reached
  • Low: Minimum price reached
  • Close: Closing price of the period
  • Volume: Quantity traded

Key Limitation

You don't know the order of movements within the bar. Did the price touch High first then Low, or vice versa? This affects strategies with tight stops.

Tick Data

Each individual transaction recorded with millisecond timestamp.

Timestamp,Price,Volume,Side
2026-01-24 09:30:00.123,4520.25,5,BUY
2026-01-24 09:30:00.125,4520.50,3,BUY
2026-01-24 09:30:00.128,4520.25,10,SELL

✅ Advantages

  • Maximum precision for backtesting
  • Allows reconstruction of any timeframe
  • Necessary for high frequency
  • You can see real order flow

❌ Limitations

  • Huge files (GB per day)
  • Requires more processing
  • More expensive and hard to get
  • Overkill for swing trading
TIME ⏱️

Granularity: From Tick to Monthly

The correct granularity depends on your time horizon and trading style.

Granularity Typical Use Data/year Size
Tick HFT, scalping Millions 1-10 GB
1 minute Day trading ~98,000 5 MB
5 minutes Day/Swing ~19,600 1 MB
1 hour Swing trading ~1,600 100 KB
Daily (EOD) Position trading ~252 20 KB
Weekly Long-term investing ~52 5 KB

📐 Practical Rule

Your data timeframe should be at least 5-10x smaller than your trading timeframe. If you trade on 1-hour charts, you need data of at least 5-15 minutes.

SPREAD 💰

Bid/Ask and Spread: The Detail that Kills Strategies

Most historical data only shows one price (last, mid, or close). But in reality, there are always two prices: Bid (what you can sell at) and Ask (what you can buy at). The key formulas to understand these prices are:

Mid Price = (Bid + Ask) / 2

Spread Cost = Ask - Bid (implicit cost per trade from the differential)

The Hidden Spread

If your backtest assumes you buy at "Close" price and the real spread is 2 pips, you're ignoring a cost that eats your edge.

EUR/USD Example

  • Your backtest uses mid price: 1.0850
  • Real spread: 1 pip
  • Buy at Ask: 1.08505
  • Sell at Bid: 1.08495

In a 50-pip profit trade, spread costs 1 pip (2%). In a 10-pip trade, it costs 1 pip (10%). Spread matters more the shorter your horizon.

Does your backtest include realistic costs?

Algo Strategy Analyzer lets you configure commissions and slippage to see the real impact on your strategy.

Try free →
SPLITS ✂️

Adjusted vs Unadjusted Data

This is one of the most confusing and most important topics. Using the wrong type can generate completely false signals.

What Are Adjustments?

When a company does a 2:1 split, the price is divided by 2. If yesterday the stock was worth $200 and today it's worth $100, it's not because it fell 50% — it's because there are now twice as many shares.

Adjusted data retroactively modifies all historical prices to reflect these corporate actions, maintaining series continuity. The basic adjustment formula for splits is:

Adjusted Price = Price x (1 / Split Ratio) (retroactive adjustment for split)

For example, in a 4:1 split, all prices before the split date are divided by 4 to maintain time series coherence.

Types of Corporate Actions

Action What Happens Adjustment Needed
Split (2:1) Price ÷2, shares ×2 Divide historical prices by 2
Reverse split (1:10) Price ×10, shares ÷10 Multiply historical prices by 10
Dividend Price drops by amount Subtract dividend from historical
Spinoff New company created Proportional adjustment

When to Use Each Type

✅ ADJUSTED Data (95% of cases)

  • Momentum, trend following strategies
  • Technical indicators (moving averages, RSI, etc.)
  • Any historical price comparison

⚠️ UNADJUSTED Data (specific cases)

  • Nominal price analysis ("trading above $50?")
  • Options strategies (nominal strikes)
  • Dividend analysis

The Classic Error

AAPL with 4:1 split in 2020. Price before: $400. Price after: $100.

With unadjusted data: Your strategy sees a 75% drop and generates a massive buy signal (crash). Catastrophic error.

⚠️

Beware of biases in your data

Market data can contain invisible traps like survivorship bias, look-ahead bias, or data snooping. These biases turn winning backtests into real losses. There are also technical errors like trading CFDs with index data or timezone issues that invalidate results. We have a dedicated article on the 7 main backtesting problems where we cover each one in detail.

SOURCES 🏪

Data Providers: Free vs Premium

Free Providers

Provider Markets Granularity Limitations
Yahoo Finance Stocks, ETFs, indices Daily Errors, no delisted
Alpha Vantage Stocks, Forex, Crypto Up to 1 min 25 calls/day free
FRED Macro Daily/monthly Macro only
Dukascopy Forex, Index CFDs, Commodities Tick (bid/ask) ✓ Free (demo account)

Premium Providers

Provider Markets Granularity Price/month
Polygon.io US Stocks, Options, Forex Tick $29-199
IQFeed US Stocks, Futures Tick $80-150
Norgate Data US/AU Stocks, Futures Daily (adjusted) $35-50
CSI Data Global Futures Daily $30-60
Tiingo US Stocks, Crypto Daily + IEX intraday Free limited / $100+

Recommendations by Profile

🎓 Beginner

  • Yahoo Finance for learning
  • Tiingo ($10/month) for cleaner data

📈 US Stock Trader

  • Polygon.io ($29/month) for intraday
  • Norgate ($35/month) for EOD survivorship-free

🔮 Futures Trader

  • IQFeed ($80-150/month) — industry standard
  • CSI Data for long history

💱 Forex Trader

  • Dukascopy (free) — excellent tick data
  • TrueFX for long history

The Special Case: TradeStation as a Complete Ecosystem

The TradeStation ecosystem deserves special mention as one of the most popular solutions among algorithmic traders because it solves several problems at once: development platform, quality historical data, and execution.

📊 Data Included with TradeStation Account

US Stocks:

  • Tick-by-tick: 6 months
  • 1-minute data: since 1991
  • Daily data: since 1968

Futures:

  • Tick-by-tick: 6 months
  • 1-minute data: since 1982 (depending on market)
  • Individual contracts by expiration

✅ Advantages

  • All-in-one: data, platform, backtesting, execution
  • High quality: automatic real-time error filtering
  • Deep history: decades of intraday data (minute since 1991)
  • Cost effective: data included with trading account
  • Exportable: you can download data to TXT/CSV
  • EasyLanguage: accessible development language

⚠️ Considerations

  • Requires account: $5,000 minimum for futures
  • API: programmatic access may require additional capital (verify current requirements)
  • Lower reported volume: ~28-30% less volume vs Polygon/Alpaca (per comparative studies)
  • Limited tick data: only 6 months of tick-by-tick history
💡

Who is TradeStation for?

If you're going to develop strategies AND execute them, TradeStation is a very efficient option. You pay for the broker and get quality data included. If you only need data for research (without executing), platforms like Polygon.io or Norgate offer more flexibility. The TradeStation ecosystem is especially popular for US futures where the platform + data + execution combination is hard to beat on cost.

Conclusion

Market data is literally the raw material of your algorithmic trading. Without quality data, everything else —your strategy, your code, your analysis— is built on sand. Once you have quality data, the next step is measuring performance correctly with advanced risk-adjusted metrics and understanding your strategy's drawdown.

The 5 Key Points

  1. Garbage In, Garbage Out: A backtest is only as good as its data. Investing in quality data is the best investment.
  2. Know your biases: Survivorship bias, look-ahead bias, and selection bias invalidate more backtests than any coding error.
  3. Adjusted vs unadjusted matters: For technical strategies, always use adjusted data.
  4. Spread kills short-term strategies: If you trade frequently, you need bid/ask data or simulate spread.
  5. Scale your data investment: Start basic to learn, but invest in professional data when trading real money.
FAQ ?

Frequently Asked Questions

Can I use free Yahoo Finance data for serious backtesting?

For learning and prototyping, yes. For real strategies with money, not recommended. Yahoo Finance has known errors, doesn't include delisted stocks (survivorship bias), and modifies data retroactively. For serious backtesting, invest in Tiingo ($10/month) or Norgate Data ($35/month).

What's the difference between tick data and OHLC data?

Tick data records each individual transaction with millisecond timestamp. OHLC aggregates transactions into bars showing only open, high, low, and close. Tick is necessary for high frequency and maximum precision. OHLC is sufficient for swing and position trading.

Do I need adjusted or unadjusted data for my backtest?

Adjusted for 95% of cases: any strategy with technical indicators or that compares historical prices. Unadjusted only if you need specific nominal prices (options with fixed strikes, dividend analysis).

How much does professional data cost?

Depends on market and granularity. EOD: $10-50/month. Intraday: $30-100/month. Tick data: $50-200/month. Some providers like Dukascopy (forex) offer free tick data if you have an account.

What is survivorship bias and how does it affect me?

It's the bias when your data only includes assets that "survived" until today. Companies that went bankrupt or were delisted don't appear. This artificially inflates results by 50-200% because you ignore losers. Solution: use providers with "survivorship-free" data like Norgate Data.

What data provider do you recommend for futures?

IQFeed is the industry standard for US futures in real-time ($80-150/month). For long history and well-built continuous contracts, CSI Data ($30-60/month).

Can I trust my broker's data?

With caution. Broker data is useful for execution, but for backtesting it has problems: may have gaps, limited history, and prices may be broker-specific (especially in forex). Use broker data for confirmation, but backtest with independent providers.

How many years of historical data do I need?

Depends on your strategy. Minimum 3-5 years for intraday, 10-15 years for swing trading, 20+ years for strategies with few trades. You need enough data to include different market regimes (bull, bear, sideways, high/low volatility).

What is look-ahead bias in market data?

Look-ahead bias occurs when you use information in a backtest that wouldn't have been available at the time of the actual trading decision. Examples: using closing prices for pre-close decisions, using fundamental data before publication, or applying retroactive adjustments. It's one of the hardest biases to detect and can completely invalidate your results.

How does the bid/ask spread affect backtest results?

The spread is an implicit cost on every trade. On a 50-pip profit trade, a 1-pip spread represents 2% of the profit. On a 10-pip trade, the same spread represents 10%. The shorter your time horizon, the greater the relative impact of spread. That's why high-frequency strategies need real bid/ask data.

Have good data and a strategy?

The next step is to validate it with professional techniques: Monte Carlo, Walk Forward, and more than 27 advanced metrics.

Validate my strategy free →