Your strategy has a Sharpe Ratio of 3.5. A maximum drawdown of 8%. Profit Factor of 2.8. It seems too good to be true. And it probably is.
Garbage In, Garbage Out. If your market data has errors, gaps, or biases, your backtest is fiction. It doesn't matter how sophisticated your algorithm is: if the raw material is contaminated, the final product will be garbage.
"Market data is the foundation upon which you build your entire algorithmic trading system. Without quality data, everything else is irrelevant."
If you're coming from our algorithmic trading guide, you already know you need data for backtesting. And if you've explored the algorithmic trading tools, you know that data is the fuel that powers everything. Now you'll learn exactly what types of data exist, where to get them, how much they cost, and —most importantly— how to avoid the traps that invalidate 90% of backtests.
Already have data and a strategy?
Validate your backtest with Monte Carlo analysis, Walk Forward, and bias detection. Free.
Analyze my strategy →Garbage In, Garbage Out: Why Data Is Everything
A backtest is only as good as the data that feeds it. This is not a cliché: it's the first law of algorithmic trading. As Marcos López de Prado explains in Advances in Financial Machine Learning, data quality is the foundation upon which any reliable quantitative system is built. Considering that 60-80% of trading volume in U.S. and European equity markets is generated by algorithmic trading (Select USA, 2024), the competition for quality data is fierce: data is the real competitive advantage.
The Real Cost of Bad Data
Imagine you develop a momentum strategy in stocks. You backtest 10 years of data and get spectacular results. But your data has a problem: it doesn't include companies that went bankrupt or were delisted. You're only seeing the survivors.
Result of Survivorship Bias
Your backtest shows +340% in 10 years. Reality would have been +40% (or losses). You built on incomplete data.
Real Cases of Data Failures
| Case | Data Problem | Impact |
|---|---|---|
| Pairs strategy (stocks) | Data not adjusted for splits | False signals, -60% losses |
| Mean reversion (futures) | 2-hour gaps in data | Real DD 3x higher than backtest |
| Breakout (forex) | Single broker data (biased) | Real spread 2x higher, not profitable |
| Momentum (ETFs) | No delisted ETFs | Performance inflated +200% |
Why Free Data Has a Hidden Cost
"Yahoo Finance is free, why pay for data?"
Because free data:
- Has errors that no one corrects (misapplied splits, gaps)
- Doesn't include delisted (guaranteed survivorship bias)
- Is only EOD (end of day) — useless for intraday strategies
- Changes retroactively without notice (Yahoo has modified historical data)
- Has no bid/ask — impossible to simulate real slippage
The global real-time financial data market is valued at $28 billion (2024), demonstrating the importance the industry places on information quality. It's no coincidence that Python is used by 74% of algorithmic traders according to the QuantInsti Developer Survey (2024): the Python ecosystem of data libraries and tools makes it easier to access and process market data at scale.
Cost Perspective
A professional provider costs $50-200/month. This is insignificant compared to the cost of developing strategies on garbage data for months and losing real money afterwards.
Types of Market Data
Not all data is equal. Depending on your strategy, you'll need different types and granularities.
OHLC (Open, High, Low, Close)
The most common format. Each bar contains:
OHLC Components
- Open: Opening price of the period
- High: Maximum price reached
- Low: Minimum price reached
- Close: Closing price of the period
- Volume: Quantity traded
Key Limitation
You don't know the order of movements within the bar. Did the price touch High first then Low, or vice versa? This affects strategies with tight stops.
Tick Data
Each individual transaction recorded with millisecond timestamp.
Timestamp,Price,Volume,Side
2026-01-24 09:30:00.123,4520.25,5,BUY
2026-01-24 09:30:00.125,4520.50,3,BUY
2026-01-24 09:30:00.128,4520.25,10,SELL ✅ Advantages
- Maximum precision for backtesting
- Allows reconstruction of any timeframe
- Necessary for high frequency
- You can see real order flow
❌ Limitations
- Huge files (GB per day)
- Requires more processing
- More expensive and hard to get
- Overkill for swing trading
Granularity: From Tick to Monthly
The correct granularity depends on your time horizon and trading style.
| Granularity | Typical Use | Data/year | Size |
|---|---|---|---|
| Tick | HFT, scalping | Millions | 1-10 GB |
| 1 minute | Day trading | ~98,000 | 5 MB |
| 5 minutes | Day/Swing | ~19,600 | 1 MB |
| 1 hour | Swing trading | ~1,600 | 100 KB |
| Daily (EOD) | Position trading | ~252 | 20 KB |
| Weekly | Long-term investing | ~52 | 5 KB |
📐 Practical Rule
Your data timeframe should be at least 5-10x smaller than your trading timeframe. If you trade on 1-hour charts, you need data of at least 5-15 minutes.
Bid/Ask and Spread: The Detail that Kills Strategies
Most historical data only shows one price (last, mid, or close). But in reality, there are always two prices: Bid (what you can sell at) and Ask (what you can buy at). The key formulas to understand these prices are:
Mid Price = (Bid + Ask) / 2
Spread Cost = Ask - Bid (implicit cost per trade from the differential)
The Hidden Spread
If your backtest assumes you buy at "Close" price and the real spread is 2 pips, you're ignoring a cost that eats your edge.
EUR/USD Example
- Your backtest uses mid price: 1.0850
- Real spread: 1 pip
- Buy at Ask: 1.08505
- Sell at Bid: 1.08495
In a 50-pip profit trade, spread costs 1 pip (2%). In a 10-pip trade, it costs 1 pip (10%). Spread matters more the shorter your horizon.
Does your backtest include realistic costs?
Algo Strategy Analyzer lets you configure commissions and slippage to see the real impact on your strategy.
Try free →Adjusted vs Unadjusted Data
This is one of the most confusing and most important topics. Using the wrong type can generate completely false signals.
What Are Adjustments?
When a company does a 2:1 split, the price is divided by 2. If yesterday the stock was worth $200 and today it's worth $100, it's not because it fell 50% — it's because there are now twice as many shares.
Adjusted data retroactively modifies all historical prices to reflect these corporate actions, maintaining series continuity. The basic adjustment formula for splits is:
Adjusted Price = Price x (1 / Split Ratio) (retroactive adjustment for split)
For example, in a 4:1 split, all prices before the split date are divided by 4 to maintain time series coherence.
Types of Corporate Actions
| Action | What Happens | Adjustment Needed |
|---|---|---|
| Split (2:1) | Price ÷2, shares ×2 | Divide historical prices by 2 |
| Reverse split (1:10) | Price ×10, shares ÷10 | Multiply historical prices by 10 |
| Dividend | Price drops by amount | Subtract dividend from historical |
| Spinoff | New company created | Proportional adjustment |
When to Use Each Type
✅ ADJUSTED Data (95% of cases)
- Momentum, trend following strategies
- Technical indicators (moving averages, RSI, etc.)
- Any historical price comparison
⚠️ UNADJUSTED Data (specific cases)
- Nominal price analysis ("trading above $50?")
- Options strategies (nominal strikes)
- Dividend analysis
The Classic Error
AAPL with 4:1 split in 2020. Price before: $400. Price after: $100.
With unadjusted data: Your strategy sees a 75% drop and generates a massive buy signal (crash). Catastrophic error.
Beware of biases in your data
Market data can contain invisible traps like survivorship bias, look-ahead bias, or data snooping. These biases turn winning backtests into real losses. There are also technical errors like trading CFDs with index data or timezone issues that invalidate results. We have a dedicated article on the 7 main backtesting problems where we cover each one in detail.
Data Providers: Free vs Premium
Free Providers
| Provider | Markets | Granularity | Limitations |
|---|---|---|---|
| Yahoo Finance | Stocks, ETFs, indices | Daily | Errors, no delisted |
| Alpha Vantage | Stocks, Forex, Crypto | Up to 1 min | 25 calls/day free |
| FRED | Macro | Daily/monthly | Macro only |
| Dukascopy | Forex, Index CFDs, Commodities | Tick (bid/ask) | ✓ Free (demo account) |
Premium Providers
| Provider | Markets | Granularity | Price/month |
|---|---|---|---|
| Polygon.io | US Stocks, Options, Forex | Tick | $29-199 |
| IQFeed | US Stocks, Futures | Tick | $80-150 |
| Norgate Data | US/AU Stocks, Futures | Daily (adjusted) | $35-50 |
| CSI Data | Global Futures | Daily | $30-60 |
| Tiingo | US Stocks, Crypto | Daily + IEX intraday | Free limited / $100+ |
Recommendations by Profile
🎓 Beginner
- Yahoo Finance for learning
- Tiingo ($10/month) for cleaner data
📈 US Stock Trader
- Polygon.io ($29/month) for intraday
- Norgate ($35/month) for EOD survivorship-free
🔮 Futures Trader
- IQFeed ($80-150/month) — industry standard
- CSI Data for long history
💱 Forex Trader
- Dukascopy (free) — excellent tick data
- TrueFX for long history
The Special Case: TradeStation as a Complete Ecosystem
The TradeStation ecosystem deserves special mention as one of the most popular solutions among algorithmic traders because it solves several problems at once: development platform, quality historical data, and execution.
📊 Data Included with TradeStation Account
US Stocks:
- Tick-by-tick: 6 months
- 1-minute data: since 1991
- Daily data: since 1968
Futures:
- Tick-by-tick: 6 months
- 1-minute data: since 1982 (depending on market)
- Individual contracts by expiration
✅ Advantages
- All-in-one: data, platform, backtesting, execution
- High quality: automatic real-time error filtering
- Deep history: decades of intraday data (minute since 1991)
- Cost effective: data included with trading account
- Exportable: you can download data to TXT/CSV
- EasyLanguage: accessible development language
⚠️ Considerations
- Requires account: $5,000 minimum for futures
- API: programmatic access may require additional capital (verify current requirements)
- Lower reported volume: ~28-30% less volume vs Polygon/Alpaca (per comparative studies)
- Limited tick data: only 6 months of tick-by-tick history
Who is TradeStation for?
If you're going to develop strategies AND execute them, TradeStation is a very efficient option. You pay for the broker and get quality data included. If you only need data for research (without executing), platforms like Polygon.io or Norgate offer more flexibility. The TradeStation ecosystem is especially popular for US futures where the platform + data + execution combination is hard to beat on cost.
Conclusion
Market data is literally the raw material of your algorithmic trading. Without quality data, everything else —your strategy, your code, your analysis— is built on sand. Once you have quality data, the next step is measuring performance correctly with advanced risk-adjusted metrics and understanding your strategy's drawdown.
The 5 Key Points
- Garbage In, Garbage Out: A backtest is only as good as its data. Investing in quality data is the best investment.
- Know your biases: Survivorship bias, look-ahead bias, and selection bias invalidate more backtests than any coding error.
- Adjusted vs unadjusted matters: For technical strategies, always use adjusted data.
- Spread kills short-term strategies: If you trade frequently, you need bid/ask data or simulate spread.
- Scale your data investment: Start basic to learn, but invest in professional data when trading real money.
Continue your learning
Frequently Asked Questions
For learning and prototyping, yes. For real strategies with money, not recommended. Yahoo Finance has known errors, doesn't include delisted stocks (survivorship bias), and modifies data retroactively. For serious backtesting, invest in Tiingo ($10/month) or Norgate Data ($35/month).
Tick data records each individual transaction with millisecond timestamp. OHLC aggregates transactions into bars showing only open, high, low, and close. Tick is necessary for high frequency and maximum precision. OHLC is sufficient for swing and position trading.
Adjusted for 95% of cases: any strategy with technical indicators or that compares historical prices. Unadjusted only if you need specific nominal prices (options with fixed strikes, dividend analysis).
Depends on market and granularity. EOD: $10-50/month. Intraday: $30-100/month. Tick data: $50-200/month. Some providers like Dukascopy (forex) offer free tick data if you have an account.
It's the bias when your data only includes assets that "survived" until today. Companies that went bankrupt or were delisted don't appear. This artificially inflates results by 50-200% because you ignore losers. Solution: use providers with "survivorship-free" data like Norgate Data.
IQFeed is the industry standard for US futures in real-time ($80-150/month). For long history and well-built continuous contracts, CSI Data ($30-60/month).
With caution. Broker data is useful for execution, but for backtesting it has problems: may have gaps, limited history, and prices may be broker-specific (especially in forex). Use broker data for confirmation, but backtest with independent providers.
Depends on your strategy. Minimum 3-5 years for intraday, 10-15 years for swing trading, 20+ years for strategies with few trades. You need enough data to include different market regimes (bull, bear, sideways, high/low volatility).
Look-ahead bias occurs when you use information in a backtest that wouldn't have been available at the time of the actual trading decision. Examples: using closing prices for pre-close decisions, using fundamental data before publication, or applying retroactive adjustments. It's one of the hardest biases to detect and can completely invalidate your results.
The spread is an implicit cost on every trade. On a 50-pip profit trade, a 1-pip spread represents 2% of the profit. On a 10-pip trade, the same spread represents 10%. The shorter your time horizon, the greater the relative impact of spread. That's why high-frequency strategies need real bid/ask data.
Have good data and a strategy?
The next step is to validate it with professional techniques: Monte Carlo, Walk Forward, and more than 27 advanced metrics.
Validate my strategy free →