This page will updated with interesting tools for data scientist. The tool varies from data management, data cleaning, command line tool, some solution when tackling large dataset (not that big data), data visualisation, Data Science learning resource, etc..

Learning Resource


Statistics & Machine Learning


Data cleaning

  • Open Refine: providing data cleaning solution / open source
  • Data cleaner: data cleaning and assessment / open source & commercial
  • Trifacta Wrangler: free and commercial edition exists, nice interface but lack of an export of standard script.


  • R libraries:
    • rvest
    • XML
  • Python packages:
    • urllib2 + beautifulsoup
    • scrapy

Parallel computing

  • Spark: fast and general engine for large-scale data processing / open source

Memory bottleneck

  • SQLite: write local installation-free SQL DB

Data visualization

Design resources


My Collection of R packages