Do you need large amounts of data, may be GB's of data, for checking the performance of your apps? Do you fret over where to get it from? The most uncomplicated way is to download samples of data from free data repositories available on the Web. Though its disadvantage is that the data will have very less unique content and it may not give the desired results. Given below is a list of websites from where you can get large data repositories free of cost :
![]() |
- Wikipedia : Here data is available in multiple languages. It offers free copies of all available content to the interested users. The available content along with images are downloadable.
- BigML : This site hosts thousands of public data sources.
- EDRM File Formats Data Set : It consists of more than 381 files in covering over 200 file formats.
- Common crawl : It builds and maintains an open crawl of the web which is accessible to everyone. The data is stored in amazon s3bucket and the requester may have to spend some money for accessing it.
- Apache Mahout : The Apache Mahout project's goal is to create scalable, machine learning algorithms. Mahout has many links to get free and paid corpus data.
- EDRM Enron Email Data Set : It consists of Enron e-mail messages and attachments in two sets of downloadable compressed files: XML and PST.
- ClueWeb09 : It was created to support research on information retrieval and related human language technologies. It consists of over 1 billion web pages in ten languages that were collected during January-February 2009. The data set is used by several tracks of the TREC conference.
- WS (Amazon Web Services) Public Data Sets : It provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
- DMOZ : Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It has collections of URLs in different category. Dmoz is a major source for internet search engines.
- Million song data set : It provides data related to tracks and artists for viewers.
- : This site is about large data sets and caters to the people who love them; to the scrapers and crawlers who collect them, the academics and geeks who process them and the designers and artists who visualize them. It's a place where they can exchange tips and tricks, develop and share tools together, and also to integrate their particular projects.
- Project Gutenberg : It offers more than 36,000 free eBooks to download on your PC, Kindle, Android, iOS or any other portable device.
- Bioassay data :This is an Open Access article distributed under the terms of the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.It was described in Virtual screening of bioassay data, by Amanda Schierz, J. of Cheminformatics, with 21 Bioassay datasets (Active / Inactive compounds) available for download.
- Bitly data: It hosts anonymized clicks of government links.
- Canada Open Data : It is a pilot project with many government and geospatial datasets.
- Causality Workbench : It is a data repository.
- Corral Big Data repository : It is Texas Advanced Computing Center and it supports data-centric science.
- Data Source Handbook : Data Source Handbook A Guide to Public Data, by Pete Warden, O'Reilly (Jan 2011) is a concise ebook that covers the most useful sources of public data available today.
- : It hosts a comprehensive list of open data catalogs, government data from US, EU, Canada, CKAN, and more.
- : It is a publicly available data from UK (also London datastore.)
- It is a central guide for education data resources including high-value data sets, data visualization tools, resources for the classroom, applications created from open data and more.
- DataMarket : It hosts and visualizes the world's economy, societies, nature, and industries, with 100 million time series from UN, World Bank, Eurostat and other important data providers.
- Datamob : It has public data put to good use.
- :It is a clearinghouse of datasets available from the City & County of San Francisco, CA.
- DataFerrett : It is a data mining tool that accesses and manipulates TheDataWeb, a collection of many on-line US Goverment datasets.
- Delve : Delve refers to Data for Evaluating Learning in Valid Experiments.
- EconData : It hosts thousands of economic time series, produced by a number of US Government agencies.
- Enron Email Dataset : It hosts data from about 150 users, mostly senior management of Enron.
- Europeana Data : It contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana - a trusted and comprehensive resource for European cultural heritage content.
- FEDSTATS : It is a comprehensive source of US statistics and more.
- FIMI : This website serves as the FIMI repository containing the source codes of all implementations that were accepted at the FIMI workshops together with several publicly available datasets.
- Hilary Mason research-quality Big Data sets : It provides a collection - many text and image datasets.
- Google ngrams datasets : It has text from millions of books scanned by Google.
- KDD Cup center : It hosts information on data sciences, data mining and analytics community.
- Grain Market Research : It provides users with financial data including stocks, futures, etc.
- Financial Data Finder at OSU : It is a large catalog of financial data sets.
- Kevin Chai list of datasets : This site provides with information on data set directories.
- Linking Open Data : It is a project, making data freely available to everyone
- GEO (GEO Gene Expression Omnibus) : It is a gene expression/molecular abundance repository supporting MIAME compliant data submissions,and a curated, online resource for gene expression data browsing, query and retrieval.
- GeoDa Center : It hosts geographical and spatial data.
- HitCompanies Datasets: This website provides comprehensive data on random 10,000 UK companies sampled from Hit Companies and is updated automatically using AI/Machine Learning.
- ICWSM-2009 dataset : It contains 44 million blog posts made between August 1st and October 1st, 2008.
- Infochimps : This is an open catalog and marketplace for data. You can share, sell, curate, and download data about anything and everything.
- Investor Links : This website includes financial data.
- OpenData from Socrata : This website grants access to over 10,000 datasets including business, education, government, and fun.
- National Space Science Data Center(NSSDC) : It gives access to NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more.
- Open Data Census : This website assesses the state of open data around the world.
- KONECT : The Koblenz Network Collection, provides users with large network data sets of all types in order to perform research in the area of network mining.
- National Government Statistical Web Sites : It gives readers access to data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America.
- Quandl : This website grants users access to a collaboratively curated portal to millions of financial and economic time-series datasets.
- GDELT : GDELT stands for The Global Data on Events, Location and Tone, described by the Guardian as "a big data history of life, the universe and everything."
- qunb: This is a platform for finding and visualizing quantitative data.
- ML Data :It is the data repository of the EU Pascal2 networks.
- Robert Schiller data : This website provides with data on housing, stock market, and more from Robert Schiller's book Irrational Exuberance.
- Open Source Sports : This website hosts many sports databases, including Baseball, Football, Basketball, and Hockey.
- SMD: Stanford Microarray Database : This website stores raw and normalized data from microarray experiments.
- StatLib : This website hosts the CMU Datasets Archive.
- Peter Skomoroch dataset Bookmarks : This cites genomic-related publications databases.
- NASDAQ Data Store: It provides access to market data.
- Research Data : It includes historic and status statistics on more than 100,000 projects and over 1 million registered users' activities at the project management web site
- Jerry Smith dataset collection : This website cites data collection on Finance, Government, Machine Learning, Science, and other data.
- Time Series Data Library : It hosts statistical data related to day-to-day events and a whole range of other datasets.
- UCI KDD Database : This is a repository for large datasets which are used in machine learning and knowledge discovery research.
- UCR Time Series Data Archive : This website offers a variety of data sets, papers, links, and code
- Yahoo Sandbox datasets: This website provides information and data sets on Language, Graph, Ratings, Advertising and Marketing, Competition etc.
- Wikiposit : This is a virtual amalgamation of mostly financial data from many different sites, allowing users to merge data from different sources.
- Yelp Academic Dataset : This website hosts all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research.
- UCI Machine Learning Repository. : This website hosts information on machine learning repositories.
- United States Census Bureau.: This website provides data sets of census and other different data sets.
- Wolfram Alpha : It provides data sets on diseases and patient level statistics.
- STATOO Datasets part 1 and STATOO Datasets part 2 : STATOO is consulting firm specialised in statistical consulting and training, data analysis and data mining services.
- MIT Cancer Genomics: This website provides information cancer programme data sets from MIT Whitehead Center for Genome Research.
ConversionConversion EmoticonEmoticon