Per Erik Strandberg /cv /kurser /blog

Background

I have a challenge at work: we have lots of multi-dimensional data that I want to structure and visualize. We create some plots and look at some trends. But I'd like to be better at really analyzing, structuring and visualizing it.

There are of course many ways to do this. A lot of people use Microsoft Excel. But I have millions of data entries so using that kind of a GUI is not an option for me. Others use the R programming language (W: [1]), or Matlab (W: [2]), Mathematica (W: [3]), Maple (W: [4]) or other solutions (W: [5]). These tools might be all right and perfect for the job -- but I have been working with Python (W: [6]) for more than 10 years and I want to dive into the tools available within the Python ecosystem first.

One goal of this page in my blog is to have a starting point for learning these tools better. Right now I am pretty good at using matplotlib, but I want to learn more about the other tools - so I'll try to store my findings so that I can find them here.

Another, perhaps even more important goal, is to learn about data analysis. I studied data mining a bit at the university and I want to revisit the topic and others.

Data Analysis Sub-Processes and topics

According to Wikipedia (W: [7]) there is a process of Data Analysis. It involves collecting raw data from the world, processing this data into a clean dataset. On this set one can perform exploratory data analysis. From the clean data set and the exploratory data analysis we can build models and algorithms. We can now communicate, visualize and report, something that supports making decisions. Before putting data back into the world we can use a data product - this could for example be a e-book store that recommends books based on what you (and others) have previously bought.

Other topics that are included in Data Analysis include:

Commonly mentioned tools

I have looked at the summaries of many books on data analysis and python, in particular from Oreilly (see for example [8] or [9]) and Packt (see for example [10], [11], [12], [13], [14], [15], or Boken Python Data Analysis that I bought and reviewed). (This might look like advertisements but it is not. I have not read any of these books but they seem pretty good. I mention them because they in turn reference interesting parts of the Python ecosystem).

All of these books discuss either data analysis or commonly used python tools for data analysis. The tools are:

Often these are combined with a way to organize python sessions. Instead of a sloppy python command line session we use a notebook metaphor. Some tools that are sometimes mentioned are:

Data, Input & Output

I want to explore these tools by working with data. This data comes in many forms and I typically read from one format and export to another. Sometimes the input comes from database queries, and the output is stored in a CSV or YAML format. Typical data formats I want to explore a bit more are:

Data Analysis Process, Topics, Python Data Analysis Tools and Data Formats


This page has a lot of links. A convention I use is W for Wikipedia, H for what I guess is the official homepage.

The quotes are all from Wikipedia -- The Free Encyclopedia. Most from the huge article on Data Analysis with more than 300 links to other Wikipedia articles.


This page belongs in Kategori Programmering
This page belongs in Kategori Plot