Per Erik Strandberg /cv /kurser /blog

I just read Python Data Analysis by Ivan Idris, Packt Publishing, from 2014 with ISBN 9781783553358. The price right now is about 400 SEK (or less than 300 as an e-book) at Adlibris (see [1]), 50 USD at Amazon (see [2]) or 40€ at Packt (see [3]).

http://www.pererikstrandberg.se/blog/python-data-analysis-ivan-idris.jpg

Overview

I like physical books, so I bought as a softback on 329 pages instead of an e-book. In addition to the 12 chapters it also has 3 appendices, an about section, a preface and an index:

  1. Getting Started with Python Libraries
  2. NumPy Arrays
  3. Statistics and Linear Algebra
  4. pandas Primer
  5. Retrieving, Processing, and Storing Data
  6. Data Visualization
  7. Signal Processing and Time Series
  8. Working With Databases
  9. Analyzing Textual Data and Social Media
  10. Predictive Analytics and Machine Learning
  11. Environments Outside the Python Ecosystem and Cloud Computing
  12. Performances Tuning, Profiling, and Concurrency
  13. Appendices
    1. Key Concepts
    2. Useful Functions
    3. Online Resources

It covers many python libraries. Some of them are: NumPy, SciPy, matplotlib, IPython, pandas, PyTables, JSON, Feedparser, Beautiful Soup, sqlite3, SQLAlchemy, Pony ORM, PyMongo, MongoDB, Redis, Apache Cassandra, NLTK, scikit-learn, ...

Negative

If I try to think about negative things for a while, and read critical reviews of the book I find it to be a bit heavy on the code snippet side, and a little light on the data analysis side. I read the book from start to finish as opposed to using it as a reference book and I found it to be very repetitive towards the end. In particular for areas that were not my favored areas.

One reviewer at Amazon wrote "...Unfortunately, the book is seriously flawed. First, it is a mile wide and barely an inch deep..." and goes on to say "...the book is riddled with errors..." and "I give 2 stars only because there is some useful information throughout the book, as shallow as it may be, and because the author has at least used a variety of datasets in the examples".

I can agree that I expected more depth in certain areas, especially where I wanted more (for example the chapter on Data Visualization). From this perspective it is a good start -- but I expect to have to find more information elsewhere.

Another unnecessary issue with the book is that it uses the well known Lenna standard test image. This is an image from a 1972 Playboy issue. If you don't know what Playboy is, then we can just say that it is the kind of magazine that men only read for the articles. My opinion is that it is a shame that the author used this image as I think it tells women that data analysis is not for them. Wikipedia phrases it well in the article of the Lenna image: "The use of the image has produced controversy because "Playboy" is sometimes considered as degrading towards women and the Lenna photo has been pointed to as an example of sexism in the sciences, reinforcing gender stereotypes". (Wikipedia: [4])

Postitive

This is a really good sweep of the tools available for data analysis in Python. In some areas, such as the chapter on Numpy and pandas, we also get a little more that the broad overview. This was perfect for me since I have more than 10 years of experience of Python, and pretty much experience of numerical computing and some in data mining, but very limited knowledge of numerical computing in Python.

Other areas that I found to be very interesting were the ones on pandas (I recently wrote two texts invloving pandas: Python Pandas First Look and Python Data Analysis With Sqlite And Pandas), and on the Natural Language Toolkit NLTK (W: [5], H: [6]). Back in 2005 I wrote a final thesis that included text mining, and NLTK could perhaps have been a good choice instead of the home-made libs I wrote back then.

Another very positive aspect of the book is that it introduces some data sets that it uses for examples. Sometimes it is sun spots, other times it is Shakespeare, and so on. I get the urge to download these data sets to play with the tools presented in the book -- and that is a good thing!

In addition to the many snippets of code in the book, the code examples are available for download from Packt (I have not tried it), so I guess that getting started with your own data analysis project can be speed up thanks to this.

Summary

I will keep this book as a reference on my desk (so that I don't have trip on pandas groups the next time I want to group data). I have some old University courses in data mining, but these needs to be refreshed and this book can really help with that.

I liked the data analysis process with filtering of outliers and the text mining library NLTK encouraged me to start thinking about more intelligent text mining than just grep when investigating logfiles. I also think that the next time I need exploratory data analysis I can do it faster and better if I take a quick peek in the pandas chapter.


This page belongs in Kategori Boktips
See also Data Analysis With Python