No study of the history of scientific communication can be complete without mention of Joseph Charles Minard, a 19th Century French civil engineer and cartographer.
At the end of his life, Minard created two very famous examples of statistical charts, called flow maps, that every scientist, engineer and student should be familair with. The first showed Hannibal’s crossing of the Alps (218 BC, Second Punic War), and the second describes Napoleon’s disastrous invasion of Russia (1812-1813).
Both examples are beautiful works of art and masterful examples of evidence. But they are also more than that, they tell cohesive and interesting stories. In this post, I thought it might be interesting to take a closer look at the history of Hannibal and Napoleon, and highlight the ways which Minard’s charts help us to explain their eventual outcome.
(Note: High resolution, PDF versions of the two maps are available for download. These versions have been translated from the original French. To download, either click on the images, or here for the Hannibal invasion of Northern Italy, and here for the French Invasion of Russia.)
Publications are the currency of ideas. Through them the experts, thinkers and dreamers of this world can share their thoughts and insights. A good publication is not only influential, but it’s even capable of shifting the course of a whole society, as Martin Luther King demonstrated with his “Letter from a Birmingham Jail”.
Since publications are so important to the dissemination of knowledge, there is a rather high expectation that an academic author should publish prolifically. The mantra “Publish or Perish” is not just a clever quip, but a very serious way of life.
It is ironic, then, that the most prolific of academic writers can suffer from a surprising problem: it can be very difficult to keep track of all of their work. Yet, an up to date CV is very important. After all, publishing your work in influential journals is an important first step toward establishing tenure!
Members of a research team or those who collaborate outside of their institution experience this same problem, only more so. Such a person may work on many projects at once, but only have direct responsibility for one or two of them. This places the researcher in the unenviable position of trying to track the work of others. This situation becomes even more complicated if the collaborator refuses to play by the rules of common decency.
It would be nice, for example, if the primary author of a publication would notify the co-authors of its progress, or when it has been submitted. But … that doesn’t always happen. Academic researchers are busy people and soliciting feedback from all of your collaborators can be difficult … and there is a tendency for difficult things to go undone. Thus, if you don’t follow what your team mates are working on, it is quite possible that an abstract might have gotten submitted while your back was turned.
To stay on top of the “delightful chaos”, you need to have some kind of system. Personally, I keep my list of projects and publications in three places. The first (and perhaps most important) is the hand-written list in my experimental notebook. Any time I hear about a new project, it gets added to this list. I keep track of what I’ve contributed, what papers or abstracts have been created from the data, and what their status is. When I know that an abstract or paper has been accepted, I then create an entry for the item in my bibliography manager. Once in the bibliography manager, I can cite the reference in other documents such as proposals or related papers.
About once a year, I go through the tedious process of updating my CV. This typically involves manually sorting through both my project list and my reference database and account for new items or reconcile differences. Every time I do this, it’s painful; and because I’ve historically formatted the reference list by hand, it’s not uncommon for a typo to sneak its way in or for an author to accidentally get left off of a citation. These mistakes are never intentional, but they do happen.
When I find such an error in the reference database, I fix it. But since I often import these references from websites, the errors tend to be few and far between. Moreover, my reference database is something that I use every day; as a result, it gets a lot of scrutiny. My CV, on the other hand, gets updated much less frequently and errors tend to persist longer.
For a very long time, I’ve wanted to automate the process. Instead of keeping three separate lists – active projects, reference database, and CV – I’d prefer to keep only one (or two). But I’ve never found a really satisfactory way of doing so. Or at least I hadn’t found a system until quite recently.
In my last review of different ways to typeset a CV, I came across an interesting article by Dario Taraborelli. In it, he described how to create a CV based on the standard “article” document class. It was well designed, elegant, simple and attractive. From his work, I created the xetexCV document class. Additional research turned up an add-on module that makes it convenient to automatically generate a list of publications. So, for the first time in a great while, I have finally found a way to automatically generate a publications list in a simple and automated manner. In this article, I will demonstrate how that is done.
Many first-time users of LaTeX often mistakenly look at the language as a a type of glorified word processing software – albeit a particularly complicated one. While such an analogy may be apt in helping new users become acclimatized to the language, it suffers from a rather nasty problem: LaTeX isn’t a word processor.
If anything, LaTeX shares more in common with a programming languages than any type of application. In fact, the document processing system is really nothing more than a bunch of re-usable pieces of programming called macros. Everything is a macro. That includes the commands that every user is familiar with: \title{}, \section{}, \subsection{}; in addition to the internal formatting commands that allows LaTeX to function. (Most of the macros were originally created or packaged by Leslie Lamport as a way of making TeX – the typesetting system created by Donald Knuth – easier to work with.)
This has some rather practical consequence; because everything in LaTeX is a macro, it is far more extensible than a word processor could ever hope to be. If you require a feature that doesn’t yet exist, it typically isn’t all that difficult to add it. And when your extension is packaged inside a style or class, you can use those customizations in anything that you want to write.
But though creating macros isn’t particularly complicated, it is a different beast than just using the stock macros for writing. This is not surprising, the craft of design is inherently different than the craft of writing. There are different conventions to follow and different topics to obsess about. In the first article of this series, I introduced the xetexCV document class, which is one example of where I decided to don the designer hat.
But before you get too far down the road of customizing and extending, there are a some important things that you need to know. These include the general conventions used when working with document classes, their internal anatomy, an understanding of how macros are created, and how to handle formatting and layout challenges. In this article, I will look at these issues more in detail, particularly as they pertain to xetexCV. In the process of reviewing these topics, I will also explain some of my design choices.
Very few documents are more personal than a curriculum vitae (CV). A CV lists a person’s educational history, who they’ve worked for and what they’ve accomplished. Moreover, a CV is frequently used to judge a person’s inherent worth and value (or at least exploitability). A quality curiculum vitae matters, a lot.
For that reason, a CV not only needs to include all the pertinent information of a person’s life, but it also needs to look good. An attractive CV with good spacing and contrast leaves a positive impression and makes it easier to find information. When laid out correctly, a reviewer might just find themselves scouring past accomplishments for interesting tidbits: “I didn’t realize that this applicant organized a lecture series with Patch Adams and other notables, that’s interesting!”
Imagine for a minute that you’re writing a book or technical manual. Let’s say it’s a book on technology, maybe the open source tools used for scientific writing (to randomly pick an example). As you write this book, you realize that you need some way to cue the reader into different parts of the text.
For instance, you might want all definitions to appear in bolded text so that a reader pick out key terms quickly. Or you might want code examples to appear in a different font than the regular text, again, so they’re easy to find. What’s the best way to do this?
Sure, you could just bold the definitions, or manually change the font for the code examples. But that’s painful! Changing typeface and size every time that you have a section of code will eventually result in a lot of lost time. Moreover, you might make a mistake, which destroys your consistency and makes your writing look unprofessional. There must be a better way!
Thankfully, there is. It’s through the consistent use of styles.
When doing math or numerical analysis, the knowledge of the technique is far too often tied to the tool performing the calculation. Consider an engineer whose understanding of the Fast Fourier transformation is inseparably tied to the fft function in Matlab. Of course this hypothetical engineer understands what the results mean (more or less) but may not be able to duplicate his analysis if Matlab were taken away.
In most cases, it is likely that no deeper understanding will be required. But what happens if the computer makes a mistake? Or the program becomes unavailable? Both situations are entirely possible. Computer algorithms aren’t perfect and occasionally arrive at results make little sense; and hardware has been known to fail.
When the engineer understands how the computer arrived at the answer, however, he can recognize, understand, and ultimately correct those cases where the results are unexpected. This is an important reality check that can prevent costly disasters later down the line. Or, if the hardware is unavailable, he can use an alternative tool or software package to duplicate the analysis.
But while such a situation can arise with any type of numerical software, it’s most likely to happen to users of a statistical package. I find this extremely ironic since a proper understanding of statistics is essential to live in the modern world. (Much more so than an understanding of the Fast Fourier transform, at any rate.) The rules of probability, the normal curve, correlation, and multivariate statistics can have a direct impact on how we live our lives. They are used in making important decisions in finance, medicine, science and government. A misunderstanding of stats and the methods of science (from which statistics is inseparable), underlies the most divisive issues of our day: abortion, stem cell research, and global warming.
Moreover, neither side has a monopoly on ignorance or misunderstanding. People fail to distinguish between correlation and causality, or insist in using the word “average” as a slur. Nearly as bad are those that – like the hypothetical engineer described above – only understand statistics within the narrow context of their stats package. Casual statisticians are nearly as dangerous as the wholly uninformed.
The Statistical Package for the Social Sciences (SPSS), is one of the biggest perpetrators of this crisis. Which is hugely ironic, because I happen to love SPSS. SPSS is probably the first statistical package that has placed advanced statistical methods within the grasp of the novice user. I’ve been a happy user for nearly a decade (ever since I was introduced to the program in high school). But there is no doubt that I’ve come to understand statistics within the context of SPSS and its GUI.
Please don’t misunderstand me, I have a pretty good grasp of basic statistics. I can sling probability with the best of them and take relish in describing when to use the Fischer Exact test instead of a Chi-Square; but advanced statistics are a completely different matter. Advanced stats scare me. I can certainly use these more complicated methods. I’ve analyzed and written about multi-variate models and even ventured into Analysis of Variance (ANOVA). But I have to rely on SPSS and the aid of my institution’s biostatistician to help me recognize when there is a problem.
Which is why, in a time of tight budgets, losing the institution’s SPSS license has been a crushing blow to my productivity. (Whoever made that decision should be hauled out and shot!) Because I don’t have my statistics software any more, there are certain aspects of my job that are much more difficult to do. And unfortunately, there is only logical conclusion to draw: I’ve become a victim of the statistical ease of SPSS.
LyX is a wonderful writing program. It’s easy to use and produces beautifully typeset output. More importantly, though, it lets an author focus on the content and structure of his writing; rather than the formatting. It isn’t so easy to customize, though, which limits its usefulness in a big way. What if you need to create a new layout or take advantage of one of the thousands of specialized LaTeX styles? How, exactly, do you go about doing that?
That’s why this article was written. Recently, I was asked to help with a National Institutes of Health (NIH) R21 grant proposal. After some talk amongst the different investigators, it was decided that we would use LaTeX and LyX to draft it. Unfortunately, we hit a rather substantial hurdle early in the process: LyX doesn’t have an NIH grant template.
After additional debate, we decided to proceed with LyX anyway. But in the process, I found myself saddled with an additional job. In addition to responsibilities as research flunky and copy editor, I was tasked with creating a LyX and LaTeX template for our NIH grant. This article will summarize the steps I took and describe how to create a custom template using an available style on CTAN.
Note: All of the files in this tutorial can downloaded here (.zip).
I have a serious love-hate relationship with Linux. I love the fact that it’s free and open source. I love the fact that it can breathe new life into old hardware. I love the fact that it’s easy to extend. I love the fact that it has a vibrant and passionate user community.
What I do not love is that many open source programs are incomplete. They can do most everything that you need, but never get around to adding the one or two features that prevent them from being finished, polished and exceptional. I’ve ranted about this before, back when I was trying to find the perfect backup program.
Well … I’m at it again; except this time, I’m looking for the perfect email program.
Imagine how awesome it would be if this announcement read: “Time Drive has been completely rewritten from scratch (yet again) to take better advantage of the paradigms of modern computing! Version 0.3 has hundreds of updates and new features which will make your life easier and more fulfilled!”
There’s just one little problem … such a hyper inflated announcement wouldn’t necessarily be true. (Marketing hyperbole, I never knew thee!) The truth is this: Time Drive is a simple backup program that does a good job of reliably backing up your data. It offers a nice list of potential backup options: from an attached hard drive, to a computer over the network, or across the internet. It makes it easy to search for and restore a lost file. In short, Time Drive seeks to change the world by making an act of computer maintenance more convenient.
But the real test of a program isn’t how well it works, but how easy it is to fix when broken. A good program does what you want, but a better program helps you get back on track when things go wrong. Back when I was looking at other backup programs available for Linux, this was my number one frustration. Most of the applications would work (for the most part), but I could never troubleshoot or repair problems when they happened. There just wasn’t enough information available.
For an example, let’s take SBackup. It’s a lovely little program, except you have no way of knowing if it is working. It doesn’t keep log files, it doesn’t notify you if a backup job failed. It doesn’t let you know if it is running. Its simplicity is actually symptomatic of a flaw: it’s incomplete.
These were problems that I desperately wanted to avoid with Time Drive. And version 0.3 includes a number of refinements that solve these issues while at the same time making make it better, easier and more refined. In the rest of this post, I’ll explain why.
As much as I love Apple’s Time Machine, it’s a hard drive pig. If not carefully watched, the little porker will use every spare byte of free space it can. What is particularly obnoxious, however, is that you might not realize you have a problem until it is too late and you’re backup drive is filled to capacity.
Take my situation as an example. I have a single MacBook Pro notebook with a 250 GB hard drive. Most of my files are text based and on the smallish side. In comparison, my networked backup is a hefty 1.5 terabytes. The combination of small hard drive and large backup drive had me thoroughly convinced that I wouldn’t have to worry about free up space for years.
I was wrong.
Because of the size of the backup drive, I like to keep other files on it – mostly music and video files – so that I have a duplicate copy. But earlier this week, I got a nasty surprise while trying to add an album I had just downloaded from Amazon Mp3. The Mac informed me the backup drive was full.
As you might guess, I found this to be very confusing. How could the drive be full? Sure … I had three or four hundred gigabytes of music and video files on it, but there was no way that the Time Machine backup could be over a terabyte in size … Could it?
This situation didn’t smell right, so I decided to investigate. I mounted the backup drive and tracked down the Time Machine sparsebundle and confirmed the impossible. My Time Machine Backup was a whopping 1.15 terabytes worth of disk space. “How in the world could the backup be so large?”, I asked myself. “Time Machine is supposed to be an incremental system. 1.15 terabytes is big enough to hold every bit and byte on my computer four and a half times over!”
First, I got annoyed; then, I got angry. What really tipped the scale toward seething fury, however, was failing to find any straightforward way of getting the space back. Yet another spectacular example of Apple’s “simple over useful” approach to computer design!
After the first bout of obscenities, I came to a simple conclusion: I could publicly express my dissatisfaction with Apple’s product line or I could go about trying to find a solution. Publicly spouting off was unlikely to help much, so I opted for the latter option. What follows is a brief summary of what I learned.