# Stata: Data Analysis and Statistical Software

Available as Standardsoftware

*von Martin Ejnar Hansen, PhD (Fakultätszentrum für Methoden der Sozialwissenschaften)(Ausgabe 10/1, März 2010)*

*Wir bedanken uns beim Autor, der uns diesen Artikel in englischer Sprache zur Verfügung gestellt hat.*

In this short introduction to Stata the interface will be explained and a few tips using Stata will be provided as well as a few websites will be listed which are useful for further learning.

Departments and students at the University of Vienna now have access to a new piece of statistical software, Stata. Stata, in its current version 11, provides its users with a standard package well equipped for most statistical analyses as well as the opportunity for the users to write own commands. Stata is a statistical package like SAS, SPSS or eViews. It is a command-line-driven program which operates in a windowed environment. For users who wish to use point-and-click Stata also contains such an interface.

However, a word of caution is needed: Stata is just a helpful tool and can not replace the theoretical considerations which guide empirical research. With that in mind we are ready to start using the powerful tool that Stata is.

Stata has great strengths in data manipulation. Data can be moved from external sources, such as spreadsheets, directly into the program. Cleaning the data and generating new variables is very easy. On the statistical side Stata includes all standard uni-, bi-, and multivariate statistical tools, ranging from descriptive statistics, principal component analysis to regression analysis. However, also the more specialized statistics are implemented in Stata. Time-series econometrics, for example ARCH and VAR, can easily be called as a maximum likelihood estimation. In several categories a special command provides the leading techniques:

- "xt" for cross-section/cross-time or panel data (longitudinal data)
- "svy" for survey data with complex sampling designs
- "st" for survival time data in duration models

# How does it function?

The screen that users meet the first time they open Stata can be seen in figure 1. Four seperate windows are present: the Result window, the Review window, the Variable window and the Command window.

### 1) Variable window

In this window a list of all the variables in the dataset can be found. Next to the name of the variables we can also see what their labels are, in case labels are used, also the type and format of the variables can be found.

### 2) Command window

If point-and-click is cast aside then the Command window is where the commands are given. Whether it is a command to load a dataset, to make a recoding or to run a regression, it can all be done in this window. It should be noted that when a command is entered and the enter button is hit the Command window empties. However, this does not mean that we can not see what command we actually entered, it is just in another window.

### 3) Review window

This window is the Review window, where all commands called in the Command window or by point-and-click can be found. This is extremely helpful after a number of long commands have been run through and it is needed to refer back to a previously called command.

### 4) Result window

When commands are called they also show up in the Result window. If the command includes a statistical request, for instance a regression, the result shows up in this window.

## Why using the command line?

Using the command line might seem to be lots of trouble. However, there are good reasons for sticking to this approach over the point-and-click approach. First of all, the command line helps replication. In scientific research it is necessary to be able to replicate the results. In the ideal situation anyone with access to your commands and data should be able to replicate the results. In programs where all actions are point-and-click it is impossible to say how a certain result was reached if not all steps of manipulating the data can be followed. This problem is not relevant with the command-line approach. Replication also makes it very easy to make alternate analyses, for instance adding a variable or choosing a different estimation technique. It is just to go back and find the earlier specification and change it instead of trying to remember what happened and clicking around in the hope that the memory is right.

### Do-files and log-files

When using Stata for complex analyses typing in each command or using the point-and-click strategy will soon turn out to be less useful. Luckily there are tools in Stata which can help. The first is the **do-file**, which is a collection of all the commands used for the analysis of the dataset in question. Instead of typing each command in the command window we type them all in the do-file and when all commands needed are included we can choose to run the do-file. If the commands are in order, the do-file will run all the commands through while outputting the results and commands in the result window.

The other very useful tool is the **log**. While it may be preached that we as quantitative empirical researchers should keep a research log of all the decisions we make, it is often forgotten. Here Stata can help. By calling the command log using the filename we ask Stata to keep all our commands and results in a separate file (filename). We can choose to add a comma after the filename and write either append, which appends an already existing log, or we can replace it with the replace command. It is also possible to choose between having the log as a text-file, by simply writing text after the comma or the default setting, the smcl-file can be kept by adding nothing. The log is either closed when Stata is closed or with the command log close. Logs are saved in your working directory and can be opened either through Stata or by text editor at any time.

## Estimation

Through the command line we can estimate all sorts of models. We have to know the command for the particular model we wish to estimate what follows that the command is uniform. Take for instance a multiple linear regression which can be called by the command: reg. The dependent variable is the first variable name listed after the command with the independent variables following, for example: reg y x1 x2 x3. Now, in case the model has been mis-specified changing it is just a matter of changing the command, for example: logit y x1 x2 x3 will provide us with the model estimated as a logistic regression. The logic in the command language is coherent and learning it takes little time, but it is not a specific programming language which is needed.

## The data editor

It can happen that it is needed to inspect the raw data or to edit it directly. This can be done by the command edit in the command line which opens the data editor as it can be seen in figure 2 below.

## Other features

One of the superior features of Stata over other statistical software is the **graphical component**. The graphics part of Stata can produce high-quality and ready to publish graphics. Every little aspect of the graphics can be customized to fit the particular needs of the users, which makes the package stand out in relation to its competitors.

Since version 9 Stata has a **matrix programming language**, Mata, which can be used like MATLAB or GAUSS. Mata can be used on its own or it can be developed to be called from within Stata. Mata provides the users with a large library of matrix and mathematical functions, such as equation solvers.

Worldwide users of Stata create routines and add-ons which can help estimate very specific models or be used as diagnostics. It is also possible for users to write their own commands in case the existing ones are deemed inferior. This is done by using the **ado-files**, which are added to the command structure and can then be called by the name given to the particular file by the users.

Overall, Stata is recommended for all researchers and students dealing with quantitative data. The program is available for both Windows and Mac OS X. All documentation can be found online as well as in the Stata help function. The Stata file format .dta is compatible between platforms making it no problem to co-operate with Macintosh or Linux users as a Windows user.

## Useful links

**Learn Stata online:** www.ats.ucla.edu/stat/stata/**Follow the latest developments in Stata**: www.stata-journal.com**To find answers to questions or participate with own questions see StataList Listserv:** www.stata.com

## Some useful Stata commands

**help**online help on a specific command**log**log output to an external file**clear**clear memory**quietly**do not show the results of a command**exit**exit the program (clear if dataset is not saved)**gen**create a new variable**replace**modify an existing variable**rename**rename variable**renvars**rename a set of variables**sort**change the sort order of the dataset**drop**drop certain variables and/or observations**keep**keep only certain variables and/or observations**append**combine datasets by stacking**merge**merge datasets (one-to-one or match merge)**encode**generate numeric variable from categorical variable**recode**recode categorical variable**destring**convert string variables to numeric**describe**describe a data set or current contents of memory**use**load a Stata data set**save**write the contents of memory to a Stata data set**insheet**load a text file in tab- or comma-delimited format**tab**abbreviation for tabulate: 1- and 2-way tables**table**tables of summary statistics**sum**descriptive statistics**corr**correlation matrices**ttest**perform 1-, 2-sample and paired t-tests**anova**1-, 2-, n-way analysis of variance**reg**least squares regression**predict**generate fitted values, residuals, etc.**logit, logistic**logit model, logistic regression**probit**binomial probit model**ologit**, oprobit ordered logit and probit models**mlogit**multinomial logit model**poisson**Poisson regression**arima**Box-Jenkins models, regressions with ARMA errors**arch**models of autoregressive conditional heteroskedasticity**var**vector autoregressions (basic and structural)**xtreg,fe**fixed effects estimator**xtreg,re**random effects estimator**xtlogit**panel-data logit models**xtprobit**panel-data probit models**xtmixed**linear mixed (multi-level) models