Stata: Data Analysis and Statistical Software
Available as Standardsoftware

von Martin Ejnar Hansen, PhD (Fakultätszentrum für Methoden der Sozialwissenschaften)
(Ausgabe 10/1, März 2010)

Wir bedanken uns beim Autor, der uns diesen Artikel in englischer Sprache zur Verfügung gestellt hat.

 

In this short introduction to Stata the interface will be explained and a few tips using Stata will be provided as well as a few websites will be listed which are useful for further learning.

Departments and students at the University of Vienna now have access to a new piece of statistical software, Stata. Stata, in its current version 11, provides its users with a standard package well equipped for most statistical analyses as well as the opportunity for the users to write own commands. Stata is a statistical package like SAS, SPSS or eViews. It is a command-line-driven program which operates in a windowed environment. For users who wish to use point-and-click Stata also contains such an interface.

However, a word of caution is needed: Stata is just a help­ful tool and can not replace the theoretical considerations which guide empirical research. With that in mind we are ready to start using the powerful tool that Stata is.

Stata has great strengths in data manipulation. Data can be moved from external sources, such as spread­sheets, directly into the program. Cleaning the data and generating new variables is very easy. On the statistical side Stata includes all standard uni-, bi-, and multivariate statis­tical tools, ranging from descriptive statistics, principal component analysis to regression analysis. However, also the more specialized statistics are implemented in Stata. Time-series econometrics, for example ARCH and VAR, can easily be called as a maximum likelihood estimation. In several categories a special command provides the leading techniques:

  • "xt" for cross-section/cross-time or panel data (longitudinal data)
  • "svy" for survey data with complex sampling designs
  • "st" for survival time data in duration models

How does it function?

The screen that users meet the first time they open Stata can be seen in figure 1. Four seperate windows are present: the Result window, the Review window, the Variable window and the Command window.

1) Variable window

In this window a list of all the variables in the dataset can be found. Next to the name of the variables we can also see what their labels are, in case labels are used, also the type and format of the variables can be found.

2) Command window

If point-and-click is cast aside then the Command window is where the commands are given. Whether it is a command to load a dataset, to make a recoding or to run a regression, it can all be done in this window. It should be noted that when a command is entered and the enter button is hit the Command window empties. However, this does not mean that we can not see what command we actually entered, it is just in another window.

3) Review window

This window is the Review window, where all commands called in the Command window or by point-and-click can be found. This is extremely helpful after a number of long commands have been run through and it is needed to refer back to a previously called command.

4) Result window

When commands are called they also show up in the Result window. If the command includes a statistical request, for instance a regression, the result shows up in this window.

Why using the command line?

Using the command line might seem to be lots of trouble. However, there are good reasons for sticking to this approach over the point-and-click approach. First of all, the command line helps replication. In scientific research it is necessary to be able to replicate the results. In the ideal situation anyone with access to your commands and data should be able to replicate the results. In programs where all actions are point-and-click it is impossible to say how a certain result was reached if not all steps of manipulating the data can be followed. This problem is not relevant with the command-line approach. Replication also makes it very easy to make alternate analyses, for instance adding a variable or choosing a different estimation technique. It is just to go back and find the earlier specification and change it instead of trying to remember what happened and clicking around in the hope that the memory is right.

Do-files and log-files

When using Stata for complex analyses typing in each command or using the point-and-click strategy will soon turn out to be less useful. Luckily there are tools in Stata which can help. The first is the do-file, which is a collection of all the commands used for the analysis of the dataset in question. Instead of typing each command in the command window we type them all in the do-file and when all commands needed are included we can choose to run the do-file. If the commands are in order, the do-file will run all the commands through while outputting the results and commands in the result window.

The other very useful tool is the log. While it may be preached that we as quantitative empirical researchers should keep a research log of all the decisions we make, it is often forgotten. Here Stata can help. By calling the command log using the filename we ask Stata to keep all our commands and results in a separate file (filename). We can choose to add a comma after the filename and write either append, which appends an already existing log, or we can replace it with the replace command. It is also possible to choose between having the log as a text-file, by simply writing text after the comma or the default setting, the smcl-file can be kept by adding nothing. The log is either closed when Stata is closed or with the command log close. Logs are saved in your working directory and can be opened either through Stata or by text editor at any time.

Estimation

Through the command line we can estimate all sorts of models. We have to know the command for the particular model we wish to estimate what follows that the command is uniform. Take for instance a multiple linear regression which can be called by the command: reg. The dependent variable is the first variable name listed after the command with the independent variables following, for example: reg y x1 x2 x3. Now, in case the model has been mis-specified changing it is just a matter of changing the command, for example: logit y x1 x2 x3 will provide us with the model estimated as a logistic regression. The logic in the command language is coherent and learning it takes little time, but it is not a specific programming language which is needed.

The data editor

It can happen that it is needed to inspect the raw data or to edit it directly. This can be done by the command edit in the command line which opens the data editor as it can be seen in figure 2 below.

Figure 2: Data editor

Other features

One of the superior features of Stata over other statistical software is the graphical component. The graphics part of Stata can produce high-quality and ready to publish graphics. Every little aspect of the graphics can be customized to fit the particular needs of the users, which makes the pack­age stand out in relation to its competitors.

Since version 9 Stata has a matrix programming language, Mata, which can be used like MATLAB or GAUSS. Mata can be used on its own or it can be developed to be called from within Stata. Mata provides the users with a large library of matrix and mathematical functions, such as equation solvers.

Worldwide users of Stata create routines and add-ons which can help estimate very specific models or be used as di­agnostics. It is also possible for users to write their own commands in case the existing ones are deemed inferior. This is done by using the ado-files, which are added to the command structure and can then be called by the name given to the particular file by the users.

Overall, Stata is recommended for all researchers and students dealing with quantitative data. The program is available for both Windows and Mac OS X. All documentation can be found online as well as in the Stata help function. The Stata file format .dta is compatible between platforms making it no problem to co-operate with Macintosh or Linux users as a Windows user.

Useful links

Learn Stata online: Öffnet einen externen Link in einem neuen Fensterwww.ats.ucla.edu/stat/stata/
Follow the latest developments in Stata: Öffnet einen externen Link in einem neuen Fensterwww.stata-journal.com
To find answers to questions or participate with own questions see StataList Listserv: Öffnet einen externen Link in einem neuen Fensterwww.stata.com

Some useful Stata commands

  • help online help on a specific command
  • log log output to an external file
  • clear clear memory
  • quietly do not show the results of a command
  • exit exit the program (clear if dataset is not saved)
  • gen create a new variable
  • replace modify an existing variable
  • rename rename variable
  • renvars rename a set of variables
  • sort change the sort order of the dataset
  • drop drop certain variables and/or observations
  • keep keep only certain variables and/or observations
  • append combine datasets by stacking
  • merge merge datasets (one-to-one or match merge)
  • encode generate numeric variable from categorical variable
  • recode recode categorical variable
  • destring convert string variables to numeric
  • describe describe a data set or current contents of memory
  • use load a Stata data set
  • save write the contents of memory to a Stata data set
  • insheet load a text file in tab- or comma-delimited format
  • tab abbreviation for tabulate: 1- and 2-way tables
  • table tables of summary statistics
  • sum descriptive statistics
  • corr correlation matrices
  • ttest perform 1-, 2-sample and paired t-tests
  • anova 1-, 2-, n-way analysis of variance
  • reg least squares regression
  • predict generate fitted values, residuals, etc.
  • logit, logistic logit model, logistic regression
  • probit binomial probit model
  • ologit, oprobit ordered logit and probit models
  • mlogit multinomial logit model
  • poisson Poisson regression
  • arima Box-Jenkins models, regressions with ARMA errors
  • arch models of autoregressive conditional heteroskedasticity
  • var vector autoregressions (basic and structural)
  • xtreg,fe fixed effects estimator
  • xtreg,re random effects estimator
  • xtlogit panel-data logit models
  • xtprobit panel-data probit models
  • xtmixed linear mixed (multi-level) models