Stata quick guide

Lane Kenworthy
January 2020

HELP

Getting help

Type help command (e.g., help regress); then Enter. If you don’t know the command name, type lookup topic (e.g., lookup regression).
Email: tech-support@stata.com. Put your Stata serial # in the subject line of the email.

STARTING AND EXITING STATA

To start

Double-click on the Stata icon or on a Stata dataset.

To quit

Stata menu –> Quit Stata.

UPDATING STATA

Type update all; then Enter.

SETTING THE “SCHEME”

Type set scheme s1mono, permanently; then Enter. To see other schemes: help scheme.

COMMANDS

Use commands rather than point-and-click. While nearly everything in Stata can be done via the menus, you’re better off typing commands into a word processing file and saving them, then copying-and-pasting them into the Stata “Command” window.

How to enter commands in the “Command” window

Type the command or paste it in; then Enter.
To repeat an earlier command without having to retype it: Look at the list of your previous commands in the “History” window and click on the one you want; then Enter.

DATA FILES

Creating and opening data files

To create a new data file: Click on the “Data Editor” toolbar button.
To get into an existing data file: Double-click on the file. Or in Stata: click on the “Open” toolbar button (or File menu –> Open) and select the file.

Saving a data file

To save a data file: Exit the data editor by clicking on the X in the upper-left corner of the Stata screen, then File menu –> Save.
The extension for Stata data files is .dta.

Entering data

Type it in manually: In a Stata data file, type the datum in the appropriate cell, then Enter.
Copy and paste from an Excel file: In Excel, highlight the cells you want; then Edit menu –> Copy; then in the Stata Data Editor, put the cursor in the upper-left cell and Edit menu –> Paste.

Listing data

list variablename1 variablename2 variablename3
The option clean leaves out table lines.

Variable names and labels

Variable names: Click on the variable name. Variable names must begin with a letter; they can’t begin with a number. Stata is case sensitive: race is different from Race or RACE. Variable names can’t include a dash; use an underscore instead.
Variable labels: Click on the variable name.
Variable format: Click on the variable name. %8.2g indicates that the variable can stretch for 8 characters, 2 of which follow the decimal point. %8.0gc indicates that a comma will display.
Value labels: Type label define race 1 “white” 2 “black” 3 “other” and then Enter. Then type label values race race and Enter.

Creating a new variable

To create a new variable: generate variablename = operation (e.g., generate unionization_sqr = unionization^2). Once you execute this command, Stata creates the new variable and adds it to the data set. For any recomputations, use the replace command instead of generate.
To create a new variable that corresponds to observation numbers (i.e., 1, 2, 3, etc.): generate variablename=_n.

Recoding a variable

To recode a variable: First make a copy of the variable: generate newvariablename = oldvariablename (e.g., generate race2 = race). Then recode race2 3=0 2=1 1=2.
To recode a value into a missing value: recode race 3=. (3=dot).
An alternative to the recode command is the replace command: First generate income2=income. Then replace income2=1 if income<=10000. Then replace income2=2 if income>10000.

To change a string variable into a numeric variable

destring variablename, replace

To delete a variable

drop variablename

Ordering variables and cases

To reorder variables in the data file: order city stateabbr year population will put those variables, in the order listed, at the beginning of the data set. Options for the order command include alphabetic, before(), after(), first, last.
To reorder cases in the data file (in ascending order): sort city year.
To reorder cases in descending order: gsort -city.

OUTPUT

Pasting output into a word processing document

In the “Results” window, highlight the output you want to copy; then Edit menu –> Copy. In the word processing document, put the cursor where you want the results to appear; then Edit menu –> Paste.

Printing output

To print from the Results window: Highlight the output you want to print; then click on the “Print” toolbar button.

CONDITIONAL EXPRESSIONS, OPERATORS

Conditional expressions

IF command: command if expression. For example: regress income educ age agesq if race==2.
BY command: by variablename: command. For instance: by race: regress income educ age agesq.

Operators and functions

and: &
or: |
equals: ==
does not equal: ~=
greater than or equal to: >=
less than or equal to: <=
addition: +
subtraction: –
multiplication: *
division: /
to the power of: ^
square root: sqrt(variablename)

GRAPHS

Scatterplot

scatter yvariablename xvariablename
To add a regression line: scatter yvariablename xvariablename || lfit yvariablename xvariablename, connect(direct)
To add a loess curve: scatter yvariablename xvariablename || lowess yvariablename xvariablename, connect(direct)
To spread out data points that otherwise would lie on top of each other and thus be undecipherable: scatter yvariablename xvariablename, jitter(number). A good jitter number to start with is 7.

Line graph

scatter variablename year, connect(direct)

Dot plot

graph dot numericalvariablename, over(groupvariablename, sort(numericalvariablename))

Histogram

histogram variablename, percent bin(numberofdesiredbars). The percent option requests that relative frequencies, rather than counts, be displayed on the vertical axis. The bin option tells Stata the number of bars you want. For example: histogram income, percent bin(8).

Saving a graph in eps (encapsulated postscript) format

Right-click on the graph; choose “Save Graph …”; select the “Save as type:” field to .eps. Or use the command graph export filename.eps after the graph command.

Shading a section of a graph (usually a line graph)

twoway function y=60, range(1975 1985) bcolor(gs8) recast(area) || scatter yvariable xvariable, where 60 is the largest value on the y axis and 1975-85 is the period on the x axis to be shaded

MISSING VALUES

How to enter missing values

Stata’s recognized code for missing values is a period (.). Note, however, that Stata treats missing values as larger than nonmissing values, so beware when using the generate or if commands. It’s probably best to leave missing values blank.

STATISTICAL ANALYSIS

Coefficient of variation

summarize variablename. Then display r(sd) / r(mean).

Correlation

correlate variablename1 variablename2 variablename3
For pairwise deletion: pwcorr variablename1 variablename2 variablename3.

Crosstabs

tabulate rowvariablename columnvariablename, column. The column option requests column percentages. Usually with crosstabs we put the (presumed) y variable as the row variable and the x variable as the column variable, and then examine the column percentages.

Descriptive statistics

summarize variablename. Shows mean, standard deviation, smallest value, largest value.
summarize variablename, detail. In addition to mean etc., this shows the median (50th percentile), some other percentiles (1st, 5th, 10th, 25th, 75th, 90th, 95th, 99th), variance, skewness statistic, kurtosis statistic.

Frequency distribution

tab1 variablename1 variablename2 variablename3

Regression (OLS)

regress yvariablename x1variablename x2variablename
The option beta shows standardized coefficients.
The option robust adds heteroskedasticity-consistent standard errors.
The option level(80) shows 80% confidence intervals for coefficients (default is 95%).

Skewness statistic

summarize variablename, detail

Z-score (standardized score)

summarize variablename. Then generate variablename_zscore = (variablename – r(mean)) / r(sd).

MISCELLANEOUS

Error messages

“Log file already open (r 604)”: Type log close, then Enter.
“No; data in memory would be lost (r 4)”: Save the data file before executing this command. Or use clear in the command.
“No room to add more observations (r901)”: Try typing set memory 32m, then Enter.

Growth rate calculation using a starting-value variable and an ending-value variable (non-pooled data set)

generate growthvariable = (((endvaluevariable/startvaluevariable)^(1/numberofperiods))-1)*100

Interpolate missing values

ipolate variablename year if year>=1933, gen(newvariablename)

Lagged variables

l.variablename is 1-year lag (e.g., l.unionden), l2.variablename is 2-year lag, etc. (The “l” is a lower-case letter L, not the number one.)

Notes to self within commands

Begin the command line with an asterisk (*).
For a note in the middle of a command line, use /* note */.

Period average calculation (for growth, employment, etc.) in a pooled data set

bysort country: summarize growth if year>=1979 & year<=2007