Professional Documents
Culture Documents
Chapter 1-8. Operators, Ifs, Dates, and Times: If Expressions
Chapter 1-8. Operators, Ifs, Dates, and Times: If Expressions
Chapter 1-8. Operators, Ifs, Dates, and Times: If Expressions
If Expressions
Example:
In this section we are going to become expert at if expressions, as well as using operators in
generate commands.
Operators
In Stata, there are four classes of operators: arithmetic, string, relational, and logical.
Arithmetic Operators
+ (addition)
- (subtraction)
* (multiplication)
/ (division)
^ (raised to a power, or exponentiation)
- (negation)
If the arithmetic operation includes a missing value or impossible operation (such as division by
zero), the operation produces a missing value.
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.
1st: ^ (exponentiation)
2nd: - (negation)
3rd: * , / (multiplication or division)
4th: - , + (substraction or addition)
This left to right order can be modified with parentheses ( ), where the operation in parentheses
is done first.
display (1*2)/(3*4)
display 1*2/3*4
“In the constant infusion methods, urine and plasma samples are taken at a time when the
rate of excretion equals the rate of infusion so that clearance (C)
where V is the volume of urine produced during time T, and U and P are the measured
concentrations of activity in urine and plasma respectively.”
clear
input u v p t
2 4 4 2
2 2 1 4
end
list
write a generate (gen) statement to compute the variable C, and then list the data with the
new variable computed. The solution should be:
+-------------------+
| u v p t c |
|-------------------|
1. | 2 4 4 2 1 |
2. | 2 2 1 4 1 |
+-------------------+
Using the following data for the variables, TBW, Na1, Na2, Vinput, Einput,
Eurine, Einf (note: these are just made-up unrealistic values),
clear
input tbw na1 na2 vinput einput eurine einf
25 33 49 50 100 5 44
end
list
write a generate (gen) statement to compute the variable volume, and then list the data
with the new variable computed.
it is because you must have the same number of left and right parentheses.
The generate command that gives this result in given on the last page of this chapter, if
you cannot get it to work.
+ (concatenation)
If the + occurs between two strings, Stata concatenates them. If + appears between two numeric
values, Stata adds them.
Example When strings are read with the input command, use a “str#” in front of it.
clear
input str4 a str4 b
this that
here now
end
gen both=a+b
list
+------------------------+
| a b both |
|------------------------|
1. | this that thisthat |
2. | here now herenow |
+------------------------+
+------------------------------------+
| a b both both2 |
|------------------------------------|
1. | this that thisthat this that |
2. | here now herenow here now |
+------------------------------------+
Note: Stata does not understand “=>” or “=<”, which is not the standard way to say these
operations. You state them just as you learned in elementary school. That is, you say
“greater than or equal to”, not “equal to or greater than”. You say, “less than or equal
to”, not “equal to or less than”.
It is natural to think of relational operators as evaluating to true or false. They actually evaluate
to numbers (1= true) (0=false).
Example
display 5>4
display 5<4
. display 5>4
1
. display 5<4
0
+-----------------------+
| male age underage |
|-----------------------|
1. | 1 20 1 |
2. | 0 18 1 |
3. | 1 25 0 |
4. | 0 26 0 |
+-----------------------+
& (and)
| (or)
! or ~ (not)
The logical operators interpret any nonzero value (including missing) as true and zero as false.
Example
clear
input male age
1 20
0 18
1 25
0 26
end
gen underagemale = 1 if age<21 & male==1
list
replace underagemale = 0 if age>=21 & male==1
list
. list
+-----------------------+
| male age undera~e |
|-----------------------|
1. | 1 20 1 |
2. | 0 18 . |
3. | 1 25 . |
4. | 0 26 . |
+-----------------------+
. list
+-----------------------+
| male age undera~e |
|-----------------------|
1. | 1 20 1 |
2. | 0 18 . |
3. | 1 25 0 |
4. | 0 26 . |
+-----------------------+
A more efficient way to create this variable is with the condition function, which is shown
below.
Stata knew what to do in this case, without using parentheses. It used the order of evaluation
rule, or order of operations rule, for all operators, shown below.
^ (exponentiation)
- (negation)
/ , * (division or multiplication)
- , + (subtraction or addition)
~= (or !=) (not)
> , < (greater than or less than)
<= , >= (“less than or equal to” or “greater than or equal to”)
== (equal to)
& (and)
| (or)
The logical operators “and”, “or” and “not” (&, |, and ~ or !) follow precisely defined rules,
called boolean arithmetic . This arithmetic is define by the following truth tables.
P and Q
P Q P and Q
T T T
T F F
F T F
F F F
Thus both propositions, or expressions, P and Q,
must be true for (P and Q ) to be true.
P or Q
P Q P or Q
T T T
T F T
F T T
F F F
Thus either proposition, P or Q, must be true for
(P or Q) to be true.
not P
P not P
T F
F T
For any value of P, (not P) is its opposite.
Example
clear
input p q
1 1
1 0
0 1
0 0
end
gen PandQ=(p==1)&(q==1)
gen PorQ=(p==1)|(q==1)
gen notP=~(p==1)
list
+-----------------------------+
| p q PandQ PorQ notP |
|-----------------------------|
1. | 1 1 1 1 0 |
2. | 1 0 0 1 0 |
3. | 0 1 0 1 1 |
4. | 0 0 0 0 1 |
+-----------------------------+
clear
input p q
1 1
1 0
0 1
0 0
end
gen PandQ=p&q
gen PorQ=p|q
gen notP=~p
list
+-----------------------------+
| p q PandQ PorQ notP |
|-----------------------------|
1. | 1 1 1 1 0 |
2. | 1 0 0 1 0 |
3. | 0 1 0 1 1 |
4. | 0 0 0 0 1 |
+-----------------------------+
but this abbreviation approach is not a good idea because a missing value evaluates to a 1, since
it is stored as a “very large” non-zero number, and you might expect these to evaluate to missing
like numeric generates do.
clear
input p q
1 1
1 0
0 .
. 0
end
gen PandQ=p&q
gen PorQ=p|q
gen notP=~p
list
+-----------------------------+
| p q PandQ PorQ notP |
|-----------------------------|
1. | 1 1 1 1 0 |
2. | 1 0 0 1 0 |
3. | 0 . 0 1 1 |
4. | . 0 0 1 0 |
+-----------------------------+
Exercise
By applying the rule for “and” in your head, and using the above output, which is replicated
below, fill in the added column:
+-----------------------------+
| p q PandQ PorQ notP | (P and Q) and (P or Q)
|-----------------------------| ----------------------
1. | 1 1 1 1 0 |
Chapter 1-8 (revision 16 May 2010) p. 10
2. | 1 0 0 1 0 |
3. | 0 1 0 1 1 |
4. | 0 0 0 0 1 |
+-----------------------------+
Practice With Missing Data (Arithmetic Operators)
If arithmetic operators are used, generated variables are set to missing if any of the variables in
the arithmetic expression are missing.
Example
clear
input v1 v2
10 15
. 20
15 .
. .
5 4
end
gen tot = v1+v2
list
+---------------+
| v1 v2 tot |
|---------------|
1. | 10 15 25 |
2. | . 20 . |
3. | 15 . . |
4. | . . . |
5. | 5 4 9 |
+---------------+
Notice the total variable was set to missing in the expected fashion.
With relational operators, the missing values affect the result differently.
Example
Let’s create an indicator, or dichotomous, variable for legal-aged males (males ≥ 21).
clear
input male age
1 20
. .
0 18
1 .
0 26
. 21
end
gen legalmale = 0
replace legalmale = 1 if age>=21 & male==1
list , abbrev(15)
+------------------------+
| male age legalmale |
|------------------------|
1. | 1 20 0 |
2. | . . 0 |
3. | 0 18 0 |
4. | 1 . 1 |
5. | 0 26 0 |
|------------------------|
6. | . 21 0 |
+------------------------+
We see that no results are missing, since missing is treated as a very large number when
evaluating relational operators.
A way to get around this is to add another line to set missing back to missing.
clear
input male age
1 20
. .
0 18
1 .
0 26
. 21
end
gen legalmale = 0
replace legalmale = 1 if age>=21 & male==1
replace legalmale = . if age==. | male==.
list , abbrev(15)
+------------------------+
| male age legalmale |
|------------------------|
1. | 1 20 0 |
2. | . . . |
3. | 0 18 0 |
4. | 1 . . |
Condition function
The condition function is a very fast way to create a categorical variable from a continuous
variable. It tests the condition in the first parameter, sets it equal to the value in the second
parameter if true, sets it equal to the value in the third parameter if false.
In other words, it replaces the “generate” and “replace” lines with one “generate” line.
Example
Again, creating an indicator, or dichotomous, variable for legal-aged males (males ≥ 21).
clear
input male age
1 20
. .
0 18
1 .
0 26
. 21
end
gen legalmale = cond(age>=21 & male==1,1,0)
list , abbrev(15)
+------------------------+
| male age legalmale |
|------------------------|
1. | 1 20 0 |
2. | . . 0 |
3. | 0 18 0 |
4. | 1 . 1 |
5. | 0 26 0 |
|------------------------|
6. | . 21 0 |
+------------------------+
Even with the cond( ) function, we must still replace the missing with missing, using one more
line:
replace legalmale = . if age==. | male==.
There is a long list of functions for working with strings, dates, and times. You can see these by
searching for “functions” in Stata’s help, and then clicking on the “string functions” link or “date
and times” link.
A popular format for storing dates and times in hospital databases is the following:
clear
input str20 admit_date str20 infection_date
"7/22/1999 6:26:00" "7/25/1999 13:00:00"
"7/12/1999 9:35:00" "7/14/1999 10:30:00"
"2/25/2000 10:20:00" "2/28/2000 12:45:00"
end
list , abbrev(15)
+-----------------------------------------+
| admit_date infection_date |
|-----------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 |
+-----------------------------------------+
To discover these dates and times are stored as string variables, we use
describe
Contains data
obs: 3
vars: 2
size: 132 (99.9% of memory free)
------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------
admit_date str20 %20s
infection_date str20 %20s
------------------------------------------------------------
To be able to subtract the two dates, for example, we will have to convert them to numeric
variables. This is done by turning them into “elapsed dates”, which is the amount of time since
January 1, 1960.
. list , abbrev(15)
+-------------------------------------------------------+
| admit_date infection_date admit_date2 |
|-------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 14447 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 14437 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 14665 |
+-------------------------------------------------------+
. format admit_date2 %d
. list , abbrev(15)
+-------------------------------------------------------+
| admit_date infection_date admit_date2 |
|-------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 22jul1999 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 12jul1999 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 25feb2000 |
+-------------------------------------------------------+
Notice we had to inform Stata of the order in which the date-time variable was in using the
“MDYhms” mask, which is Month, Day, Year and hours, minutes, seconds.
+--------------------------------------------------------------------+
| admit_date infection_date admit_date2 admit_year |
|--------------------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 14447 1999 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 14437 1999 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 14665 2000 |
+--------------------------------------------------------------------+
. format admit_date2 %d
. list , abbrev(15)
+--------------------------------------------------------------------+
| admit_date infection_date admit_date2 admit_year |
|--------------------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 22jul1999 1999 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 12jul1999 1999 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 25feb2000 2000 |
+--------------------------------------------------------------------+
Notice that the admit_year variable is not an elapsed date, so it needed no format statement.
capture drop admit_year <- done with this, so remove from variables
*
capture drop admit_date2
gen admit_date2 = clock(admit_date, "MDYhms")
list , abbrev(15)
format admit_date2 %tc
list , abbrev(15)
. list , abbrev(15)
+-------------------------------------------------------+
| admit_date infection_date admit_date2 |
|-------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 1.25e+12 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 1.25e+12 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 1.27e+12 |
+-------------------------------------------------------+
. list , abbrev(15)
+--------------------------------------------------------------+
| admit_date infection_date admit_date2 |
|--------------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 22jul1999 06:26:46 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 12jul1999 09:34:12 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 25feb2000 10:20:09 |
+--------------------------------------------------------------+
Notice the seconds do not match, which is due to loss of precision with the default “float”
numeric variable format.
This is because a date and time variable, converted with the clock function, is stored as the
elapsed milliseconds from January 1, 1960 at midnight. That is a very large number than cannot
fit in a “float” variable.
To preserve precision, we must specify the double precision format for the new variable.
. list , abbrev(15)
+-------------------------------------------------------+
| admit_date infection_date admit_date2 |
|-------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 1.248e+12 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 1.247e+12 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 1.267e+12 |
+-------------------------------------------------------+
. list , abbrev(15)
+--------------------------------------------------------------+
| admit_date infection_date admit_date2 |
|--------------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 22jul1999 06:26:00 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 12jul1999 09:35:00 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 25feb2000 10:20:00 |
+--------------------------------------------------------------+
If we want to compute the number of days from July 1, 1999, which is perhaps the study start
date, we can use a date literal. Stata’s d( ) is a date expressed as a day, followed by a month,
followed by a four-digit year.
+-----------------------------------------------------------------------+
| admit_date infection_date admit_date2 daysfromstart |
|-----------------------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 22jul1999 21 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 12jul1999 11 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 25feb2000 239 |
+-----------------------------------------------------------------------+
+-----------------------------------------------------------------------+
| admit_date infection_date admit_date2 daysfromstart |
|-----------------------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 22jul1999 21 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 12jul1999 11 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 25feb2000 239 |
+-----------------------------------------------------------------------+
You have the idea that if you can show it takes longer for a third year resident to respond to a
beeper page than a second year resident, that you could surely get this published in the New
England Journal of Medicine.
Data were collected using a computerized system, which records the date and time, down to the
second. The data are in the file residents.dta. Compute the elapsed minutes from page to
response and compare the two groups using an independent groups t test and a Wilcoxon-Mann-
Whitney test.
This is a rather difficult problem for most students. So here is most of the solution (also at the
bottom of the chapter8.do file.
1) All you need to do is fill in the missing line or lines, to create the timeminutes variable
(minutes between page and response).
Hint: there are 1000 milliseconds to one second. The timemilliseconds variable is elapsed time
in milliseconds. You have to get time in minutes.
2) Add “if statements” to the test statistics on the last three rows to eliminate an outlier in one of
the groups. Do you think it would be a legitimate analysis strategy to do this?
References
Ellison DH, Berl T. (2007). The syndrome of inappropriate antidiuresis. N Engl J Med
356;2064-
72.
Sykes MK, Vickers MD, Hull CJ, Winterburn PJ, Shepstone BJ. (1991). Principles of
Measurement and Monitoring in Anaesthesia and Intensive Care, 3rd ed, Oxford,
Blackwell Scientific Publications.
clear
input tbw na1 na2 vinput einput eurine einf
25 33 49 50 100 5 44
end
list
*
capture drop volume
gen volume=(tbw*(1-(na1+23.8)/(na2+23.8)) ///
+vinput-(einput*vinput)/eurine) ///
/(einf/eurine-1)
list
Note: The “///”, which tells Stata to continue the command on the next
line, only works in the do-file editor. If you are using Stata’s command window, you
must put everything on one line, omitting the two instances of “///”.