/* Literatur see below
In order to show differences in respect to estimation (bias) and precision
Data are simulated for 4 groups with mean 0.1 to 0.4
The error term is lognormal distributed with mean=0.01 and std=0.1
Heteroscedasticity appears data are truncated to [0;1]
All regressions should be done with "robust" correction
*/
* Simulation
clear
input percent group freq
.10 1 64
.20 2 32
.30 3 16
.40 4 8
end
label variable group "group"
expand freq
sort group
scalar e=0.01
scalar s=1
scalar b=sqrt(ln(s^2/e^2+1))
scalar a=ln(e)-b^2/2
set seed 3
replace percent=min(max(percent+exp(rnormal(a,b)),0),1)
tabstat percent, by(group) stat(N mean median sd min max skew) format(%8.3f)
graph box percent, over(group) ///
title(percentage data, size(small)) ///
subtitle(means of group: 0.1-0.4 error lognormal µ=0.01 sd=0.1 truncated, size(vsmall))
graph rename data, replace
*********************************************** per ordinary reg
reg percent i.group
hettest
estimates store OrdReg
margins group, mcomp(bon)
marginsplot, title(percentage data per ordinary regression, size(small)) ///
subtitle(means of group: 0.1-0.4 error lognormal µ=0.01 sd=0.1 truncated, size(vsmall))
graph rename per_ordinaryreg, replace
pwcompare group, effects mcomp(bon)
*********************************************** per robust reg
reg percent i.group, robust
estimates store RobReg
margins group, mcomp(bon)
marginsplot, title(percentage data per robust regression, size(small)) ///
subtitle(means of group: 0.1-0.4 error lognormal µ=0.01 sd=0.1 truncated, size(vsmall))
graph rename per_robustreg, replace
pwcompare group, effects mcomp(bon)
*********************************************** per robust logit
gen logitperc=logit(percent)
reg logitperc i.group, robust
estimates store RobLogit
margins group, expression(invlogit(predict(xb))) mcomp(bon)
marginsplot, title(percentage data per robust logit, size(small)) ///
subtitle(means of group: 0.1-0.4 error lognormal µ=0.01 sd=0.1 truncated, size(vsmall))
graph rename per_robustlogit, replace
margins group, expression(invlogit(predict(xb))) mcomp(bon) pwcompare(effects)
/* Cave!
Neither in this case nor in the case GLM below:
the "margins, exp( )... pwcompare..." command can *** not *** be abbreviated to
margins group, mcomp(bon) pwcompare(effects)
as it is the case if a logistic regression is used
*/
*********************************************** per robust glm
glm percent i.group, family(binomial) link(logit) robust
estimates store RobGLM
margins group, mcomp(bon) /* the margin command can be abbreviated *** here in GLM *** */
marginsplot, title(percentage data per robust glm, size(small)) ///
subtitle(means of group: 0.1-0.4 error lognormal µ=0.01 sd=0.1 truncated, size(vsmall))
margins group, expression(invlogit(predict(xb))) mcomp(bon) pwcompare(effects)
/* Cave!
Neither in this case nor in the case RegLogit above:
the "margins, exp( )... pwcompare..." command can *** not *** be abbreviated to
margins group, mcomp(bon) pwcompare(effects)
as it is the case if a logistic regression is used
*/
graph rename per_robustglm, replace
************************************************ Graphs
graph combine data per_ordinaryreg per_robustreg per_robustlogit per_robustglm, ycommon
graph export "percent ordinary vs robust.wmf", replace
graph combine per_robustreg per_robustlogit per_robustglm, ycommon
graph export "percent various robust methods.wmf", replace
************************************************ Results (only reg because logit and GLM not transformed)
esttab OrdReg RobReg, b(%10.4f) se mtitles title(Compare various models)
/* if percentages are in the form
#success #pop x1 x2 ...
then use
Logistic regression for grouped data: blogit
Probit regression for grouped data: bprobit
Weighted least-squares logistic regression for grouped data: glogit
Weighted least-squares probit regression for grouped data: gprobit
* For just percentages use the following
http://www.stata.com/support/faqs/stat/logit.html
How do you fit a model when the dependent variable is a proportion?
Title Logit transformation
Author Allen McDowell, StataCorp
Nicholas J. Cox, Durham University, UK
Date August 2001; updated August 2004
A traditional solution to this problem is to perform a logit transformation on the data.
Suppose that your dependent variable is called y and your independent variables are called X.
Then, one assumes that the model that describes y is
1
y = ----------------
1 + exp(-XB)
If one then performs the logit transformation, the result is
ln( y / (1 - y) ) = XB
We have now mapped the original variable, which was bounded by 0 and 1, to the real line.
One can now fit this model using OLS or WLS, for example by using regress.
Of course, one cannot perform the transformation on observations where the dependent variable is zero or one;
the result will be a missing value, and that observation would subsequently be dropped from the estimation sample.
A better alternative is to estimate using glm with family(binomial), link(logit), and robust;
this is the method proposed by Papke and Wooldridge (1996).
At the time this article was published, Stata’s glm command could not fit such models,
and this fact is noted in the article.
glm has since been enhanced specifically to deal with fractional response data.
In either case, there may well be a substantive issue of interpretation.
Let us focus on interpreting zeros: the same kind of issue may well arise for ones.
Suppose the y variable is proportion of days workers spend off sick.
There are two extreme possibilities.
The first extreme is that all observed zeros are in effect sampling zeros:
each worker has some nonzero probability of being off sick, and it is merely that some workers were not,
in fact, off sick in our sample period.
Here, we would often want to include the observed zeros in our analysis and the glm route is attractive.
The second extreme is that some or possibly all observed zeros must be considered as structural zeros:
these workers will not ever report sick, because of robust health and exemplary dedication.
These are extremes, and intermediate cases are also common.
In practice, it is often helpful to look at the frequency distribution:
a marked spike at zero or one may well raise doubt about a single model fitted to all data.
A second example might be data on trading links between countries. Suppose the y variable is proportion of imports from a certain country. Here a zero might be structural if two countries never trade, say on political or cultural grounds. A model that fits over both the zeros and the nonzeros might not be advisable, so that a different kind of model should be considered.
Reference
Papke, L. E. and J. Wooldridge. 1996.
Econometric methods for fractional response variables with an application to 401(k) plan participation rates. Journal of Applied Econometrics 11: 619–632.
*/