StatTalk: December 2006

Friday, December 15, 2006

A Special Anova

I had a client who came in with an anova question in Stata. She wanted to run a model with one between-subject factor with six levels and three within-subject factors, each with two levels. The model has seven explicit error terms, each with 30 degrees of freedom, not including the residual error which also has 30 df. The reason that there are so many error terms is that each within-subject factor has its own error term as do each of the interaction combinations of the within-subjects factors. In all, the model used 257 degrees of freedom along with another 30 df for the residual error.

We started with regular Intercooled Stata and immediately got a "too many variables or values (matsize too small error)." I set the matsize to 800, which is the maximum for Intercooled Stata and ran it again. This time it just said, "too many variables or values." Nothing about the matsize. I read in the manual that the limit for anova was eight variables in a single term. This model came close but didn't exceed that limit. Nothing we tried could get it to run.

I gave up on Stata for the moment and flipped the data into SAS using StatTransfer. Using proc glm, it ran perfectly the first time (this doesn't happen for me very often). The client was not familiar with SAS and really wanted output in Stata. I thought maybe it would run using Stata/SE. The SE in Stata/SE stands for special edition (I think) and allows a matsize up to 11,000. I got on one of the computers that had SE, set the memory to 100m and matsize to 1200 (a value I thought would be way too big), and ran it again. It worked fine, producing all the F-tests, the conservative p-values, the covariance matrix, everything.

So, what was going on? Why wouldn't it run in Intercooled Stata? A little bit of investigation revealed the answer. When I manually code a design matrix for anova, I use as many columns as their are degrees of freedom. However, Stata uses an over parameterized design matrix, it uses a many columns as there are parameters. Consider one of the with-subjects effect B*C and its error term, B*C*blocks nested in A (in Stata written as B*C*blocks|A). I would give one df for B*C and 30 df for B*C*blocks|A. Stata with its over parameterized model allocates 4 columns for B*C and 196 for the error term. In total the design needed a matsize of 1184 in order to run. So I really wasn't that far off with my wild guess of 1200.

This situation shows how quickly the matsize can grow for these mixed-effects models.

pbe

Friday, December 8, 2006

What's in a name?

As you can tell from our web pages we are the Statistical Consulting Group of UCLA Academics Technology Services. Some of the parts of our name are obvious, such as, UCLA. But the other parts may need some explaining.

"Statistical Consulting Group" -- as this part of our name suggests, we provide statistical consulting for UCLA faculty, staff and students. We provide services to campus members engaged in research. We do not provide direct support for students enrolled in statistics or other methods classes, that is, we do not assist students with homework or course projects. On the other hand, we do not charge campus researchers for any of our services, that's right, its totally free.

We have three major modes of providing our consulting services. First, we have walk-in consulting sixteen hours per week. People show up during consulting hours and receive assistance on a first come/first serve basis. On an average day we will see about a dozen clients. Our second mode is email consulting. Campus clients send in questions via email and if we can understand the question, we email a reply within one working day. If we don't understand what is being asked we can reply asking for further clarification or to ask the client to visit us during our walk-in consulting hours. When clients visit us, we ask that they bring their data with them so that we can work on their data with them. Our final mode is our web pages. We try to put up pages that cover many of the common questions and/or problems that our clients encounter. Our web services are very successful, generating nearly one million hits per month, split almost equally between campus and non-campus URLs.

"Academic Technology Services (ATS)" -- this part of our name requires a little explanation. The former name for ATS was the Office of Academic Computing (OAC). It was the campus computing center. I started at UCLA back in the mainframe days. At that time OAC was running an IBM 360 model 91 which later was swapped for an IBM 3033 and finally an IBM 3090. Just about all the statistical computing on campus was done on the central mainframe. The big stat packages in those days were BMD, SAS, SPSS. Since all this software was run centrally it made sense to have centralized statistical consultants. In fact, the vast majority of the "stat" questions concerned the JCL (Job Control Language) needed to statistics software and use tapes and disks.

As the computing environment on campus changed from centralized mainframe systems to distributed departmental and individual systems, it became clear that the name Office of Academic Computing was no longer a good fit for what the organization did. The name change to Academic Technology Services better reflects the breath of services that are currently provided. One of these services is statistical consulting. So even as computing has become decentralized, statistical consulting has remained an important component of a centralized unit, providing service to the entire campus community.
pbe

Wednesday, December 6, 2006

Citing the End

Most people know that we support numerous statistical software packages. But did you know that we also have a variety of other research-related programs? For example, we have three packages that help manage references: EndNote, ProCite and Reference Manager. These programs (and others like them) help you create the bibliography for your publication or dissertation. Each of them have hundreds of journal styles that you can select so that your references are in the proper format. Using a program like this may save you a headache, so UCLA researchers are welcome to stop by our walk-in consulting and try these out.

crw

Saturday, December 2, 2006

Power Struggle

I got an email from Kevin Cummins (UCSD Medical School), who was playing with my powerlog program. powerlog is supposed to compute the necessary sample size to achieve a given power in logistic regression with a continuous predictor. You tell the program the proportion of one's at the mean of a continuous predictor (p1) and the proportion of one's at the mean plus one standard deviation (p2). I basically translated a SAS macro program by Michael Friendly (York University) who, in turn, referenced two formulas from Agresti's, An Introduction to Categorical Data Analysis (pg 131).

Kevin's email pointed out that my program produced some strange sample sizes. For some values of p1 and p2 the program was generating negative sample sizes and there were also times when sample size increased when the effect size increased. This, of course, is not a behavior you would expect.

The problem with the negative sample sizes was easy to correct once I spotted the error. There was a calculation that involved exp(-`lambda'^2) (the funny single quotes are Stata's way of indicating a macro variable). The way it supposed to work was that lambda was squared, then made negative and then exponentiated. The way I wrote it, lambda was first made negative, then squared and finally expopnentiated. An extra set of parentheses solved the problem.

The second problem was much more difficult to diagnose. I verified that the SAS macro program had the same problem of increasing sample size with increasing effect size under certain conditions. The algorithm used was from Hsieh (1989). So I did what I should have done in the first place, get a copy of the article from the library. I verified that Agresti, the SAS macro program and my Stata ado implemented the algorithm as it was written. Then, I did some online searching and found a much later Hsieh et al (1998) article which compared several different methods for estimating sample size for logistic regression. In the conclusion section of the article there was the statement that the algorithm should not be used for odds ratios greater than 3 or less than 1/3. With large odds ratios you can get some unbelievable sample size estimates. For example, with an odds ratio of 81 the program produces a suggested sample size of 1.7e+11.

So, right now I'm trying to decide whether to just restrict the program to odds ratios between 1/3 and 3 or to just sit on the program for a bit and search for a better algorithm that I can implement.

There will be a lot more items on power analysis coming up in the near future because we will be putting on a series of presentations on this topic in the Spring. We have an outside speaker (non-UCLA), Jason Cole (Consulting Measurement Group), who volunteered to do presentations on intermediate and advanced issues in power analysis. We will kick things off with an introduction to power analysis by our own Christine Wells. We will announce the dates and times on our ATS Stat web page.

pbe

StatTalk