Exhibit
A.1 The Three Aspects oand Major Categories of the Mathematics
Frameworks
Exhibit
A.2 Distribution of Mathematics Items by Content Reporting Category
and Performance Category
Exhibit
A.3 Coverage of TIMSS 1999 Target Population Countries
Exhibit
A.4 School Sample Sizes Countries
Exhibit
A.5 Student Sample Sizes Countries
Exhibit
A.6 Overall Participation Rates Countries
Exhibit
A.7 TIMSS 1999 Within-Country Free-Response Scoring Reliability
Data for Mathematics Items
Exhibit
A.8 Cronbach's Alpha Reliability Coefficient TIMSS 1999
Mathematics Test
Exhibit
A.9 Country-Specific Variations in Mathematics Topics in the Curriculum
Questionnaire
History
TIMSS 1999 represents the continuation of a long series of studies
conducted by the International Association for the Evaluation of Educational
Achievement (IEA). Since its inception in 1959, the IEA has conducted
more than 15 studies of cross-national achievement in the curricular
areas of mathematics, science, language, civics, and reading. The
Third International Mathematics and Science Study (TIMSS), conducted
in 1994-1995, was the largest and most complex IEA study, and included
both mathematics and science at third and fourth grades, seventh and
eighth grades, and the final year of secondary school.
In 1999, TIMSS again assessed eighth-grade students in both mathematics
and science to measure trends in student achievement since 1995. TIMSS
1999 was also known as TIMSS-Repeat, or TIMSS-R.(1)
To provide U.S. states and school districts with an opportunity to
benchmark the performance of their students against that of students
in the high-performing TIMSS countries, the International Study Center
at Boston College, with the support of the National Center for Education
Statistics and the National Science Foundation, established the TIMSS
1999 Benchmarking Study. Through this project, the TIMSS mathematics
and science achievement tests and questionnaires were administered
to representative samples of students in participating states and
school districts in the spring of 1999, at the same time the tests
and questionnaires were administered in the TIMSS countries. Participation
in TIMSS Benchmarking was intended to help states and districts understand
their comparative educational standing, assess the rigor and effectiveness
of their own mathematics and science programs in an international
context, and improve the teaching and learning of mathematics and
science.
Participants in TIMSS Benchmarking
Thirteen states availed of the opportunity to participate in the
Benchmarking Study. Eight public school districts and six consortia
also participated, for a total of fourteen districts and consortia.
They are listed in Exhibit
1 of the Introduction, together with the 38 countries that took
part in TIMSS 1999.
Developing the TIMSS 1999 Mathematics Test
The TIMSS curriculum framework underlying the mathematics tests was
developed for TIMSS in 1995 by groups of mathematics educators with
input from the TIMSS National Research Coordinators (nrcs). As shown
in Exhibit
A.1, the mathematics curriculum framework contains three dimensions
or aspects. The content aspect represents the subject matter content
of school mathematics. The performance expectations aspect describes,
in a non-hierarchical way, the many kinds of performances or behaviors
that might be expected of students in school mathematics. The perspectives
aspect focuses on the development of students attitudes, interest,
and motivation in mathematics. Because the frameworks
were developed to include content, performance expectations, and perspectives
for the entire span of curricula from the beginning of schooling through
the completion of secondary school, some aspects may not be reected
in the eighth-grade TIMSS assessment.(2) Working
within the framework, mathematics test specifications for TIMSS in
1995 were developed that included items representing a wide range
of mathematics topics and eliciting a range of skills from the students.
The 1995 tests were developed through an international consensus involving
input from experts in mathematics and measurement specialists, ensuring
they reected current thinking and priorities in mathematics.
About one-third of the items in the 1995 assessment were kept secure
to measure trends over time; the remaining items were released for
public use. An essential part of the development of the 1999 assessment,
therefore, was to replace the released items with items of similar
content, format, and difficulty. With the assistance of the Science
and Mathematics Item Replacement Committee, a group of internationally
prominent mathematics and science educators nominated by participating
countries to advise on subject-matter issues in the assessment, over
300 mathematics and science items were developed as potential replacements.
After an extensive process of review and field testing, 114 items were
selected for use as replacements in the 1999 mathematics assessment.
Exhibit
A.2 presents the five content areas included in the 1999 mathematics
test and the numbers of items and score points in each area. Distributions
are also included for the five performance categories derived from
the performance expectations aspect of the curriculum framework. About
one-fourth of the items were in the free-response format, requiring
students to generate and write their own answers. Designed to take
about one-third of students test time, some free-response questions
asked for short answers while others required extended responses with
students showing their work or providing explanations for their answers.
The remaining questions used a multiple-choice format. In scoring
the tests, correct answers to most questions were worth one point.
Consistent with the approach of allotting students longer response
time for the constructed-response questions than for multiple-choice
questions, however, responses to some of these questions (particularly
those requiring extended responses) were evaluated for partial credit,
with a fully correct answer being awarded two points (see later section
on scoring). The total number of score points available for analysis
thus somewhat exceeds the number of items.
Every effort was made to help ensure that the
tests represented the curricula of the participating countries and
that the items exhibited no bias towards or against particular countries.
The final forms of the tests were endorsed by the nrcs of the participating
countries.(3)
TIMSS Test Design
Not all of the students in the TIMSS assessment responded to all
of the mathematics items. To ensure broad subject-matter coverage
without overburdening individual students, TIMSS used a rotated design
that included both the mathematics and science items. Thus, the same
students participated in both the mathematics and science testing.
As in 1995, the 1999 assessment consisted of eight booklets, each
requiring 90 minutes of response time. Each participating student
was assigned one booklet only. In accordance with the design, the
mathematics and science items were assembled into 26 clusters (labeled
A through Z). The secure trend items were in clusters A through H,
and items replacing the released 1995 items in clusters I through
Z. Eight of the clusters were designed to take 12 minutes to complete;
10 of the clusters, 22 minutes; and 8 clusters, 10 minutes. In all,
the design provided 396 testing minutes, 198 for mathematics and 198
for science. Cluster A was a core cluster assigned
to all booklets. The remaining clusters were assigned to the booklets
in accordance with the rotated design so that representative samples
of students responded to each cluster.(4)
Background Questionnaires
TIMSS in 1999 administered a broad array of questionnaires to collect
data on the educational context for student achievement and to measure
trends since 1995. National Research Coordinators, with the assistance
of their curriculum experts, provided detailed information on the
organization, emphases, and content coverage of the mathematics and
science curriculum. The students who were tested answered questions
pertaining to their attitudes towards mathematics and science, their
academic self-concept, classroom activities, home background, and
out-of-school activities. The mathematics and science teachers of
sampled students responded to questions about teaching emphasis on
the topics in the curriculum frameworks, instructional practices,
professional training and education, and their views on mathematics
and science. The heads of schools responded to questions about school
staffing and resources, mathematics and science course offerings, and
teacher support.
Translation and Verification
The TIMSS instruments were prepared in English and translated into
33 languages, with 10 of the 38 countries collecting data in two languages.
In addition, it sometimes was necessary to modify the international
versions for cultural reasons, even in the nine countries that tested
in English. This process represented an enormous effort for the national
centers, with many checks along the way. The translation
effort included (1) developing explicit guidelines for translation
and cultural adaptation; (2) translation of the instruments by the
national centers in accordance with the guidelines, using two or more
independent translations; (3) consultation with subject-matter experts
on cultural adaptations to ensure that the meaning and difficulty
of items did not change; (4) verification of translation quality by
professional translators from an independent translation company;
(5) corrections by the national centers in accordance with the suggestions
made; (6) verification by the International Study Center that corrections
were made; and (7) a series of statistical checks after the testing
to detect items that did not perform comparably across countries.(5)
Population Definition and Sampling
TIMSS in 1995 had as its target population students enrolled in the
two adjacent grades that contained the largest proportion of 13-year-old
students at the time of testing, which were seventh- and eighth-grade
students in most countries. TIMSS in 1999 used
the same definition to identify the target grades, but assessed students
in the upper of the two grades only, which was the eighth grade in
most countries, including the United States.(6)
The eighth grade was the target population for all of the Benchmarking
participants.
The selection of valid and efficient samples was essential to the
success of TIMSS and of the Benchmarking Study. For TIMSS internationally,
NRCs, including Westat, the sampling and data collection coordinator
for TIMSS in the United States, received training in how to select
the school and student samples and in the use of the sampling software,
and worked in close consultation with Statistics Canada, the TIMSS
sampling consultants, on all phases of sampling. As well as conducting
the sampling and data collection for the U.S. national TIMSS sample,
Westat was also responsible for sampling and data collection in each
of the Benchmarking states, districts, and consortia.
To document the quality of the school and student samples in each
of the TIMSS countries, staff from Statistics Canada and the International
Study Center worked with the TIMSS sampling referee (Keith Rust, Westat)
to review sampling plans, sampling frames, and sampling implementation.
Particular attention was paid to coverage of the target population
and to participation by the sampled schools and students. The data
from the few countries that did not fully meet all of the sampling
guidelines are annotated in the TIMSS international reports, and are
also annotated in this report. The TIMSS samples for the Benchmarking
participants were also carefully reviewed in light of the TIMSS sampling
guidelines, and the results annotated where appropriate. Since Westat
was the sampling contractor for the Benchmarking project, the role
of sampling referee for the Benchmarking review was filled by Pierre
Foy, of Statistics Canada.
Although all countries and Benchmarking participants were expected
to draw samples representative of the entire internationally desired
population (all students in the upper of the two adjacent grades with
the greatest proportion of 13-year-olds), the few countries where
this was not possible were permitted to define a national desired
population that excluded part of the internationally desired population.
Exhibit
A.3 shows any differences in coverage between the international
and national desired populations. Almost all TIMSS countries achieved
100 percent coverage (36 out of 38), with Lithuania and Latvia the
exceptions. Consequently, the results for Lithuania are annotated,
and because coverage fell below 65 percent for Latvia, the Latvian
results are labeled Latvia (lss), for Latvian-Speaking
Schools. Additionally, because of scheduling difficulties, Lithuania
was unable to test its eighth-grade students in May 1999 as planned.
Instead, the students were tested in September 1999, when they had
moved into the ninth grade. The results for Lithuania are annotated
to reect this as well. Exhibit
A.3 also shows that the sampling plans for the Benchmarking participants
all incorporated 100 percent coverage of the desired population. Four
of the 13 states (Idaho, Indiana, Michigan, and Pennsylvania) as well
as the Southwest Pennsylvania Math and Science Collaborative included
private schools as well as public schools.
In operationalizing their desired eighth-grade population, countries
and Benchmarking participants could define a population to be sampled
that excluded a small percentage (less than 10 percent) of certain
kinds of schools or students that would be very difficult or resource-intensive
to test (e.g., schools for students with special needs or schools
that were very small or located in extremely rural areas). Exhibit
A.3 also shows that the degree of such exclusions was small. Among
countries, only Israel reached the 10 percent limit, and among Benchmarking
participants, only Guilford County and Montgomery County did so. All
three are annotated as such in the achievement chapters of this report.
Within countries, TIMSS used a two-stage sample design, in which
the first stage involved selecting about 150 public and private schools
in each country. Within each school, countries were to use random
procedures to select one mathematics class at the eighth grade. All
of the students in that class were to participate in the TIMSS testing.
This approach was designed to yield a representative sample of about
3,750 students per country. Typically, between 450 and 3,750 students
responded to each achievement item in each country, depending on the
booklets in which the items appeared.
States participating in the Benchmarking study were required to sample
at least 50 schools and approximately 2,000 eighth-grade students.
School districts and consortia were required to sample at least 25
schools and at least 1,000 students. Where there were fewer than 25
schools in a district or consortium, all schools were to be included,
and the within-school sample increased to yield the total of 1,000
students.
Exhibits A.4
and A.5
present achieved sample sizes for schools and students, respectively,
for the TIMSS countries and for the Benchmarking participants. Where
a district or consortium was part of a state that also participated,
the state sample was augmented by the district or consortium sample,
properly weighted in accordance with its size. Schools in a state
that were sampled as part of the U.S. national TIMSS sample were also
used to augment the state sample. For example, the Illinois sample
consists of 90 schools, 41 from the state Benchmarking sample (including
five schools from the national TIMSS sample), 27 from the Chicago
Public Schools, 17 from the First in the World Consortium, and five
from the Naperville School District.
Exhibit
A.6 shows the participation rates for schools, students, and overall,
both with and without the use of replacement schools, for TIMSS countries
and Benchmarking participants. All of the countries met the guideline
for sampling participation 85 percent of both the schools and
students, or a combined rate (the product of school and student participation)
of 75 percent although Belgium (Flemish), England, Hong Kong,
and the Netherlands did so only after including replacement schools,
and are annotated accordingly in the achievement chapters.
With the exception of Pennsylvania and Texas, all the Benchmarking
participants met the sampling guidelines, although Indiana did so
only after including replacement schools. Indiana is annotated to
reect this in the achievement chapters, and Pennsylvania and Texas
are italicized in all exhibits in this report.
Data Collection
Each participating country was responsible for carrying out all aspects
of the data collection, using standardized procedures developed for
the study. Training manuals were created for school coordinators and
test administrators that explained procedures for receipt and distribution
of materials as well as for the activities related to the testing
sessions. These manuals covered procedures for test security, standardized
scripts to regulate directions and timing, rules for answering students
questions, and steps to ensure that identification on the test booklets
and questionnaires corresponded to the information on the forms used
to track students. As the data collection contractor for the U.S.
national TIMSS, Westat was fully acquainted with the TIMSS procedures,
and applied them in each of the Benchmarking jurisdictions in the
same way as in the national data collection.
Each country was responsible for conducting quality control procedures
and describing this effort in the nrcs report documenting procedures
used in the study. In addition, the International Study Center considered
it essential to monitor compliance with standardized procedures through
an international program of quality control site visits. nrcs were
asked to nominate one or more persons unconnected with their national
center, such as retired school teachers, to serve as quality control
monitors for their countries. The International Study Center developed
manuals for the monitors and briefed them in two-day training sessions
about TIMSS, the responsibilities of the national centers in conducting
the study, and their own roles and responsibilities. In all, 71 international
quality control monitors participated in this training.
The international quality control monitors interviewed
the nrcs about data collection plans and procedures. They also visited
a sample of 15 schools where they observed testing sessions and interviewed
school coordinators.(7) Quality control monitors
interviewed school coordinators in all 38 countries, and observed
a total of 550 testing sessions. The results of the interviews conducted
by the international quality control monitors indicated that, in general,
nrcs had prepared well for data collection and, despite the heavy
demands of the schedule and shortages of resources, were able to conduct
the data collection efficiently and professionally. Similarly, the
TIMSS tests appeared to have been administered in compliance with
international procedures, including the activities before the testing
session, those during testing, and the school-level activities related
to receiving, distributing, and returning material from the national
centers.
As a parallel quality control effort for the Benchmarking project,
the International Study Center recruited and trained a team of 18
quality control observers, and sent them to observe the data collection
activities of the Westat test administrators in a sample of about
10 percent of the schools in the study (98 schools in all).(8)
In line with the experience internationally, the observers reported
that the data collection was conducted successfully according to the
prescribed procedures, and that no serious problems were encountered.
Scoring the Free-Response Items
Because about one-third of the written test time was devoted to free-response
items, TIMSS needed to develop procedures for reliably evaluating
student responses within and across countries. Scoring used two-digit
codes with rubrics specific to each item. The first digit designates
the correctness level of the response. The second digit, combined
with the first, represents a diagnostic code identifying specific types
of approaches, strategies, or common errors and misconceptions. Although
not used in this report, analyses of responses based on the second
digit should provide insight into ways to help students better understand
mathematics concepts and problem-solving approaches.
To ensure reliable scoring procedures based on the TIMSS rubrics,
the International Study Center prepared detailed guides containing
the rubrics and explanations of how to implement them, together with
example student responses for the various rubric categories. These
guides, along with training packets containing extensive examples
of student responses for practice in applying the rubrics, were used
as a basis for intensive training in scoring the free-response items.
The training sessions were designed to help representatives of national
centers who would then be responsible for training personnel in their
countries to apply the two-digit codes reliably. In the United States,
the scoring was conducted by National Computer Systems (NCS) under
contract to Westat. To ensure that student responses from the Benchmarking
participants were scored in the same way as those from the U.S. national
sample, NCS had both sets of data scored at the same time and by the
same scoring staff.
To gather and document empirical information about the within-country
agreement among scorers, TIMSS arranged to have systematic subsamples
of at least 100 students responses to each item coded independently
by two readers. Exhibit
A.7 shows the average and range of the within-country percent
of exact agreement between scorers on the free-response items in the
mathematics test for 37 of the 38 countries. A high percentage of
exact agreement was observed, with an overall average of 99 percent
across the 37 countries. The TIMSS data from the reliability studies
indicate that scoring procedures were robust for the mathematics items,
especially for the correctness score used for the analyses in this
report. In the United States, the average percent exact agreement
was 99 percent for the correctness score and 96 percent for the diagnostic
score. Since the Benchmarking data were combined with the U.S. national
TIMSS sample for scoring purposes, this high level of scoring reliability
applies to the Benchmarking data also.
Test Reliability
Exhibit
A.8 displays the mathematics test reliability coefficient for
each country and Benchmarking participant. This coefficient is the
median kr-20 reliability across the eight test booklets. Among countries,
median reliabilities ranged from 0.76 in the Philippines to 0.94 in
Chinese Taipei. The international median, 0.89, is the median of the
reliability coefficients for all countries. Reliability coefficients
among Benchmarking participants were generally close to the international
median, ranging from 0.88 to 0.91 across states, and from 0.84 to
0.91 across districts and consortia.
Data Processing
To ensure the availability of comparable, high-quality
data for analysis, TIMSS took rigorous quality control steps to create
the international database.(9) TIMSS prepared manuals
and software for countries to use in entering their data, so that
the information would be in a standardized international format before
being forwarded to the IEA Data Processing Center in Hamburg for creation
of the international database. Upon arrival at the Data Processing
Center, the data underwent an exhaustive cleaning process. This involved
several iterative steps and procedures designed to identify, document,
and correct deviations from the international instruments, file structures,
and coding schemes. The process also emphasized consistency of information
within national data sets and appropriate linking among the many student,
teacher, and school data files. In the United States, the creation
of the data files for both the Benchmarking participants and the U.S.
national TIMSS effort was the responsibility of Westat, working closely
with NCS. After the data files were checked carefully by Westat, they
were sent to the IEA Data Processing Center, where they underwent
further validity checks before being forwarded to the International
Study Center.
IRT Scaling and Data Analysis
The general approach to reporting the TIMSS
achievement data was based primarily on item response theory (IRT)
scaling methods.(10) The mathematics results were
summarized using a family of 2-parameter and 3-parameter IRT models
for dichotomously-scored items (right or wrong), and generalized partial
credit models for items with 0, 1, or 2 available score points. The
IRT scaling method produces a score by averaging the responses of
each student to the items that he or she took in a way that takes
into account the difficulty and discriminating power of each item.
The methodology used in TIMSS includes refinements that enable reliable
scores to be produced even though individual students responded to
relatively small subsets of the total mathematics item pool. Achievement
scales were produced for each of the five mathematics content areas
(fractions and number sense, measurement, data representation, analysis,
and probability, geometry, and algebra), as well as for mathematics
overall.
The IRT methodology was preferred for developing comparable estimates
of performance for all students, since students answered different
test items depending upon which of the eight test booklets they received.
The IRT analysis provides a common scale on which performance can
be compared across countries. In addition to providing a basis for
estimating mean achievement, scale scores permit estimates of how
students within countries vary and provide information on percentiles
of performance. To provide a reliable measure of student achievement
in both 1999 and 1995, the overall mathematics scale was calibrated
using students from the countries that participated in both years.
When all countries participating in 1995 at the eighth grade are treated
equally, the TIMSS scale average over those countries is 500 and the
standard deviation is 100. Since the countries varied in size, each
country was weighted to contribute equally to the mean and standard
deviation of the scale. The average and standard deviation of the
scale scores are arbitrary and do not affect scale interpretation.
When the metric of the scale had been established, students from the
countries that tested in 1999 but not 1995 were assigned scores on
the basis of the new scale. IRT scales were also created for each
of the five mathematics content areas for the 1999 data. Students from
the Benchmarking samples were assigned scores on the overall mathematics
scale as well as in each of the five mathematics content areas using
the same item parameters and estimation procedures as for TIMSS internationally.
To allow more accurate estimation of summary statistics for student
subpopulations, the TIMSS scaling made use of plausible-value technology,
whereby five separate estimates of each students score were generated
on each scale, based on the students responses to the items
in the students booklet and the students background characteristics.
The five score estimates are known as plausible values,
and the variability between them encapsulates the uncertainty inherent
in the score estimation process.
Estimating Sampling Error
Because the statistics presented in this report are estimates of
performance based on samples of students, rather than the values that
could be calculated if every student in every country or Benchmarking
jurisdiction had answered every question, it is important to have
measures of the degree of uncertainty of the estimates. The
jackknife procedure was used to estimate the standard error associated
with each statistic presented in this report.(11)
The jackknife standard errors also include an error component due
to variation between the five plausible values generated for each
student. The use of confidence intervals, based on the standard errors,
provides a way to make inferences about the population means and proportions
in a manner that reects the uncertainty associated with the sample
estimates. An estimated sample statistic plus or minus two standard
errors represents a 95 percent confidence interval for the corresponding
population result.
Making Multiple Comparisons
This report makes extensive use of statistical hypothesis-testing
to provide a basis for evaluating the significance of differences
in percentages and in average achievement scores. Each separate test
follows the usual convention of holding to 0.05 the probability that
reported differences could be due to sampling variability alone. However,
in exhibits where statistical significance tests are reported, the
results of many tests are reported simultaneously, usually at least
one for each country and Benchmarking participant in the exhibit.
The significance tests in these exhibits are based on a Bonferroni
procedure for multiple comparisons that hold to 0.05 the probability
of erroneously declaring a statistic (mean or percentage) for one
entity to be different from that for another entity. In the multiple
comparison charts (Exhibit 1.2 and those in Appendix B), the Bonferroni
procedure adjusts for the number of entities in the chart, minus one.
In exhibits where a country or Benchmarking participant
statistic is compared to the international average, the adjustment
is for the number of entities.(12)
Setting International Benchmarks of Student Achievement
International benchmarks of student achievement were computed at
each grade level for both mathematics and science. The benchmarks
are points in the weighted international distribution of achievement
scores that separate the 10 percent of students located on top of
the distribution, the top 25 percent of students, the top 50 percent,
and the bottom 25 percent. The percentage of students in each country
and Benchmarking jurisdiction meeting or exceeding the international
benchmarks is reported. The benchmarks correspond to the 90th, 75th,
50th, and 25th percentiles of the international distribution of achievement.
When computing these percentiles, each country contributed as many
students to the distribution as there were students in the target
population in the country. That is, each countrys contribution
to setting the international benchmarks was proportional to the estimated
population enrolled at the eighth grade.
In order to interpret the TIMSS scale scores and analyze achievement
at the international benchmarks, TIMSS conducted a scale anchoring
analysis to describe achievement of students at those four points
on the scale. Scale anchoring is a way of describing students
performance at different points on a scale in terms of what they know
and can do. It involves a statistical component,
in which items that discriminate between successive points on the
scale are identified, and a judgmental component in which subject-matter
experts examine the items and generalize to students knowledge
and understandings.(13)
Mathematics Curriculum Questionnaire
In an effort to collect information about the content of the intended
curriculum in mathematics, TIMSS asked National Research Coordinators
and Coordinators from the Benchmarking jurisdictions to complete a
questionnaire about the structure, organization, and content coverage
of their curricula. Coordinators reviewed 56 mathematics topics and
reported the percentage of their eighth-grade students for which each
topic was intended in their curriculum. Although most topic descriptions
were used without modification, there were occasions when Coordinators
found it necessary to expand on or qualify the topic description to
describe their situation accurately. The country-specific adaptations
to the mathematics curriculum questionnaire are presented in Exhibit
A.9. No adaptations to the list of topics were necessary for the
U.S. national version, nor were any adaptations made by any Benchmarking
participants.
| 1 |
The TIMSS 1999 results for mathematics and science,
respectively, are reported in Mullis, I.V.S., Martin, M.O., Gonzalez,
E.J., Gregory, K.D., Garden, R.A., OConnor, K.M., Chrostowski,
S.J., and Smith, T.A. (2000), TIMSS 1999 International Mathematics
Report: Findings from IEAs Repeat of the Third International
Mathematics and Science Study at the Eighth Grade, Chestnut
Hill, MA: Boston College, and in Martin, M.O., Mullis, I.V.S.,
Gonzalez, E.J., Gregory, K.D., Smith, T.A., Chrostowski, S.J.,
Garden, R.A., and OConnor, K.M. (2000), TIMSS 1999 International
Science Report: Findings from IEAs Repeat of the Third International
Mathematics and Science Study at the Eighth Grade, Chestnut
Hill, MA: Boston College.
|
| 2 |
The complete TIMSS curriculum frameworks
can be found in Robitaille, D.F., et al. (1993), TIMSS Monograph
No.1: Curriculum Frameworks for Mathematics and Science, Vancouver,
BC: Pacific Educational Press.. |
| 3 |
For a full discussion of the TIMSS
1999 test development effort, please see Garden, R.A. and Smith,
T.A. (2000), TIMSS Test Development in M.O. Martin,
K.D. Gregory, K.M. OConnor, and S.E. Stemler (eds.), TIMSS
1999 Benchmarking Technical Report, Chestnut Hill, MA: Boston
College. |
| 4 |
The 1999 TIMSS test design is identical
to the design for 1995, which is fully documented in Adams, R. and
Gonzalez, E. (1996), TIMSS Test Design in M.O. Martin
and D.L. Kelly (eds.), Third International Mathematics and Science
Study Technical Report, Volume I, Chestnut Hill, MA: Boston College. |
| 5 |
More details about the translation verification
procedures can be found in OConnor, K., and Malak, B. (2000),
Translation and Cultural Adaptation of the TIMSS Instruments
in M.O. Martin, K.D. Gregory, K.M. OConnor, and S.E. Stemler
(eds.), TIMSS 1999 Benchmarking Technical Report, Chestnut
Hill, MA: Boston College. |
| 6 |
The sample design for TIMSS is described
in detail in Foy, P., and Joncas, M. (2000), TIMSS Sample
Design in M.O. Martin, K.D. Gregory and S.E. Stemler (eds.),
TIMSS 1999 Technical Report, Chestnut Hill, MA: Boston College.
Sampling for the Benchmarking project is described in Fowler, J.,
Rizzo, L., and Rust, K. (2001), TIMSS Benchmarking Sampling
Design and Implementation in M.O. Martin, K.D. Gregory, K.M.
OConnor, and S.E. Stemler (eds.), TIMSS 1999 Benchmarking
Technical Report, Chestnut Hill, MA: Boston College. |
| 7 |
Steps taken to ensure high-quality data
collection in TIMSS internationally are described in detail in OConnor,
K., and Stemler, S. (2000), Quality Control in the TIMSS Data
Collection in M.O. Martin, K.D. Gregory and S.E. Stemler (eds.),
TIMSS 1999 Technical Report, Chestnut Hill, MA: Boston College. |
| 8 |
Quality control measures for the Benchmarking
project are described in OConnor, K. and Stemler, S. (2001),
Quality Control in the TIMSS Benchmarking Data Collection
in M.O. Martin, K.D. Gregory, K.M. OConnor, and S.E. Stemler
(eds.), TIMSS 1999 Benchmarking Technical Report, Chestnut
Hill, MA: Boston College. |
| 9 |
These steps are detailed in Hastedt,
D., and Gonzalez, E. (2000), Data Management and Database
Construction in M.O. Martin, K.D. Gregory, K.M. OConnor,
and S.E. Stemler (eds.), TIMSS 1999 Benchmarking Technical Report,
Chestnut Hill, MA: Boston College. |
| 10 |
For a detailed description of the TIMSS
scaling, see Yamamoto, K., and Kulick, E. (2000), Scaling
Methods and Procedures for the TIMSS Mathematics and Science Scales
in M.O. Martin, K.D. Gregory, K.M. OConnor, and S.E. Stemler
(eds.), TIMSS 1999 Benchmarking Technical Report, Chestnut
Hill, MA: Boston College. |
| 11 |
Procedures for computing jackknifed
standard errors are presented in Gonzalez, E. and Foy, P. (2000),
Estimation of Sampling Variance in M.O. Martin, K.D.
Gregory, K.M. OConnor, and S.E. Stemler (eds.), TIMSS 1999
Benchmarking Technical Report, Chestnut Hill, MA: Boston College. |
| 12 |
The application of the Bonferroni procedures
is described in Gonzalez, E., and Gregory, K. (2000), Reporting
Student Achievement in Mathematics and Science in M.O. Martin,
K.D. Gregory, K.M. OConnor, and S.E. Stemler (eds.), TIMSS
1999 Benchmarking Technical Report, Chestnut Hill, MA: Boston
College. |
| 13 |
The scale anchoring procedure is described
fully in Gregory, K., and Mullis, I. (2000), Describing International
Benchmarks of Student Achievement in M.O. Martin, K.D. Gregory,
K.M. OConnor, and S.E. Stemler (eds.), TIMSS 1999 Benchmarking
Technical Report, Chestnut Hill, MA: Boston College. An application
of the procedure to the 1995 TIMSS data may be found in Kelly, D.L.,
Mullis, I.V.S., and Martin, M.O. (2000), Profiles of Student
Achievement in Mathematics at the TIMSS International Benchmarks:
U.S. Performance and Standards in an International Context,
Chestnut Hill, MA: Boston College. |