ASSESS: SPSS USERS' GROUP 16th ANNUAL UK MEETING

FRIDAY 8th NOVEMBER 2002, ST WILLIAM'S COLLEGE, YORK, ENGLAND


ABSTRACTS [PROGRAMME]

 

Incorporating CRISP-DM in the Data Matching Process
Ursula Becker, NFO Europe, Munich, Germany
Author's email address: Ursula.Becker@nfoeurope.com
FULL TEXT (PDF format)

In 1997 NFO Infratest was the first market research company to realise the importance of making market research results more actionable, not only at a strategic level but also at the operational end of our clients' needs. To fill this void, The Global NFO EX-A-MINE Centre was founded. The purpose of The Global NFO EX-A-MINE Centre is to combine information gathered in market research surveys with clients‘ internal databases in order to provide a truly holistic understanding of consumer attitudes and behaviour. We call this process Data Matching. In essence, the Data Matching process enables our clients to improve their identification of key target audiences thereby assisting in refining CRM or Direct Marketing programs. Conducting an EX-A-MINE project is not a push-button process. It requires several data analysts working together to merge and analyse data, gathered from various sources. Complex algorithms are used to model the data and to produce an output which is able to be interpreted and actioned by our clients. Because of the complexity of these projects, a highly structured approach has to be taken to the execution and documentation. To meet this need, every project follows the Cross Industry Standard Process for Data Mining (CRISP-DM). This entails the recording of each stage of the process from initial 'business understanding' through to 'final deployment' of the matching model in a dedicated project database. Such protocol minimises the opportunity for error and maximises the ability to recreate the analyses at a later date. SPSS serves as the backbone of our analytical toolbox. In my talk I will explain in more detail the advantages of our analysis and documentation approach.

[TOP]

 

Adjusted Survival Graphs in SPSS
(Gilbert MacKenzie, Centre for Medical Statistics, University of Keele)

As interest in league tables in the National Health Service and Education mounts, the notion of standardizing findings for the effects of 'case mix' becomes ever more relevant. Increasingly, many comparisons involve 'Time to Event' data , for example, those involving 'time to discharge' or 'survival time', per se, or 'time to completion'. A natural vehicle for such analyses is Cox's PH model and the SPSS implementation provides graphical output which purports to convey information about the effects of adjustment for relevant covariates. In the course of the talk I shall demonstrate that this graphical output can be misleading in several commonly occurring circumstances. In particular, we shall consider two practical examples, from Health Service Research, where the graphical information presented by the package is misleading: In the first case the dis-information is minor, but in the second case it is seriously misleading in context. As far as I am aware this is the first time that the problem, which is inherent in the SPSS algorithm, has been identified. We provide alternative SPSS code which authentically represents the data, reproducing the jumps in the process over the track, and the appropriate covariate adjustments in the contexts in which the problems arise. Meanwhile, we do not recommend the use of the adjusted graphical output generated by the Cox regression module at this time.

[TOP]

 

Intelligent OLAP: Getting Away from the Fixed Cube: New Developments in Data Mining and their Impact for SPSS Users
Simon Dunkley, KXEN

Prevalent in data mining is the use of OLAP (On-Line Analytical Processing) 'cubes'. These typically take the form of some pre-defined tables of data, which are then utilised in much the same manner as SPSS pivot tables by end users. This should enable them to obtain insight into their business drivers rapidly, but there are two problems with this approach. Firstly, the behind the scenes files used by OLAP are enormous - typically two to three times larger than the input data file - and secondly, they are by their nature predefined, so data bandings and fields cannot be controlled by the user. Truly what you see is what you get - and nothing more. Recent developments in theoretical statistics and computational science have changed that. By employing a non-parametric approach to data analysis, flexible, powerful, robust and rapid data mining tools are being developed by many independent software vendors.

KXEN have taken their leading edge algorithms and applied them to produce IOLAP (Intelligent OLAP). This circumvents all of the problems inherent in OLAP. Based on Vladimir Vapnik's Structured Risk Minimisation (SRM) approach to data analysis, KXEN have developed core algorithms which prepare data for robust analysis, and then perform very rapid analysis. These have been used in the development of IOLAP, which enables analytical tables to be produced based on business questions. The output produced has two key pieces of information: firstly, it identifies what are the important drivers (variables) in the data that answer the question; secondly, how should these variables be treated i.e. what is the optimal coding of these variables for this business question? The encoding is produced in KXEN IOLAP as a pivot table, which can then be transposed, edited, and pivoted to gain rapid insight into how, for example, a direct marketing campaign could be improved.

Although the core KXEN software components are designed for integration with other products, a simple Java wizard will be demonstrated to show the capabilities of the SRM approach to data analysis, as well as a little of the background behind this approach to robust modelling.

[TOP]

 

What is a Random Factor?
(Jeremy Miles, Department of Health Sciences, University of York)
FULL TEXT (Powerpoint format)

In SPSS 7.5 the ANOVA.MANOVA procedures in SPSS were replaced (at least in the menus) by General Linear Model (GLM) procedures. The GLM approach is more flexible than the ANOVA approach, and one of the key differences is that GLM allows the inclusion of random factors in the model. A random factor is a factor in which the levels are (assumed to be) randomly selected froma population of possible levels. Defining a random factor as a fixed factor has implications for the data analysis, however the focus of this workshop will be on how to exploit the additional flexibility that random factors allows us.

[TOP]

 

Exploratory Data Analysis (EDA) Using SPSS
(Simon Kometa, Computing Service, Statistics Support, University of Newcastle)

It is always advisable to perform an exploratory data analysis before embarking on formal statistical analyses. The procedure of data analysis using SPSS and indeed any other statistical package should be thought of as a two step process, first, the exploration and description of the data to examine the main characteristics and second, the formal statistical analysis to confirm some of the characteristics of the data. EDA could also help to decide whether parametric or non-parametric analytical techniques are suitable for your data. Many EDA techniques are available in SPSS including Frequencies, Descriptives, Explore, Crosstabs and Case summaries. This paper pays particular emphasis on the Explore and Descriptives techniques.

[TOP]