Other projects GitHub Repo

Automated Detection of Puffing and Smoking with Wrist Accelerometers

This repository provides detailed descriptions of the RealSmoking dataset in addition to step-by-step instructions for replicating results presented in Tang’s paper published in the proceedings of PervasiveHealth 14.


Real-time, automatic detection of smoking behavior could lead to novel measurement tools for smoking research and “just-in-time” interventions that may help people quit, reducing preventable deaths. This paper discusses the use of machine learning with wrist accelerometer data for automatic puffing and smoking detection. A two-layer smoking detection model is proposed that incorporates both low-level time domain features and high-level smoking topography such as inter-puff intervals and puff frequency to detect puffing then smoking. On a pilot dataset of 6 individuals observed for 11.8 total hours in real-life settings performing complex tasks while smoking, the model obtains a cross validation F1-score of 0.70 for puffing detection and 0.79 for smoking detection over all participants, and a mean F1-score of 0.75 for puffing detection with user-specific training data. Unresolved challenges that must still be addressed in this activity detection domain are discussed.


Please cite the paper in any presentations or publications if using this dataset or source codes.

Tang, Q., Vidrine, D., Crowder, E., and Intille, S. 2014. Automated Detection of Puffing and Smoking with Wrist Accelerometers. 8th International Conference on Pervasive Computing Technologies for Healthcare, ICST (2014).

Repo structure

* Root Path
* data
    * original
    * featureset (very large size ~ 3.0G)
    * pkl
* src
* results
* supplements
    * publication_plots
    * consolidate_plots
    * posture_example_plots
    * puffing_example_plots
  • data/original contains raw sensor data and annotation files. These data may be downloaded here.

  • data/featureset contains all generated feature sets used by the detection model. These data may be downloaded here.

    Note that this folder will not normally be included into the version control repository, as they are intermediate results and can be very large size.

  • data/pkl contains all binary intermediate files that can be loaded with the python serialization package pickle. These data may be downloaded here.

    Note that these intermediate files are featureset files in binary format, with 50% less size). They are not commited to the version control system.

  • src contains the python source code used in the paper.

  • results stores all results generated by the source code in .csv format.

  • supplements stores all plots.

    • publication_plots stores all plots used in the final publication.
    • consolidate_plots stores supplimentary plots showing and comparison of model prediction and visual inspection along with annotations.
    • posture_example_plots stores raw sensor data plot along with annotations for different posture examples.
    • puffing_example_plots stores raw sensor data plot along with annotations for different puffing examples.

Original dataset structure

Each session represents a single subject. There are in total seven sessions, please refer to Table 2 in Tang’s paper for the statistical details of each session. And refer to data/statistics_dataset for the original results.

Inside data/original folder you will find a list of .csv files, all the files have similar filename pattern, used to distinguish different sessions, sensor location and dataset type.

[session name]_[sensor location].[dataset_type].csv

For example, file session1_DAK.raw.csv represents the raw 3-axis accelerometer data file for session 1 on dominant ankle. And file session2.annotation.csv represents the annotation file for session 2.

Note that there is only one annotation file for each session so sensor_location is omitted.

There are four raw accelerometer data files for each session at four sensor locations,

  • dominant wrist (DW)
  • dominant ankle (DAK)
  • nondominant wrist (NDW)
  • dominant arm (DAR)

Note that for the experiments run in the paper, only wrist accelerometers are used, please refer to the paper for detailed explanation.

Raw sensor data format

The accelerometer we used is Actigraph GT3X, which has a dynamic range of ±4g and stores value using 10 bits, it’s a 3-axis linear accelerometer. It stores in csv format, with the first column being the unix timestamp, and column 2 to 4 for x, y and z axis.


There is no header row. Data has already been converted to -4g to +4g from voltage.

Annotation data format

Annotation file has standard csv format with header describes the meaning of each column.

Column name Meaning
STARTTIME start time of current annotation
ENDTIME end time of current annotation
COLOR unused
posture annotation for body posture, could be walking, sitting, standing, lying
activity annotation for subject activity
smoking annotation for smoking status, could be either smoking or not-smoking
puffing annotation for puffing status, could be no-puff, left-puff or right-puff
puff index the index (start from 0) of current puff, if two annotation has the same puff index, that means they actually belong to the same puff. If it’s a no-puff annotation, the index will be -1
prototypical? A binary value (0 or 1) indicates whether this puff is prototypical puff. If it’s no-puff, the value will be -1
potential error? A binary value indicates whether this annotation might be wrong due to human error
note additional comments on this annotation
link filename of the corresponding puff example plot in the puffing_example_plots folder

Feature set data format

Inside the data/featureset or data/pkl folder, you will find the feature set data files that may also be reproduced by running the codes. Files in data/featureset folder are in .csv format and files in data/pkl folder are in binary format .pkl. Both will share the same filename and have the same content/values once loaded in the program.

[session name]_[feature set type].[window size].[data type].csv
[session name]_[feature set type].[window size].[data type].pkl

There will be two data types: one is for feature vector (.data.csv or .data.pkl) and the other is for class label (.class.csv or .class.pkl). They are associated with each other.

For example, session1_BFW.40.data.csv represents feature vectors for session 1 and the features are computed with data from both wrists and with window size of 40 samples (which is 1s under 40 Hz sampling rate).

As another example, session_all_DW.600.class.csv represents class labels for all sessions and the corresponding .data.csv is computed with dominant wrist accelerometer data with window size 600 samples (which is 15s under 40 Hz sampling rate).

Feature data format

There will be a header row. The first column is the index of current segment (start from 0). And the rest columns are different features. It’s easy to understand the meaning through the names, or you may refer to the documentation of the source codes.

Class label format

There will be a header row with all necessary or unnecessary columns, you can choose to use any of them.

Column name Meaning
segment index of current segment (start from 0)
STARTTIME start time of current segment
ENDTIME end time of current segment
seg duration the duration of current segment in seconds
session the number indicate which session current segment belongs to
sensor which sensor does current segment row belongs to (DW, NDW, and BFW), refer to footnote 5 for explanation.
inside puff/puff duration The duration of puffing happens within current segment / the duration of this current puffing
inside puff/segment duration The duration of puffing happens within current segment / the duration of this current segment
prototypical? Whether it’s prototypical puff or not (0, 1) if no-puff, leave empty string
puff index the index of puff for current segment from current sensor
puff side left or right
car percentage The percentage of time that in-car activity occupies current segment
talk percentage The percentage of time that talking activity occupies current segment
unknown percentage The percentage of time that unknown-activity occupies current segment
drinking percentage The percentage of time that drinking-beverage activity occupies current segment
reading percentage The percentage of time that reading-paper activity occupies current segment
eating percentage The percentage of time that eating-a-meal activity occupies current segment
smoking percentage The percentage of time that smoking activity occupies current segment
computer percentage The percentage of time that using-computer activity occupies current segment
phone percentage The percentage of time that using-phone activity occupies current segment
walking percentage The percentage of time that walking posture occupies current segment
superposition percentage The percentage of time that activity superposition happens during current segment
name The name of the class assigned to this segment
target The index of the class assigned to this segment

Reproduce the results in the publication

Coming soon…

Demo: the complexity of smoking behavior in real world

These videos are shot to show the real world cases of smoking behavior. Videos and corresponding accelerometer signal are put side by side to give a direct impression of how movement could affect and intervene the underlying signal.

Separable concurrent activities

In this video, tester was asked to perform smoking while walking. As you can see from the video, signal was shown to contain additive components from both hand movement (puffing) and body movement (walking). These two components are independent and additive because they are conducted by different body components. This gives us some inspiration when dealing with concurrent activities.

Click image to view the video

Ambiguous hand gestures

In this video, tester was asked to perform puffing, eating and drinking during smoking in a natural way. The signal contains several ups and downs but none of them shows distinguish characteristics only for puffing. In fact, these activites all belong to hand-to-mouse gestures. The differences between these activities are quite minor from the view of the signal. This exposes one of the chanllenges in activity recognition, which is to classify similar movements.

Click image to view the video

A comprehensive episode

This video shows a comprehensive episode of natural smoking behavior including a series of complex activities ongoing at the same time. Signals, as shown on the right side, appear to be quite intervened and complex and lack of visible distinguish characteristics for each type of activities. The activites are changing relatively fast in time, thus makes them even more difficult to be captured and recognized in real time.

Click image to view the video

Contact us

If you have any question about the dataset or source codes, please create a github issue.