Setting up Open Data Kit (ODK) for use in a survey of households in urban India
by Emily Kumpel
We are a team of University of California, Berkeley grad students conducting an impact evaluation of the health, economics, and water quality effects on households of a shift from intermittent to continuous water supply in a small town in south India. I wanted to write a narrative of sorts about our experience using Open Data Kit (ODK) for our study. It is a great tool, and a lot of how we went about the process of incorporating ODK into what is normally a paper-based survey system was discovered through trial and error. I thought I’d outline the steps we went through and some comments on what I feel we would’ve done differently if we were to start over, if it helps others get through the process of using ODK for a study.
We had heard about the idea of using mobile phones/PDAs for collecting survey data from friends, who had experience with mostly commercial software for similar household surveys. The most frequent advantage cited was the instantaneous availability of data, allowing the research teams to continuously monitor what was happening, fix data issues with data collection while it was still occurring, and enabling presentation of data very soon after completion of the study. After our experience running significantly (by an order of magnitude) smaller-scale studies, also in India, and having to deal with pile-ups of paper and data entry woes, we decided to at least consider an electronic method.
- Design your study. I’m presuming this is all being done on a separate track from ODK. The relevant things you need to know about your study before deciding to use ODK: location/context (is there easily a source of power for charging the phones frequently enough? What are conditions for enumerators going to be like? Is there technical knowledge/ability among the field supervisor staff?), study size and length of time (relevant for hardware selection and budgeting).
- Decide whether ODK is for you. We looked into several other software/hardware options, including Windows Mobile, Nokia, and iPhone – MobileActive’s guide Comparing Mobile Data Collection Toolshas a great listing of the options available (this didn’t exist when we were making our decision). We downloaded samples, including ODK, to try it out, and looked at the following factors that were most relevant for us:
- Budget – this was our biggest consideration, since we are operating on a very tight budget. The quick estimate we got for commercial software (Windows Mobile, iPhone) was at least US$7,000. After pricing out the costs of phones and software, ODK was a clear winner. After our sample size doubled from 2000 to 4000 households a few months into setup (after we had already decided to use ODK), we looked at our budget and discovered that the cost of the hardware (Android phones) was less than what the photocopying/data entry costs associated with a paper-based system would have been.
- Language support – Our study is in the state of Karnataka, and our survey is administered in the language Kannada. Currently, Android does not support many complex language fonts, including various Indian languages (see thread in ODK Implementers list). However, we’re in an urban setting and found that most of or enumerator staff would have at least basic knowledge of English letters. Also, many people in India are familiar with transliteration – turning Kannada sounds into English letters – as it’s common in text messages and store signs. We talked with many Kannadigas and decided transliteration of Kannada would work for our enumerators (part of our interview process for hiring involved reading a bit of transliterated Kannada).
- Usability for research team – I was the ‘programmer’ (by programmer, I mean I used to change up the html on geocities pages back in ’97, have taken classes that involved FORTRAN and MATLAB, and am reasonably adept at looking at existing code and using cut-copy-paste. While we were still deciding on whether to use ODK, I tried out the demo forms, made adjustments and used the basic features, loaded it onto a borrowed Android phone, and set up Aggregate. I figured if I could get through this in a day, combined with the helpful team on the ODK listservs and some friends in Berkeley’s CS department, I could manage setting up ODK.
- Features – Our most important requirement was skip logic – in fact, that was one of the main reasons we wanted electronic data collection. Our survey is extremely complicated, as we’re trying to document things like the various sources of water that people use and many other features of such a supply. This translates into >8 different options for a source of water, with 4-20 questions after they say ‘yes’ to any one of them. This level of complexity would have been impossible in paper form, so the main feature requirements for us was skip logic and the ability to input numbers/decimals/text; everything else (constraints, GPS, etc) was icing.
- Make the basics work.Once we made the decision, I needed to teach myself how to use ODK, and also be able to let my other team members know what was and was not possible. To get set up with the basics, I set up a 10 question survey that used all of the types of questions we intended to use. The steps I went through:
- Find an Android phone to borrow or use the emulator (I followed these instructions to the letter and it worked).
- Download ODK Collect on the phone/emulated phone through the marketplace or the website directly (often the marketplace didn’t work on my emulator, so I just navigated to the website on the phone’s browser and downloaded) using these instructions.
- Try out the sample forms on the ODK appspot server. Download them to your device, fill them out, send them back to the server, view the data on the server. Get a feel for how they work.
- Set up ODK Aggregate (optional). This wasn’t necessary, but I was in the US when I got started on this and thought I’d be using it, so I set it up and tested it., so had plentiful internet and was using Google appspot for testing(we never even ended up using anything that involves the internet for our actual study).
- Download one of the sample forms. Upload it to your Appspot site by using the ‘Upload form’ option. Change the server on your ODK Collect (phone or emulator) to your own appspot site address (press ‘menu’ when you’re on the main screen), find your form, and bring it do your device. Try out the form, send data back, view it on you appspot. Congrats! The groundwork for everything you’ll do is set.
- Read up the Xform tutorial and follow the examples. Try your own forms through the whole system. There’s now ODK Build, but I knew we’d be using very complicated logic which is not supported in Build at this time, so I decided to code from scratch.
- Edit your sample forms, incrementally. Since I didn’t really know what I was doing, I found it was best to change just one question at a time, constraint, question type, etc, so when something went wrong (which it, of course, did nearly every step of the way as I made mistake after mistake), I knew exactly where to find the problem. Don’t worry, there will be lots of crashes and problems, but that’s learning! Just take it slowly and be patient. And use the Validator.
- Make a 10-question form that includes all of your survey question types. I went through the pilot questionnaire (at this point, the final one was not yet finished by my colleagues) and chose 10 questions that incorporated all the types of questions and constraints I was likely to have (multiple choice, text/integer/decimal entry, GPS, skip logics with 4 relevancies, various constraints etc.). I worked on getting a form to work that incorporated all of these. From then on, I just copied and pasted from this form to make the full survey.
- Decide on your hardware. I looked through the relevant question in the FAQs and also pricing and decided based on others’ experiences with ODK in the field and cost to go with MyTouch phones. We ended up needing more phones when we were in India, so we got one of the 3 Android options readily available in our town – the Sony Ericsson Xperia x8. Again, cost was our biggest concern. It’s worked fine, though recently we’re having some trouble with the GPS.
- Decide how data will be transferred between Androids and a computer.This is dependent on the design of your field logistics. Are the enumerators bringing the phones back to a central point daily? Weekly? Will you have a computer at that point? How often do you want data transferred? Also, are there privacy concerns for your data? Who will be there when something goes wrong with whatever system you have set up?
- No internet?: As a team we decided on logistics: that the phones would be handed in by the enumerators each evening and given back each morning. This meant that there was no reason for us to even have SIM cards in the phones or use the internet for our transfer of data at all.
- Aggregate: Aggregate is not (yet) secure on its own, so we would’ve needed to add security to it ourselves and figure out how to put it on a computer. I wasn’t sure I was easily capable of this, so I’d need to call on a friend to help. While possible, this was not ideal – since I was the one in the field, I wanted to be capable of handling any issue that came up.
- KoboPostProcessor: I discovered the best solution for us was KoboPostProcessor. This takes the forms produced by ODK Collect and transcribes it into a .csv file. I made the decision to use this and shelved it for a later to actually set up and figure out (of course, it would be better to at least have gone through the process of getting it to work a few times before decided to use it!).
- 6. Convince your survey writers to give you the ‘final final’ survey questionnaire. Our survey went through extensive 2-3 months piloting – all in paper form – before it even got to a ‘final form’. Our survey is extremely long, and divided into 10 sections, and some sections were finished before others. I required at least 95% completion of sections before I started programming, as I knew it would be a lot more work for me to make changes later. It’s tempting when things are not printed out in paper to keep making changes until the last minute, but I think it’s important to set deadlines and constraints about how much can be changed after a version is given to the ‘programmer’.
- 7. Code your survey.
- Decide how you’ll handle language issues. Since the survey was still undergoing piloting and the translation was not yet done, and I don’t speak Kannada, everything was coded initially in English. We decided that our survey was incredibly long, and since others had reported running into crashes with long surveys, we wanted to keep the English and Kannada versions entirely separate. Also, we knew almost all of our surveys would be in Kannada, so there likely wasn’t going to be a need for a separate English survey. ODK has great functionality for changing languages within the form, but we didn’t use any of that. I completed everything in English and copied and pasted the Kannada language in over the English later.
- Keep things simple. We wanted the screens to be as uncluttered as possible, and the questions to be as simple and plain as possible. A lot of this is survey craft, but relevant for ODK:
- Question numbering. All of the questions in the paper survey were numbered, since its necessary for paper-based skip logic. However, we left the numbering out of the ODK form because there was really no need for it – the skip logic would be taking care of itself, and it just added more text to the screen that there was no reason for the enumerators to need to see or keep track of.
- Groupings. While for ourselves, we had sections like ‘Child Health’ or ‘Water Treatment,’ and our first version of the ODKized survey had these as ‘groups’ that then showed at the top of the screen, we realized there was really no reason for the enumerators to need this information and it just added more things that cluttered up the screen. It was useful for us, early on in the programming and testing phase, but we got rid of it for the final form.
- Hints. Many of our questions had hints like ‘Enter 99 if don’t know’ or ‘DON’T read the options’. The first version was full of long, convoluted various forms of these, so we standardized the 4-5 basic messages that appeared throughout the survey and made them as short as possible.
- Constraint messages. While it’s great that you can put in specific constraint messages, we felt it flashed for too short of a time for our enumerators to read and process (especially considering they would have been in transliterated Kannada, which is slower for reading). We started out with specific constraint messages but ended up simplifying and just leaving the default there for everything except one question where we wanted to be specific.
- Annotate your survey document. This is likely a personal preference and I don’t know if this will work for everyone, but it certainly did for me. Our survey was being written and piloted by printing it out in Word. I took the word document and started making my own ‘coding’ annotations to it. In almost all questions, my colleagues had put in the skip logic and constraints. Skip logic in a paper questionnaire is generally backwards from the way ODK thinks about it. In a paper questionnaire, the answer to question 4.1.1 might be: Yes (if yes, skip to question 4.1.3). ODK works backwards, and instead you need to append to question 4.1.3 that it is only a relevant question if 4.1.1=yes. Annotating in Word was necessary for me to keep all of this straight. Before each question in the Word doc, I’d write (in a different font) notes for myself: the name of the variable (CostWater), the relevant parameters (int, constraint <10000; relevant if PayWater=1, hint: ‘99 if don’t know’), and the values that would be given for each multi-choice. We also knew we’d be teaching the enumerators that any ‘don’t know’ in a numerical input question would be ‘99’, so I made sure our constraints accounted for this. In retrospect, while annotating was a great idea, doing so in Word and doing so while the survey was still being amended was not. It made changes hard (if my colleagues wanted to make a change to a question, they’d make it on their version, do ‘track changes’ and then I’d have to go back to my version and make the same change; vice versa for when I found mistakes in their skip logic, etc.). I couldn’t annotate directly to their document, since it was in a formatting useful for printing it out daily and using it in the field for piloting. I think what I would recommend is using either Excel or at least a table in Word to keep track of it. I haven’t thought this through, but in the next month I’ll be programming again an updated survey document for Round 2 and I’ll append this if I find a better system. In the end, it was great we had this annotated document, and I made sure that everything that ended up in the coding of the form was also reflected in the document. This made pulling out the information to make a codebook easy.
- Adjust all questions work with ODK. Through the annotations process, you’ll find some questions whose format or skip logic that won’t really work with ODK.
- Range answers. One that came up many times in our survey was ‘range’ answers. For example, a question like ‘how many days ago did you last collect borewell water’ might have an answer of 3, 4, or 3-4 days. We would have to do this using a text input, so the enumerators would have to switch from the text to the numerical keyboard to type “3-4”. Adding symbols complicates matters too, and since most of the answers to this question were not a range, we didn’t want to complicate things by having them switch between keyboards. We came up with a few solutions: 1) Turn it into a multi-select (with the values of 1, 2, 3, 4, etc.) and instruct the enumerators to select both ‘3’ and ‘4’ if the answer was 3-4; 2) Make single-select ‘buckets’ like ‘less than 2 days, 2-4 days’ etc. or 3) Use a decimal input and instruct enumerators to use ‘3.5’ in the event they reported 3-4 days. The solution we used varied depending on the question and the resolution of answers that we got.
- Tables. One of our questions in the paper format involved a table, where the enumerators would fill in information about water containers, including sizes, materials, shape, etc. A table worked great on paper, but is hard to translate into ODK. I first started with a repeating group, where each of the questions (size, material, shape, etc) was asked in turn. This worked well until I discovered that KoboPostProcessor would not work with repeats (see my comment above that I should’ve tried it out much, much earlier). I then make it into a forced repeat (rather than using a loop, just asking the questions again and again). But since we were allowing for up to 15 different storage container combinations, this grew quickly into an extra 75 questions, which led to worries about crashes due to survey length (we were also using the previous version of ODK collect – maybe the new one could handle this)? We ended up deciding that 1) we didn’t need this information from all 4000 households; 2) it was much, much easier to fill this information out in table form; and 3) we were nervous about the enumerators holding the phones over the water storage containers as they measured the tops and sides of them. So we decided to have them fill out this information in a 1-page paper format at only their first house of the day. Some of our other info for houses is on paper (name and address information and consent scripts (an ethics committee protocol), sheets that household IDs are checked off of, etc.), so the enumerators were used to having some paper out during the survey, and we were set up for a small amount of data entry.
- Work on small sections first. Luckily our survey was already broken into smaller sections, so I coded forms individually by section. It made it much easier to debug than a huge long form would have been. If there were relevancies that depended on previous sections, I left a commented note to do this when I integrated them. I also named each with different groupings (Section 5, Section 6, etc) to make it easy to navigate. For our complicated issues with relevancy and constraints, I mostly borrowed from the great example of the icmi form (on the ‘example forms’ page on the ODK site). Some of the things I tried worked, some didn’t; when I came upong something that I wasn’t sure was possible, or that left me feeling like it was a huge hassle/danger of leading to fatal crashes, I asked my friends to change up the survey to accommodate easier logic (adding a new question, breaking up one question into several, etc).
- Extensively test sections. Our survey includes complicated skip logics. I enlisted the help of my colleagues and everyone else who was around to test every possible permutation. When there were crashes (there always were), I’d go to the code of the screen that it had crashed near and try to find the problem.
- Slowly integrate sectional forms together. I started to bring sections into one ‘master’ form and test, tested extensively at each step of the way. No, testing again and again is not fun, but every time I’d get ambitious and confident, I’d end up with a bug I couldn’t trace.
- Get your translation right. This isn’t an ODK issue, but we wished we had had this advice beforehand to make sure we put the time in our schedule. We sent our survey out for help with translation to some students who had done some translation work for us in the past. We also got it independently back-translated. When our field supervisor came on board, we had to spend many, many hours going question by question with him making sure the survey questions were asking what we wanted them to ask (my favorite: a translation for the answer ‘buried’ for the question ‘where do you put your child’s feces’ had been translated into the Kannada ‘on your ancestor’s grave’). While sending it off and not being a part of the process is a good first cut, make sure you account for the copious one-on-one time needed with someone who speaks both languages and understands what it is you’re trying to get at with your questions.
- Extensively test final form. My colleagues made a flow chart of the especially tricky skip logic sections and went through step by step to make sure everything checked out with ODK.
- Set up data transfer. We used KoboPostProcessor. First I hooked up the phone, navigated to the ODK>forms folder on the phone, took the forms to my computer in a folder I called “raw data”. I then set the KoboPostProcessor to Transcribe from “raw data” to “transcribed data” folder.
- I didn’t get the Sync feature of KoboPP to work, so a friend wrote a [[simple python script]] which could pull data from multiple phones at once to the computer.
- KoboPP does a great job pulling the data from the xml forms to a .csv, however, for some reason it re-aligns all the columns in a strange order. To alleviate this, I added numbers in front of all of the variables, e.g. A01, A02, B01 etc (a fair amount of work, but I think better to do after the form was programmed, in case there were changes to the survey and re-numbering would have been worse). We would then just re-sort the columns in Excel.
- We also found that KoboPP on my computer (Windows XP) would break the answers of Multi-select questions into separate columns (with associated ‘1’s and ‘0’s’, our field supervisor’s computer (Windows 7) would not. My only guess is that this is a Windows issue, since we tried it on several other computers and came up with the same results. We ended up having to re-transcribed the first week’s worth of data, since we had switched computers that were doing this operation a week into the survey.
- KoboPP doesn’t work with repeats in the forms, so we rid of the few of these we had.
- Prepare the phones. We went through the phones and put ODK collect on all of them, changed settings (simple background, not allowing a rotating screen, hiding all apps on the home page, putting ODK Collect on every single page, airplane mode, etc). We then put our forms on all of them by hooking them up to a computer and manually putting the form into the ODK>forms folder. I’ve found that sometimes the form doesn’t entirely get onto the phone properly the first time – I have no idea why, but sometimes, after putting a form onto the SD card and trying to open the form in ODK, it would crash. So whenever we loaded a new form to the SD cards of the phones, we’d run through each survey at least once and save it to make sure it was working OK before handing them to the enumerators.
- Train the enumerators. Refer to Neil’s excellent guide. We followed this exactly, though with the first few days set aside for teaching the enumerators to read transcribed Kannada.
- Set up a daily system. The procedure for the end of each day: 1) connect 4 phones through USB hub; 2) test that they’re all there using the android debug bridge (adb); 3) run the script to pull the data; 4) count that the number of forms in the folder matches what was expected (he has a paper from the enumerator teams each day that states the number of households each visited); 5) repeat for other 4 phones, and check numbers; 6) transcribes data and re-sorts it; 7) check that household IDs in the transcribed .csv match what was expected, and check through data for errors; 8 ) charge phone batteries and delete data from the phones; 9) password protect all data (complies with our CPHS protocols).
- Bi-monthly. Every 2 weeks the enumerator teams finish a ‘ward’ (geographically specified area). At this time we go through the data to check that things are making sense and have enumerators return to households to fix issues (the wonderful thing about electronic data collection!), check the accuracy of GPS (enumerators re-take coordinates if accuracy is >20m), and make maps of coordinates to make sure they stayed within boundaries (and re-do households if they haven’t).
Our first 4-month round is almost done, and we have 3 more rounds – 9 months left.
Questions? Email me at ekumpel (at) berkeley (dot) edu