Cleaning and Fixing Bilingual Data for MT


  • ISO:9001 quality certified
  • Excellent customer service
  • Expertise in more than 120 languages

Get a free quote

10 + 6 =

Natural Language Processing Engineers

You need high-quality bilingual data to train your Machine Translation (MT) engine. The more data you have, the better. But it’s no good if it isn’t processed in the right way. Or if it contains errors or “noise” which will mislead your machine. That’s why Asian Absolute’s service for cleaning and fixing bilingual data is used by our clients in the UK and globally.

So that whatever data you have ready for use in your MT engine training. Whether it’s for Statistical (SMT) or Neural (NMT) Machine Translation. It’s going to get you the best possible results when you get started.

Why choose Asian Absolute?

You’ll find it simple to clean and fix your bilingual data. Ready for it to be pre-processed and used to train your MT engine. Helping you train a better engine without increasing the volume of data.

Get all of the expertise you don’t already have in-house. Need to get that list of actions defined? Ramp up capacity to get those fixes done fast? Need in-domain linguists to check your data too?

You can rely on us for whatever skills you prefer to outsource.

  • Award-winning, ISO:9001-certified qualitystandard project management
  • Rely on specialist engineerswith extensive experience delivering Machine Translation services
  • Get a list of actions with clear steps to taketo fix your data
  • Count on linguists with in-domain experiencein every field – banking, legal, manufacturing and many more – available now
  • Make sure your data is clean in more than 120 language.


Asian Absolute helped in the challenging task of building a world-class translation service. They provide top quality, personal service.

Financial Times

I was extremely impressed by Asian Absolute’s hard work to complete the project to our high standards and within a very tight timeframe.

Global Witness

Many thanks for your help and also for providing an interpreter for the week, she was absolutely fantastic and a real life-saver!

Guinness World Records

What data do I need for bilingual machine translation engine training?

The data you need to train your MT engine should be:


All of your data should relate to your subject area.


Containing as close to zero “noise” as possible. Noise, in this case, can be thought of as any errors which might “distract” your MT engine from achieving an accurate translation of your text.

Of sufficient quantity

The lower the volume of data you have, the more important it is that you ensure it is as clean as it can be.

Note: You might have already used dirty data to train your SMT engine. If so, you can clean it and improve your engine’s performance without having to train a new machine from scratch.

How do you clean and fix bilingual data?

Asian Absolute’s Machine Translation engineers have multiple tools at their disposal to fix your data. Text cleaning tools and custom scripts maximise efficiency. Specialist in-domain linguists are also available should you need a truly comprehensive improvement of your data.

You can choose for us to go about cleaning your data in one of two ways:

1. Rely on the experts for a full investigation

A full examination should always be a precursor to using your bilingual data to train your MT engine. You’ll need highly experienced and qualified engineers to do this in-house. If you don’t have them, this is the level of service you’re looking for.

A full examination will provide you with a list of suggested actions. It’s a little like a bug report. It will list all of the problems with your data. And what needs to be done to resolve them.

  • Your personal project manager takes down all of your specific requirements.
  • Asian Absolute’s data processing engineers create a list of actionsshowing the categories of issues in your data. Any identified patterns between them. How they can be fixed. And the results you can expect by doing so.
  • Your team of linguists and engineers can then clean your data up to a standard which is optimalto use in training your MT engine. Or you can rely on us to do so.
2. Provide us with a list of issues

Alternatively when cost-effectiveness is high on your list of objectives, you can provide us with your own list of specific issues. This will minimise the workload of our engineers cleaning your data. And thus the cost of each round of cleaning.

What sort of noise can you fix?

The main goal when cleaning bilingual data is to fix or outright eliminate noise. Some of the most common MT noise which our engineers will identify and fix include:

  • Misaligned sentences
  • Untranslated sections in the source or target text
  • Instances of the wrong language used in the source or target text
  • Issues relating to name entities
  • Short segments
  • Insertion errors, where one segment contains repetitions of whole sentences
  • Duplication errors, where words, phrases or entire sentences are repeated
  • Issues relating to case sensitivity

But some noise requires the attention of a skilled linguist to identify. We recommend our service for cleaning and fixing bilingual data (linguists) to maximise the value of your data.

Get free advice about machine translation services 24/7

Contact us today. Ask any questions you might have about cleaning and fixing bilingual data. And then get a free, no-obligation quote before you choose to go ahead.

Get a free quote